May 4, 2024

Wu on Zittrain's Future of the Internet

Related to my previous post about the future of open technologies, Tim Wu has a great review of Jonathan Zittrain’s book. Wu reviews the origins of the 20th century’s great media empires, which steadily consolidated once-fractious markets. He suggests that the Internet likely won’t meet the same fate. My favorite part:

In the 2000s, AOL and Time Warner took the biggest and most notorious run at trying to make the Internet more like traditional media. The merger was a bet that unifying content and distribution might yield the kind of power that Paramount and NBC gained in the 1920s. They were not alone: Microsoft in the 1990s thought that, by owning a browser (Explorer), dial-in service (MSN), and some content (Slate), it could emerge as the NBC of the Internet era. Lastly, AT&T, the same firm that built the first radio network, keeps signaling plans to assert more control over “its pipes,” or even create its own competitor to the Internet. In 2000, when AT&T first announced its plans to enter the media market, a spokesman said: “We believe it’s very important to have control of the underlying network.”

Yet so far these would-be Zukors and NBCs have crashed and burned. Unlike radio or film, the structure of the Internet stoutly resists integration. AOL tried, in the 1990s, to keep its users in a “walled garden” of AOL content, but its users wanted the whole Internet, and finally AOL gave in. To make it after the merger, AOL-Time Warner needed to build a new garden with even higher walls–some way for AOL to discriminate in favor of Time Warner content. But AOL had no real power over its users, and pretty soon it did not have many of them left.

I think the monolithic media firms of the 20th century ultimately owed their size and success to economies of scale in the communication technologies of their day. For example, a single newspaper with a million readers is a lot cheaper to produce and distribute than ten newspapers with 100,000 readers each. And so the larger film studios, newspapers, broadcast networks, and so on were able to squeeze out smaller players. Once one newspaper in a given area began reaping the benefits of scale, it made it difficult for its competitors to turn a profit, and a lot of them went out of business or got acquired at firesale prices.

On the Internet, distributing content is so cheap that economies of scale in distribution just don’t matter. On a per-reader basis, my personal blog certainly costs more to operate than CNN. But the cost is so small that it’s simply not a significant factor in deciding whether to continue publishing it. Even if the larger sites capture the bulk of the readership and advertising revenue, that doesn’t preclude a “long tail” of small, often amateur sites with a wide variety of different content.

The Perpetual Peril of Open Platforms

Over at Techdirt, Mike Masnick did a great post a few weeks back on a theme I’ve written about before: peoples’ tendency to underestimate the robustness of open platforms.

Once people have a taste for what that openness allows, stuffing it back into a box is very difficult. Yes, it’s important to remain vigilant, and yes, people will always attempt to shut off that openness, citing all sorts of “dangers” and “bad things” that the openness allows. But, the overall benefits of the openness are recognized by many, many people — and the great thing about openness is that you really only need a small number of people who recognize its benefits to allow it to flourish.

Closed systems tend to look more elegant at first — and often they are much more elegant at first. But open systems adapt, change and grow at a much faster rate, and almost always overtake closed systems, over time. And, once they overtake the closed systems, almost nothing will allow them to go back. Even if it were possible to turn an open system like the web into a closed system, openness would almost surely sneak out again, via a new method by folks who recognized how dumb it was to close off that open system.

Predictions about the impending demise of open systems have been a staple of tech policy debates for at least a decade. Larry Lessig’s Code and Other Laws of Cyberspace is rightly remembered as a landmark work of tech policy scholarship for its insights about the interplay between “East Coast code” (law) and “West Coast code” (software). But people often forget that it also made some fairly specific predictions. Lessig thought that the needs of e-commerce would push the Internet toward a more centralized architecture: a McInternet that squeezed out free speech and online anonymity.

So far, at least, Lessig’s predictions have been wide of the mark. The Internet is still an open, decentralized system that allows robust anonymity and free speech. But the pessimistic predictions haven’t stopped. Most recently, Jonathan Zittrain wrote a book predicting the impending demise of the Internet’s “generativity,” this time driven by security concerns rather than commercialization.

It’s possible that these thinkers will be proven right in the coming years. But I think it’s more likely that these brilliant legal thinkers have been mislead by a kind of optical illusion created by the dynamics of the marketplace. The long-term trend has been a steady triumph for open standards: relatively open technologies like TCP/IP, HTTP, XML, PDF, Java, MP3, SMTP, BitTorrent, USB, and x86, and many others have become dominant in their respective domains. But at any given point in time, a disproportionate share of public discussion is focused on those sectors of the technology industry where open and closed platforms are competing head-to-head. After all, nobody wants to read news stories about, say, the fact that TCP/IP’s market share continues to be close to 100 percent and has no serious competition. And at least superficially, the competition between open and closed systems looks really lopsided: the proprietary options tend to be supported by large, deep-pocketed companies with large development teams, multi-million dollar advertising budgets, distribution deals with leading retailers, and so forth. It’s not surprising that people so frequently conclude that open standards are on the verge of getting crushed.

For example, Zittrain makes the iPhone a poster child for the flashy but non-generative devices he fears will come to dominate the market. And it’s easy to see the iPhone’s advantages. Apple’s widely-respected industrial design department created a beautiful product. Its software engineers created a truly revolutionary user interface. Apple and AT&T both have networks of retail stores with which to promote the iPhone, and Apple is spending millions of dollars airing television ads. On first glance, it looks like open technologies are on the ropes in the mobile marketplace.

But open technologies have a kind of secret weapon: the flexibility and power that comes from decentralization. The success of the iPhone is entirely dependent on Apple making good technical and business decisions, and building on top of proprietary platforms requires navigating complex licensing issues. In contrast, absolutely anyone can use and build on top of an open platform without asking anyone else for permission, and without worrying about legal problems down the line. That means that at any one time, you have a lot of different people trying a lot of different things on that open platform. In the long run, the creativity of millions of people will usually exceed that of a few hundred engineers at a single firm. As Mike says, opens systems adapt, change and grow at a much faster rate than closed ones.

Yet much of the progress of open systems tends to happen below the radar. The grassroots users of open platforms are far less likely to put out press releases or buy time for television ads. So often it’s only after an open technology has become firmly entrenched in its market—MySQL in the low-end database market, for example—that the mainstream press starts to take notice of it.

As a result, despite the clear trend toward open platforms in the past, it looks to many people like that pattern is going to stop and perhaps even be reversed. I think this illusion is particularly pronounced for folks who are getting their information second- or third-hand. If you’re judging the state of the technology industry from mainstream media stories, television ads, shelf space at Best Buy, etc, you’re likely not getting the whole story. It’s helpful to remember that open platforms have always looked like underdogs. They’re no more likely to be crushed today than they were in 1999, 1989, or 1979.

More Privacy, Bit by Bit

Before the Holidays, Yahoo got a flurry of good press for the announcement that it would (as the LA Times puts it) “purge user data after 90 days.” My eagle-eyed friend Julian Sanchez noticed that the “purge” was less complete than privacy advocates might have hoped. It turns out that Yahoo won’t be deleting the contents of its search logs. Rather, it will merely be zeroing out the last 8 bits of users’ IP addresses. Julian is not impressed:

dropping the last byte of an IP address just means you’ve narrowed your search space down to (at most) 256 possibilities rather than a unique machine. By that standard, this post is anonymous, because I guarantee there are more than 255 other guys out there with the name “Julian Sanchez.”

The first three bytes, in the majority of cases, are still going to be enough to give you a service provider and a rough location. Assuming every address in the range is in use, dropping the least-significant byte just obscures which of the 256 users at that particular provider is behind each query. In practice, though, the search space is going to be smaller than that, because people are creatures of habit: You’re really working with the pool of users in that range who perform searches on Yahoo. If your not-yet-anonymized logs show, say, 45 IP addreses that match those first three bytes making routine searches on Yahoo (17.6% of the search market x 256 = 45) you can probably safely assume that an “anonymized” IP with the same three leading bytes is one of those 45. If different users tend to exhibit different usage patterns in search time, clustering of queries, expertise with Boolean operators, or preferred natural language, you can narrow it down further.

I think this isn’t quite fair to Yahoo. Dropping the last eight bits of the IP address certainly doesn’t protect privacy as much as deleting log entries entirely, but it’s far from useless. To start with, there’s often not a one-to-one correspondence between IP addresses and Internet users. Often a single user has multiple IPs. For example, when I connect to the Princeton wireless network, I’m dynamically assigned an IP address that may not be the same as the IP address I used the last time I logged on. I also access the web from my iPhone and from hotels and coffee shops when I travel. Conversely, several users on a given network may be sharing a single IP address using a technology called network address translation. So even if you know the IP address of the user who performed a particular search, that may simply tell you that the user works for a particular company or connected from a particular coffee shop. Hence, tracking a particular user’s online activities is already something of a challenge, and it becomes that much harder if several dozen users’ online activities are scrambled together in Yahoo!’s logs.

Now, whether this is “enough” privacy depends a lot on what kind of privacy problem you’re worried about. It seems to me that there are three broad categories of privacy concerns:

  • Privacy violations by Yahoo or its partners: Some people are worried that Yahoo itself is tracking their online activities, building an online profile about them, and selling this information to third parties. Obviously, Yahoo’s new policy will do little to allay such concerns. Indeed, as David Kravets points out, Yahoo will have already squeezed all the personal information it can out of those logs before it scours them. If you don’t trust Yahoo or its business partners, this move isn’t going to make you feel very much safer.
  • Data breaches: A second concern involves cases where customer data falls into the wrong hands due to a security breach. In this case, it’s not clear that search engine logs are especially useful to data thieves in the first place. Data thieves are typically looking for information such as credit card and Social Security numbers that can make them a quick buck. People rarely type such information into search boxes. Some searches may be embarrassing to users, but they probably won’t be so embarrassing as to enable blackmail or extortion. So search logs are not likely to be that useful to criminals, whether or not they are “anonymized.”
  • Court-ordered information release: This is the case where the new policy could have the biggest effect. Consider, for example, a case where the police seek a suspect’s search results. The new policy will help protect privacy in three ways: first, if Yahoo! can’t cleanly filter search logs by IP address, judges may be more reluctant to order the disclosure of several dozen users’ search results just to give police information from a single suspect. Second, scrubbing the last byte of the IP address will make searching through the data much more difficult. Finally, the resulting data will be less useful in the court of law, because prosecutors will need to convince a jury that a given search was performed by the defendant rather than another user who happened to have a similar IP address. At the margin, then, Yahoo’s new policy seems likely to significantly enhance user privacy against government information requests. The same principle applies in the case of civil suits: the recording and movie industries, for example, will have a harder time using Yahoo!’s search logs as evidence that a user was engaged in illegal file-sharing.

So based on the small amount of information Yahoo has made available, it seems that the new policy is a real, if small, improvement in users’ privacy. However, it’s hard to draw any definite conclusions without more specific information about what information Yahoo! is saving. Because anonymizing data is a lot harder than people think. AOL learned this the hard way in 2006 when “anonymized” search results were released to researchers. People quickly noticed that you could figure out who various users were by looking at the contents of their searches. The data wasn’t so anonymous after all.

One reason AOL’s data wasn’t so anonymous is that AOL had “anonymized” the data set by assigning each user a unique ID. That meant people could look at all searches made by a single user and find searches that gave clues to the user’s identity. Had AOL instead stripped off the user information without replacing it, it would have been much harder to de-anonymize the data because there would be no way to match up different searches by the same user. If Yahoo’s logs include information linking each user’s various searches together, then even deleting the IP address entirely probably won’t be enough to safeguard user privacy. On the other hand, if the only user-identifying information is the IP address, then stripping off the low byte of the IP address is a real, if modest, privacy enhancement.

The DC Metro and the Invisible Hand

My friend Tom Lee has been pestering the Washington Metropolitan Area Transit Authority, the agency that runs the DC area’s public transit system, to publish its schedule data in an open format. That will allow companies like Google to include the information in products like Google Transit. It seems that Google has been negotiating with WMATA for months to get access to the data, and the negotiations recently broke down, depriving DC-area transit users of the opportunity to use Google Transit. Reading between the lines, it appears that the sticking point is that WMATA wants Google to cough up some money for access to the data. It seems that WMATA earns some advertising revenue from its existing website, and it’s afraid that Google will undermine that revenue source.

While as a taxpayer I’m happy to see WMATA worrying about its bottom line, this seems like a pretty misguided decision. For starters, this really isn’t about Google. Google has been lobbying transit agencies around the country to publish data in the Google Transit Feed Specification. Although it may sound proprietary, the GTFS is an open standard. This means that absolutely anyone can download GTFS-formatted data and put it to new uses. Of course, Google has a small head start because they invented the format, but with Google making open-sourced tools available for manipulating GTFS files, the barrier to entry here is pretty small.

WMATA seems to have lost sight of the fact that it is a government agency accountable to the general public, not a profit-making business. It’s laudable that the agency is looking for new revenue sources, but it’s silly to do so in the way that’s contrary to its broader mission. And the amount of money we’re talking about here—DCist says the agency made $68,000 in ad revenue 2007—is truly trivial for an agency with a billion-dollar budget. Scuttling WMATA participation in Google Transit looks especially shortsighted when we consider that making schedule information easier to access would almost certainly bring additional riders (and, therefore, additional revenues) to the system.

Finally, and most importantly, WMATA should remember the point made by my colleagues in their recent paper: the most important uses for public data are often the ones that no one expects at the outset. Google Transit is great, and Metro riders will enjoy immediate benefits from being able to access schedule information using it. But there may be even more valuable uses to which the data could be put. And not everyone with a good idea for using the data will have the resources to negotiate directly with the WMATA for access. This is why it’s crucial that WMATA not only release the data to Google, but to make it freely and widely available to the general public, so that other private parties can get access to it. To its credit, Google has asked WMATA to do just that. WMATA should say yes.

The Journal Misunderstands Content-Delivery Networks

There’s been a lot of buzz today about this Wall Street Journal article that reports on the shifting positions of some of the leading figures of the network neutrality movement. Specifically, it claims that Google, Microsoft, and Yahoo! have abandoned their prior commitment to network neutrality. It also claims that Larry Lessig has “softened” his support for network neutrality, and it implies that because Lessig is an Obama advisor, that Lessig’s changing stance may portend a similar shift in the president-elect views, which would obviously be a big deal.

Unfortunately, the Journal seems to be confused about the contours of the network neutrality debate, and in the process it has mis-described the positions of at least two of the key players in the debate, Google and Lessig. Both were quick to clarify that their views have not changed.

At the heart of the dispute is a question I addressed in my recent Cato paper on network neutrality: do content delivery networks (CDNs) violate network neutrality? A CDN is a group of servers that improve website performance by storing content closer to the end user. The most famous is Akamai, which has servers distributed around the world and which sells its capacity to a wide variety of large website providers. When a user requests content from the website of a company that uses Akamai’s service, the user’s browser may be automatically re-directed to the nearest Akamai server. The result is faster load times for the user and reduced load on the original web server. Does this violate network neutrality? If you’ll forgive me for quoting myself, here’s how I addressed the question in my paper:

To understand how Akamai manages this feat, it’s helpful to know a bit more about what happens under the hood when a user loads a document from the Web. The Web browser must first translate the domain name (e.g., “cato.org”) into a corresponding IP address (72.32.118.3). It does this by querying a special computer called a domain name system (DNS) server. Only after the DNS server replies with the right IP address can the Web browser submit a request for the document. The process for accessing content via Akamai is the same except for one small difference: Akamai has special DNS servers that return the IP addresses of different Akamai Web servers depending on the user’s location and the load on nearby servers. The “intelligence” of Akamai’s network resides in these DNS servers.

Because this is done automatically, it may seem to users like “the network” is engaging in intelligent traffic management. But from a network router’s perspective, a DNS server is just another endpoint. No special modifications are needed to the routers at the core of the Internet to get Akamai to work, and Akamai’s design is certainly consistent with the end-to-end principle.

The success of Akamai has prompted some of the Internet’s largest firms to build CDN-style networks of their own. Google, Microsoft, and Yahoo have already started building networks of large data centers around the country (and the world) to ensure there is always a server close to each end user’s location. The next step is to sign deals to place servers within the networks of individual residential ISPs. This is a win-win-win scenario: customers get even faster response times, and both Google and the residential ISP save money on bandwidth.

The Journal apparently got wind of this arrangement and interpreted it as a violation of network neutrality. But this is a misunderstanding of what network neutrality is and how CDNs work. Network neutrality is a technical principle about the configuration of Internet routers. It’s not about the business decisions of network owners. So if Google signs an agreement with a major ISP to get its content to customers more quickly, that doesn’t necessarily mean that a network neutrality violation has occurred. Rather, we have to look at how the speed-up was accomplished. If, for example, it was accomplished by upgrading the network between the ISP and Google, network neutrality advocates would have no reason to object. In contrast, if the ISP accomplished by re-configuring its routers to route Google’s packets in preference to those from other sources, that would be a violation of network neutrality.

The Journal article had relatively few details about the deal Google is supposedly negotiating with residential ISPs, so it’s hard to say for sure which category it’s in. But what little description the Journal does give us—that the agreement would “place Google servers directly within the network of the service providers”—suggests that the agreement would not violate network neutrality. And indeed, over on its public policy blog, Google denies that its “edge caching” network violates network neutrality and reiterates its support for a neutral Internet. Don’t believe everything you read in the papers.