March 29, 2024

Archives for April 2008

Phorm's Harms Extend Beyond Privacy

Last week, I wrote about the privacy concerns surrounding Phorm, an online advertising company who has teamed up with British ISPs to track user Web behavior from within their networks. New technical details about its Webwise system have since emerged, and it’s not just privacy that now seems to be at risk. The report exposes a system that actively degrades user experience and alters the interaction with content providers. Even more importantly, the Webwise system is a clear violation of the sacred end-to-end principle that guides the core architectural design of the Internet.

Phorm’s system does more than just passively gain “access to customers’ browsing records” as previously suggested. Instead, they plan on installing a network switch at each participating ISP that actively interferes with the user’s browsing session by injecting multiple URL redirections before the user can retrieve the requested content. Sparing you most of the nitty-gritty technical details, the switch intercepts the initial HTTP request to the content server to check whether a Webwise cookie–containing the user’s randomly-assigned identifier (UID)– exists in the browser. It then impersonates the requested server to trick the browser into accepting a spoofed cookie (which I will explain later) that contains the same UID. Only then will the switch forward the request and return the actual content to the user. Basically, this amounts to a big technical hack by Phorm to set the cookies that track users as they browse the Web.

In all, a user’s initial request is redirected three times for each domain that is contacted. Though this may not seem like much, this extra layer of indirection harms the user by degrading the overall browsing experience. It imposes an unnecessary delay that will likely be noticeable by users.

The spoofed cookie that Phorm stores on the user’s browser during this process is also a highly questionable practice. Generally speaking, a cookie is specific to a particular domain and the browser typically ensures that a cookie can only be read and written by the domain it belongs to. For example, data in a yahoo.com cookie is only sent when you contact a yahoo.com server, and only a yahoo.com server can put data into that cookie.

But since Phorm controls the switch at the ISP, it can bypass this usual guarantee by impersonating the server to add cookies for other domains. To continue the example, the switch (1) intercepts the user’s request, (2) pretends to be a yahoo.com server, and (3) injects a new yahoo.com cookie that contains the Phorm UID. The browser, believing the cookie to actually be from yahoo.com, happily accepts and stores it. This cookie is used later by Phorm to identify the user whenever the user visits any page on yahoo.com.

Cookie spoofing is problematic because it can change the interaction between the user and the content-providing site. Suppose a site’s privacy policy promises the user that it does not use tracking cookies. But because of Phorm’s spoofing, the browser will store a cookie that (to the user) looks exactly like a tracking cookie from the site. Now, the switch typically strips out this tracking cookie before it reaches the site, but if the user moves to a non-Phorm ISP (say at work), the cookie will actually reach the site in violation of its stated privacy policy. The cookie can also cause other problems, such as a cookie collision if the site cookie inadvertently has the same name as the Phorm cookie.

Disruptive activities inside the network often create these sort of unexpected problems for both users and websites, which is why computer scientists are skeptical of ideas that violate the end-to-end principle. For the uninitiated, the principle, in short, states that system functionality should almost always be implemented at the end hosts of the network, with a few justifiable exceptions. For instance, almost all security functionality (such as data encryption and decryption) is done by end users and only rarely by machines inside the network.

The Webwise system has no business being inside the network and has no role in transporting packets from one end of the network to the other. The technical Internet community has been worried for years about the slow erosion of the end-to-end principle, particularly by ISPs who are looking to further monetize their networks. This principle is the one upon which the Internet is built and one which the ISPs must uphold. Phorm’s system, nearly in production, is a cogent realization of this erosion, and ISPs should keep Phorm outside the gate.

NJ Election Discrepancies Worse Than Previously Thought, Contradict Sequoia's Explanation

I wrote previously about discrepancies in the vote totals reported by Sequoia AVC Advantage voting machines in New Jersey’s presidential primary election, and the incomplete explanation offered by Sequoia, the voting machine vendor. I published copies of the “summary tapes” printed by nine voting machines in Union County that showed discrepancies; all of them were consistent with Sequoia’s explanation of what went wrong.

This week we obtained six new summary tapes, from machines in Bergen and Gloucester counties. Two of these new tapes contradict Sequoia’s explanation and show more serious discrepancies that we saw before.

Before we dig into the details, let’s review some background. At the end of Election Day, each Sequoia AVC Advantage voting machine prints a “summary tape” (or “results report”) that lists (among other things) the number of votes cast for each candidate on that machine, and the total voter turnout (number of votes cast) in each party. In the Super Tuesday primary, a few dozen machines in New Jersey showed discrepancies in which the number of votes recorded for candidates in one party exceeded the voter turnout in that party. For example, the vote totals section of a tape might show 61 total votes for Republican candidates, while the turnout section of the same tape shows only 60 Republican voters.

Sequoia’s explanation was that in certain circumstances, a voter would be allowed to vote in one party while being recorded in the other party’s turnout. (“It has been observed that the ‘Option Switch’ or Party Turnout Totals section of the Results Report may be misreported whereby turnout associated with the party or option switch choice is misallocated. In every instance, however, the total turnout, or the sum of the turnout allocation, is accurate.”) Sequoia’s memo points to a technical flaw that might cause this kind of misallocation.

The nine summary tapes I had previously were all consistent with Sequoia’s explanation. Though the total votes exceeded the turnout in one party, the votes were less than the turnout in the other party, so that the discrepancy could have been caused by misallocating turnout as Sequoia described. For example, a tape from Hillside showed 61 Republican votes cast by 60 voters, and 361 Democratic votes cast by 362 voters, for a total of 422 votes cast by 422 voters. Based on these nine tapes, Sequoia’s explanation, though incomplete, could have been correct.

But look at one of the new tapes, from Englewood Cliffs, District 4, in Bergen County. Here’s a relevant part of the tape:

The Republican vote totals are Giuliani 1, Paul 1, Romney 6, McCain 14, for a total of 22. The Democratic totals are Obama 33, Edwards 2, Clinton 49, for a total of 84. That comes to 106 total votes across the two parties.

The turnout section (or “Option Switch Totals”) shows 22 Republican voters and 83 Democratic voters, for a total of 105.

This is not only wrong – 106 votes cast by 105 voters – but it’s also inconsistent with Sequoia’s explanation. Sequoia says that all of the voters show up in the turnout section, but a few might show up in the wrong party’s turnout. (“In every instance, however, the total turnout, or the sum of the turnout allocation, is accurate.”) That’s not what we see here, so Sequoia’s explanation must be incorrect.

And that’s not all. Each machine has a “public counter” that keeps track of how many votes were cast on the machine in the current election. The public counter, which is found on virtually all voting machines, is one of the important safeguards ensuring that votes are not cast improperly. Here’s the top of the same tape, showing the public counter as 105.

The public counter is important enough that the poll workers actually sign a statement at the bottom of the tape, attesting to the value of the public counter. Here’s the signed statement from the same tape:

The public counter says 105, even though 106 votes were reported. That’s a big problem.

Another of the new tapes, this one from West Deptford in Gloucester County, shows a similar discrepancy, with 167 total votes, a total turnout of 166, and public counter showing 166.

How many more New Jersey tapes show errors? What’s wrong with Sequoia’s explanation? What really happened? We don’t know the answers to any of these questions.

Isn’t it time for a truly independent investigation?

UPDATE (April 11): The New Jersey Secretary of State, along with the two affected counties, are now saying that I am misreading the two tapes discussed here. In particular, they are now saying that the tape image included above shows 48 votes for Hillary Clinton, not 49. They’re also saying now that the West Deptford tape shows two votes for Ron Paul, not three.

It’s worth noting that the counties originally read the tapes as I did. When I sent an open records request for tapes showing discrepancies, they sent these tapes – which they would not have done had they read the tapes as they now do. Also, the Englewood Cliffs tape pictured above shows hand-written numbers that must have been written by a county official (they were on the tapes before they were copied and sent to us), showing 84 votes for Democratic candidates, consistent with the county’s original reading of the tape (but not its new reading).

In short, the Secretary of State talked to the counties, and then the counties changed their minds about how to read the tapes.

So: were the counties right before, or are they right now? Decide for yourself – here are the tapes: Englewood Cliffs, West Deptford.

UPDATE (April 14): Regardless of what these two tapes show, plenty of other tapes from the Feb. 5 primary show discrepancies that the state and counties are not disputing. These other discrepancies are consistent with Sequoia’s explanation (though that explanation is incomplete and more investigation is needed to tell whether it is correct). Thus far we have images of at least thirty such tapes.

Bad Phorm on Privacy

Phorm, an online advertising company, has recently made deals with several British ISPs to gain unprecedented access to every single Web action taken by their customers. The deals will let Phorm track search terms, URLs and other keywords to create online behavior profiles of individual customers, which will then be used to provide better targeted ads. The company claims that “No private or personal information, or anything that can identify you, is ever stored – and that means your privacy is never at risk.” Although Phorm might have honest intentions, their privacy claims are, at best, misleading to customers.

Their privacy promise is that personally-identifiable information is never stored, but they make no promises on how the raw logs of search terms and URLs are used before they are deleted. It’s clear from Phorm’s online literature that they use this sensitive data for ad delivery purposes. In one example, they claim advertisers will be able to target ads directly to users who see the keywords “Paris vacation” either as a search or within the text of a visited webpage. Without even getting to the storage question, users will likely perceive Phorm’s access and use of their behavioral data as a compromise of their personal privacy.

What Phorm does store permanently are two pieces of information about each user: (1) the “advertising categories” that the user is interested in and (2) a randomly-generated ID from the user’s browser cookie. Each raw online action is sorted into one or more categories, such as “travel” or “luxury cars”, that are defined by advertisers. The privacy worry is that as these categories become more specific, the behavioral profiles of each user becomes ever more precise. Phorm seems to impose no limit on the specificity of these defined categories, so for all intents and purposes, these categories over time will become nearly identical to the search terms themselves. Indeed, they market their “finely tuned” service as analogous to typical keyword search campaigns that advertisers are already used to. Phorm has a strong incentive to store arbitrarily specific interest categories about each user to provide optimally targeted ads, and thus boost the profits of their advertising business.

The second protection mechanism is a randomly-generated ID number stored in a browser cookie that Phorm uses to “anonymously” track a user as she browses the web. This ID number is stored with the list of the interest categories collected for that user. Phorm should be given credit for recognizing this as more privacy-protecting than simply using the customer’s name or IP address as an identifier (something even Google has disappointingly failed to recognize). But from past experience, these protections are unlikely to be enough. The storage of random user IDs mapped to keywords mirroring actual search queries is highly reminiscent of the AOL data fiasco from 2006, where AOL released “anonymized” search histories containing 20 million keywords. It turned out to be easy to identify the name of specific individuals based solely on their search history.

In the least, the company’s employees will be able to access an AOL-like dataset about the ISP’s customers. Granted, distinguishing whether particular datasets as personally-identifiable or not is a notoriously difficult problem and subject to further research. But it’s inaccurate for Phorm to claim that personally-identifiable information is not being stored and to promise users that their privacy is not at risk.

Music Industry Under Fire for Exploring EFF Suggestion

Jim Griffin, a music industry consultant who is in the unusual position of being recognized as smart and reasonable by participants across a broad swath of positions in the copyright debate, revealed last week that he’s working to start a new music industry organization that will urge ISPs to bundle a music licensing fee into their monthly service costs, in exchange for which the major labels will agree not to sue (and, presumably, not to threaten suit against) the ISP’s customers for copyright infringement of the music whose rights they own. The goal, Griffin says, is to “monetize the anarchy of the Internet.”

This idea has a long history and has at various times been propounded by some on the “copyleft.” The Electronic Frontier Foundation, for example, issued in April 2004 a report entitled “A Better Way Forward: Voluntary Collective Licensing of Music File Sharing“. This report even suggested the $5 per user per month ($60 per user per year) that Griffin apparently has in mind.

According to the OECD, there were roughly 60 million broadband subscriptions in the United States as of the end of 2006. If each of these were to pay $60 a year, the total would be $3.6 billion a year. I know that broadband uptake is increasing, but I remain unsure how Griffin figures that the proposed system “could create a pool as large as $20 billion a year.” Perhaps this imagines global, rather than national, uptake of the plan? If so, it seems to embody some optimistic assumptions about how widely any such agreement could plausibly be extended.

Some prominent blogs have reacted with ire—Michael Arrington at TechCrunch, for example, characterizes the move as an “extortion scheme.” Arrington argues that a licensing system will hinder innovation because the revenues from it will be constant irrespective of the amount or quality of music published by the labels, and will flow to an infrastructure that, once it begins to be subsidized, will have little structural incentive to innovate. He also argues in a later post that since the core of the system is a covenant not to sue, it represents a “protection racket.”

I think this kind of skepticism is poorly justified at this point. If the labels can turn their statutory right to sue for damages after copyright infringement into a voluntary system where they get paid and nobody gets sued, it strikes me as a case of the system working. And the numbers matter: The idea of a $20 billion payoff that would triple the industry’s current $10 billion in annual revenue does not seem reasonable, but unless I am missing something it also does not seem probable.

There are two core questions for the plan. First, what will it cover? The idea is that it will let the industry stop suing, and thereby end the antagonism between labels and customers. But unless a critical mass of the labels agree to the plan, users whose ISPs are paying in will still face the risk of suit from non-participating copyright holders. In fact, if the plan takes off, individual rights holders may face an incentive to defect, since consumers are equally likely to infringe all popular music regardless of which music happens to be covered by the plan (since they aren’t likely to track which music is covered).

Second, how will the revenue be shared? Filesharing metrics, provided by analysts like BigChampagne, are at best approximate, and they only track downloads that occur via the public, unencrypted Internet–presumably a large share of the relevant copying, but not all of it, especially in the context of University and other networks. The squabbles will be challenging, and if past is prologue, then the labels may not prove themselves an amicable bunch in negotiating with each other.

Finally, it’s important to remember that the labels’ power depends, in the very long run, on their ability to sign the best new talent. If the licensing system proposed by Griffin takes off, it may preserve the status quo for now. But if the industry continues to give artists themselves a raw deal, as it is so often accused of doing, artists will still have the growing power that digital technology gives them to share their music without a label’s help.

An Inconvenient Truth About Privacy

One of the lessons we’ve learned from Al Gore is that it’s possible to have too much of a good thing. We all like to tool around in our SUVs, but too much driving leads to global warning. We must all take responsibility for our own carbon emissions.

The same goes for online privacy, except that there the problem is storage rather than carbon emissions. We all want more and bigger hard drives, but what is going to be stored on those drives? Information, probably relating to other people. The equation is simple: more storage equals more privacy invasion.

That’s why I have pledged to maintain a storage-neutral lifestyle. From now on, whenever I buy a new hard drive, I’ll either delete the same amount of old information, or I’ll purchase a storage offset from someone else who has extra data to delete. By bidding up the cost of storage offsets, I’ll help create a market for storage conservation, without the inconvenience of changing my storage-intensive lifestyle.

Government can do its part, too. If the U.S. government adopted a storage-neutral policy, then for every email the NSA recorded, the government would have to delete another email elsewhere – say, at the White House. It’s truly a win-win outcome. And storage conservation technology can help drive the green economy of the twenty-first century.

For private industry, a cap-and-trade system is the best policy. Companies will receive data storage permits, which can be bought and sold freely. When JuicyCampus conserves storage by eliminating its access logs, it can sell the unused storage capacity to ChoicePoint, perhaps for storing information about the same JuicyCampus posters. The free market will allocate the limited storage capacity efficiently, as those who profit by storing less can sell permits to those who profit by storing more.

Debating these policy niceties is all well and good, but the important thing is for all of us to recognize the storage problem and make changes in our own lives. If you and I don’t reduce our storage footprint, who will?

Please join me today in adopting a storage-neutral lifestyle. You can start by not leaving comments on this post.