October 31, 2024

Abandoning the Envelope Analogy (What Your Mailman Knows Part 2)

Last time, I commented on NPR’s story about a mail carrier named Andrea in Seattle who can tell us something about the economic downturn by revealing private facts about the people she serves on her mail route. By critiquing the decision to run the story, I drew a few lessons about the way people value and weigh privacy. In Part 2 of this series, I want to tie this to NebuAd and Phorm.

It’s probably a sign of the deep level of monomania to which I’ve descended that as I listened to the story, I immediately started drawing connections between Andrea and NebuAd/Phorm. Technology policy almost always boils down to a battle over analogies, and many in the ISP surveillance/deep packet inspection debate embrace the so-called envelope analogy. (See, e.g., the comments of David Reed to Congress about DPI, and see the FCC’s Comcast/BitTorrent order.) Just as mail carriers are prohibited from opening closed envelopes, so a typical argument goes, so too should packet carriers be prohibited from looking “inside” the packets they deliver.

As I explain in my article, I’m not a fan of the envelope analogy. The NPR story gives me one more reason to dislike it: envelopes–the physical kind–don’t mark as clear a line of privacy as we may have thought. Although Andrea is restricted by law from peeking inside envelopes, every day her mail route is awash in “metadata” that reveal much more than the mere words scribbled on the envelopes themselves. By analyzing all of this metadata, Andrea has many ways of inferring what is inside the envelopes she delivers, and she feels pretty confident about her guesses.

There are metadata gleaned from the envelopes themselves: certified letters usually mean bad economic news; utility bills turn from white to yellow to red as a person slides toward insolvency. She also engages in traffic analysis–fewer credit card offers might herald the credit crunch. She picks up cues from the surroundings, too: more names on a mailbox might mean that a young man who can no longer make rent has moved in with grandma. Perhaps most importantly, she interacts with the human recipients of these envelopes, reporting in the story about a guy who runs a cafe who jokes about needing credit card offers in order to pay the bill, or describing the people who watch her approach with “a real desperation in their eyes; when they see me their face falls; what am I going to bring today?”

So let’s stop using the envelope analogy, because it makes a comparison that doesn’t really fit well. But I have a deeper objection to the use of the envelope analogy in the DPI/ISP surveillance debate: It states a problem rather than proposes a solution, and it assumes away all of the hard questions. Saying that there is an “inside” and an “outside” to a packet is the same thing as saying that we need to draw a line between permissible and impermissible scrutiny, but it offers no guidance about how or where to draw that line. The promise of the envelope analogy is that it is clear and easy to apply, but the solutions proposed to implement the analogy are rarely so clear.

What Your Mailman Knows (Part 1 of 2)

A few days ago, National Public Radio (NPR) tried to offer some lighter fare to break up the death march of gloomier stories about economic calamity. You can listen to the story online. The story’s reporter, Chana Joffe-Walt, followed a mail carrier named Andrea on her route around the streets of Seattle. The premise of the story is that Andrea can measure economic suffering along her mail route–and therefore in that mythical place, “Main Street”–by keeping tabs on the type of mail she delivered. I have two technology policy thoughts about this story, but because I have a lot to say, I will break this into two posts. In this post, I will share some general thoughts about privacy, and in the next post, I will tie this story to NebuAd and Phorm.

I was troubled by Andrea’s and Joffe-Walt’s cavalier approaches to privacy. In the course of the five minute story, Andrea reveals a lot of private, personal information about the people on her route. Only once does Joffe-Walt even hint at the creepiness of peering into people’s private lives in this way, embracing a form of McNealy’s “you have no privacy, get over it” declaration. In the first line of the story, Joffe-Walt says, “Okay before we can do this, I need to clear up one question: Yes, your mailman reads your postcards; she notices what magazines you get, which catalogs; she knows everything about you.” The last line of the story is simply, “The government is just starting on its $700 billion plan. As it moves forward, Wall Street economists will be watching Wall Street; Fed economists will be watching Wall Street; Andrea will be watching the mail.”

There are many privacy lessons I can draw from this: First, did the Postal Service approve Andrea’s participation in the interview? If it did, did it weigh the privacy impact? If not, why not?

More broadly speaking, I bet all of the people who produced or authorized this story, from Andrea and Joffe-Walt to the Postal Service and NPR, if they thought about privacy at all, engaged in a cost-benefits balancing, and they evidently made the same types of mistakes on both sides of that balancing that people often make when they think about privacy.

First, what are the costs to privacy from this story? At first blush, they seem to be slight to non-existent because the reporter anonymized the data. Although most of the activity in the story appears to center on one city block in Seattle, we aren’t told which city block. This is a lot like AOL arguing that it had anonymized its search queries by replacing IP addresses with unique identifiers or like Phorm arguing that it protects privacy by forgetting that you visited Orbitz.com and remembering instead only that you visited a travel-related website.

The NPR story exposes the flaw in this type of argument. Although a casual listener won’t be able to place the street toured by Andrea, it probably wouldn’t be very hard to pierce this cloak of privacy. In the story, we are told that the street is “three-quarters of a mile [north] of” Main Street. The particular block is “a wide residential block where section 8 housing butts against glassy, snazzy new chic condos that cost half-a-million dollars.” Across the block are a couple businesses including a cafe “across the way.” Does this describe more than a few possible locations in Seattle? [Insert joke about the number of cafes in Seattle here.]

It’s probably even easier for someone who lives in Seattle to pinpoint the location, particularly if it is near where they live or work. For these people, thanks to NPR, they now know that in the Section 8 building lives “a single mom with an affinity for black leather is getting an overdraft notice” and a “minister . . . getting more late payment bills.” The owner of the cafe has been outed as somebody who pays his bills only by applying for new credit cards. If you lived or worked on this particular block, wouldn’t you have at least a hunch about the identities of the people tied to these potentially embarrassing facts?

Laboring under the mistaken belief that anonymization negated any costs to privacy, the creators of the story probably thought the costs were outweighed by the potential benefits. But these benefits seem to pale in comparison to the privacy risks, accurately understood. What does the listener gain by listening to this story? A small bit of anecdotal knowledge about the economic crisis? A reason to fear his mailman? The small thrill of voyeurism? A chance to think about the economic crisis while not seized by fear and dread? I’m not saying that these benefits are valueless, but I don’t think they were justified when held against the costs.

It can be rational to sell your private information cheaply, even if you value privacy

One of the standard claims about privacy is that people say they value their privacy but behave as if they don’t value it. The standard example involves people trading away private information for something of relatively little value. This argument is often put forth to rebut the notion that privacy is an important policy value. Alternatively, it is posed as a “what could they be thinking” puzzle.

I used to be impressed by this argument, but lately I have come to doubt its power. Let me explain why.

Suppose you offer to buy a piece of information about me, such as my location at this moment. I’ll accept the offer if the payment you offer me is more than the harm I would experience due to disclosing the information. What matters here is the marginal harm, defined as amount of privacy-goodness I would have if I withheld the information, minus the amount I would have if I disclosed it.

The key word here is marginal. If I assume that my life would be utterly private, unless I gave this one piece of information to you, then I might require a high price from you. But if I assume that I have very little privacy to start with, then selling this one piece of information to you makes little difference, and I might as well sell it cheaply. Indeed, the more I assume that my privacy is lost no matter what I do, the lower a price I’ll demand from you. In the limit, where I expect you can get the information for free elsewhere even if I withhold if from you, I’ll be willing to sell you the information for a penny.

Viewed this way, the price I charge you tells you at least as much about how well I think my privacy is protected, as it does about how badly I want to keep my location private. So the answer to “what could they be thinking” is “they could be thinking they have no privacy in the first place”.

And in case you’re wondering: At this moment, I’m sitting in my office at Princeton.

Phorm's Harms Extend Beyond Privacy

Last week, I wrote about the privacy concerns surrounding Phorm, an online advertising company who has teamed up with British ISPs to track user Web behavior from within their networks. New technical details about its Webwise system have since emerged, and it’s not just privacy that now seems to be at risk. The report exposes a system that actively degrades user experience and alters the interaction with content providers. Even more importantly, the Webwise system is a clear violation of the sacred end-to-end principle that guides the core architectural design of the Internet.

Phorm’s system does more than just passively gain “access to customers’ browsing records” as previously suggested. Instead, they plan on installing a network switch at each participating ISP that actively interferes with the user’s browsing session by injecting multiple URL redirections before the user can retrieve the requested content. Sparing you most of the nitty-gritty technical details, the switch intercepts the initial HTTP request to the content server to check whether a Webwise cookie–containing the user’s randomly-assigned identifier (UID)– exists in the browser. It then impersonates the requested server to trick the browser into accepting a spoofed cookie (which I will explain later) that contains the same UID. Only then will the switch forward the request and return the actual content to the user. Basically, this amounts to a big technical hack by Phorm to set the cookies that track users as they browse the Web.

In all, a user’s initial request is redirected three times for each domain that is contacted. Though this may not seem like much, this extra layer of indirection harms the user by degrading the overall browsing experience. It imposes an unnecessary delay that will likely be noticeable by users.

The spoofed cookie that Phorm stores on the user’s browser during this process is also a highly questionable practice. Generally speaking, a cookie is specific to a particular domain and the browser typically ensures that a cookie can only be read and written by the domain it belongs to. For example, data in a yahoo.com cookie is only sent when you contact a yahoo.com server, and only a yahoo.com server can put data into that cookie.

But since Phorm controls the switch at the ISP, it can bypass this usual guarantee by impersonating the server to add cookies for other domains. To continue the example, the switch (1) intercepts the user’s request, (2) pretends to be a yahoo.com server, and (3) injects a new yahoo.com cookie that contains the Phorm UID. The browser, believing the cookie to actually be from yahoo.com, happily accepts and stores it. This cookie is used later by Phorm to identify the user whenever the user visits any page on yahoo.com.

Cookie spoofing is problematic because it can change the interaction between the user and the content-providing site. Suppose a site’s privacy policy promises the user that it does not use tracking cookies. But because of Phorm’s spoofing, the browser will store a cookie that (to the user) looks exactly like a tracking cookie from the site. Now, the switch typically strips out this tracking cookie before it reaches the site, but if the user moves to a non-Phorm ISP (say at work), the cookie will actually reach the site in violation of its stated privacy policy. The cookie can also cause other problems, such as a cookie collision if the site cookie inadvertently has the same name as the Phorm cookie.

Disruptive activities inside the network often create these sort of unexpected problems for both users and websites, which is why computer scientists are skeptical of ideas that violate the end-to-end principle. For the uninitiated, the principle, in short, states that system functionality should almost always be implemented at the end hosts of the network, with a few justifiable exceptions. For instance, almost all security functionality (such as data encryption and decryption) is done by end users and only rarely by machines inside the network.

The Webwise system has no business being inside the network and has no role in transporting packets from one end of the network to the other. The technical Internet community has been worried for years about the slow erosion of the end-to-end principle, particularly by ISPs who are looking to further monetize their networks. This principle is the one upon which the Internet is built and one which the ISPs must uphold. Phorm’s system, nearly in production, is a cogent realization of this erosion, and ISPs should keep Phorm outside the gate.

Bad Phorm on Privacy

Phorm, an online advertising company, has recently made deals with several British ISPs to gain unprecedented access to every single Web action taken by their customers. The deals will let Phorm track search terms, URLs and other keywords to create online behavior profiles of individual customers, which will then be used to provide better targeted ads. The company claims that “No private or personal information, or anything that can identify you, is ever stored – and that means your privacy is never at risk.” Although Phorm might have honest intentions, their privacy claims are, at best, misleading to customers.

Their privacy promise is that personally-identifiable information is never stored, but they make no promises on how the raw logs of search terms and URLs are used before they are deleted. It’s clear from Phorm’s online literature that they use this sensitive data for ad delivery purposes. In one example, they claim advertisers will be able to target ads directly to users who see the keywords “Paris vacation” either as a search or within the text of a visited webpage. Without even getting to the storage question, users will likely perceive Phorm’s access and use of their behavioral data as a compromise of their personal privacy.

What Phorm does store permanently are two pieces of information about each user: (1) the “advertising categories” that the user is interested in and (2) a randomly-generated ID from the user’s browser cookie. Each raw online action is sorted into one or more categories, such as “travel” or “luxury cars”, that are defined by advertisers. The privacy worry is that as these categories become more specific, the behavioral profiles of each user becomes ever more precise. Phorm seems to impose no limit on the specificity of these defined categories, so for all intents and purposes, these categories over time will become nearly identical to the search terms themselves. Indeed, they market their “finely tuned” service as analogous to typical keyword search campaigns that advertisers are already used to. Phorm has a strong incentive to store arbitrarily specific interest categories about each user to provide optimally targeted ads, and thus boost the profits of their advertising business.

The second protection mechanism is a randomly-generated ID number stored in a browser cookie that Phorm uses to “anonymously” track a user as she browses the web. This ID number is stored with the list of the interest categories collected for that user. Phorm should be given credit for recognizing this as more privacy-protecting than simply using the customer’s name or IP address as an identifier (something even Google has disappointingly failed to recognize). But from past experience, these protections are unlikely to be enough. The storage of random user IDs mapped to keywords mirroring actual search queries is highly reminiscent of the AOL data fiasco from 2006, where AOL released “anonymized” search histories containing 20 million keywords. It turned out to be easy to identify the name of specific individuals based solely on their search history.

In the least, the company’s employees will be able to access an AOL-like dataset about the ISP’s customers. Granted, distinguishing whether particular datasets as personally-identifiable or not is a notoriously difficult problem and subject to further research. But it’s inaccurate for Phorm to claim that personally-identifiable information is not being stored and to promise users that their privacy is not at risk.