April 24, 2014

avatar

Studying the Frequency of Redaction Failures in PACER

Since we launched RECAP a couple of years ago, one of our top concerns has been privacy. The federal judiciary’s PACER system offers the public online access to hundreds of millions of court records. The judiciary’s rules require each party in a case to redact certain types of information from documents they submit, but unfortunately litigants and their counsel don’t always comply with these rules. Three years ago, Carl Malamud did a groundbreaking audit of PACER documents and found more than 1600 cases in which litigants submitted documents with unredacted Social Security numbers. My recent research has focused on a different problem: cases where parties tried to redact sensitive information but the redactions failed for technical reasons. This problem occasionally pops up in news stories, but as far as I know, no one has conducted a systematic study.

To understand the problem, it helps to know a little bit about how computers represent graphics. The simplest image formats are bitmap or raster formats. These represent an image as an array of pixels, with each pixel having a color represented by a numeric value. The PDF format uses a different approach, known as vector graphics, that represent an image as a series of drawing commands: lines, rectangles, lines of text, and so forth.

Vector graphics have important advantages. Vector-based formats “scale up” gracefully, in contrast to the raster images that look “blocky” at high resolutions. Vector graphics also do a better job of preserving a document’s structure. For example, text in a PDF is represented by a sequence of explicit text-drawing commands, which is why you can cut and paste text from a PDF document, but not from a raster format like PNG.

But vector-based formats also have an important disadvantage: they may contain more information than is visible to the naked eye. Raster images have a “what you see is what you get” quality—changing all the pixels in a particular region to black destroys the information that was previously in that part of the image. But a vector-based image can have multiple “layers.” There might be a command to draw some text followed by a command to draw a black rectangle over the text. The image might look like it’s been redacted, but the text is still “under” the box. And often extracting that information is a simple matter of cutting and pasting.

So how many PACER documents have this problem? We’re in a good position to study this question because we have a large collection of PACER documents—1.8 million of them when I started my research last year. I wrote software to detect redaction rectangles—it turns out these are relatively easy to recognize based on their color, shape, and the specific commands used to draw them. Out of 1.8 million PACER documents, there were approximately 2000 documents with redaction rectangles. (There were also about 3500 documents that were redacted by replacing text by strings of Xes, I also excluded documents that were redacted by Carl Malamud before he donated them to our archive.)

Next, my software checked to see if these redaction rectangles overlapped with text. My software identified a few hundred documents that appeared to have text under redaction rectangles, and examining them by hand revealed 194 documents with failed redactions. The majority of the documents (about 130) appear be from commercial litigation, in which parties have unsuccessfully attempted to redact trade secrets such as sales figures and confidential product information. Other improperly redacted documents contain sensitive medical information, addresses, and dates of birth. Still others contain the names of witnesses, jurors, plaintiffs, and one minor.

Implications

PACER reportedly contains about 500 million documents. We don’t have a random sample of PACER documents, so we should be careful about trying to extrapolate to the entire PACER corpus. Still, it’s safe to say there are thousands, and probably tens of thousands, of documents in PACER whose authors made unsuccessful attempts to conceal information.

It’s also important to note that my software may not be detecting every instance of redaction failures. If a PDF was created by scanning in a paper document (as opposed to generated directly from a word processor), then it probably won’t have a “text layer.” My software doesn’t detect redaction failures in this type of document. This means that there may be more than 194 failed redactions among the 1.8 million documents I studied.

A few weeks ago I wrote a letter to Judge Lee Rosenthal, chair of the federal judiciary’s Committee on Rules of Practice and Procedure, explaining this problem. In that letter I recommend that the courts themselves use software like mine to automatically scan PACER documents for this type of problem. In addition to scanning the documents they already have, the courts should make it a standard part of the process for filing new documents with the courts. This would allow the courts to catch these problems before the documents are made available to the public on the PACER website.

My code is available here. It’s experimental research code, not a finished product. We’re releasing it into the public domain using the CC0 license; this should make it easy for federal and state officials to adapt it for their own use. Court administrators who are interested in adapting the code for their own use are especially encouraged to contact me for advice and assistance. The code relies heavily on the CAM::PDF Perl library, and I’m indebted to Chris Dolan for his patient answers to my many dumb questions.

Getting Redaction Right

So what should litigants do to avoid this problem? The National Security Agency has a good primer on secure redaction. The approach they recommend—completely deleting sensitive information in the original word processing document, replacing it with innocuous filler (such as strings of XXes) as needed, and then converting it to a PDF document, is the safest approach. The NSA primer also explains how to check for other potentially sensitive information that might be hidden in a document’s metadata.

Of course, there may be cases where this approach isn’t feasible because a litigant doesn’t have the original word processing document or doesn’t want the document’s layout to be changed by the redaction process. Adobe Acrobat’s redaction tool has worked correctly when we’ve used it, and Adobe probably has the expertise to do it correctly. There may be other tools that work correctly, but we haven’t had an opportunity to experiment with them so we can’t say which ones they might be.

Regardless of the tool used, it’s a good idea to take the redacted document and double-check that the information was removed. An easy way to do this is to simply cut and paste the “redacted” content into another document. If the redaction succeeded, no text should be transferred. This method will catch most, but not all, redaction failures. A more rigorous check is to remove the redaction rectangles from the document and manually observe what’s underneath them. One of the scripts I’m releasing today, called remove_rectangles.pl, does just that. In its current form, it’s probably not user-friendly enough for non-programmers to use, but it would be relatively straightforward for someone (perhaps Adobe or the courts) to build a user-friendly version that ordinary users could use to verify that the document they just attempted to redact actually got redacted.

One approach we don’t endorse is printing the document out, redacting it with a black marker, and then re-scanning it to PDF format. Although this may succeed in removing the sensitive information, we don’t recommend this approach because it effectively converts the document into a raster-based image, destroying useful information in the process. For example, it will no longer be possible to cut and paste (non-redacted) text from a document that has been redacted in this way.

Bad redactions are not a new problem, but they are taking on a new urgency as PACER documents become increasingly available on the web. Correct redaction is not difficult, but it does require both knowledge and care by those who are submitting the documents. The courts have several important roles they should play: educating attorneys about their redaction responsibilities, providing them with software tools that make it easy for them to comply, and monitoring submitted documents to verify that the rules are being followed.

This research was made possible with the financial support of Carl Malamud’s organization, Public.Resource.Org.

avatar

"You Might Also Like:" Privacy Risks of Collaborative Filtering

Ann Kilzer, Arvind Narayanan, Ed Felten, Vitaly Shmatikov, and I have released a new research paper detailing the privacy risks posed by collaborative filtering recommender systems. To examine the risk, we use public data available from Hunch, LibraryThing, Last.fm, and Amazon in addition to evaluating a synthetic system using data from the Netflix Prize dataset. The results demonstrate that temporal changes in recommendations can reveal purchases or other transactions of individual users.

To help users find items of interest, sites routinely recommend items similar to a given item. For example, product pages on Amazon contain a “Customers Who Bought This Item Also Bought” list. These recommendations are typically public, and they are the product of patterns learned from all users of the system. If customers often purchase both item A and item B, a collaborative filtering system will judge them to be highly similar. Most sites generate ordered lists of similar items for any given item, but some also provide numeric similarity scores.

Although item similarity is only indirectly related to individual transactions, we determined that temporal changes in item similarity lists or scores can reveal details of those transactions. If you’re a Mozart fan and you listen to a Justin Bieber song, this choice increases the perceived similarity between Justin Bieber and Mozart. Because similarity lists and scores are based on perceived similarity, your action may result in changes to these scores or lists.

Suppose that an attacker knows some of your past purchases on a site: for example, past item reviews, social networking profiles, or real-world interactions are a rich source of information. New purchases will affect the perceived similarity between the new items and your past purchases, possibility causing visible changes to the recommendations provided for your previously purchased items. We demonstrate that an attacker can leverage these observable changes to infer your purchases. Among other things, these attacks are complicated by the fact that multiple users simultaneously interact with a system and updates are not immediate following a transaction.

To evaluate our attacks, we use data from Hunch, LibraryThing, Last.fm, and Amazon. Our goal is not to claim privacy flaws in these specific sites (in fact, we often use data voluntarily disclosed by their users to verify our inferences), but to demonstrate the general feasibility of inferring individual transactions from the outputs of collaborative filtering systems. Among their many differences, these sites vary dramatically in the information that they reveal. For example, Hunch reveals raw item-to-item correlation scores, but Amazon reveals only lists of similar items. In addition, we examine a simulated system created using the Netflix Prize dataset. Our paper outlines the experimental results.

While inference of a Justin Bieber interest may be innocuous, inferences could expose anything from dissatisfaction with a job to health issues. Our attacks assume that a victim reveals certain past transactions, but users may publicly reveal certain transactions while preferring to keep others private. Ultimately, users are best equipped to determine which transactions would be embarrassing or otherwise problematic. We demonstrate that the public outputs of recommender systems can reveal transactions without user knowledge or consent.

Unfortunately, existing privacy technologies appear inadequate here, failing to simultaneously guarantee acceptable recommendation quality and user privacy. Mitigation strategies are a rich area for future work, and we hope to work towards solutions with others in the community.

Worth noting is that this work suggests a risk posed by any feature that adapts in response to potentially sensitive user actions. Unless sites explicitly consider the data exposed, such features may inadvertently leak details of these underlying actions.

Our paper contains additional details. This work was presented earlier today at the 2011 IEEE Symposium on Security and Privacy. Arvind has also blogged about this work.

avatar

Web Tracking and User Privacy Workshop: Test Cases for Privacy on the Web

This guest post is from Nick Doty, of the W3C and UC Berkeley School of Information. As a companion post to my summary of the position papers submitted for last month’s W3C Do-Not-Track Workshop, hosted by CITP, Nick goes deeper into the substance and interaction during the workshop.

The level of interest and participation in last month’s Workshop on Web Tracking and User Privacy — about a hundred attendees spanning multiple countries, dozens of companies, a wide variety of backgrounds — confirms the broad interest in Do Not Track. The relatively straightforward technical approach with a catchy name has led to, in the US, proposed legislation at both the state and federal level and specific mention by the Federal Trade Commission (it was nice to have Ed Felten back from DC representing his new employer at the workshop), and comparatively rapid deployment of competing proposals by browser vendors. Still, one might be surprised that so many players are devoting such engineering resources to a relatively narrow goal: building technical means that allow users to avoid tracking across the Web for the purpose of compiling behavioral profiles for targeted advertising.

In fact, Do Not Track (in all its variations and competing proposals) is the latest test case for how new online technologies will address privacy issues. What mix of minimization techniques (where one might classify Microsoft’s Tracking Protection block lists) versus preference expression and use limitation (like a Do Not Track header) will best protect privacy and allow for innovation? Can parties agree on a machine-readable expression of privacy preferences (as has been heavily debated in P3P, GeoPriv and other standards work), and if so, how will terms be defined and compliance monitored and enforced? Many attendees were at the workshop not just to address this particular privacy problem — ubiquitous invisible tracking of Web requests to build behavioral profiles — but to grab a seat at the table where the future of how privacy is handled on the Web may be decided. The W3C, for its part, expects to start an Interest Group to monitor privacy on the Web and spin out specific work as new privacy issues inevitably arise, in addition to considering a Working Group to address this particular topic (more below). The Internet Engineering Task Force (IETF) is exploring a Privacy Directorate to provide guidance on privacy considerations across specs.

At a higher level, this debate presents a test case for the process of building consensus and developing standards around technologies like tracking protection or Do Not Track that have inspired controversy. What body (or rather, combination of bodies) can legitimately define preference expressions that must operate at multiple levels in the Web stack, not to mention serve the diverse needs of individuals and entities across the globe? Can the same organization that defines the technical design also negotiate semantic agreement between very diverse groups on the meaning of “tracking”? Is this an appropriate role for technical standards bodies to assume? To what extent can technical groups work with policymakers to build solutions that can be enforced by self-regulatory or governmental players?

Discussion at the recent workshop confirmed many of these complexities: though the agenda was organized to roughly separate user experience, technical granularity, enforcement and standardization, overlap was common and inevitable. Proposals for an “ack” or response header brought up questions of whether the opportunity to disclaim following the preference would prevent legal enforcement; whether not having such a response would leave users confused about when they had opted back in; and how granular such header responses should be. In defining first vs. third party tracking, user expectations, current Web business models and even the same-origin security policy could point the group in different directions.

We did see some moments of consensus. There was general agreement that while user interface issues were key to privacy, trying to standardize those elements was probably counterproductive but providing guidance could help significantly. Regarding the scope of “tracking”, the group was roughly evenly divided on what they would most prefer: a broad definition (any logging), a narrow definition (online behavioral advertising profiling only) or something in between (where tracking is more than OBA but excludes things like analytics or fraud protection, as in the proposal from the Center for Democracy and Technology). But in a “hum” to see which proposals workshop attendees opposed (“non-starters”) no one objected to starting with a CDT-style middle ground — a rather shocking level of agreement to end two days chock full of debate.

For tech policy nerds, then, this intimate workshop about a couple of narrow technical proposals was heady stuff. And the points of agreement suggest that real interoperable progress on tracking protection — the kind that will help the average end user’s privacy — is on the way. For the W3C, this will certainly be a topic of discussion at the ongoing meeting in Bilbao, and we’re beginning detailed conversations about the scope and milestones for a Working Group to undertake technical standards work.

Thanks again to Princeton/CITP for hosting the event, and to Thomas and Lorrie for organizing it: bringing together this diverse group of people on short notice was a real challenge, and it paid off for all of us. If you’d like to see more primary materials: minutes from the workshop (including presentations and discussions) are available, as are the position papers and slides. And the W3C will post a workshop report with a more detailed summary very soon.

avatar

Overstock's $1M Challenge

As reported in Fast Company, RichRelevance and Overstock.com teamed up to offer up to a $1,000,000 prize for improving “its recommendation engine by 10 percent or more.”

If You Liked Netflix, You Might Also Like Overstock
When I first read a summary of this contest, it appeared they were following in Netflix’s footsteps right down to releasing user data sans names. This did not end well for Netflix’s users or for Netflix. Narayanan and Shmatikov were able to re-identify Netflix users using the contest dataset, and their research contributed greatly to Ohm’s work on de-anonimization. After running the contest a second time, Netflix terminated it early in the face of FTC attention and a lawsuit that they settled out of court.

This time, Overstock is providing “synthetic data” to contest entrants, then testing submitted algorithms against unreleased real data. Tag line: “If you can’t bring the data to the code, bring the code to the data.” Hmm. An interesting idea, but short on a few details around the sharp edges that jump out as highest concern. I look forward to getting the time to play with the system and dataset. What is good news is seeing companies recognize privacy concerns and respond with something interesting and new. That is, at least, a move in the right direction.

Place your bets now on which happens first: a contest winner with a 10% boost to sales, or researchers finding ways to re-identify at least 10% of the data?

avatar

Debugging Legislation: PROTECT IP

There’s more than a hint of theatrics in the draft PROTECT IP bill (pdf, via dontcensortheinternet ) that has emerged as son-of-COICA, starting with the ungainly acronym of a name. Given its roots in the entertainment industry, that low drama comes as no surprise. Each section name is worse than the last: “Eliminating the Financial Incentive to Steal Intellectual Property Online” (Sec. 4) gives way to “Voluntary action for Taking Action Against Websites Stealing American Intellectual Property” (Sec. 5).

Techdirt gives a good overview of the bill, so I’ll just pick some details:

  • Infringing activities. In defining “infringing activities,” the draft explicitly includes circumvention devices (“offering goods or services in violation of section 1201 of title 17″), as well as copyright infringement and trademark counterfeiting. Yet that definition also brackets the possibility of “no [substantial/significant] use other than ….” Substantial could incorporate the “merely capable of substantial non-infringing use” test of Betamax.
  • Blocking non-domestic sites. Sec. 3 gives the Attorney General a right of action over “nondomestic domain names”, including the right to demand remedies from (A) domain name system server operators, (B) financial transaction providers, (C), Internet advertising services, and (D) “an interactive computer service (def. from 230(f)) shall take technically feasible and reasonable measures … to remove or disable access to the Internet site associated with the domain name set forth in the order, or a hypertext link to such Internet site.”
  • Private right of action. Sec. 3 and Sec. 4 appear to be near duplicates (I say appear, because unlike computer code, we don’t have a macro function to replace the plaintiff, so the whole text is repeated with no diff), replacing nondomestic domain with “domain” and permitting private plaintiffs — “a holder of an intellectual property right harmed by the activities of an Internet site dedicated to infringing activities occurring on that Internet site.” Oddly, the statute doesn’t say the simpler “one whose rights are infringed,” so the definition must be broader. Could a movie studio claim to be hurt by the infringement of others’ rights, or MPAA enforce on behalf of all its members? Sec. 4 is missing (d)(2)(D)
  • WHOIS. The “applicable publicly accessible database of registrations” gets a new role as source of notice for the domain registrant, “to the extent such addresses are reasonably available.” (c)(1)
  • Remedies. The bill specifies injunctive relief only, not money damages, but threat of an injunction can be backed by the unspecified threat of contempt for violating one.
  • Voluntary action. Finally the bill leaves room for “voluntary action” by financial transaction providers and advertising services, immunizing them from liability to anyone if they choose to stop providing service, notwithstanding any agreements to the contrary. This provision jeopardizes the security of online businesses, making them unable to contract for financial services against the possibility that someone will wrongly accuse them of infringement. 5(a) We’ve already seen that it takes little to convince service providers to kick users off, in the face of pressure short of full legal process (see everyone vs Wikileaks, Facebook booting activists, and numerous misfired DMCA takedowns); this provision insulates that insecurity further.

In short, rather than “protecting” intellectual and creative industry, this bill would make it less secure, giving the U.S. a competitive disadvantage in online business. (Sorry, Harlan, that we still can’t debug the US Code as true code.)

avatar

Don't love the cyber bomb, but don't ignore it either

Cybersecurity is overblown – or not

A recent report by Jerry Brito and Tate Watkins of George Mason University titled “Loving The Cyber Bomb? The Dangers Of Threat Inflation In Cybersecurity Policy” has gotten a bit of press. This is an important topic worthy of debate, but I believe their conclusions are incorrect. In this posting, I’ll summarize their report and explain why I think they’re wrong.

Brito & Watkins (henceforth B&W) argue that the cyber threat is exaggerated, and its being driven by private industry anxious to feed at the public trough in a manner similar to the creation of the military industrial complex in the second half of the 20th century as an outgrowth of the Cold War.

The paper starts by describing how deliberate misinformation in the run-up to the Iraq war is an example of how public opinion can be manipulated by policy makers and private industry trying to sell a threat. My opinion of the Iraq war is not relevant to this discussion, but I believe they’re using to create a strawman which they then knock down.

Next, B&W they use the CSIS Commission Report on Cybersecurity for the 44th Presidency and Richard Clarke’s “Cyber War” to argue that the threat of cyber conflict has been overblown. With regard to the former, they criticize the confusion of probes (port scans) with real attacks, and argue that probes are not evidence of an attack or breach but more akin to doorknob rattling. While that’s certainly true (and an analogy that’s been made for years), if your doorknob is rattled thousands of times a day it’s a strong indication that you’re living in a bad neighborhood! They then note that there’s little unclassified proof of real threats, and hence the call for regulation by CSIS (and others) is inappropriate. Unfortunately, quantitative proof is hard to come by, but there are enough incidents that there can be little doubt as to the severity of the threat. Requiring quantitative data before we move to protection would be akin to demanding an open and accurate assessment of the number of foreign spies and the damage they do before we fund the CIA! Instead, we rely on experts in spycraft to assess the threat, and help define appropriate defenses. In the same way, we should rely on cybersecurity experts to provide an assessment of the risks and appropriate actions. I certainly agree with both CSIS and B&W that overclassification of the threats works to our detriment – if the public is unable to see the threat, it becomes hard to justify spending to defend against it. I’ve personally seen this in the commercial software industry, where the inability to provide hard data about cyber threats to senior management results in that threat being discounted, with consequent risk to businesses. But again, the problems with overclassification do not mean the problem doesn’t exist.

Regarding Clarke’s book, there’s been plenty of criticism of both technical inaccuracies and the somewhat hysterical tone. Those notwithstanding, Clarke generally has a good understanding of the types of threats and the risks. B&W’s claim that the only verifiable attacks are DDOS is simply untrue – there have been verified attacks against infrastructure like water systems, although some of the claimed attacks are other types of failures that could have been cyber-related, but aren’t. As an example, while Clarke claims that the northeast power blackout of 2003 was cyber-related, there’s adequate evidence that it was not – but there’s also adequate evidence that such an accidental failure could be caused by a deliberate attack. Similarly, the NYSE “flash crash” was not caused by a cyber attack, but demonstrates the fragility of modern highly computerized systems, and shows that a cyber attack could cause similar symptoms. That which can happen by accident can also happen intentionally, if an adversary desires.

As for B&W’s analogy to the military industrial complex that President Eisenhower so famously feared, and the increasing influence of cyberpork, I must reluctantly agree. Large defense contractors have, in recent years, flocked to cyber as it has become trendy and large budgets have become attractive, frequently more concerned with revenue than with solving problems. However, the problems existed (and were being discussed) by researchers and practitioners long before the influx of government contractors. The fact that they’re trying to make money off the problem doesn’t mean the problem doesn’t exist.

The final section of the paper, covering regulatory issues, has some good points, but it is so poisoned by the assumptions in the earlier sections of the paper that it’s hard to take seriously.

To summarize, we should distinguish between the existence of the problem (which is real and growing) versus the desire of some government contractors to cash in – the fact that the latter is occurring does not deny the reality of the former.

avatar

In DHS Takedown Frenzy, Mozilla Refuses to Delete MafiaaFire Add-On

Not satisfied with seizing domain names, the Department of Homeland Security asked Mozilla to take down the MafiaaFire add-on for Firefox. Mozilla, through its legal counsel Harvey Anderson, refused. Mozilla deserves thanks and credit for a principled stand for its users’ rights.

MafiaaFire is a quick plugin, as its author describes, providing redirection service for a list of domains: “We plan to maintain a list of URLs, and their duplicate sites (for example Demoniod.com and Demoniod.de) and painlessly redirect you to the correct site.” The service provides redundancy, so that domain resolution — especially at a registry in the United States — isn’t a single point of failure between a website and its would-be visitors. After several rounds of ICE seizure of domain names on allegations of copyright infringement — many of which have been questioned as to both procedural validity and effectiveness — redundancy is a sensible precaution for site-owners who are well within the law as well as those pushing its limits.

DHS seemed poised to repeat those procedural errors here. As Mozilla’s Anderson blogged: “Our approach is to comply with valid court orders, warrants, and legal mandates, but in this case there was no such court order.” DHS simply “requested” the takedown with no such procedural back-up. Instead of pulling the add-on, Anderson responded with a set of questions, including:

  1. Have any courts determined that MAFIAAfire.com is unlawful or illegal inany way? If so, on what basis? (Please provide any relevant rulings)

  2. Have any courts determined that the seized domains related to MAFIAAfire.com are unlawful, illegal or liable for infringement in any way? (please provide relevant rulings)
  3. Is Mozilla legally obligated to disable the add-on or is this request based on other reasons? If other reasons, can you please specify.

Unless and until the government can explain its authority for takedown of code, Mozilla is right to resist DHS demands. Mozilla’s hosting of add-ons, and the Firefox browser itself, facilitate speech. They, like they domain name system registries ICE targeted earlier, are sometimes intermediaries necessary to users’ communication. While these private actors do not have First Amendment obligations toward us, their users, we rely on them to assert our rights (and we suffer when some, like Facebook are less vigilant guardians of speech).

As Congress continues to discuss the ill-considered COICA, it should take note of the problems domain takedowns are already causing. Kudos to Mozilla for bringing these latest errors to public attention — and, as Tom Lowenthal suggests in the do-not-track context, standing up for its users.

cross-posted at Legal Tags

avatar

Summary of W3C DNT Workshop Submissions

Last week, we hosted the W3C “Web Tracking and User Privacy” Workshop here at CITP (sponsored by Adobe, Yahoo!, Google, Mozilla and Microsoft). If you were not able to join us for this event, I hope to summarize some of the discussion embodied in the roughly 60 position papers submitted.

The workshop attracted a wide range of participants; the agenda included advocates, academics, government, start-ups and established industry players from various sectors. Despite the broad name of the workshop, the discussion centered around “Do Not Track” (DNT) technologies and policy, essentially ways of ensuring that people have control, to some degree, over web profiling and tracking.

Unfortunately, I’m going to have to expect that you are familiar with the various proposals before going much further, as the workshop position papers are necessarily short and assume familiarity. (If you are new to this area, the CDT’s Alissa Cooper has a brief blog post from this past March, “Digging in on ‘Do Not Track’”, that mentions many of the discussion points. Technically, much of the discussion involved the mechanisms of the Mayer, Narayanan and Stamm IETF Internet-Draft from March and the Microsoft W3C member submission from February.)

Read on for more…

Technical Implementation: First, some quick background and updates: A number of papers point out how analogizing to a Do-Not-Call-like registry–I suppose where netizens would sign-up not to be tracked–would not work in the online tracking sense, so we should be careful not to shape the technology and policy too closely to Do-Not-Call. Having recognized that, the current technical proposals center around the Microsoft W3C submission and the Mayer et al. IETF submission, including some mix of a DNT HTTP header, a DNT DOM flag, and Tracking Protection Lists (TPLs). While the IETF submission focuses exclusively on the DNT HTTP Header, the W3C submission includes all three of these technologies. Browsers are moving pretty quickly here: Mozilla’s FireFox v4.0 browser includes the DNT header, Microsoft’s IE9 includes all three of these capabilities, Google’s Chrome browser now allows extensions to send the DNT Header through the WebRequest API and Apple has announced that the next version of its Safari browser will support the DNT header.

Some of the papers critique certain aspects of the three implementation options while some suggest other mechanisms entirely. CITP’s Harlan Yu includes an interesting discussion of the problems with DOM flag granularity and access control problems when third-party code included in a first-party site runs as if it were first-part code. Toubiana and Nissenbaum talk about a number of problems with the persistence of DNT exceptions (where a user opts back in) when a resource changes content or ownership and then go on to suggest profile-based opting-back-in based on a “topic” or grouping of websites. Avaya’s submission has a fascinating discussion of the problems with implementation of DNT within enterprise environments, where tracking-like mechanisms are used to make sure people are doing their jobs across disparate enterprise web-services; Avaya proposes a clever solution where the browser first checks to see if it can reach a resource only available internally to the enterprise (virtual) network, in which case it ignores DNT preferences for enterprise software tracking mechanisms. A slew of submissions from Aquin et al., Azigo and PDECC favor a culture of “self-tracking”, allowing and teaching people to know more about the digital traces they leave and giving them (or their agents) control over the use and release of their personal information. CASRO-ESOMAR and Apple have interesting discussions of gaming TPLs: CASRO-ESOMAR points out that a competitor could require a user to accept a TPL that blocks traffic from their competitors and Apple talks about spam-like DNS cycling as an example of an “arms race” response against TPLs.

Definitions: Many of the papers addressed definitions definitions definitions… mostly about what “tracking” means and what terms like “third-party” should mean. Many industry submissions such as Paypal, Adobe, SIIA, and Google urge caution so that good types of “tracking”, such as analytics and forensics, are not swept under the rug and further argue that clear definitions of the terms involved in DNT is crucial to avoid disrupting user expectations, innovation and the online ecosystem. Paypal points out, as have others, that domain names are not good indicators of third-party (e.g., metrics.apple.com is the Adobe Omniture service for apple.com and fb.com is equivalent to facebook.com). Ashkan Soltani’s submission distinguishes definitions for DNT that are a “do not use” conception vs. a “do not collect” conception and argues for a solution that “does not identify”, requiring the removal of any unique identifiers associated with the data. Soltani points out how this has interesting measurement/enforcement properties as if a user sees a unique ID in the DNI case, the site is doing it wrong.

Enforcement: Some raised the issue of enforcement; Mozilla, for example, wants to make sure that there are reasonable enforcement mechanisms to deal with entities that ignore DNT mechanisms. On the other side, so to speak, are those calling for self-regulation such as Comcast and SIIA vs. those advocating for explicit regulation. The opinion polling research groups, CASRO-ESOMAR, call explicitly for regulation no matter what DNT mechanism is ultimately adopted, such that DNT headers requests are clearly enforced or that TPLs are regulated tightly so as to not over-block legitimate research activities. Abine wants a cooperative market mechanism that results in a “healthy market system that is responsive to consumer outcome metrics” and that incentivizes advertising companies to work with privacy solution providers to increase consumer awareness and transparency around online tracking. Many of the industry players worried about definitions are also worried about over-prescription from a regulatory perspective; e.g., Datran Media is concerned about over-prescription via regulation that might stifle innovation in new business models. Hoofnagle et al. are evaluating the effectiveness of self-regulation, and find that the self-regulation programs currently in existence are greatly stilted in favor of industry and do not adequately embody consumer conceptions of privacy and tracking.

Research: There were a number of submissions addressing research that is ongoing and/or further needed to gauge various aspects of the DNT puzzle. The submissions from McDonald and Wang et al. describe user studies focusing, respectively, on what consumers expect from DNT–spoiler: they expect no collection of their data–and gauging the usability and effectiveness of current opt-out tools. Both of these lines of work argue for usable mechanisms that communicate how developers implement/envision DNT and how users can best express their preferences via these tools. NIST’s submission argues for empirical studies to set objective and usable standards for tracking protection and describes a current study of single sign-on (SSO) implementations. Thaw et al. discuss a proposal for incentivizing developers to communicate and design the various levels of rich data they need to perform certain kinds of ad targeting, and then uses a multi-arm bandit model to illustrate game-theoretic ad targeting that can be tweaked based on how much data they are allowed to collect. Finally, CASRO-ESOMAR makes a plea for exempting legitimate research purposes from DNT, so that opinion polling and academic research can avoid bias.

Transparency: A particularly fascinating thread of commentary to me was the extent to which submissions touched on or entirely focused on issues of transparency in tracking. Grossklags argues that DNT efforts will spark increased transparency but he’s not sure that will overcome some common consumer privacy barriers they see in research. Seltzer talks about the intimate relationship between transparency and privacy and concludes that a DNT header is not very transparent–in operation, not use–while TPLs are more transparent in that they are a user-side mechanism that users can inspect, change and verify correct operation. Google argues that there is a need for transparency in “what data is collected and how it is used”, leaving out the ability for users to effect or controls these things. In contrast, BlueKai also advocates for transparency in the sense of both accessing a user’s profile and user “control” over the data it collects, but it doesn’t and probably cannot extend this transparency to an understanding how BlueKai’s clients use this data. Datran Media describes their PreferenceCentral tool which allows opting out of brands the user doesn’t want targeting them (instead of ad networks, with which people are not familiar), which they argue is granular enough to avoid the “creepy” targeting feeling that users get from behavioral ads and also allow high-value targeted advertising. Evidon analogizes to physical world shopping transactions and concludes, smartly, “Anytime data that was not explicitly provided is explicitly used, there is a reflexive notion of privacy violation.” and “A permanently affixed ‘Not Me’ sign is not a representation of an engaged, meaningful choice.”

W3C vs. IETF: Finally, Mozilla seems to be the only submission that wrestles a bit with the “which standards-body?” question: W3C, IETF or some mix of both? They point out that the DNT Header is a broader issue than just web browsing so should be properly tackled by IETF where HTTP resides and the W3C effort could be focused on TPLs with a subcommittee for the DNT DOM element.

Finally, here are a bunch of submissions that don’t fit into the above categories that caught my eye:

  • Soghoian talks about the quantity and quality of information needed for security, law enforcement and fraud prevention is usually so big as to risk making it the exception that swallows the rule. Soghoian further recommends a total kibosh on certain nefarious technologies such as browser fingerprinting.

  • Lowenthal makes the very good point that browser vendors need to get more serious about managing security and privacy vulnerabilities, as that kind of risk can be best dealt with in the choke-point of the browsers that users choose, rather than the myriad of possible web entities. This would allow browsers to compete on privacy in terms of how privacy preserving they can be.

  • Mayer argues for a “generative” approach to a privacy choice signaling technology, highlighting that language preferences (via short codes) and browsing platform (via user-agent strings) are now sent as preferences in web requests and web sites are free to respond as they see fit. A DNT signaling mechanism like this would allow for great flexibility in how a web service responded to a DNT request, for example serving a DNT version of the site/resource, prompting the user for their preferences or asking for a payment before serving.

  • Yahoo points out that DNT will take a while to make it into the majority of browsers that users are using. They suggest a hybrid approach using the DAA CLEAR ad notice for backwards compatibility for browsers that don’t support DNT mechanisms and the DNT header for an opt-out that is persistent and enforceable.

Whew; I likely left out a lot of good stuff across the remaining submissions, but I hope that readers get an idea of some of the issues in play and can consult the submissions they find particularly interesting as this develops. We hope to have someone pen a “part 2″ to this entry describing the discussion during the workshop and what the next steps in DNT will be.

avatar

California to Consider Do Not Track Legislation

This afternoon the CA Senate Judiciary Committee had a brief time for proponents and opponents of SB 761 to speak about CA’s Do Not Track legislation. In general, the usual people said the usual things, with a few surprises along the way.

Surprise 1: repeated discussion of privacy as a Constitutional right. For those of us accustomed to privacy at the federal level, it was a good reminder that CA is a little different.

Surprise 2: TechNet compared limits on Internet tracking to Texas banning oil drilling, and claimed DNT is “not necessary” so legislation would be “particularly bad.” Is Kleiner still heavily involved in the post-Wade TechNet?

Surprise 3: the Chamber of Commerce estimated that DNT legislation would cost $4 billion dollars in California, extrapolated from an MIT/Toronto study in the EU. Presumably they mean Goldfarb & Tucker’s Privacy Regulation and Online Advertising, which is in my queue to read. Comments on donottrack.us raise concerns. Assuming even a generous opt-out rate of 5% of CA Internet users, $4B sounds high based on other estimates of value of entire clickstream data for $5/month. I look forward to reading their paper, and to learning the Chamber’s methods of estimating CA based on Europe.

Surprise 4: hearing about the problems of a chilling effect — for job growth, not for online use due to privacy concerns. Similarly, hearing frustrations about a text that says something “might” or “may” happen, with no idea what will actually transpire — about the text of the bill, not about the text of privacy policies.

On a 3 to 2 vote, they sent the bill to the next phase: the Appropriations Committee. Today’s vote was an interesting start.