December 21, 2024

Some Technical Clarifications About Do Not Track

When I last wrote here about Do Not Track in August, there were just a few rumblings about the possibility of a Do Not Track mechanism for online privacy. Fast forward four months, and Do Not Track has shot to the top of the privacy agenda among regulators in Washington. The FTC staff privacy report released in December endorsed the idea, and Congress was quick to hold a hearing on the issue earlier this month. Now, odds are quite good that some kind of Do Not Track legislation will be introduced early in this new congressional session.

While there isn’t yet a concrete proposal for Do Not Track on the table, much has already been written both in support of and against the idea in general, and it’s terrific to see the issue debated so widely. As I’ve been following along, I’ve noticed some technical confusion on a few points related to Do Not Track, and I’d like to address three of them here.

1. Do Not Track will most likely be based on an HTTP header.

I’ve read some people still suggesting that Do Not Track will be some form of a government-operated list or registry—perhaps of consumer names, device identifiers, tracking domains, or something else. This type of solution has been suggested before in an earlier conception of Do Not Track, and given its rhetorical likeness to the Do Not Call Registry, it’s a natural connection to make. But as I discussed in my earlier post—the details of which I won’t rehash here—a list mechanism is a relatively clumsy solution to this problem for a number of reasons.

A more elegant solution—and the one that many technologists seem to have coalesced around—is the use of a special HTTP header that simply tells the server whether the user is opting out of tracking for that Web request, i.e. the header can be set to either “on” or “off” for each request. If the header is “on,” the server would be responsible for honoring the user’s choice to not be tracked. Users would be able to control this choice through the preferences panel of the browser or the mobile platform.

2. Do Not Track won’t require us to “re-engineer the Internet.”

It’s also been suggested that implementing Do Not Track in this way will require a substantial amount of additional work, possibly even rising to the level of “re-engineering the Internet.” This is decidedly false. The HTTP standard is an extensible one, and it “allows an open-ended set of… headers” to be defined for it. Indeed, custom HTTP headers are used in many Web applications today.

How much work will it take to implement Do Not Track using the header? Generally speaking, not too much. On the client-side, adding the ability to send the Do Not Track header is a relatively simple undertaking. For instance, it only took about 30 minutes of programming to add this functionality to a popular extension for the Firefox Web browser. Other plug-ins already exist. Implementing this functionality directly into the browser might take a little bit longer, but much of the work will be in designing a clear and easily understandable user interface for the option.

On the server-side, adding code to detect the header is also a reasonably easy task—it takes just a few extra lines of code in most popular Web frameworks. It could take more substantial work to program how the server behaves when the header is “on,” but this work is often already necessary even in the absence of Do Not Track. With industry self-regulation, compliant ad servers supposedly already handle the case where a user opts out of their behavioral advertising programs, the difference now being that the opt-out signal comes from a header rather than a cookie. (Of course, the FTC could require stricter standards for what opting-out means.)

Note also that contrary to some suggestions, the header mechanism doesn’t require consumers to identify who they are or otherwise authenticate to servers in order to gain tracking protection. Since the header is a simple on/off flag sent with every Web request, the server doesn’t need to maintain any persistent state about users or their devices’ opt-out preferences.

3. Microsoft’s new Tracking Protection feature isn’t the same as Do Not Track.

Last month, Microsoft announced that its next release of Internet Explorer will include a privacy feature called Tracking Protection. Mozilla is also reportedly considering a similar browser-based solution (although a later report makes it unclear whether they actually will). Browser vendors should be given credit for doing what they can from within their products to protect user privacy, but their efforts are distinct from the Do Not Track header proposal. Let me explain the major difference.

Browser-based features like Tracking Protection basically amount to blocking Web connections from known tracking domains that are compiled on a list. They don’t protect users from tracking by new domains (at least until they’re noticed and added to the tracking list) nor from “allowed” domains that are tracking users surreptitiously.

In contrast, the Do Not Track header compels servers to cooperate, to proactively refrain from any attempts to track the user. The header could be sent to all third-party domains, regardless of whether the domain is already known or whether it actually engages in tracking. With the header, users wouldn’t need to guess whether a domain should be blocked or not, and they wouldn’t have to risk either allowing tracking accidentally or blocking a useful feature.

Tracking Protection and other similar browser-based defenses like Adblock Plus and NoScript are reasonable, but incomplete, interim solutions. They should be viewed as complementary with Do Not Track. For entities under FTC jurisdiction, Do Not Track could put an effective end to the tracking arms race between those entities and browser-based defenses—a race that browsers (and thus consumers) are losing now and will be losing in the foreseeable future. For those entities outside FTC jurisdiction, blocking unwanted third parties is still a useful though leaky defense that maintains the status quo.

Information security experts like to preach “defense in depth” and it’s certainly vital in this case. Neither solution fully protects the user, so users really need both solutions to be available in order to gain more comprehensive protection. As such, the upcoming features in IE and Firefox should not be seen as a technical substitute for Do Not Track.

——

To reiterate: if the technology that implements Do Not Track ends up being an HTTP header, which I think it should be, it would be both technically feasible and relatively simple. It’s also distinct from recent browser announcements about privacy in that Do Not Track forces server cooperation, while browser-based defenses work alone to fend off tracking.

What other technical issues related to Do Not Track remain murky to readers? Feel free to leave comments here, or if you prefer on Twitter using the #dntrack tag and @harlanyu.

On Facebook Apps Leaking User Identities

The Wall Street Journal today reports that many Facebook applications are handing over user information—specifically, Facebook IDs—to online advertisers. Since a Facebook ID can easily be linked to a user’s real name, third party advertisers and their downstream partners can learn the names of people who load their advertisement from those leaky apps. This reportedly happens on all ten of Facebook’s most popular apps and many others.

The Journal article provides few technical details behind what they found, so here’s a bit more about what I think they’re reporting.

The content of a Facebook application, for example FarmVille, is loaded within an iframe on the Facebook page. An iframe essentially embeds one webpage (FarmVille) inside another (Facebook). This means that as you play FarmVille, your browser location bar will show http://apps.facebook.com/onthefarm, but the iframe content is actually controlled by the application developer, in this case by farmville.com.

The content loaded by farmville.com in the iframe contains the game alongside third party advertisements. When your browser goes to fetch the advertisement, it automatically forwards to the third party advertiser “referer” information—that is, the URL of the current page that’s loading the ad. For FarmVille, the URL referer that’s sent will look something like:

http://fb-tc-2.farmville.com/flash.php?…fb_sig_user=[User’s Facebook ID]…

And there’s the issue. Because of the way Zynga (the makers of FarmVille) crafts some of its URLs to include the user’s Facebook ID, the browser will forward this identifying information on to third parties. I confirmed yesterday evening that using FarmVille does indeed transmit my Facebook ID to a few third parties, including Doubleclick, Interclick and socialvi.be.

Facebook policy prohibits application developers from passing this information to advertising networks and other third parties. In addition, Zynga’s privacy policy promises that “Zynga does not provide any Personally Identifiable Information to third-party advertising companies.”

But evidence clearly indicates otherwise.

What can be done about this? First, application developers like Zynga can simply stop including the user’s Facebook ID in the HTTP GET arguments, or they can place a “#” mark before the sensitive information in the URL so browsers don’t transmit this information automatically to third parties.

Second, Facebook can implement a proxy scheme, as proposed by Adrienne Felt more than two years ago, where applications would not receive real Facebook IDs but rather random placeholder IDs that are unique for each application. Then, application developers can be free do whatever they want with the placeholder IDs, since they can no longer be linked back to real user names.

Third, browser vendors can give users easier and better control over when HTTP referer information is sent. As Chris Soghoian recently pointed out, browser vendors currently don’t make these controls very accessible to users, if at all. This isn’t a direct solution to the problem but it could help. You could imagine a privacy-enhancing opt-in browser feature that turns off the referer header in all cross-domain situations.

Some may argue that this leak, whether inadvertent or not, is relatively innocuous. But allowing advertisers and other third parties to easily and definitively correlate a real name with an otherwise “anonymous” IP address, cookie, or profile is a dangerous path forward for privacy. At the very least, Facebook and app developers need to be clear with users about their privacy rights and comply with their own stated policies.

Assessing PACER's Access Barriers

The U.S. Courts recently conducted a year-long assessment of their Electronic Public Access program which included a survey of PACER users. While the results of the assessment haven’t been formally published, the Third Branch Newsletter has an interview with Bankruptcy Judge J. Rich Leonard that discusses a few high-level findings of the survey. Judge Leonard has been heavily involved in shaping the evolution of PACER since its inception twenty years ago and continues to lead today.

The survey covered a wide range of PACER users—“the courts, the media, litigants, attorneys, researchers, and bulk data collectors”—and Judge Leonard claims they found “a remarkably high level of satisfaction”: around 80% of those surveyed were “satisfied” or “very satisfied” with the service.

If we compare public access before we had PACER to where we are now, there is clearly much success to celebrate. But the key question is not only whether current users are satisfied with the service but also whether PACER is reaching its entire audience of potential users. Are there artificial obstacles preventing potential PACER users—who admittedly would be difficult to poll—from using the service? The satisfaction statistic may be fine at face value, assuming that a representative sample of users were polled, but it could be misleading if it’s being used to gauge the overall success of PACER as a public access system.

One indicator of obstacles may be another statistic cited by Judge Leonard: “about 45% of PACER users also use CM/ECF,” the Courts’ electronic case management and filing system. To put it another way, nearly half of all PACER users are currently attorneys who practice federal law.

That number seems inordinately high to me and suggests that significant barriers to public access may exist. In particular, account registration requires all users to submit a valid credit card for billing (or alternatively a valid home address to receive log-in credentials and billing statements by mail.) Even if users’ credit cards are never charged, this registration hurdle may already turn away many potential PACER users at the door.

The other barrier is obviously the cost itself. With a few exceptions, users are forced to pay a fee for each document they download, at a metered rate of eight-cents per page. Judge Leonard asserts that “surprisingly, cost ranked way down” in the survey and that “most people thought they paid a fair price for what they got.”

But this doesn’t necessarily imply that cost isn’t a major impediment to access. It may just be that those surveyed—primarily lawyers—simply pass the cost of using PACER down to their clients and never bear the cost themselves. For the rest of PACER users who don’t have that luxury, the high cost of access can completely rule out certain kinds of legal research, or cause users to significantly ration and monitor their usage (as is the case even in the vast majority of our nation’s law libraries), or wholly deter users from ever using the service.

Judge Leonard rightly recognizes that it’s Congress that has authorized the collection of user fees, rather than using general taxpayer money, to fund the electronic public access program. But I wish the Courts would at least acknowledge that moving away from a fee-based model, to a system funded by general appropriations, would strengthen our judicial process and get us closer to securing each citizen’s right to equal protection under the law.

Rather than downplaying the barriers to public access, the Courts should work with Congress to establish a way forward to support a public access system that is truly open. They should study and report on the extent to which Congress already funds PACER indirectly, through Executive and Legislative branch PACER fee payments to the Judiciary, and re-appropriate those funds directly. If there is a funding shortfall, and I assume there will be, they should study the various options for closing that gap, such as additional direct appropriations or a slight increase in certain filing fees.

With our other two branches of government making great strides in openness and transparency with the help of technology, the Courts similarly needs to transition away from a one-size-fits-all approach to information dissemination. Public access to the courts will be fundamentally transformed by a vigorous culture of civic innovation around federal court documents, and this will only happen if the Courts confront today’s access barriers head-on and break them down.

(Thanks to Daniel Schuman for pointing me to the original article.)

Do Not Track: Not as Simple as it Sounds

Over the past few weeks, regulators have rekindled their interest in an online Do Not Track proposal in hopes of better protecting consumer privacy. FTC Chairman Jon Leibowitz told a Senate Commerce subcommittee last month that Do Not Track is “one promising area” for regulatory action and that the Commission plans to issue a report in the fall about “whether this is one viable way to proceed.” Senator Mark Pryor (D-AR), who sits on the subcommittee, is also reportedly drafting a new privacy bill that includes some version of this idea, of empowering consumers with blanket opt-out powers over online tracking.

Details are sparse at this point about how a Do Not Track mechanism might actually be implemented. There are a variety of possible technical and regulatory approaches to the problem, each with its own difficulties and limitations, which I’ll discuss in this post.

An Adaptation of “Do Not Call”

Because of its name, Do Not Track draws immediate comparisons to arguably the most popular piece of consumer protection regulation ever instituted in the US—the National Do Not Call Registry. If the FTC were to take an analogous approach for online tracking, a consumer would register his device’s network identifier—its IP address—with the national registry. Online advertisers would then be prohibited from tracking devices that are identified by those IP addresses.

Of course, consumer devices rarely have persistent long-term IP addresses. Most ISPs assign IP addresses dynamically (using DHCP) and a single device might be assigned a new IP address every few minutes. Consumer devices often also share the same IP address at the same time (using NAT) so there’s no stable one-to-one mapping between IPs and devices. Things could be different with IPv6, where each device could have its own stable IP address, but the Do Not Call framework, directly applied, is not the best solution for today’s online world.

The comparison is still useful though, if only to caution against the assumption that Do Not Track will be as easy, or as successful, as Do Not Call. The differences between the problems at hand and the technologies involved are substantial.

A Registry of Tracking Domains

Back in 2007, a coalition of online consumer privacy groups lobbied for the creation of a national Do Not Track List. They proposed a reverse approach: online advertisers would be required to register with the FTC all domain names used to issue persistent identifiers to user devices. The FTC would then publish this list, and it would be up to the browser to protect users from being tracked by these domains. Notice that the onus here is fully on the browser—equipped with this list—to protect the user from being uniquely identified. Meanwhile, online advertisers would still have free rein to try any method they wish to track user behavior, so long as it happens from these tracking domains.

We’ve learned over the past couple of years that modern browsers, from a practical perspective, can be limited in their ability to protect the user from unique identification. The most stark example of this is the browser fingerprinting attack, which was popularized by the EFF earlier this year. In this attack, the tracking site runs a special script that gathers information about the browser’s configurations, which are unique enough to identify the browser instance in nearly every case. The attack takes advantage of the fact that much of the gathered information is used frequently for legitimate purposes—such as determining which plugins are available to the site—so a browser which blocks the release of this information would surely irritate the user. As these kinds of “side-channel” attacks grow in sophistication, major browser vendors might always be playing catch-up in the technical arms race, leaving most users vulnerable to some form of tracking by these domains.

The x-notrack Header

If we believe that browsers, on their own, will be unable to fully protect users, then any effective Do No Track proposal will need to place some restraints on server tracking behavior. Browsers could send a signal to the tracking server to indicate that the user does not want this particular interaction to be tracked. The signaling mechanism could be in the form of a standard pre-defined cookie field, or more likely, an HTTP header that marks the user’s tracking preference for each connection.

In the simplest case, the HTTP header—call it x-notrack—is a binary flag that can be turned on or off. The browser could enable x-notrack for every HTTP connection, or for connections to only third party sites, or for connections to some set of user-specified sites. Upon receiving the signal not to track, the site would be prevented, by FTC regulation, from setting any persistent identifiers on the user’s machine or using any other side-channel mechanism to uniquely identify the browser and track the interaction.

While this approach seems simple, it could raise a few complicated issues. One issue is bifurcation: nothing would prevent sites from offering limited content or features to users who choose to opt-out of tracking. One could imagine a divided Web, where a user who turns on the x-notrack header for all HTTP connections—i.e. a blanket opt-out—would essentially turn off many of the useful features on the Web.

By being more judicious in the use of x-notrack, a user could permit silos of first-party tracking in exchange for individual feature-rich sites, while limiting widespread tracking by third parties. But many third parties offer useful services, like embedding videos or integrating social media features, and they might require that users disable x-notrack in order to access their services. Users could theoretically make a privacy choice for each third party, but such a reality seems antithetical to the motivations behind Do Not Track: to give consumers an easy mechanism to opt-out of harmful online tracking in one fell swoop.

The FTC could potentially remedy this scenario by including some provision for “tracking neutrality,” which would prohibit sites from unnecessarily discriminating against a user’s choice not to be tracked. I won’t get into the details here, but suffice it to say that crafting a narrow yet effective neutrality provision would be highly contentious.

Privacy Isn’t a Binary Choice

The underlying difficulty in designing a simple Do Not Track mechanism is the subjective nature of privacy. What one user considers harmful tracking might be completely reasonable to another. Privacy isn’t a single binary choice but rather a series of individually-considered decisions that each depend on who the tracking party is, how much information can be combined and what the user gets in return for being tracked. This makes the general concept of online Do Not Track—or any blanket opt-out regime—a fairly awkward fit. Users need simplicity, but whether simple controls can adequately capture the nuances of individual privacy preferences is an open question.

Another open question is whether browser vendors can eventually “win” the technical arms race against tracking technologies. If so, regulations might not be necessary, as innovative browsers could fully insulate users from unwanted tracking. While tracking technologies are currently winning this race, I wouldn’t call it a foregone conclusion.

The one thing we do know is this: Do Not Track is not as simple as it sounds. If regulators are serious about putting forth a proposal, and it sounds like they are, we need to start having a more robust conversation about the merits and ramifications of these issues.

Release Government Data, Early and Often

One of the key axioms of modern open government is that all public data should be published online in a raw but usable form. Usability in this case is aimed at software programmers. By making government datasets more usable, programmers are more likely to innovate in the civic sphere and build technologies, using the raw data, to enhance the relationships among citizens and with government.

The open government community has provided plenty of valuable guidance about what usability means for programmers. We proclaim that all datasets need to be: published in a format that is reasonably structured and machine-processable; well-documented; downloadable in bulk; authenticated using cryptographic digital signatures; version-controlled; permanent and citable; and the list goes on and on. These are all worthy principles to be sure, and all government datasets should strive to meet them.

But you’ll be hard-pressed to find any government datasets that exist with all of these principles pre-satisfied. While some are in better shape than others, most datasets would make programmers cringe. Data often only exist as informal working sets in proprietary Excel spreadsheets. Sometimes they are in structured databases, but schemas are undocumented, field values are ambiguous, and the semantics are only understood by the employee who created them. Datasets have errors and biases that are known but never explicitly corrected.

For a civil servant who is a data caretaker looking over the laundry list of publishing principles, there’s frequently a huge quality chasm between the dataset she owns and how people are asking to see it released. To her, publishing this data adequately just seems like a lot of extra work. The more attractive alternative is to put off the data publishing—it’s not in her job description or evaluations anyway—and move on to other work instead.

How can this chasm be bridged? A widely-adopted philosophy in software development and entrepreneurship would serve open government data well: release early and release often. And listen to your customers.

In the software development world, a working version of the product is pushed out as soon as possible even with known imperfections—an “alpha” release—so it can be subject to real use by early adopters. Early adopters can provide helpful feedback about what works, what’s broken, and what new features would be most useful to them. The software developers then iterate quickly. They incorporate the suggested fixes and features into their code and release an updated version of the product to their users. The virtuous cycle then starts again. Under this philosophy, software developers can be efficient about how to best improve their code where it matters, and users get software that works better and has more features they desire.

The “release early, release often” philosophy should be applied to government data. For the initial release, data caretakers should take the path of least resistance to get data out the door. This means publishing datasets in whatever format is most convenient, along with as much documentation as can reasonably be mustered. Documentation is especially important with an “alpha” dataset—proper warnings about its problems, instabilities and inductive limitations must be prominently displayed. (Of course, the usual privacy and legal caveats should also be applied.) Sometimes, the “alpha” release will be “good enough” for programmers to start their work, and this will minimize any superfluous work done by caretakers. This is the virtue of “release early.”

In other cases, programmers will need assistance using the dataset and will notice problem spots with the initial release. The dataset might be confusing, contain errors or be difficult to work with. A tight feedback mechanism allows the programmer to get help quickly and continue to innovate, while the data caretaker can fix problems based on real use cases and add clarifying metadata into an updated version of the dataset. Data quality and usability increases for those working with the dataset, both in and outside of government. That’s the virtue of “release often.”

And here is the big opportunity for government: no platform currently exists to engage the prime audience for government data—software programmers. Without a tight feedback mechanism, the virtuous cycle of mutual benefit cannot exist. Government is missing its best opportunity to improve data quality by neglecting useful feedback from programmers who are actually tinkering with the datasets. Society is losing out on potentially game-changing civic innovations, which otherwise would have been built if data were more usable and the uncertainty of failure reduced.

A terrific start in turning the corner would be for government to adopt an issue-tracking system for its datasets. As a public venue, it would help ensure that data caretakers are prompt in addressing developer concerns. It would also allow caretakers to organize feedback in a formal way. Such platforms are commonplace in any successful software development venture. The same needs to be true for government data in order to drive rapid quality improvements and increase developer engagement.