May 23, 2018

What We Lose if We Lose Data.gov

In its latest 2011 budget proposal, Congress makes deep cuts to the Electronic Government Fund. This fund supports the continued development and upkeep of several key open government websites, including Data.gov, USASpending.gov and the IT Dashboard. An earlier proposal would have cut the funding from $34 million to $2 million this year, although the current proposal would allocate $17 million to the fund.

Reports say that major cuts to the e-government fund would force OMB to shut down these transparency sites. This would strike a significant blow to the open government movement, and I think it’s important to emphasize exactly why shuttering a site like Data.gov would be so detrimental to transparency.

On its face, Data.gov is a useful catalog. It helps people find the datasets that government has made available to the public. But the catalog is really a convenience that doesn’t necessarily need to be provided by the government itself. Since the vast majority of datasets are hosted on individual agency servers—not directly by Data.gov—private developers could potentially replicate the catalog with only a small amount of effort. So even if Data.gov goes offline, nearly all of the data still exist online, and a private developer could go rebuild a version of the catalog, maybe with even better features and interfaces.

But Data.gov also plays a crucial behind the scenes role, setting standards for open data and helping individual departments and agencies live up to those standards. Data.gov establishes a standard, cross-agency process for publishing raw datasets. The program gives agencies clear guidance on the mechanics and requirements for releasing each new dataset online.

There’s a Data.gov manual that formally documents and teaches this process. Each agency has a lead Data.gov point-of-contact, who’s responsible for identifying publishable datasets and for ensuring that when data is published, it meets information quality guidelines. Each dataset needs to be published with a well-defined set of common metadata fields, so that it can be organized and searched. Moreover, thanks to Data.gov, all the data is funneled through at least five stages of intermediate review—including national security and privacy reviews—before final approval and publication. That process isn’t quick, but it does help ensure that key goals are satisfied.

When agency staff have data they want to publish, they use a special part of the Data.gov website, which outside users never see, called the Data Management System (DMS). This back-end administrative interface allows agency points-of-contact to efficiently coordinate publishing activities agency-wide, and it gives individual data stewards a way to easily upload, view and maintain their own datasets.

My main concern is that this invaluable but underappreciated infrastructure will be lost when IT systems are de-funded. The individual roles and responsibilities, the informal norms and pressures, and perhaps even the tacit authority to put new datasets online would likely also disappear. The loss of structure would probably mean that sharply reduced amounts of data will be put online in the future. The datasets that do get published in an ad hoc way would likely lack the uniformity and quality that the current process creates.

Releasing a new dataset online is already a difficult task for many agencies. While the current standards and processes may be far from perfect, Data.gov provides agencies with a firm footing on which they can base their transparency efforts. I don’t know how much funding is necessary to maintain these critical back-end processes, but whatever Congress decides, it should budget sufficient funds—and direct that they be used—to preserve these critically important tools.

What are the Constitutional Limits on Online Tracking Regulations?

As the conceptual contours of Do Not Track are being worked out, an interesting question to consider is whether such a regulation—if promulgated—would survive a First Amendment challenge. Could Do Not Track be an unconstitutional restriction on the commercial speech of online tracking entities? The answer would of course depend on what restrictions a potential regulation would specify. However, it may also depend heavily on the outcome of a case currently in front of the Supreme Court—Sorrell v. IMS Health Inc.—that challenges the constitutionality of a Vermont medical privacy law.

The privacy law at issue would restrict pharmacies from selling prescription drug records to data mining companies for marketing purposes without the prescribing doctor’s consent. These drug records each contain extensive details about the doctor-patient relationship, including “the prescriber’s name and address, the name, dosage and quantity of the drug, the date and place the prescription is filled and the patient’s age and gender.” A doctor’s prescription record can be tracked very accurately over time, and while patient names are redacted, each patient is assigned a unique identifier so their prescription histories may also be tracked. Pharmacies have been selling these records to commercial data miners, who in turn aggregate the data and sell compilations to pharmaceutical companies, who then engage in direct marketing back to individual doctors using a practice known as “detailing.” Sound familiar yet? It’s essentially brick-and-mortar behavioral advertising, and a Do Not Track choice mechanism, for prescription drugs.

The Second Circuit recently struck down the Vermont law on First Amendment grounds, ruling first that the law is a regulation of commercial speech and second that the law’s restrictions fall on the wrong side of the Central Hudson test—the four-step analysis used to determine the constitutionality of commercial speech restrictions. This ruling clashes explicitly with two previous decisions in the First Circuit, in Ayotte and Mills, which deemed that similar medical privacy laws in Maine and New Hampshire were constitutional. As such, the Supreme Court decided in January to take the case and resolve the disagreement, and the oral argument is set for April 26th.

I’m not a lawyer, but it seems like the outcome of Sorrell could have a wide-ranging impact on current and future information privacy laws, including possible Do Not Track regulations. Indeed, the petitioners recognize the potentially broad implications of their case. From the petition:

“Information technology has created new and unprecedented opportunities for data mining companies to obtain, monitor, transfer, and use personal information. Indeed, one of the defining traits of the so-called “Information Age” is this ability to amass information about individuals. Computers have made the flow of data concerning everything from personal purchasing habits to real estate records easier to collect than ever before.”

One central question in the case is whether a restriction on access to these data for marketing purposes is a restriction on legitimate commercial speech. The Second Circuit believes it is, reasoning that even “dry information” sold for profit—and already in the hands of a private actor—is entitled to First Amendment protection. In contrast, the First Circuit in Ayotte posited that the information being exchanged has “itself become a commodity,” not unlike beef jerky, so such restrictions are only a limitation on commercial conduct—not speech—and therefore do not implicate any First Amendment concerns.

A major factual difference here, as compared to online privacy and tracking, is that pharmacies are required by many state and federal laws to collect and maintain prescription drug records, so there may be more compelling reasons for the state to restrict access to this information.

In the case of online privacy, it could be argued that Internet users are voluntarily supplying information to the tracking servers, even though many users probably don’t intend to do this, nor do they expect that this is occurring. Judge Livingston, in her circuit dissent in Sorrell, notes that different considerations apply where the government is “prohibiting a speaker from conveying information that the speaker already possesses,” distinguishing that from situations where the government restricts access to the information itself. In applying this to online communications, at what point does the server “possess” the user’s data—when the packets are received and are sitting in a buffer or when the packets are re-assembled and the data permanently stored? Is there a constitutional difference between restrictions on collection versus restrictions on use? The Supreme Court in 1965 in Zemel v. Rusk stated that “the right to speak and publish does not carry with it the unrestrained right to gather information.” To what extent does this apply to government restrictions of online tracking?

The constitutionality of state and federal information privacy laws have historically and consistently been called into question, and things would be no different if—and it’s a big if— Congress grants the FTC authority over online tracking. When considering technical standards and what “tracking” means, it’s worth keeping in mind the possible constitutional challenges insofar as state action may be involved, as some desirable options to curb online tracking may only be possible within a voluntary or self-regulatory framework. Where that line is drawn will depend on how the Supreme Court comes down in Sorrell and how broadly they decide the case.

Some Technical Clarifications About Do Not Track

When I last wrote here about Do Not Track in August, there were just a few rumblings about the possibility of a Do Not Track mechanism for online privacy. Fast forward four months, and Do Not Track has shot to the top of the privacy agenda among regulators in Washington. The FTC staff privacy report released in December endorsed the idea, and Congress was quick to hold a hearing on the issue earlier this month. Now, odds are quite good that some kind of Do Not Track legislation will be introduced early in this new congressional session.

While there isn’t yet a concrete proposal for Do Not Track on the table, much has already been written both in support of and against the idea in general, and it’s terrific to see the issue debated so widely. As I’ve been following along, I’ve noticed some technical confusion on a few points related to Do Not Track, and I’d like to address three of them here.

1. Do Not Track will most likely be based on an HTTP header.

I’ve read some people still suggesting that Do Not Track will be some form of a government-operated list or registry—perhaps of consumer names, device identifiers, tracking domains, or something else. This type of solution has been suggested before in an earlier conception of Do Not Track, and given its rhetorical likeness to the Do Not Call Registry, it’s a natural connection to make. But as I discussed in my earlier post—the details of which I won’t rehash here—a list mechanism is a relatively clumsy solution to this problem for a number of reasons.

A more elegant solution—and the one that many technologists seem to have coalesced around—is the use of a special HTTP header that simply tells the server whether the user is opting out of tracking for that Web request, i.e. the header can be set to either “on” or “off” for each request. If the header is “on,” the server would be responsible for honoring the user’s choice to not be tracked. Users would be able to control this choice through the preferences panel of the browser or the mobile platform.

2. Do Not Track won’t require us to “re-engineer the Internet.”

It’s also been suggested that implementing Do Not Track in this way will require a substantial amount of additional work, possibly even rising to the level of “re-engineering the Internet.” This is decidedly false. The HTTP standard is an extensible one, and it “allows an open-ended set of… headers” to be defined for it. Indeed, custom HTTP headers are used in many Web applications today.

How much work will it take to implement Do Not Track using the header? Generally speaking, not too much. On the client-side, adding the ability to send the Do Not Track header is a relatively simple undertaking. For instance, it only took about 30 minutes of programming to add this functionality to a popular extension for the Firefox Web browser. Other plug-ins already exist. Implementing this functionality directly into the browser might take a little bit longer, but much of the work will be in designing a clear and easily understandable user interface for the option.

On the server-side, adding code to detect the header is also a reasonably easy task—it takes just a few extra lines of code in most popular Web frameworks. It could take more substantial work to program how the server behaves when the header is “on,” but this work is often already necessary even in the absence of Do Not Track. With industry self-regulation, compliant ad servers supposedly already handle the case where a user opts out of their behavioral advertising programs, the difference now being that the opt-out signal comes from a header rather than a cookie. (Of course, the FTC could require stricter standards for what opting-out means.)

Note also that contrary to some suggestions, the header mechanism doesn’t require consumers to identify who they are or otherwise authenticate to servers in order to gain tracking protection. Since the header is a simple on/off flag sent with every Web request, the server doesn’t need to maintain any persistent state about users or their devices’ opt-out preferences.

3. Microsoft’s new Tracking Protection feature isn’t the same as Do Not Track.

Last month, Microsoft announced that its next release of Internet Explorer will include a privacy feature called Tracking Protection. Mozilla is also reportedly considering a similar browser-based solution (although a later report makes it unclear whether they actually will). Browser vendors should be given credit for doing what they can from within their products to protect user privacy, but their efforts are distinct from the Do Not Track header proposal. Let me explain the major difference.

Browser-based features like Tracking Protection basically amount to blocking Web connections from known tracking domains that are compiled on a list. They don’t protect users from tracking by new domains (at least until they’re noticed and added to the tracking list) nor from “allowed” domains that are tracking users surreptitiously.

In contrast, the Do Not Track header compels servers to cooperate, to proactively refrain from any attempts to track the user. The header could be sent to all third-party domains, regardless of whether the domain is already known or whether it actually engages in tracking. With the header, users wouldn’t need to guess whether a domain should be blocked or not, and they wouldn’t have to risk either allowing tracking accidentally or blocking a useful feature.

Tracking Protection and other similar browser-based defenses like Adblock Plus and NoScript are reasonable, but incomplete, interim solutions. They should be viewed as complementary with Do Not Track. For entities under FTC jurisdiction, Do Not Track could put an effective end to the tracking arms race between those entities and browser-based defenses—a race that browsers (and thus consumers) are losing now and will be losing in the foreseeable future. For those entities outside FTC jurisdiction, blocking unwanted third parties is still a useful though leaky defense that maintains the status quo.

Information security experts like to preach “defense in depth” and it’s certainly vital in this case. Neither solution fully protects the user, so users really need both solutions to be available in order to gain more comprehensive protection. As such, the upcoming features in IE and Firefox should not be seen as a technical substitute for Do Not Track.

——

To reiterate: if the technology that implements Do Not Track ends up being an HTTP header, which I think it should be, it would be both technically feasible and relatively simple. It’s also distinct from recent browser announcements about privacy in that Do Not Track forces server cooperation, while browser-based defenses work alone to fend off tracking.

What other technical issues related to Do Not Track remain murky to readers? Feel free to leave comments here, or if you prefer on Twitter using the #dntrack tag and @harlanyu.