November 28, 2024

Summary of W3C DNT Workshop Submissions

Last week, we hosted the W3C “Web Tracking and User Privacy” Workshop here at CITP (sponsored by Adobe, Yahoo!, Google, Mozilla and Microsoft). If you were not able to join us for this event, I hope to summarize some of the discussion embodied in the roughly 60 position papers submitted.

The workshop attracted a wide range of participants; the agenda included advocates, academics, government, start-ups and established industry players from various sectors. Despite the broad name of the workshop, the discussion centered around “Do Not Track” (DNT) technologies and policy, essentially ways of ensuring that people have control, to some degree, over web profiling and tracking.

Unfortunately, I’m going to have to expect that you are familiar with the various proposals before going much further, as the workshop position papers are necessarily short and assume familiarity. (If you are new to this area, the CDT’s Alissa Cooper has a brief blog post from this past March, “Digging in on ‘Do Not Track'”, that mentions many of the discussion points. Technically, much of the discussion involved the mechanisms of the Mayer, Narayanan and Stamm IETF Internet-Draft from March and the Microsoft W3C member submission from February.)

Read on for more…

California to Consider Do Not Track Legislation

This afternoon the CA Senate Judiciary Committee had a brief time for proponents and opponents of SB 761 to speak about CA’s Do Not Track legislation. In general, the usual people said the usual things, with a few surprises along the way.

Surprise 1: repeated discussion of privacy as a Constitutional right. For those of us accustomed to privacy at the federal level, it was a good reminder that CA is a little different.

Surprise 2: TechNet compared limits on Internet tracking to Texas banning oil drilling, and claimed DNT is “not necessary” so legislation would be “particularly bad.” Is Kleiner still heavily involved in the post-Wade TechNet?

Surprise 3: the Chamber of Commerce estimated that DNT legislation would cost $4 billion dollars in California, extrapolated from an MIT/Toronto study in the EU. Presumably they mean Goldfarb & Tucker’s Privacy Regulation and Online Advertising, which is in my queue to read. Comments on donottrack.us raise concerns. Assuming even a generous opt-out rate of 5% of CA Internet users, $4B sounds high based on other estimates of value of entire clickstream data for $5/month. I look forward to reading their paper, and to learning the Chamber’s methods of estimating CA based on Europe.

Surprise 4: hearing about the problems of a chilling effect — for job growth, not for online use due to privacy concerns. Similarly, hearing frustrations about a text that says something “might” or “may” happen, with no idea what will actually transpire — about the text of the bill, not about the text of privacy policies.

On a 3 to 2 vote, they sent the bill to the next phase: the Appropriations Committee. Today’s vote was an interesting start.

Tracking Your Every Move: iPhone Retains Extensive Location History

Today, Pete Warden and Alasdair Allan revealed that Apple’s iPhone maintains an apparently indefinite log of its location history. To show the data available, they produced and demoed an application called iPhone Tracker for plotting these locations on a map. The application allows you to replay your movements, displaying your precise location at any point in time when you had your phone. Their open-source application works with the GSM (AT&T) version of the iPhone, but I added changes to their code that allow it to work with the CDMA (Verizon) version of the phone as well.

When you sync your iPhone with your computer, iTunes automatically creates a complete backup of the phone to your machine. This backup contains any new content, contacts, and applications that were modified or downloaded since your last sync. Beginning with iOS 4, this backup also included is a SQLite database containing tables named ‘CellLocation’, ‘CdmaCellLocaton’ and ‘WifiLocation’. These correspond to the GSM, CDMA and WiFi variants of location information. Each of these tables contains latitude and longitude data along with timestamps. These tables also contain additional fields that appear largely unused on the CDMA iPhone that I used for testing — including altitude, speed, confidence, “HorizontalAccuracy,” and “VerticalAccuracy.”

Interestingly, the WifiLocation table contains the MAC address of each WiFi network node you have connected to, along with an estimated latitude/longitude. The WifiLocation table in our two-month old CDMA iPhone contains over 53,000 distinct MAC addresses, suggesting that this data is stored not just for networks your device connects to but for every network your phone was aware of (i.e. the network at the Starbucks you walked by — but didn’t connect to).

Location information persists across devices, including upgrades from the iPhone 3GS to iPhone 4, which appears to be a function of the migration process. It is important to note that you must have physical access to the synced machine (i.e. your laptop) in order to access the synced location logs. Malicious code running on the iPhone presumably could also access this data.

Not only was it unclear that the iPhone is storing this data, but the rationale behind storing it remains a mystery. To the best of my knowledge, Apple has not disclosed that this type or quantity of information is being stored. Although Apple does not appear to be currently using this information, we’re curious about the rationale for storing it. In theory, Apple could combine WiFi MAC addresses and GPS locations, creating a highly accurate geolocation service.

The exact implications for mobile security (along with forensics and law enforcement) will be important to watch. What is most surprising is that this granularity of information is being stored at such a large scale on such a mainstream device.

Oak Ridge, spear phishing, and i-voting

Oak Ridge National Labs (one of the US national energy labs, along with Sandia, Livermore, Los Alamos, etc) had a bunch of people fall for a spear phishing attack (see articles in Computerworld and many other descriptions). For those not familiar with the term, spear phishing is sending targeted emails at specific recipients, designed to have them do an action (e.g., click on a link) that will install some form of software (e.g., to allow stealing information from their computers). This is distinct from spam, where the goal is primarily to get you to purchase pharmaceuticals, or maybe install software, but in any case is widespread and not targeted at particular victims. Spear phishing is the same technique used in the Google Aurora (and related) cases last year, the RSA case earlier this year, Epsilon a few weeks ago, and doubtless many others that we haven’t heard about. Targets of spear phishing might be particular people within an organization (e.g., executives, or people on a particular project).

In this posting, I’m going to connect this attack to Internet voting (i-voting), by which I mean casting a ballot from the comfort of your home using your personal computer (i.e., not a dedicated machine in a precinct or government office). My contention is that in addition to all the other risks of i-voting, one of the problems is that people will click links targeted at them by political parties, and will try to cast their vote on fake web sites. The scenario is that operatives of the Orange party send messages to voters who belong to the Purple party claiming to be from the Purple party’s candidate for president and giving a link to a look-alike web site for i-voting, encouraging voters to cast their votes early. The goal of the Orange party is to either prevent Purple voters from voting at all, or to convince them that their vote has been cast and then use their credentials (i.e., username and password) to have software cast their vote for Orange candidates, without the voter ever knowing.

The percentage of users who fall prey to targeted attacks has been a subject of some controversy. While the percentage of users who click on spam emails has fallen significantly over the years as more people are aware of them (and as spam filtering has improved and mail programs have improved to no longer fetch images by default), spear phishing attacks have been assumed to be more effective. The result from Oak Ridge is one of the most significant pieces of hard data in that regard.

According to an article in The Register, of the 530 Oak Ridge employees who received the spear phishing email, 57 fell for the attack by clicking on a link (which silently installed software in their computers using to a security vulnerability in Internet Explorer which was patched earlier this week – but presumably the patch wasn’t installed yet on their computers). Oak Ridge employees are likely to be well-educated scientists (but not necessarily computer scientists) – and hence not representative of the population as a whole. The fact that this was a spear phishing attack means that it was probably targeted at people with access to sensitive information, whether administrative staff, senior scientists, or executives (but probably not the person running the cafeteria, for example). Whether the level of education and access to sensitive information makes them more or less likely to click on links is something for social scientists to assess – I’m going to take it as a data point and assume a range of 5% to 20% of victims will click on a link in a spear phishing attack (i.e., that it’s not off by more than a factor of two).

So as a working hypothesis based on this actual result, I propose that a spear phishing attack designed to draw voters to a fake web site to cast their votes will succeed with 5-20% of the targeted voters. With UOCAVA (military and overseas voters) representing around 5% of the electorate, I propose that a target of impacting 0.25% to 1% of the votes is not an unreasonable assumption. Now if we presume that the race is close and half of them would have voted for the “preferred” candidate anyway, this allows a spear phishing attack to capture an additional 0.12% to 0.50% of the vote.

If i-voting were to become more widespread – for example, to be available to any absentee voter – then these numbers double, because absentee voters are typically 10% of all voters. If i-voting becomes available to all voters, then we can guess that 5% to 20% of ALL votes can be coerced this way. At that point, we might as well give up elections, and go to coin tossing.

Considering the vast sums spent on advertising to influence voters, even for the very limited UOCAVA population, spear phishing seems like a very worthwhile investment for a candidate in a close race.

What We Lose if We Lose Data.gov

In its latest 2011 budget proposal, Congress makes deep cuts to the Electronic Government Fund. This fund supports the continued development and upkeep of several key open government websites, including Data.gov, USASpending.gov and the IT Dashboard. An earlier proposal would have cut the funding from $34 million to $2 million this year, although the current proposal would allocate $17 million to the fund.

Reports say that major cuts to the e-government fund would force OMB to shut down these transparency sites. This would strike a significant blow to the open government movement, and I think it’s important to emphasize exactly why shuttering a site like Data.gov would be so detrimental to transparency.

On its face, Data.gov is a useful catalog. It helps people find the datasets that government has made available to the public. But the catalog is really a convenience that doesn’t necessarily need to be provided by the government itself. Since the vast majority of datasets are hosted on individual agency servers—not directly by Data.gov—private developers could potentially replicate the catalog with only a small amount of effort. So even if Data.gov goes offline, nearly all of the data still exist online, and a private developer could go rebuild a version of the catalog, maybe with even better features and interfaces.

But Data.gov also plays a crucial behind the scenes role, setting standards for open data and helping individual departments and agencies live up to those standards. Data.gov establishes a standard, cross-agency process for publishing raw datasets. The program gives agencies clear guidance on the mechanics and requirements for releasing each new dataset online.

There’s a Data.gov manual that formally documents and teaches this process. Each agency has a lead Data.gov point-of-contact, who’s responsible for identifying publishable datasets and for ensuring that when data is published, it meets information quality guidelines. Each dataset needs to be published with a well-defined set of common metadata fields, so that it can be organized and searched. Moreover, thanks to Data.gov, all the data is funneled through at least five stages of intermediate review—including national security and privacy reviews—before final approval and publication. That process isn’t quick, but it does help ensure that key goals are satisfied.

When agency staff have data they want to publish, they use a special part of the Data.gov website, which outside users never see, called the Data Management System (DMS). This back-end administrative interface allows agency points-of-contact to efficiently coordinate publishing activities agency-wide, and it gives individual data stewards a way to easily upload, view and maintain their own datasets.

My main concern is that this invaluable but underappreciated infrastructure will be lost when IT systems are de-funded. The individual roles and responsibilities, the informal norms and pressures, and perhaps even the tacit authority to put new datasets online would likely also disappear. The loss of structure would probably mean that sharply reduced amounts of data will be put online in the future. The datasets that do get published in an ad hoc way would likely lack the uniformity and quality that the current process creates.

Releasing a new dataset online is already a difficult task for many agencies. While the current standards and processes may be far from perfect, Data.gov provides agencies with a firm footing on which they can base their transparency efforts. I don’t know how much funding is necessary to maintain these critical back-end processes, but whatever Congress decides, it should budget sufficient funds—and direct that they be used—to preserve these critically important tools.