Verizon’s practice of injecting a unique ID into the HTTP headers of traffic originating on their wireless network has alarmed privacy advocates and researchers. Jonathan Mayer detailed how this header is already being used by third-parties to create zombie cookies. In this post, I summarize just how much information Verizon collects and shares under their marketing programs. I’ll show how the implementation of the header makes previous tracking methods trivial and explore the possibility of a more secure design.
The process by which Verizon can share data with third-parties using the header has been described in a patent and discussed in several previous posts. As a summary, Verizon has two marketing programs: Relevant Mobile Advertising and Verizon Selects, and also seems to have an authentication program in the works. Here’s what Verizon has stated it may share via the UIDH:
For marketing purposes, Verizon uses the UIDH as a publicly known unique identifier that it will associate advertising segments to through its data sharing API. The information used to generate these segments depends on the user’s settings. By default, the Relevant Mobile Advertising (RMA) program is opt-out, meaning user data will be shared unless a user explicitly forbids it. Under RMA, Verizon will use information from the subscribers account (e.g. age, gender, location) as well as “interest categories obtained from other companies” to build the segments. So, not only is Verizon using subscriber information to build segments, they are also using it to purchase additional information from data brokers? This is rather unsettling for an opt-out program. Additionally, if users opt-in to the Verizon Selects program, Verizon has stated they will use behavioral data (including app usage, web browsing, and device location history) to enable marketers to deliver “more relevant information”.
Verizon describes several other uses for the header which relate to supporting an authentication mechanism for sites, providing customer information for form filling, or verifying that specific information is valid according to the customer records they hold. These systems are designed around the fact that Verizon completely controls the placement of headers and thus the headers can be relied upon as a proxy for the user’s identity allowing the service to query it for extra information or verification of user provided information. It is unclear how much user interaction Verizon would require under each of these use cases as it doesn’t appear that any have been deployed.
Verizon’s header makes tracking much easier
… and that’s before any additional data is transmitted. Since the header is injected into all HTTP requests originating on their mobile network, all recipients receive the same header, regardless of their business relationship with Verizon.
Verizon assures that they “change the UIDH on a regular basis to protect the privacy of [their] customers”, but that simply isn’t enough. For traffic that includes a UIDH value, many attacks against a user’s privacy are trivially achievable.
Our study shows how the pervasiveness of cookies can enable a network adversary, such as the NSA, who passively collects HTTP traffic to piece together a user’s browsing history. This attack is now much easier; instead of trying to connect page visits with overlapping cookie values, an adversary can simply link traffic by UIDH value, knowing each value is a single user.
The current system of browser cookies forces at least some portion of the interaction between trackers to be visible in the browser (for example, cookie syncing). This has allowed some interesting studies to happen, such as a measurement of ad bid prices done by Inria. The Verizon UIDH follows a seemingly emergent trend of a consolidation of identifiers and identities on the web. [5] The Verizon header eliminates any need for client-side interactions, and instead, all server-server interactions could occur in secret.
Typically, trackers need some persistent state on the user’s browser or a persistent fingerprint to keep a profile of the user’s activity. This can prove difficult as a user clears cookies or as browser fingerprints slowly drift (although attacks like canvas fingerprinting may make them more robust). The UIDH value allows a tracker to only need an identifier to remain constant immediately before and after Verizon updates that user’s UIDH, meaning users have to time their cookie clearing precisely to Verizon’s unknown update schedule. [6]
As Jonathan showed in his blog post, that UIDH value can even be used to respawn alternate IDs on the user’s machine. I did some checks with our own infrastructure and have released the code and data. I used OpenWPM to drive Firefox to visit the top 500 Alexa sites with different combinations of UIDH values inserted into all HTTP traffic via the mitmproxy instance built into the infrastructure. By seeing which cookies were set for some combination of UIDH headers but not others, I was able to extract the values that may be tied to the UIDH value and manually inspect them. The results were very similar to those discussed in Jonathan’s post.
How to reduce the privacy impact
Let’s explore what Verizon can do to reduce the impact on privacy? The UIDH header is currently helpful in three different scenarios: (1) a passive network adversary looking to conduct surveillance on the network, (2) a tracker who sees the UIDH but does not utilize the data API, and (3) a tracker who sees the UIDH and does utilize the data API.
Case (1) and (2) currently benefit from the UIDH values even though they do not interact with Verizon. A simple way to prevent the header from being useful in these cases is for Verizon to change the header with each HTTP request. They could store a short term mapping of tokens to subscribers to support any API calls, or could trade computation for space and implement something along the lines of E(user ID | nonce). Either way, the information would not be useful to parties that do not have a business relationship with Verizon.
Under this new model, it may seem that trackers who utilize Verizon’s API will be forced to contact Verizon with each new request. However, if trackers rely on their own tracking they only need to make requests for each new visitor they see, or when they feel the segment data may need updating. This makes sense from both a privacy perspective and a business perspective; it prevents “freeloading” on the Verizon header and only makes it useful for those who have business relationships with Verizon. Similarly, this design would continue to support the use cases detailed above.
Is it enough?
I am not convinced that all of the privacy issues introduced by the header can be solved without eliminating it, even with the improvements discussed in this post. The problem is more fundamental than the header simply remaining constant. The UIDH pipeline introduces a consistent state (i.e. the user’s information) into the otherwise dynamic nature of web tracking. Although browser fingerprinting does this to some extent, users can change their device or lie about their browser properties.
Demographic data is identifying, and we don’t know to what extent the Verizon segments detailed in the above figure reveal personal information. Unlike demographic data that may be built and sold by data brokers, the Verizon demographic data is pulled directly from subscriber information and is distributed directly to its partners; there is no possibility for incorrect or incomplete information. A combination of Verizon advertising segment information and device information could very well provide enough information to uniquely identify users, once again enabling persistent tracking.
Like Verizon, internet service providers are in a position to have a very clear picture of a user’s online life, though they are typically not thought of as “tracking” companies. Historically, ISPs have not shared any information with marketers, but are in a clear position to be the ultimate trackers. Active ISP participation in the data market significantly reduces traditional client-side protections (such as ad and script blocking), as any HTTP data is fair game. Even if a users attempts to block connections to third-party domains, the ISP will still be able to sell data based on the user’s visits to any HTTP pages. This adds to the mounting evidence Internet traffic should be transmitted over HTTPS, giving users at least some control over who can view their data.
Special thanks to Arvind Narayanan, Jonathan Mayer, and Jacob Hoffman-Andrews for sharing their comments with me, and additionally to Jonathan for sharing his research. Any errors are, of course, my own.
[1] The Relevant Mobile Advertising (RMA) program and its use in connection with the header is discussed in the UIDH FAQ, the RMA FAQ, and this blog post.
[2] The RMA FAQ states: “In addition, we will use an anonymous, unique identifier we create when you register on our websites. This may allow an advertiser to use information they have about your visits to online websites to deliver marketing messages to mobile devices on our network.” From this statement it isn’t entirely clear how this unique identifier is used or if it is transmitted as part of the UIDH communication.
[3] Verizon Selects is described in this blog post, and its use with UIDH is described in the FAQ.
[4] The authentication and verification systems are described in a patent and mentioned in the UIDH FAQ, but we haven’t found any evidence of current deployments.
[5] Examples include: Apple’s IDFA, onboarding, Drawbridge’s device linking, Facebook’s instant personalization, cookie syncing
[6] For a good explanation of how to achieve persistent tracking using the header, see Jacob Hoffman-Andrews’ blog post on the EFF blog.
The cloud icon in the above diagram was created by Dmitry Baranovkiy on the Noun Project
Just one more reason to eliminate HTTP entirely in favour of HTTPS, I guess. I wasn’t sure that was necessary, but if this sort of thing is going to happen, it obviously is.
Even with https the ISP is getting a lot of metadata.
What’s interesting to me for the moment is that the ISPs have the ability to do all the tracking they want without diddling the headers. The bits are going from a known subscriber on their network to a known address, and that can be stripped and logged. Is there any plausible purpose for putting the header in, other than providing easy tracking services to outside companies?