September 19, 2018

When the business model *is* the privacy violation

Sometimes, when we worry about data privacy, we’re worried that data might fall into the wrong hands or be misused for unintended purposes. If I’m considering participating in a medical study, I’d want to know if insurance companies will obtain the data and use it against me. In these scenarios, we should look for ways to preserve the intended benefit while preventing unintended uses. In other words, achieving utility and privacy is not a zero-sum game. [1]

In other situations, the intended use is the privacy violation. The most prominent example is the tracking of our online and offline habits for targeted advertising. This business model is exactly what people object to, for a litany of reasons: targeting is creepy, manipulative, discriminatory, and reinforces harmful stereotypes. The data collection that enables targeted advertising involves an opaque surveillance infrastructure to which it’s impossible to give meaningfully informed consent, and the resulting databases give a few companies too much power over individuals and over democracy. [2]

In response to privacy laws, companies have tried to find technical measures that obfuscate the data but allow them carry on with the surveillance business as usual. But that’s just privacy theater. Technical steps that don’t affect the business model are of limited effectiveness, because the business model is fundamentally at odds with privacy; this is in fact a zero-sum game. [3]

For example, there’s an industry move to replace email addresses and other personal identifiers with hashed versions. But a hashed identifier is nevertheless a persistent, unique identifier that allows linking a person across databases, devices, and contexts, as well as targeting and manipulation on the basis of the associated data. Thus, hashing completely fails to address the underlying privacy concerns.

Policy makers and privacy advocates must recognize when privacy is a zero-sum game and when it isn’t. Policy makers like non-zero sum games because they can simultaneously satisfy different stakeholders. But they must acknowledge that sometimes this isn’t possible. In such cases, laws and regulations should avoid loopholes that companies might exploit by building narrow technical measures and claiming to be in compliance. [4]

Privacy advocates should recognize that framing a concern about data use practices as a privacy problem is a double-edged sword. Privacy can be a convenient label for a set of related concerns, but it gives industry a way to deflect attention from deeper ethical questions by interpreting privacy narrowly as confidentiality.

Thanks to Ed Felten and Nick Feamster for feedback on a draft.

[1] There is a vast computer science privacy literature predicated on the idea that we can have our cake and eat it too. For example, differential privacy seeks to enable analysis of data in the aggregate without revealing individual information. While there are disagreements on the specifics, such as whether de-identification results a win-win outcome, there is no question that the overall direction of privacy-preserving data analysis is an important one.

[2] In Mark Zuckerberg’s congressional testimony, he framed Facebook’s privacy woes as being about improper third-party access to the data. This is arguably a non-zero sum game, and one that Facebook is equipped to address without the need for legislation. However, the much bigger privacy problem is Facebook’s own data collection and business model, which is inherently at odds with privacy and is unlikely to be solved without legislation.

[3] There are research proposals for targeted advertising, such as Adnostic, that would improve privacy by drastically changing the business model, largely cutting out the tracking companies. Unsurprisingly, there has been no interest in these approaches from the traditional ad tech industry, but some browser vendors have experimented with similar ideas.

[4] As an example of avoiding the hashing loophole, the 2012 FTC privacy report is well written: it says that for data to be considered de-identified, “the company must achieve a reasonable level of justified confidence that the data cannot reasonably be used to infer information about, or otherwise be linked to, a particular consumer, computer, or other device.” It goes on to say that “reasonably” includes reasonable assumptions about the use of external data sources that might be available.

What’s new with BlockSci, Princeton’s blockchain analysis tool

Six months ago we released the initial version of BlockSci, a fast and expressive tool to analyze public blockchains. In the accompanying paper we explained how we used it to answer scientific questions about security, privacy, miner behavior, and economics using blockchain data. BlockSci has a number of other applications including forensics and as an educational tool.

Since then we’ve heard from a number of researchers and developers who’ve found it useful, and there’s already a published paper on ransomware that has made use of it. We’re grateful for the pull requests and bug reports on GitHub from the community. We’ve also used it to deep-dive into some of the strange corners of blockchain data. We’ve made enhancements including a 5x speed improvement over the initial version (which was already several hundred times faster than previous tools).

Today we’re happy to announce BlockSci 0.4.5, which has a large number of feature enhancements and bug fixes. As just one example, Bitcoin’s SegWit update introduces the concept of addresses that have different representations but are equivalent; tools such as are confused by this and return incorrect (or at least unexpected) values for the balance held by such addresses. BlockSci handles these nuances correctly. We think BlockSci is now ready for serious use, although it is still beta software. Here are a number of ideas on how you can use it in your projects or contribute to its development.

We plan to release talks and tutorials on BlockSci, and improve its documentation. I’ll give a brief talk about it at the MIT Bitcoin Expo this Saturday; then Harry Kalodner and Malte Möser will join me for a BlockSci tutorial/workshop at MIT on Monday, March 19, organized by the Digital Currency Initiative and Fidelity Labs. Videos of both events will be available.

We now have two priorities for the development of BlockSci. The first is to make it possible to implement almost all analyses in Python with the speed of C++. To enable this we are building a function composition interface to automatically translate Python to C++. The second is to better support graph queries and improved clustering of the transaction graph. We’ve teamed up with our colleagues in the theoretical computer science group to adapt sophisticated graph clustering algorithms to blockchain data. If this effort succeeds, it will be a foundational part of how we understand blockchains, just as PageRank is a fundamental part of how we understand the structure of the web. Stay tuned!

Website operators are in the dark about privacy violations by third-party scripts

by Steven Englehardt, Gunes Acar, and Arvind Narayanan.

Recently we revealed that “session replay” scripts on websites record everything you do, like someone looking over your shoulder, and send it to third-party servers. This en-masse data exfiltration inevitably scoops up sensitive, personal information — in real time, as you type it. We released the data behind our findings, including a list of 8,000 sites on which we observed session-replay scripts recording user data.

As one case study of these 8,000 sites, we found health conditions and prescription data being exfiltrated from These are considered Protected Health Information under HIPAA. The number of affected sites is immense; contacting all of them and quantifying the severity of the privacy problems is beyond our means. We encourage you to check out our data release and hold your favorite websites accountable.

Student data exfiltration on Gradescope

As one example, a pair of researchers at UC San Diego read our study and then noticed that Gradescope, a website they used for grading assignments, embeds FullStory, one of the session replay scripts we analyzed. We investigated, and sure enough, we found that student names and emails, student grades, and instructor comments on students were being sent to FullStory’s servers. This is considered Student Data under FERPA (US educational privacy law). Ironically, Princeton’s own Information Security course was also affected. We notified Gradescope of our findings, and they removed FullStory from their website within a few hours.
[Read more…]