August 18, 2018

When the cookie meets the blockchain

Cryptocurrencies are portrayed as a more anonymous and less traceable method of payment than credit cards. So if you shop online and pay with Bitcoin or another cryptocurrency, how much privacy do you have? In a new paper, we show just how little.

Websites including shopping sites typically have dozens of third-party trackers per site. These third parties track sensitive details of payment flows, such as the items you add to your shopping cart, and their prices, regardless of how you choose to pay. Crucially, we find that many shopping sites leak enough information about your purchase to trackers that they can link it uniquely to the payment transaction on the blockchain. From there, there are well-known ways to further link that transaction to the rest of your Bitcoin wallet addresses. You can protect yourself by using browser extensions such as Adblock Plus and uBlock Origin, and by using Bitcoin anonymity techniques like CoinJoin. These measures help, but we find that linkages are still possible.

 

An illustration of the full scope of our attack. Consider three websites that happen to have the same embedded tracker. Alice makes purchases and pays with Bitcoin on the first two sites, and logs in on the third. Merchant A leaks a QR code of the transaction’s Bitcoin address to the tracker, merchant B leaks a purchase amount, and merchant C leaks Alice’s PII. Such leaks are commonplace today, and usually intentional. The tracker links these three purchases based on Alice’s browser cookie. Further, the tracker obtains enough information to uniquely (or near-uniquely) identify coins on the Bitcoin blockchain that correspond to the two purchases. However, Alice took the precaution of putting her bitcoins through CoinJoin before making purchases. Thus, either transaction individually could not have been traced back to Alice’s wallet, but there is only one wallet that participated in both CoinJoins, and is hence revealed to be Alice’s.

 

Using the privacy measurement tool OpenWPM, we analyzed 130 e-commerce sites that accept Bitcoin payments, and found that 53 of these sites leak transaction details to trackers. Many, but not all, of these leaks are by design, to enable advertising and analytics. Further, 49 sites leak personal identifiers to trackers: names, emails, usernames, and so on. This combination means that trackers can link real-world identities to Bitcoin addresses. To be clear, all of this leaked data is sitting in the logs of dozens of tracking companies, and the linkages can be done retroactively using past purchase data.

On a subset of these sites, we made real purchases using bitcoins that we first “mixed” using the CoinJoin anonymity technique.[1] We found that a tracker that observed two of our purchases — a common occurrence — would be able to identify our Bitcoin wallet 80% of the time. In our paper, we present the full details of our attack as well as a thorough analysis of its effectiveness.

Our findings are a reminder that systems without provable privacy properties may have unexpected information leaks and lurking privacy breaches. When multiple such systems interact, the leaks can be even more subtle. Anonymity in cryptocurrencies seems especially tricky, because it inherits the worst of both data anonymization (sensitive data must be publicly and permanently stored on the blockchain) and anonymous communication (privacy depends on subtle interactions arising from the behavior of users and applications).

[1] In this experiment we used 1–2 rounds of mixing. We provide evidence in the paper that while a higher mixing depth decreases the effectiveness of the attack, it doesn’t defeat it. There’s room for a more careful study of the tradeoffs here.

Are you really anonymous online?

As you browse the internet, online advertisers track nearly every site you visit, amassing a trove of information on your habits and preferences. When you visit a news site, they might see you’re a fan of basketball, opera and mystery novels, and accordingly select ads tailored to your tastes. Advertisers use this information to create highly personalized experiences, but they typically don’t know exactly who you are. They observe only your digital trail, not your identity itself, and so you might feel that you’ve retained a degree of anonymity.

In new work with Ansh Shukla, Sharad Goel and Arvind Narayanan, we show that these anonymous web browsing records can in fact often be tied back to real-world identities. (Check out our demo, and see if we can figure out who you are.)

At a high level, our approach is based on a simple observation. Each person has a highly distinctive social network, comprised of family and friends from school, work, and various stages throughout one’s life. As a consequence, the set of links in your Facebook and Twitter feeds is likewise highly distinctive, and clicking on these links leaves a tell-tale mark in your browsing history.

Given only the set of web pages an individual has visited, we determine which social media feeds are most similar to it, yielding a list of candidate users who likely generated that web browsing history. In this manner, we can tie a person’s real-world identity to the near complete set of links they have visited, including links that were never posted on any social media site. This method requires only that one click on the links appearing in their social media feeds, not that they post any content.

Carrying out this strategy involves two key challenges, one theoretical and one engineering. The theoretical problem is quantifying how similar a specific social media feed is to a given web browsing history. One simple similarity measure is the fraction of links in the browsing history that also appear in the feed. This metric works reasonably well in practice, but it overstates similarity for large feeds, since those simply contain more links. We instead take an alternative approach. We posit a stylized, probabilistic model of web browsing behavior, and then compute the likelihood a user with that social media feed generated the observed browsing history. It turns out that this method is approximately equivalent to scaling the fraction of history links that appear in the feed by the log of the feed size.

The engineering challenge is identifying the most similar feeds in real time. Here we turn to Twitter, since Twitter feeds (in contrast to Facebook) are largely public. However, even though the feeds are public, we cannot simply create a local copy of Twitter against which we can run our queries. Instead we apply a series of heuristics to dramatically reduce the search space. We then combine caching techniques with on-demand network crawls to construct the feeds of the most promising candidates. On this reduced candidate set, we apply our similarity measure to produce the final results. Given a browsing history, we can typically carry out this entire process in under 60 seconds.

Our initial tests indicate that for people who regularly browse Twitter, we can deduce their identity from their web browsing history about 80% of the time. Try out our web application, and let us know if it works on you!

All the News That’s Fit to Change: Insights into a corpus of 2.5 million news headlines

[Thanks to Joel Reidenberg for encouraging this deeper dive into news headlines!]

There is no guarantee that a news headline you see online today will not change tomorrow, or even in the next hour, or will even be the same headlines your neighbor sees right now. For a real-life example of the type of change that can happen, consider this explosive headline from NBC News…

“Bernanke: More Execs Deserved Jail for Financial Crisis”

…contrasted with the much more subdued…

“Bernanke Thinks More Execs Should Have Been Investigated”

These headlines clearly suggest different stories, which is worrying because of the effect that headlines have on our perception of the news — a recent survey found that, “41 percent of Americans report that they watched, read, or heard any in-depth news stories, beyond the headlines, in the last week.”

As part of the Princeton Web Transparency and Accountability Project (WebTAP), we wanted to understand more about headlines. How often do news publishers change headlines on articles? Do variations offer different editorial slants on the same article? Are some variations ‘clickbait-y’?

To answer these questions we collected over ~1.5 million article links seen since June 1st, 2015 on 25 news sites’ front pages through the Internet Archive’s Wayback Machine. Some articles were linked to with more than one headline (at different times or on different parts of the page), so we ended up with a total of ~2.5 million headlines.[1] To clarify, we are defining headlines as the text linking to articles on the front page of news websites — we are not talking about headlines on the actual article pages themselves. Our corpus is available for download here. In this post we’ll share some preliminary research and outline further research questions.

 

One in four articles had more than one headline associated with it

We were limited in our analysis to how many snapshots of the news sites the Wayback Machine took. For the six months of data from 2015 especially, some of the less-popular news sites did not have as many daily snapshots as the more popular sites — the effect of this might suppress the measure of headline variation on less popular websites. Even so, we were able to capture many instances of articles with multiple headlines for each site we looked at.

 

Clickbait is common, and hasn’t changed much in the last year

We took a first pass at our data using an open source library to classify headlines as clickbait. The classifier was trained by the developer using Buzzfeed headlines as clickbait and New York Times headlines as non-clickbait, so it can more accurately be called a Buzzfeed classifier. Unsurprisingly then, Buzzfeed had the most clickbait headlines detected of the sites we looked at.

But we also discovered that more “traditional” news outlets regularly use clickbait headlines too. The Wall Street Journal, for instance, has used clickbait headlines in place of more traditional headlines for its news stories, as in two variations they tried for an article on the IRS:

‘Think Tax Time Is Tough? Try Being at the IRS’

vs.

‘Wait Times Are Down, But IRS Still Faces Challenges’

Overall, we found that at least 10% of headlines were classified as clickbait on a majority of sites we looked at. We also found that overall, clickbait does not appear to be any more or less common now than it was in June 2015.

 

Using lexicon-based heuristics we were able to identify many instances of bias in headlines

Identifying bias in headlines is a much harder problem than finding clickbait. One research group from Stanford approached detecting bias as a machine learning problem — they trained a classifier to recognize when Wikipedia edits did or did not reflect a neutral point of view, as identified by thousands of human Wikipedia editors. While Wikipedia edits and headlines differ in some pretty important ways, using their feature set was informative. They developed a lexicon of suspect words, curated from decades of research on biased language. Consider the use of the root word “accuse,” as in this example we found from Time Magazine:

‘Roger Ailes Resigns From Fox News’

vs.

‘Roger Ailes Resigns From Fox News Amid Sexual Harassment Accusations’

The first headline just offers the “who” and “what” of the news story — the second headline’s “accusations” add the much more attention-grabbing “why.” Some language use is more subtle, like in this example from Fox News:

‘DNC reportedly accuses Sanders campaign of improperly accessing Clinton voter data’

vs.

‘DNC reportedly punishes Sanders campaign for accessing Clinton voter data’

The facts implied by these headlines are different in a very important way. The second headline, unlike the first, can cause a reader presuppose that the Sanders campaign did do something wrong or malicious, since they are being punished. The first headline hedges the story significantly, only saying that the Sanders campaign may have done something “improper” — the truth of that proposition is not suggested. The researchers identify this as a bias of entailment.

Using a modified version of the biased-language lexicon, we looked at our own corpus of headlines and identified when headline variations added or dropped these biased words. We found approximately 3000 articles in which headline variations for the same article used different biased words, which you can look at here. From our data collection we clearly have evidence of editorial bias playing a role in the different headlines we see on news sites.

 

Detecting all instances of bias and avoiding false positives is an open research problem

While identifying bias in 3000 articles’ headlines is a start, we think we’ve identified only a fraction of biased articles. One reason for missing bias is that our heuristic defines differential bias narrowly (explicit use of biased words in one headline not present in another). There are also false positives in the headlines that we detected as biased. For instance, an allegation or an accusation might show a lack of neutrality in a Wikipedia article, but in a news story an allegation or accusation may simply be the story.

We know for sure that our data contains evidence of editorial and biased variations in headlines, but we still have a long way to go. We would like to be able to identify at scale and with high confidence when a news outlet experiments with its headlines. But there are many obstacles compared to the previous work on identifying bias in Wikipedia edits:

– Without clear guidelines, finding bias in headlines is a more subjective exercise than finding it in Wikipedia articles.

– Headlines are more information-dense than Wikipedia articles — fewer words in headlines contribute to a headline’s implication.

– Many of the stories that the news publishes are necessarily more political than most Wikipedia articles.

If you have any ideas on how to overcome these obstacles, we invite you to reach out to us or take a look at the data yourself, available for download here.

@dillonthehuman

[1] Our main measurements ignore subpages like nytimes.com/pages/politics, which appear to often have article links that are at some point featured on the front page. For each snapshot of the front page, we collected the links to articles seen on the page along with the ‘anchor text’ of those links, which are generally the headlines that are being varied.