June 21, 2018

Four cents to deanonymize: Companies reverse hashed email addresses

[This is a joint post by Gunes Acar, Steve Englehardt, and me. I’m happy to announce that Steve has recently joined Mozilla as a privacy engineer while he wraps up his Ph.D. at Princeton. He coauthored this post in his Princeton capacity, and this post doesn’t necessarily represent Mozilla’s views. — Arvind Narayanan.]
 

Datafinder, an email marketing company, charges $0.04 to recover an email address from its hash.

Your email address is an excellent identifier for tracking you across devices, websites and apps. Even if you clear cookies, use private browsing mode or change devices, your email address will remain the same. Due to privacy concerns, tracking companies including ad networks, marketers, and data brokers use the hash of your email address instead, purporting that hashed emails are “non-personally identifying”, “completely private” and “anonymous”. But this is a misleading argument, as hashed email addresses can be reversed to recover original email addresses. In this post we’ll explain why, and explore companies which reverse hashed email addresses as a service.

Email hashes are commonly used to match users between different providers and databases. For instance, if you provide your email to sign up for a loyalty card at a brick and mortar store, the store can target you with ads on Facebook by uploading your hashed email to Facebook. Data brokers like Acxiom allow their customers to look up personal data by hashed email addresses. In an earlier study, we found that email tracking companies leak hashed emails to data brokers.
 
How hash functions work
Hash functions take data of arbitrary length and convert it into a random-looking string of fixed length. For instance, the MD5 hash of is b58996c504c5638798eb6b511e6f49af. Hashing is commonly used to ensure data integrity, but there are many other uses.

Hash functions such as MD5 and SHA256 have two important properties that are relevant for our discussion: 1) the same input always yields the same output (deterministic); 2) given a hash output, it is infeasible to recover the input (non-invertible). The determinism property allows different trackers to obtain the same hash based on your email address and match your activities across websites, devices, platforms, or online-offline realms.

However, for hashing to be non-invertible, the number of possible inputs must be so large and unpredictable that all possible combinations cannot be tried. For instance, in a 2012 blog post, Ed Felten, then the FTC’s Chief Technologist, argued that hashing all possible SSNs would take “less time than it takes you to get a cup of coffee”.

The huge number of possible email addresses makes naively iterating over all possible combinations infeasible. However, the number of existing email addresses is much lower than the number of possible email addresses — a recent estimate puts the total number of email addresses at around 5 billion. That may sound like a lot, but hashing is an extremely fast operation; so fast that one can compute 450 Billion MD5 hashes per second on a single Amazon EC2 machine a the cost of $0.0069 [1]. That means hashing all five billion existing email addresses would take about ten milliseconds and cost less than a hundredth of a cent.
 

Lists of email addresses are widely available
Once an email address is known, it can be hashed and compared against supposedly “anonymous” hashed email addresses. This can be done by marketing or advertising companies that use hashed email addresses as identifiers, or hackers who acquire hashed addresses by other means. Indeed, there are several options to obtain email addresses:

    1. Data breaches: Thanks to a steady stream of data breaches, hundreds of millions of email addresses from existing leaks are publicly available. HaveIBeenPwned, a service that allows users to check if their accounts have been breached, has observed more than 4.9 Billion breached accounts. Want to check if your email address is vulnerable to this attack? Use HaveIBeenPwned  to determine if any of your email addresses were leaked in a data breach. If they were, an attacker would be able to use data from a breach to recover your email addresses from their hashes [2].
    2. Marketing email lists: Mailing lists with millions of addresses are available for bulk purchase, and often are labeled with privacy invasive categories like religious affiliation, medical conditions or addictions including “Underbanked”, “Financially Challenged”, “Gamblers”, “High Blood Pressure Sufferers in Tallahassee, Florida”, “Anti-Sharia Christian Conservatives”, “Muslim Prime Prospects”.In addition, there are websites that readily share massive lists of email addresses.

 

  1. Harvesting email addresses from websites, search engines, PGP key servers: There are a number of software solutions available to extract email addresses in bulk from websites, search engines and public PGP key servers.
  2. Guessing email addresses: Email addresses can also be synthetically generated by using popular names and patterns such as . Past studies achieved recovery rates between 42% and 70% using simple heuristics and limited resources [3]. We believe this can be significantly improved by using neural networks to generate plausible email addresses.

 
Companies reverse email hashes as a service
The hash recovery methods listed above require very basic technical skills. However, even that isn’t required to reverse hashed data as several companies reverse email hashes as a service.

Datafinder – Reverse email hashes for $0.04 per email: Datafinder, a company that combines online and offline consumer data, charges $0.04 per email to reverse hashed email addresses. The company promises 70% recovery rate and for a nominal fee will provide additional information along with the reversed email, including: name, address, city, state, zip and phone number. Datafinder is accredited by Better Business Bureau with an A+ rating, and its clients include T-Mobile.

In addition to reversing hashed email addresses, Datafinder also provides personal information including name, address and phone number associated with an email address.

 

Infutor – Sub 500-millisecond hashed email “decoding”.: Infutor, a consumer identity management company states[a]nonymous hashed data can be matched to a database of known hashed information to provide consumer contact information, insights and demographic information”. In one case study, the company claims to have reversed nearly 3MM email addresses. In another case, Infutor set up a near real-time online service to reverse hashed emails for an EU company, which “is able to extract a hashed email from the website visit”. Infotutor boasts that they could meet their client’s sub-500 millisecond response time requirement to reverse a given hash.

The Leads Warehouse – “We have cracked the code”: The Leads Warehouse claims that “[they] recover all of your MD5 hashed emails” quickly, securely and cost-effectively through their bizarrely named service “MD5 Reverse Encryption”. Their website reads “[i]n fact, [hashed emails are] designed to be impenetrable and irreversible.  Don’t sweat it, though, we have cracked the code.” The Leads Warehouse also sells phone and mailing leads that include Sleep Apnea, Wheelchair Leads and Student Loans list. For their Ailment & Diabetic Email Lists, they claim they have “amazing filtering options” including length of illness, age, ethnicity, cost of living/hospital expenses.
 

Are hashed email addresses “pseudonymous” data under the GDPR?

In response to our earlier blog post on login manager abuse, a European company official claimed that hashed email addresses are pseudonymous identifier[s]” and are “compliant with regulations.” The upcoming EU General Data Protection Regulation (GDPR) indeed recognizes pseudonymization as a security measure [4] and considers it as a factor in certain obligations [5]. But can email hashing really be classified as pseudonymization under GDPR?

The GDPR defines pseudonymization as:

“the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person;” [6]

For example, if email addresses were encrypted and the key stored separately with additional protections, the encrypted data could be considered pseudonymized under this definition. If there were a breach of the data, the adversary would not be able to recover the email addresses without the key.

However, hashing does not require a key. The additional information needed to reverse hashed email addresses — lists of email addresses, or algorithms that guess plausible email addresses — can be obtained in several ways as we described above. None of these methods requires additional information that “is kept separately and is subject to technical and organisational measures”. Therefore we argue that email hashing does not fall under GDPR’s definition of pseudonymisation.

Conclusion
Hashed email addresses can be easily reversed and linked to an individual, therefore they do not provide any significant protection for the data subjects. The existence of companies that reverse email hashes shows that calling hashed email addresses “anonymous”, “private”, “irreversible” or “de-identified” is misleading and promotes a false sense of privacy. If reversing email hashes were really impossible as claimed, it would cost more than 4 cents.

Even if hashed email addresses were not reversible, they could still be used to match, buy and sell your data between different parties, platforms or devices. As privacy scholars have already argued, when your online profile can be used to target, affect and manipulate you, keeping your true name or email address private may not bear so much significance [7].


Acknowledgements: We thank Brendan Van Alsenoy for his helpful comments.

End notes:

[1]: Hourly price for Amazon EC2 p3.16xlarge instance is $24.48 (as of March 2018).
[2]: HaveIBeenPwned does not share data from breaches, but leaked datasets can be found on underground forums, torrents and file sharing sites.
[3]: See also, Demir et al. The Pitfalls of Hashing for Privacy.
[4]: Article 32 The GDPR.
[5]: Article 6(4)(e), Article 25, Article 89(1) The GDPR.
[6]: Article 4(5), The GDPR.
[7]: See, for instance, “Big Data’s End Run around Anonymity and Consent” (Barocas and Nissenbaum, 2014) and “Singling Out People Without Knowing Their Names – Behavioural Targeting, Pseudonymous Data, and the New Data Protection Regulation” (Zuiderveen Borgesius, 2016).

Comments

  1. That is why we salt. Salt and computationally expensive hash functions are well understood.

    • Gunes Acar says:

      Indeed! But, just to clarify, salting cannot be used when matching email hashes across platforms, devices or vendors — which is the problematic use case we focus here.

      Say, a company has your email address in their CRM database and want to target you on Facebook with Custom Audiences. They cannot upload salted hashes to Facebook and expect a match.

      • They could upload the salt and the hash, but Facebook might not want to re-hash all its users with this particular salt to find a match … they would be better off mining cryptocurrencies 🙂

  2. What an excellent post. Thank you so much Princeton CITP for all the work you do on behalf of consumers, regular people like my parents and my kid. Such good information here!

    My solution to this problem doesn’t scale really….but it works very well for me and my family. I’m an IT Pro, so I took tactics I’ve used at work for decades and employed them at home. I bought a domain name, I bought an Office 365 subscription, and I’ve created literally dozens of SMTP aliases for use online. My rule is simple: I never give my real email address to any form or bot or website….only aliases are given.

    All aliases lead to one or two folders in Outlook, where they are tagged & sorted.

    Unfortunately, I can’t scale this beyond my family. I wish there was some way based on open protocols like SMTP to do this without a lot of technical skill. I feel a responsibility to help and inform people.

    Thanks again