August 17, 2018

What Are Machine Learning Models Hiding?

Machine learning is eating the world. The abundance of training data has helped ML achieve amazing results for object recognition, natural language processing, predictive analytics, and all manner of other tasks. Much of this training data is very sensitive, including personal photos, search queries, location traces, and health-care records.

In a recent series of papers, we uncovered multiple privacy and integrity problems in today’s ML pipelines, especially (1) online services such as Amazon ML and Google Prediction API that create ML models on demand for non-expert users, and (2) federated learning, aka collaborative learning, that lets multiple users create a joint ML model while keeping their data private (imagine millions of smartphones jointly training a predictive keyboard on users’ typed messages).

Our Oakland 2017 paper, which has just received the PET Award for Outstanding Research in Privacy Enhancing Technologies, concretely shows how to perform membership inference, i.e., determine if a certain data record was used to train an ML model.  Membership inference has a long history in privacy research, especially in genetic privacy and generally whenever statistics about individuals are released.  It also has beneficial applications, such as detecting inappropriate uses of personal data.

We focus on classifiers, a popular type of ML models. Apps and online services use classifier models to recognize which objects appear in images, categorize consumers based on their purchase patterns, and other similar tasks.  We show that if a classifier is open to public access – via an online API or indirectly via an app or service that uses it internally – an adversary can query it and tell from its output if a certain record was used during training.  For example, if a classifier based on a patient study is used for predictive health care, membership inference can leak whether or not a certain patient participated in the study. If a (different) classifier categorizes mobile users based on their movement patterns, membership inference can leak which locations were visited by a certain user.

There are several technical reasons why ML models are vulnerable to membership inference, including “overfitting” and “memorization” of the training data, but they are a symptom of a bigger problem. Modern ML models, especially deep neural networks, are massive computation and storage systems with millions of high-precision floating-point parameters. They are typically evaluated solely by their test accuracy, i.e., how well they classify the data that they did not train on.  Yet they can achieve high test accuracy without using all of their capacity.  In addition to asking if a model has learned its task well, we should ask what else has the model learned? What does this “unintended learning” mean for the privacy and integrity of ML models?

Deep networks can learn features that are unrelated – even statistically uncorrelated! – to their assigned task.  For example, here are the features learned by a binary gender classifier trained on the “Labeled Faces in the Wild” dataset.

While the upper layer of this neural network has learned to separate inputs by gender (circles and triangles), the lower layers have also learned to recognize race (red and blue), a property uncorrelated with the task.

Our more recent work on property inference attacks shows that even simple binary classifiers trained for generic tasks – for example, determining if a review is positive or negative or if a face is male or female – internally discover fine-grained features that are much more sensitive. This is especially important in collaborative and federated learning, where the internal parameters of each participant’s model are revealed during training, along with periodic updates to these parameters based on the training data.

We show that a malicious participant in collaborative training can tell if a certain person appears in another participant’s photos, who has written the reviews used by other participants for training, which types of doctors are being reviewed, and other sensitive information. Notably, this leakage of “extra” information about the training data has no visible effect on the model’s test accuracy.

A clever adversary who has access to the ML training software can exploit the unused capacity of ML models for nefarious purposes. In our CCS 2017 paper, we show that a simple modification to the data pre-processing, without changing the training procedure at all, can cause the model to memorize its training data and leak it in response to queries. Consider a binary gender classifier trained in this way.  By submitting special inputs to this classifier and observing whether they are classified as male or female, the adversary can reconstruct the actual images on which the classifier was trained (the top row is the ground truth):

Federated learning, where models are crowd-sourced from hundreds or even millions of users, is an even juicier target. In a recent paper, we show that a single malicious participant in federated learning can completely replace the joint model with another one that has the same accuracy but also incorporates backdoor functionality. For example, it can intentionally misclassify images with certain features or suggest adversary-chosen words to complete certain sentences.

When training ML models, it is not enough to ask if the model has learned its task well.  Creators of ML models must ask what else their models have learned. Are they memorizing and leaking their training data? Are they discovering privacy-violating features that have nothing to do with their learning tasks? Are they hiding backdoor functionality? We need least-privilege ML models that learn only what they need for their task – and nothing more.

This post is based on joint research with Eugene Bagdasaryan, Luca Melis, Reza Shokri, Congzheng Song, Emiliano de Cristofaro, Deborah Estrin, Yiqing Hua, Thomas Ristenpart, Marco Stronati, and Andreas Veit.

Thanks to Arvind Narayanan for feedback on a draft of this post.

Can Classes on Field Experiments Scale? Lessons from SOC412

Last semester, I taught a Princeton undergrad/grad seminar on the craft, politics, and ethics of behavioral experimentation. The idea was simple: since large-scale human subjects research is now common outside universities, we need to equip students to make sense of that kind of power and think critically about it.

In this post, I share lessons for teaching a class like this and how I’m thinking about next year.

Path diagram from SOC412 lecture on the Social Media Color Experiment

[Read more…]

Demystifying The Dark Web: Peeling Back the Layers of Tor’s Onion Services

by Philipp Winter, Annie Edmundson, Laura Roberts, Agnieskza Dutkowska-Żuk, Marshini Chetty, and Nick Feamster

Want to find US military drone data leaks online? Frolick in a fraudster’s paradise for people’s personal information? Or crawl through the criminal underbelly of the Internet? These are the images that come to most when they think of the dark web and a quick google search for “dark web” will yield many stories like these. Yet, far less is said about how the dark web can actually enhance user privacy or overcome censorship by enabling anonymous browsing through Tor. Recently, for example, Brave, dedicated to protecting user privacy, integrated Tor support to help users surf the web anonymously from a regular browser. This raises questions such as: is the dark web for illicit content and dealings only? Can it really be useful for day-to-day web privacy protection? And how easy is it to use anonymous browsing and dark web or “onion” sites in the first place?

To answer some of these pressing questions, we studied how Tor users use onion services. Our work will be presented at the upcoming USENIX Security conference in Baltimore next month and you can read the full paper here or the TLDR version here.

What are onion services?: Onion services were created by the Tor project in 2004. They not only offer privacy protection for individuals browsing the web but also allow web servers, and thus websites themselves, to be anonymous. This means that any “onion site” or dark web site cannot be physically traced to identify those running the site or where the site is hosted. Onion services differ from conventional web services in four ways. First, they can only be accessed over the Tor network. Second, onion domains, (akin to URLs for the regular web), are hashes over their public key and consist of a string of letters and numbers, which make them long, complicated, and difficult to remember. These domains sometimes contain prefixes that are human-readable but they are expensive to generate (e.g. torprojectqyqhjn.onion). We refer to these as vanity domains. Third, the network path between the client and the onion service is typically longer, meaning slower performance owing to longer latencies. Finally, onion services are private by default, meaning that to find and use an onion site, a user has to know the onion domain, presumably by finding this information organically, rather than with a search engine.

What did we do to investigate how Tor users make use of onion services?: We conducted a large scale survey of 517 Tor users and interviewed 17 Tor users in depth to determine how users perceive, use, and manage onion services and what challenges they face in using these services. We asked our participants about how they used Tor’s onion services and how they managed onion domains. In addition, we asked users about their expectations of privacy and their privacy and security concerns when using onion services. To compliment our qualitative data, we analyzed “leaked” DNS lookups to onion domains, as seen from a DNS root server. This data gave us insights into actual usage patterns to corroborate some of the findings from the interviews and surveys. Our final sample of participants were young, highly educated, and comprised of journalists, whistleblowers, everyday users wanting to protect their privacy to those doing competitive research on others and wanting to avoid being “outed”. Other participants included activists and those who wanted to avoid government detection for fear of persecution or worse.

What were the main findings? First, unsurprisingly, onion services were mostly used for anonymity and security reasons. For instance, 71% of survey respondents reported using onion services to protect their identity online. Almost two thirds of the survey respondents reported using onion services for non-browsing activities such as TorChat, a secure messaging app built on top of onion services. 45% of survey participants had other reasons for using Tor such as to help educate users about the dark web or for their personal blogs. Only 27% of survey respondents reported using onion services to explore the dark web and its content “out of curiosity”.

Second, users had a difficult time finding, tracking, and saving onion links. Finding links: Almost half of our survey respondents discovered onion links through social media such as Twitter or Reddit or by randomly encountering links while browsing the regular web. Fewer survey respondents discovered links through friends and family. Challenges users mentioned for finding onion services included:

  • Onion sites frequently change addresses and so often onion domain aggregators have broken and out of date links.
  • Unlike traditional URLS, onion links give no indication of the content of the website so it is difficult to avoid potentially offensive or illicit content.
  • Again, unlike traditional URLS, participants said it is hard to determine through a glance at the address bar if a site is the authentic one you are trying to reach instead of a phishing site.

A frequent wish expressed by participants was for a better search engine that is more up to date and gives an indication of the content before one clicks on the link as well as authenticity of the site itself.

Tracking and Saving links: To track and save complicated onion domains, many participants opted to bookmark links but some did not want to leave a trace of websites they visited on their machines. The majority of other survey respondents had ad-hoc measures to deal with onion links. Some memorized a few links and did so to protect privacy by not writing the links down. However, this was only possible for a few vanity domains in most cases. Others just navigated to the places where they found the links in the first place and used the links from there to open the websites they needed.

Third, onion domains are also hard to verify as authentic. Vanity domains: Users appreciated vanity domains where onion services operators have taken extra effort and expense to set up a domain that is almost readable such as the case of Facebook’s onion site, facebookcorewwwi.onion. Many participants liked the fact that vanity domains give more indication of the content of the domain. However, our participants also felt vanity domains could lead to more phishing attacks since people would not try to verify the entire onion domain but only the readable prefix. “We also get false expectations of security from such domains. Somebody can generate another onion key with same facebookcorewwwi address. It’s hard but may be possible. People who believe in uniqueness of generated characters, will be caught and impersonated.” – Participant S494

Verification Strategies: Our participants had a variety of strategies such as cutting and pasting links, using bookmarks, or verifying the address in the address bar to check the authenticity of a website. Some checked for a valid HTTPS certificate or familiar images in the website. However, a over a quarter of our survey respondents reported that they could not tell if a site was authentic (28%) and 10% did not even check for authenticity at all. Some lamented this is innate to the design of onion services and that there is not real way to tell if an onion service is authentic epitomized by a quote from Participant P1: “I wouldn’t know how to do that, no. Isn’t that the whole point of onion services? That people can run anonymous things without being able to find out who owns and operates them?”

Fourth, onion lookups suggest typos or phishing. In our DNS dataset, we found similarities between frequently visited popular onion sites such as Facebook’s onion domain and similar significantly less frequently visited websites, suggesting users were making typos or potentially that phishing sites exist. Of the top 20 onion domains we encountered in our data set, 16 were significantly similar to at least one other onion domain in the data set. More details are available in the paper.

What do these findings mean for Tor and onion services? Tor and onion services do have a part to play in helping users to protect their anonymity and privacy for reasons other than those usually associated with a “nefarious” dark web such as support for those overcoming censorship, stalking, and exposing others’ wrong-doing or whistleblowing. However, to better support these uses of Tor and onion services, our users wanted onion service improvements. Desired improvements included more support for Tor in general in browsers, improvement in performance, improved privacy and security, educational resources on how to use Tor and onion services, and finally improved onion services search engines. Our results suggest that to enable more users to make use of onion services, users need:

  • better security indicators to help them understand Tor and onion services are working correctly
  • automatic detection of phishing in onion services
  • opt in publishing of onion domains to improve search for legitimate and legal content
  • better ways to track and save onion links including privacy preserving onion bookmarking.

Future studies to further demystify the dark web are warranted and in our paper we make suggestions for more work to understand the positive aspects of the dark web and how to support privacy protections for everyday users.

You can read more about our study and its limitations here (such as the fact our participants were self-selected and may not represent those who do use the dark web for illicit activities for instance) or skim the paper summary.