October 6, 2022

Archives for August 2016

Routing Detours: Can We Avoid Nation-State Surveillance?

Since 2013, Brazil has taken significant steps to build out their networking infrastructure to thwart nation-state mass surveillance.  For example, the country is deploying a 3,500-mile fiber cable from Fortaleza, Brazil to Portugal; they’ve switched their government email system from Microsoft Outlook to a state-built system called Expresso; and they now have the largest IXP ecosystem in the world.  All of these measures aim to prevent the country’s Internet traffic from traversing the United States, thereby preventing the United States from conducting surveillance on their citizens’ data.  But Brazil isn’t the only country that has concerns about their Internet traffic passing through the United States.  Deutsche Telekom lobbied for tougher privacy protection by keeping German traffic within its national borders.  Canadian traffic has been found to routinely pass through the United States, which is a violation of Canadian network sovereignty.  Russian president Putin has called for “better protection of communication networks” and passed a law that requires foreign companies to keep Russian users’ data on servers inside the country.  

To quantify which countries Internet traffic traverses and measure how successful any particular country might be at detouring its traffic around known surveillance states, we actively measured and analyzed the traffic originating in five different countries: Brazil, Netherlands, Kenya, India, and the United States.  

  • First, to understand the current state of transnational routing (the “status quo”), we measured the country-level traffic paths for the Alexa Top 100 domains in each respective country using RIPE Atlas probes and the MaxMind geolocation service.  
  • Next, we measured how successful clients in Brazil, Netherlands, Kenya, India, and the United States might be at avoiding other countries of interest using open DNS resolvers and using an overlay network.  

The rest of this post summarizes these two parts of the study and highlights some of the results.

The Status Quo: Even Local Traffic Can Detour through Surveillance States

Despite the extreme efforts of certain countries to “keep local traffic local”, and in particular to avoid having traffic traverse the United States, our measurement study indicates that these goals have not yet been reached, for two reasons: 1) lack of domain hosting diversity and 2) lack of routing diversity.

Lack of Hosting Diversity. We find that hosting for many popular websites lacks diversity. We found that about half of the Alexa Top 100 domains are hosted in a single country; in these cases, a user cannot avoid the domain’s hosting country when accessing it.  In many cases, even popular local websites are hosted outside the country where citizens are trying to access them.  For example, more than 50% of the top domains in Brazil and India are hosted in the United States; in total, about 50% of the .br domains are hosted outside Brazil. More hosting diversity, as could be enabled with CDNs, would allow for the potential to avoid more countries more often.

Lack of Geographic Diversity. Internet paths also lack geographic diversity: about half of the paths originating in Kenya to the most popular Kenyan websites traverse the United States or Great Britain.  Much of this phenomenon is due to “tromboning,” whereby an Internet path starts and ends in a country, yet transits an intermediate country; for example, about 13% of the paths that we explored from RIPE Atlas probes in Brazil to the top domains in Brazil trombone through the United States. More than 50% of the paths from the Netherlands to their top domains transit the United States, and about half of Kenyan paths traverse the United States and Great Britain.

Towards User-Controlled Routing Detours

We next asked whether clients could take advantage of the fact that many popular websites are georeplicated, coupled with a client’s ability to selectively “bounce” packets through overlay nodes, might give some users opportunities to avoid certain countries. We studied whether users could exploit open DNS resolvers to discover hosting diversity, and overlay network relays to intentionally introduce routing detours. Previous work in overlay networks, such as RON, tries to route around failures, whereas our work tries to route around countries.  Our results show that in some cases, users can select paths to specifically avoid certain countries; in cases where local traffic leaves the country only to return (a phenomenon sometimes called “tromboning”), the use of local relays can sometimes ensure that local traffic stays within the country.  For example, without these techniques, Brazilian traffic transited Spain, Italy, France, Great Britain, Argentina, Ireland (among others). Using even a modest overlay network deployment of 12 relays across 10 countries (Brazil, United States, Ireland, Germany, Spain, France, Singapore, Japan, South Korea, and Australia), clients in Brazil could completely avoid these countries for the top 100 domains.  The overlay network can also be used to keep local traffic local; the percentage of tromboning paths from Brazil decreases from 13.2% of domestic paths to 9.7%.

Unfortunately, some of the more prominent surveillance states are also some of the least avoidable countries.  Most countries depend highly on the United States for connectivity to other locations on the Internet.  Neither Brazil, India, Kenya, nor the Netherlands can completely avoid the United States with the country avoidance techniques.  The inability of these techniques to successfully avoid the United States typically results from the lack of hosting diversity for many websites, which are solely hosted in the United States. Using the overlay network, both Brazilian and Netherlands clients were able to avoid the United States for about 65% of sites; even in these cases, the United States is completely unavoidable for about 10% of sites.  Traffic from Kenya can avoid the United States for only about 40% of the top domains.  On the other hand, the United States can avoid every other country for all sites, with the exception of France and the Netherlands which the United States can nonetheless avoid for 99% of the top 100 domains.  

More Information and Next Steps

A more detailed writeup is available on the RANSOM project website (https://ransom.cs.princeton.edu/). Encouraged by the ability to use overlay networks to avoid surveillance states in certain cases, we are in the process of designing and building a RANSOM prototype. We welcome feedback on this project as we embark on the next steps.

Differential Privacy is Vulnerable to Correlated Data — Introducing Dependent Differential Privacy

[This post is joint work with Princeton graduate student Changchang Liu and IBM researcher Supriyo Chakraborty. See our paper for full details. — Prateek Mittal ]

The tussle between data utility and data privacy

Information sharing is important for realizing the vision of a data-driven customization of our environment. Data that were earlier locked up in private repositories are now being increasingly shared for enabling new context-aware applications, better monitoring of population statistics, and facilitating academic research in diverse fields. However, sharing personal data gives rise to serious privacy concerns as the data can contain sensitive information that a user might want to keep private. Thus, while on one hand, it is imperative to release utility-providing information, on the other hand, the privacy of users whose data is being shared also needs to be protected.

What is differential privacy?

Among existing data privacy metrics, differential privacy (DP) has emerged has a rigorous mathematical framework for defining and preserving privacy, and has received considerable attention from the privacy community. DP is used for protecting the privacy of individual users’ data while publishing aggregate query responses over statistical databases, and guarantees that the distribution of query responses changes only slightly with the addition or deletion of a single tuple (entry or record) in the database. Thus, after observing the query response, an adversary is limited in its ability to infer sensitive data about any tuple and in turn about the user contributing the tuple. For example, if the government aims to publish the average salary of individuals in New Jersey, DP can reveal the approximate average salary of this region while protecting the privacy of any individual user’s salary. DP achieves this balance between utility and privacy by adding noise (perturbation) to the true query result, for example, via the Laplace perturbation mechanism.

Differential privacy is vulnerable to data correlation!

To provide its guarantees, DP implicitly assumes that the data tuples in the database, each from a different user, are all independent. This is a weak assumption, especially because tuple dependence occurs naturally in datasets due to social interactions between users. For example, consider a social network graph with nodes representing users, and edges representing friendship relationships. An adversary can infer the existence of a social link between two users that are not explicitly connected in the graph using the existence of edges among other users in the graph ( since two friends are likely to have many other friends in common). Similarly, private attributes in a user’s record can be inferred by exploiting the public attributes of other users sharing similar interests. An adversary can easily infer a user’s susceptibility to a contagious disease using access to the noisy query result and also the fact that the user’s immediate family members are part of the database. In an era of big-data, the tuples within a database present rich semantics and complex structures that inevitably lead to data correlation. Therefore, data correlation, i.e., the probabilistic dependence relationship among tuples in the database, which is implicitly ignored in DP, can seriously deteriorate the privacy properties of DP.

Previous work has investigated this problem and proposed several interesting privacy metrics, such as zero-knowledge privacy, pufferfish privacy, membership privacy, inferential privacy, etc. However, the privacy community is lacking effective perturbation mechanisms that can achieve these proposed privacy metrics.

In a paper that we presented at the Network and Distributed System Security Symposium (NDSS 2016), we first demonstrate a Bayesian attack on differentially private mechanisms using real-world datasets (such as Gowalla location data). Our attack exploits the correlation between location information and social information of users in the Gowalla dataset. We show that it is possible to infer a user’s sensitive location information from differentially private responses by exploiting her social relationships. When data is correlated, DP underestimates the amount of noise that must be added to achieve a desired bound on the adversary’s ability to make sensitive inferences. Therefore, correlation (or dependency) among data tuples would break the expected privacy guarantees.

How can we improve differential privacy to protect correlated data? Introducing dependent differential privacy.

Overall, our work generalizes the classic differential privacy framework to explicitly consider correlations in datasets, thus enhancing the applicability of the approach to a wide range of datasets and applications. We formalize the notion of dependent differential privacy (DDP); DDP aims to explicitly incorporate probabilistic dependency constraints among tuples in the privacy formulation. We also propose a perturbation mechanism to achieve the privacy guarantees in DDP; our approach extends the conventional Laplace perturbation mechanism by incorporating the correlation between data tuples. To do so, we introduce the concept of dependent coefficient between two tuples, which aims to capture the correlation between data tuples from the perspective of privacy. (The dependent coefficient between two tuples evaluates the ratio between the dependent indistinguishability of one tuple induced by the other tuple and the self indistinguishability of the tuple itself.) Our perturbation mechanism can be applied for arbitrary data queries and we validated its effectiveness using real-world datasets.

What are future research directions for dependent differential privacy?

First, to form a deeper understanding of our dependent differential privacy framework, it would be interesting to explore the application of standard concepts in differential privacy to our framework, such as local sensitivity and smooth sensitivity. Second, the effectiveness of our proposed perturbation mechanism depends on how well the dependence among data can be modeled and computed. One limitation of our work is that the dependence coefficient is exactly known to both the adversary and the database owner. How to accurately compute the dependence coefficient and deal with potential underestimation would be an interesting future work (note that the overestimation of dependence coefficient continues to provide rigorous DDP guarantees). Finally, in our dependent differential privacy framework, we only considered pairwise correlations between tuples in a database. An important research direction is to generalize the pairwise correlations that we considered in our paper to joint correlations in the dataset.

Language necessarily contains human biases, and so will machines trained on language corpora

I have a new draft paper with Aylin Caliskan-Islam and Joanna Bryson titled Semantics derived automatically from language corpora necessarily contain human biases. We show empirically that natural language necessarily contains human biases, and the paradigm of training machine learning on language corpora means that AI will inevitably imbibe these biases as well.

Specifically, we look at “word embeddings”, a state-of-the-art language representation used in machine learning. Each word is mapped to a point in a 300-dimensional vector space so that semantically similar words map to nearby points.

We show that a wide variety of results from psychology on human bias can be replicated using nothing but these word embeddings. We primarily look at the Implicit Association Test (IAT), a widely used and accepted test of implicit bias. The IAT asks subjects to pair concepts together (e.g., white/black-sounding names with pleasant or unpleasant words) and measures reaction times as an indicator of bias. In place of reaction times, we use the semantic closeness between pairs of words.

In short, we were able to replicate every single result that we tested, with high effect sizes and low p-values.

These include innocuous, universal associations (flowers are associated with pleasantness and insects with unpleasantness), racial prejudice (European-American names are associated with pleasantness and African-American names with unpleasantness), and a variety of gender stereotypes (for example, career words are associated with male names and family words with female names).

But we go further. We show that information about the real world is recoverable from word embeddings to a striking degree. The figure below shows that for 50 occupation words (doctor, engineer, …), we can accurately predict the percentage of U.S. workers in that occupation who are women using nothing but the semantic closeness of the occupation word to feminine words!

These results simultaneously show that the biases in question are embedded in human language, and that word embeddings are picking up the biases.

Our finding of pervasive, human-like bias in AI may be surprising, but we consider it inevitable. We mean “bias” in a morally neutral sense. Some biases are prejudices, which society deems unacceptable. Others are facts about the real world (such as gender gaps in occupations), even if they reflect historical injustices that we wish to mitigate. Yet others are perfectly innocuous.

Algorithms don’t have a good way of telling these apart. If AI learns language sufficiently well, it will also learn cultural associations that are offensive, objectionable, or harmful. At a high level, bias is meaning. “Debiasing” these machine models, while intriguing and technically interesting, necessarily harms meaning.

Instead, we suggest that mitigating prejudice should be a separate component of an AI system. Rather than altering AI’s representation of language, we should alter how or whether it acts on that knowledge, just as humans are able to learn not to act on our implicit biases. This requires a long-term research program that includes ethicists and domain experts, rather than formulating ethics as just another technical constraint in a learning system.

Finally, our results have implications for human prejudice. Given how deeply bias is embedded in language, to what extent does the influence of language explain prejudiced behavior? And could transmission of language explain transmission of prejudices? These explanations are simplistic, but that is precisely our point: in the future, we should treat these as “null hypotheses’’ to be eliminated before we turn to more complex accounts of bias in humans.

Security against Election Hacking – Part 2: Cyberoffense is not the best cyberdefense!

State and county election officials across the country employ thousands of computers in election administration, most of them are connected (from time to time) to the internet (or exchange data cartridges with machines that are connected).  In my previous post I explained how we must audit elections independently of the computers, so we can trust the results even if the computers are hacked.

Still, if state and county election computers were hacked, it would be an enormous headache and it would certainly cast a shadow on the legitimacy of the election.  So, should the DHS designate election computers as “critical cyber infrastructure?”

This question betrays a fundamental misunderstanding of how computer security really works.  You as an individual buy your computers and operating systems from reputable vendors (Apple, Microsoft, IBM, Google/Samsung, HP, Dell, etc.).  Businesses and banks (and the Democratic National Committee, and the Republican National Committee) buy their computers and software from the same vendors.  Your security, and the security of all the businesses you deal with, is improved when these hardware and software vendors build products without security bugs in them.   Election administrators use computers that run Windows (or MacOS, or Linux) bought from the same vendors.

Parts of the U.S. government, particularly inside the NSA, have “cyberdefense” teams that analyze widely used software for security vulnerabilities.  The best thing they could do to enhance our security is notify the vendors immediately about vulnerabilities, so the vendors can fix the bugs (and learn their lessons).   Unfortunately, the NSA also has “cyberoffense” teams that like to save up these vulnerabilities, keep them secret, and use them as weak points to break into their adversaries’ computers.  They think they’re so smart that the Russkies, or the Chinese, will never be able to figure out the same vulnerabilities and use them to break into the computers of American businesses, individuals, the DNC or RNC, or American election administrators.  There’s even an acronym for this fallacy: NOBUS.  “NObody But US” will be able to figure out this attack.

Vulnerability lists accumulated by the NSA and DHS probably don’t include a lot of vote-counting software: those lists (probably) focus on widely used operating systems, office and word-processing, network routers, phone apps, and so on.  But vote-counting software typically runs on widely used operating systems, uses PDF-handling software for ballot printing, network routers for vote aggregation.  Improvements in these components would improve election security.

So, the “cyberdefense” experts in the U.S. Government could improve everyone’s security, including election administrators, by promptly warning Microsoft, Apple, IBM, and so on about security bugs.  But their hands are often tied by the “cyberoffense” hackers who want to keep the bugs secret—and unfixed.  For years, independent cybersecurity experts have advocated that the NSA’s cyberdefense and cyberoffense teams be split up into two separate organizations, so that the offense hackers can’t deliberately keep us all insecure.   Unfortunately, in February 2016 the NSA did just the opposite: it merged its offense and defense teams together.

Some in the government talk as if “national cyberdefense” is some kind of “national guard” that they can send in to protect a selected set of computers.  But it doesn’t work that way.  Our computers are secure because of the software we purchase and install; we can choose vendors such as Apple, IBM, Microsoft, HP, or others based on their track record or based on their use of open-source software that we can inspect.  The DHS’s cybersecurity squad is not really in that process, except as they help the vendors improve the security of their products.  (See also:  “The vulnerabilities equities process.”)

Yes, it’s certainly helpful that the Secretary of Homeland Security has offered “assistance in helping state officials manage risks to voting systems in each state’s jurisdiction.”  But it’s too close to the election to be fiddling with the election software—election officials (understandably) don’t want to break anything.

But really we should ask: Should the FBI and the NSA be hacking us or defending us?  To defend us, they must stop hoarding secret vulnerabilities, and instead get those bugs fixed by the vendors.

Security against Election Hacking – Part 1: Software Independence

There’s been a lot of discussion of whether the November 2016 U.S. election can be hacked.  Should the U.S. Government designate all the states’ and counties’ election computers as “critical cyber infrastructure” and prioritize the “cyberdefense” of these systems?  Will it make any difference to activate those buzzwords with less than 3 months until the election?

First, let me explain what can and can’t be hacked.  Election administrators use computers in (at least) three ways:

  1. To maintain voter registration databases and to prepare the “pollbooks” used at every polling place to list who’s a registered voter (for that precinct); to prepare the “ballot definitions” telling the voting machines who are the candidates in each race.
  2. Inside the voting machines themselves, the optical-scan counters or touch-screen machines that the voter interacts with directly.
  3. When the polls close, the vote totals from all the different precincts are gathered (this is called “canvassing”) and aggregated together to make statewide totals for each candidate (or district-wide totals for congressional candidates).

Any of these computers could be hacked.  What defenses do we have?  Could we seal off the internet so the Russians can’t hack us?  Clearly not; and anyway, maybe the hacker isn’t the Russians—what if it’s someone in your opponent’s political party?  What if it’s a rogue election administrator?

The best defenses are ways to audit the election and count the votes outside of, independent of the hackable computers.  For example,

[Read more…]