May 21, 2018

Archives for 2018

Ethics Education in Data Science

Data scientists in academia and industry are increasingly recognizing the importance of integrating ethics into data science curricula. Recently, a group of faculty and students gathered at New York University before the annual FAT* conference to discuss the promises and challenges of teaching data science ethics, and to learn from one another’s experiences in the classroom. This blog post is the first of two which will summarize the discussions had at this workshop.

There is general agreement that data science ethics should be taught, but less consensus about what its goals should be or how they should be pursued. Because the field is so nascent, there is substantial room for innovative thinking about what data science ethics ought to mean. In some respects, its goal may be the creation of “future citizens” of data science who are invested in the welfare of their communities and the world, and understand the social and political role of data science therein. But there are other models, too: for example, an alternative goal is to equip aspiring data scientists with technical tools and organizational processes for doing data science work that aligns with social values (like privacy and fairness). The group worked to identify some of the biggest challenges in this field, and when possible, some ways to address these tensions.

One approach to data science ethics education is including a standalone ethics course in the program’s curriculum. Another option is embedding discussions of ethics into existent courses in a more integrated way. There are advantages and disadvantages to both options. Standalone ethics courses may attract a wider variety of students from different disciplines than technical classes alone, which provides potential for rich discussions. They allow professors to cover basic normative theories before diving into specific examples without having to skip the basic theories or worry that students covered them in other course modules. Independent courses about ethics do not necessarily require cooperation from multiple professors or departments, making them easier to organize. However, many worry that teaching ethics separately from technical topics may marginalize ethics and make students perceive it as unimportant. Further, standalone courses can either be elective or mandatory. If elective, they may attract a self-selecting group of students, potentially leaving out other students who could benefit from exposure to the material; mandatory ethics classes may be seen as displacing other technical training students want and need. Embedding ethics within existent CS courses may avoid some of these problems and can also elevate the discourse around ethical dilemmas by ensuring that students are well-versed in the specific technical aspects of the problems they discuss.

Beyond course structure, ethics courses can be challenging for data science faculty to teach effectively. Many students used to more technical course material are challenged by the types of learning and engagement required in ethics courses, which are often reading-heavy. And the “answers” in ethics courses are almost never clear-cut. The lack of clear answers or easily constructed rubrics can complicate grading, since both students and faculty in computer science may be used to grading based on more objective criteria. However, this problem is certainly not insurmountable – humanities departments have dealt with this for centuries, and dialogue with them may illuminate some solutions to this problem. Asking students to complete frequent but short assignments rather than occasional long ones may make grading easier, and also encourages students to think about ethical issues on a more regular basis.

Institutional hurdles can hinder a university’s ability to satisfactorily address questions of ethics in data science. A dearth of technical faculty may make it difficult to offer a standalone course on ethics. A smaller faculty may push a university towards incorporating ethics into existent CS courses rather than creating a new class. Even this, however, requires that professors have the time and knowledge to do so, which is not always the case.

The next blog post will enumerate topics discussed and assignments used in courses that discuss ethics in data science.

Thanks to Karen Levy and Kathy Pham for their edits on a draft of this post.

When the business model *is* the privacy violation

Sometimes, when we worry about data privacy, we’re worried that data might fall into the wrong hands or be misused for unintended purposes. If I’m considering participating in a medical study, I’d want to know if insurance companies will obtain the data and use it against me. In these scenarios, we should look for ways to preserve the intended benefit while preventing unintended uses. In other words, achieving utility and privacy is not a zero-sum game. [1]

In other situations, the intended use is the privacy violation. The most prominent example is the tracking of our online and offline habits for targeted advertising. This business model is exactly what people object to, for a litany of reasons: targeting is creepy, manipulative, discriminatory, and reinforces harmful stereotypes. The data collection that enables targeted advertising involves an opaque surveillance infrastructure to which it’s impossible to give meaningfully informed consent, and the resulting databases give a few companies too much power over individuals and over democracy. [2]

In response to privacy laws, companies have tried to find technical measures that obfuscate the data but allow them carry on with the surveillance business as usual. But that’s just privacy theater. Technical steps that don’t affect the business model are of limited effectiveness, because the business model is fundamentally at odds with privacy; this is in fact a zero-sum game. [3]

For example, there’s an industry move to replace email addresses and other personal identifiers with hashed versions. But a hashed identifier is nevertheless a persistent, unique identifier that allows linking a person across databases, devices, and contexts, as well as targeting and manipulation on the basis of the associated data. Thus, hashing completely fails to address the underlying privacy concerns.

Policy makers and privacy advocates must recognize when privacy is a zero-sum game and when it isn’t. Policy makers like non-zero sum games because they can simultaneously satisfy different stakeholders. But they must acknowledge that sometimes this isn’t possible. In such cases, laws and regulations should avoid loopholes that companies might exploit by building narrow technical measures and claiming to be in compliance. [4]

Privacy advocates should recognize that framing a concern about data use practices as a privacy problem is a double-edged sword. Privacy can be a convenient label for a set of related concerns, but it gives industry a way to deflect attention from deeper ethical questions by interpreting privacy narrowly as confidentiality.

Thanks to Ed Felten and Nick Feamster for feedback on a draft.


[1] There is a vast computer science privacy literature predicated on the idea that we can have our cake and eat it too. For example, differential privacy seeks to enable analysis of data in the aggregate without revealing individual information. While there are disagreements on the specifics, such as whether de-identification results a win-win outcome, there is no question that the overall direction of privacy-preserving data analysis is an important one.

[2] In Mark Zuckerberg’s congressional testimony, he framed Facebook’s privacy woes as being about improper third-party access to the data. This is arguably a non-zero sum game, and one that Facebook is equipped to address without the need for legislation. However, the much bigger privacy problem is Facebook’s own data collection and business model, which is inherently at odds with privacy and is unlikely to be solved without legislation.

[3] There are research proposals for targeted advertising, such as Adnostic, that would improve privacy by drastically changing the business model, largely cutting out the tracking companies. Unsurprisingly, there has been no interest in these approaches from the traditional ad tech industry, but some browser vendors have experimented with similar ideas.

[4] As an example of avoiding the hashing loophole, the 2012 FTC privacy report is well written: it says that for data to be considered de-identified, “the company must achieve a reasonable level of justified confidence that the data cannot reasonably be used to infer information about, or otherwise be linked to, a particular consumer, computer, or other device.” It goes on to say that “reasonably” includes reasonable assumptions about the use of external data sources that might be available.

Routing Attacks on Internet Services

by Yixin Sun, Annie Edmundson, Henry Birge-Lee, Jennifer Rexford, and Prateek Mittal

[In this post, we discuss a recent thread of research that highlights the insecurity of Internet services due to the underlying insecurity of Internet routing. We hope that this thread facilitates important dialog in the networking, security, and Internet policy communities to drive change and adoption of secure mechanisms for Internet routing]

The underlying infrastructure of the Internet comprises physical connections between more than 60,000 entities known as Autonomous Systems (such as AT&T and Verizon). Internet routing protocols such as the Border Gateway Protocol (BGP) govern how our communications are routed over a series of autonomous systems to form an end-to-end communication channel between a sender and receiver.

Unfortunately, Internet routing protocols were not designed with security in mind. The insecurity in the BGP protocol allows potential adversaries to manipulate how routing on the Internet occurs. For example, see this recent real-world example of BGP attacks against Mastercard, Visa, and Symantec. The insecurity of BGP is well known, and a number of protocols have been designed to secure Internet routing. However, we are a long ways away from large-scale deployment of secure Internet routing protocols.  

This status quo is unacceptable.

Historically, routing attacks have been viewed primarily from the perspective of an attack on availability of Internet applications.  For example, an adversary can hijack Internet traffic towards a victim application server and cause unavailability (see YouTube’s 2008 hijack). A secondary perspective is that of confidentiality of unencrypted Internet communications. For example, an adversary can manipulate Internet routing to position itself on the communication path between a client and the application server and record unencrypted traffic: http://dyn.com/blog/mitm-internet-hijacking/

In this post, we  argue that conventional wisdom significantly underestimates the vulnerabilities introduced due to insecurity of Internet routing. In particular, we discuss recent research results that exploit BGP insecurity to attack the Tor network, TLS encryption, and the Bitcoin network.

BGP attacks on anonymity systems/Tor: The Tor network is a deployed system for anonymous communication that aims to protect user identity (IP address) in online communications. The Tor network comprises of over 7,000 relays which together carry terabytes of traffic every day. Tor serves millions of users, including political dissidents, whistle-blowers, law-enforcement, intelligence agencies, journalists, businesses and ordinary citizens concerned about the privacy of their online communications.

Tor clients redirect their communications via a series of proxies for anonymous communication. Layered encryption is used such that each proxy only observes the identity of the previous hop and the next hop in the communication, and no single proxy observes the identities of both the client and the destination.

However, if an adversary can observe the traffic from the client to the Tor network, and from the Tor network to the destination, then it can leverage correlation between packet timing and sizes to infer the network identities of clients and servers (end-to-end timing analysis). Therefore, an adversary can first use BGP attacks to hijack or intercept Internet traffic towards the Tor network (Tor relays), and perform traffic analysis of encrypted communications to compromise user anonymity.

It is important to note that this timing analysis works even if the communication is encrypted. This illustrates an important point — the insecurity of Internet routing has important consequences for traffic-analysis attacks, which allow adversaries to infer sensitive information from communication meta-data (such as source IP, destination IP, packet size and packet timing), even if communication is encrypted.

We introduced the threat of “Routing Attacks on Privacy in Tor” (RAPTOR attacks) at USENIX Security in 2015. We demonstrated the feasibility of RAPTOR attacks on the Tor network by performing real-world Internet routing manipulation in a controlled and ethical manner.  Interested readers can see the technical paper and our project webpage for more details.

Routing attacks challenge conventional beliefs about security of anonymity systems, and also have broad applicability to low-latency anonymous communication (including systems beyond Tor, such as I2P). Our work also motivates the design of anonymity systems that successfully resist the threat of Internet routing manipulation. The Tor project is already implementing design changes (such as Tor proposal 247 and Tor proposal 271) that make it harder for an adversary to infer and manipulate the client’s entry point (proxy) into the Tor network. Our follow-up work on Counter-RAPTOR defenses (presented at the IEEE Security and Privacy Symposium in 2017) presents a monitoring framework to analyze routing updates for the Tor network, which is being integrated into the Tor metrics portal.

BGP attacks on TLS/Digital Certificates: The Transport Layer Security (TLS) protocol allows a client to establish a secure communication channel with a destination website using cryptographic key exchange protocols. To prevent man-in-the-middle attacks, clients using the TLS protocols need to authenticate the public key corresponding to the destination site, such as a web-server. Digital certificates issued by trusted Certificate Authorities (such as Let’s Encrypt) provide an authentic binding between destination server and its public key, allowing a client to validate the destination server. Given the widespread use of TLS for secure Internet communications, the security of the digital certificate ecosystem is paramount.  

We have shown that the process for obtaining digital certificates from trusted certificate authorities (called domain validation) is vulnerable to attack.

A domain owner can perform a Certificate Signing Request (CSR) to a trusted Certificate Authority to obtain a digital certificate.  The Certificate Authority must verify that the party submitting the request actually has control over the domains that are covered by that CSR. This process is known as domain control verification and is a core part of the Public Key Infrastructure (PKI) used in the TLS protocol.

In our ongoing work in progress, presented at the HotPETS workshop in 2017, we demonstrated the feasibility of exploiting BGP attacks to compromise the domain validation protocol. For example,  HTTP domain verification is a common method of domain control verification that requires the domain owner to upload a string specified by the CA to a specific HTTP URL at the domain. The CA can then verify the domain via a HTTP GET request. However, an adversary can manipulate inter-domain routing via BGP attacks to intercept all traffic towards the victim web-server, and successfully obtain a fraudulent digital certificate by spoofing a HTTP response corresponding to the CA challenge message. We have performed real-world Internet routing manipulation in a controlled and ethical manner to demonstrate the feasibility of these attacks. See our attack demonstration video for a demo.

This attack has significant consequences for privacy of our online communications, as adversaries can bypass cryptographic protection offered by encryption using fraudulently obtained digital certificates. Our work is leading to deployment of suggested countermeasures (verification from multiple vantage points) at Let’s Encrypt. Please see the Let’s Encrypt deployment for more details.

So far, we have discussed our research results from Princeton University. Below, I’ll briefly discuss research from Laurent Vanbever’s group at ETHZ and Sharon Goldberg’s Group at Boston University that have shown that it is possible to use inter-domain routing manipulation for attacking Bitcoin and for bypassing legal protections.

BGP attacks on Crypto-currencies/Bitcoin: BGP manipulation can be used to perform two main types of attacks on crypto-currencies such as Bitcoin: (1) partitioning attacks, in which an adversary aims to disconnect a set of victim Bitcoin nodes from the network, or (2) delaying attacks, in which an adversary can slow down the propagation of data towards victim Bitcoin nodes. Both of these attacks result in potential economic loss to Bitcoin nodes.

BGP attacks for bypassing legal protections: Domestic communications between US citizens have legal protections against surveillance. However, adversaries can manipulate inter-domain routing such that the actual communication path involves a foreign country, which could invalidate the legal protections and allow large-scale surveillance of online communications.

Concluding Thoughts:  The emergence of routing attacks on anonymity systems, Internet domain validation, and cryptocurrencies showcases that conventional wisdom has significantly underestimated the attack surface introduced due to the insecurity of Internet routing. It is imperative for critical Internet applications to be aware of the insecurity of Internet routing, and analyze the resulting security threats.

Given the vulnerabilities in Internet routing, applications should consider domain specific defense mechanisms for enhancing user security and privacy. Examples include our Counter-RAPTOR analytics for Tor and Multiple vantage point defense for domain validation). We hope that our work, and the research discussed above is an enabler for this vision.

While it is important to design and deploy application-specific defenses for protecting our systems against routing attacks that exploit current insecure Internet infrastructure, it is even more important to rethink the status quo of insecure routing protocols. Our ultimate goal ought to be to fundamentally eliminate the insecurity in today’s Internet routing protocols by moving towards the adoption of secure countermeasures. How do we drive this change?