June 25, 2018

Ethics Education in Data Science

Data scientists in academia and industry are increasingly recognizing the importance of integrating ethics into data science curricula. Recently, a group of faculty and students gathered at New York University before the annual FAT* conference to discuss the promises and challenges of teaching data science ethics, and to learn from one another’s experiences in the classroom. This blog post is the first of two which will summarize the discussions had at this workshop.

There is general agreement that data science ethics should be taught, but less consensus about what its goals should be or how they should be pursued. Because the field is so nascent, there is substantial room for innovative thinking about what data science ethics ought to mean. In some respects, its goal may be the creation of “future citizens” of data science who are invested in the welfare of their communities and the world, and understand the social and political role of data science therein. But there are other models, too: for example, an alternative goal is to equip aspiring data scientists with technical tools and organizational processes for doing data science work that aligns with social values (like privacy and fairness). The group worked to identify some of the biggest challenges in this field, and when possible, some ways to address these tensions.

One approach to data science ethics education is including a standalone ethics course in the program’s curriculum. Another option is embedding discussions of ethics into existent courses in a more integrated way. There are advantages and disadvantages to both options. Standalone ethics courses may attract a wider variety of students from different disciplines than technical classes alone, which provides potential for rich discussions. They allow professors to cover basic normative theories before diving into specific examples without having to skip the basic theories or worry that students covered them in other course modules. Independent courses about ethics do not necessarily require cooperation from multiple professors or departments, making them easier to organize. However, many worry that teaching ethics separately from technical topics may marginalize ethics and make students perceive it as unimportant. Further, standalone courses can either be elective or mandatory. If elective, they may attract a self-selecting group of students, potentially leaving out other students who could benefit from exposure to the material; mandatory ethics classes may be seen as displacing other technical training students want and need. Embedding ethics within existent CS courses may avoid some of these problems and can also elevate the discourse around ethical dilemmas by ensuring that students are well-versed in the specific technical aspects of the problems they discuss.

Beyond course structure, ethics courses can be challenging for data science faculty to teach effectively. Many students used to more technical course material are challenged by the types of learning and engagement required in ethics courses, which are often reading-heavy. And the “answers” in ethics courses are almost never clear-cut. The lack of clear answers or easily constructed rubrics can complicate grading, since both students and faculty in computer science may be used to grading based on more objective criteria. However, this problem is certainly not insurmountable – humanities departments have dealt with this for centuries, and dialogue with them may illuminate some solutions to this problem. Asking students to complete frequent but short assignments rather than occasional long ones may make grading easier, and also encourages students to think about ethical issues on a more regular basis.

Institutional hurdles can hinder a university’s ability to satisfactorily address questions of ethics in data science. A dearth of technical faculty may make it difficult to offer a standalone course on ethics. A smaller faculty may push a university towards incorporating ethics into existent CS courses rather than creating a new class. Even this, however, requires that professors have the time and knowledge to do so, which is not always the case.

The next blog post will enumerate topics discussed and assignments used in courses that discuss ethics in data science.

Thanks to Karen Levy and Kathy Pham for their edits on a draft of this post.

Routing Attacks on Internet Services

by Yixin Sun, Annie Edmundson, Henry Birge-Lee, Jennifer Rexford, and Prateek Mittal

[In this post, we discuss a recent thread of research that highlights the insecurity of Internet services due to the underlying insecurity of Internet routing. We hope that this thread facilitates important dialog in the networking, security, and Internet policy communities to drive change and adoption of secure mechanisms for Internet routing]

The underlying infrastructure of the Internet comprises physical connections between more than 60,000 entities known as Autonomous Systems (such as AT&T and Verizon). Internet routing protocols such as the Border Gateway Protocol (BGP) govern how our communications are routed over a series of autonomous systems to form an end-to-end communication channel between a sender and receiver.

Unfortunately, Internet routing protocols were not designed with security in mind. The insecurity in the BGP protocol allows potential adversaries to manipulate how routing on the Internet occurs. For example, see this recent real-world example of BGP attacks against Mastercard, Visa, and Symantec. The insecurity of BGP is well known, and a number of protocols have been designed to secure Internet routing. However, we are a long ways away from large-scale deployment of secure Internet routing protocols.  

This status quo is unacceptable.

Historically, routing attacks have been viewed primarily from the perspective of an attack on availability of Internet applications.  For example, an adversary can hijack Internet traffic towards a victim application server and cause unavailability (see YouTube’s 2008 hijack). A secondary perspective is that of confidentiality of unencrypted Internet communications. For example, an adversary can manipulate Internet routing to position itself on the communication path between a client and the application server and record unencrypted traffic: http://dyn.com/blog/mitm-internet-hijacking/

In this post, we  argue that conventional wisdom significantly underestimates the vulnerabilities introduced due to insecurity of Internet routing. In particular, we discuss recent research results that exploit BGP insecurity to attack the Tor network, TLS encryption, and the Bitcoin network.

BGP attacks on anonymity systems/Tor: The Tor network is a deployed system for anonymous communication that aims to protect user identity (IP address) in online communications. The Tor network comprises of over 7,000 relays which together carry terabytes of traffic every day. Tor serves millions of users, including political dissidents, whistle-blowers, law-enforcement, intelligence agencies, journalists, businesses and ordinary citizens concerned about the privacy of their online communications.

Tor clients redirect their communications via a series of proxies for anonymous communication. Layered encryption is used such that each proxy only observes the identity of the previous hop and the next hop in the communication, and no single proxy observes the identities of both the client and the destination.

However, if an adversary can observe the traffic from the client to the Tor network, and from the Tor network to the destination, then it can leverage correlation between packet timing and sizes to infer the network identities of clients and servers (end-to-end timing analysis). Therefore, an adversary can first use BGP attacks to hijack or intercept Internet traffic towards the Tor network (Tor relays), and perform traffic analysis of encrypted communications to compromise user anonymity.

It is important to note that this timing analysis works even if the communication is encrypted. This illustrates an important point — the insecurity of Internet routing has important consequences for traffic-analysis attacks, which allow adversaries to infer sensitive information from communication meta-data (such as source IP, destination IP, packet size and packet timing), even if communication is encrypted.

We introduced the threat of “Routing Attacks on Privacy in Tor” (RAPTOR attacks) at USENIX Security in 2015. We demonstrated the feasibility of RAPTOR attacks on the Tor network by performing real-world Internet routing manipulation in a controlled and ethical manner.  Interested readers can see the technical paper and our project webpage for more details.

Routing attacks challenge conventional beliefs about security of anonymity systems, and also have broad applicability to low-latency anonymous communication (including systems beyond Tor, such as I2P). Our work also motivates the design of anonymity systems that successfully resist the threat of Internet routing manipulation. The Tor project is already implementing design changes (such as Tor proposal 247 and Tor proposal 271) that make it harder for an adversary to infer and manipulate the client’s entry point (proxy) into the Tor network. Our follow-up work on Counter-RAPTOR defenses (presented at the IEEE Security and Privacy Symposium in 2017) presents a monitoring framework to analyze routing updates for the Tor network, which is being integrated into the Tor metrics portal.

BGP attacks on TLS/Digital Certificates: The Transport Layer Security (TLS) protocol allows a client to establish a secure communication channel with a destination website using cryptographic key exchange protocols. To prevent man-in-the-middle attacks, clients using the TLS protocols need to authenticate the public key corresponding to the destination site, such as a web-server. Digital certificates issued by trusted Certificate Authorities (such as Let’s Encrypt) provide an authentic binding between destination server and its public key, allowing a client to validate the destination server. Given the widespread use of TLS for secure Internet communications, the security of the digital certificate ecosystem is paramount.  

We have shown that the process for obtaining digital certificates from trusted certificate authorities (called domain validation) is vulnerable to attack.

A domain owner can perform a Certificate Signing Request (CSR) to a trusted Certificate Authority to obtain a digital certificate.  The Certificate Authority must verify that the party submitting the request actually has control over the domains that are covered by that CSR. This process is known as domain control verification and is a core part of the Public Key Infrastructure (PKI) used in the TLS protocol.

In our ongoing work in progress, presented at the HotPETS workshop in 2017, we demonstrated the feasibility of exploiting BGP attacks to compromise the domain validation protocol. For example,  HTTP domain verification is a common method of domain control verification that requires the domain owner to upload a string specified by the CA to a specific HTTP URL at the domain. The CA can then verify the domain via a HTTP GET request. However, an adversary can manipulate inter-domain routing via BGP attacks to intercept all traffic towards the victim web-server, and successfully obtain a fraudulent digital certificate by spoofing a HTTP response corresponding to the CA challenge message. We have performed real-world Internet routing manipulation in a controlled and ethical manner to demonstrate the feasibility of these attacks. See our attack demonstration video for a demo.

This attack has significant consequences for privacy of our online communications, as adversaries can bypass cryptographic protection offered by encryption using fraudulently obtained digital certificates. Our work is leading to deployment of suggested countermeasures (verification from multiple vantage points) at Let’s Encrypt. Please see the Let’s Encrypt deployment for more details.

So far, we have discussed our research results from Princeton University. Below, I’ll briefly discuss research from Laurent Vanbever’s group at ETHZ and Sharon Goldberg’s Group at Boston University that have shown that it is possible to use inter-domain routing manipulation for attacking Bitcoin and for bypassing legal protections.

BGP attacks on Crypto-currencies/Bitcoin: BGP manipulation can be used to perform two main types of attacks on crypto-currencies such as Bitcoin: (1) partitioning attacks, in which an adversary aims to disconnect a set of victim Bitcoin nodes from the network, or (2) delaying attacks, in which an adversary can slow down the propagation of data towards victim Bitcoin nodes. Both of these attacks result in potential economic loss to Bitcoin nodes.

BGP attacks for bypassing legal protections: Domestic communications between US citizens have legal protections against surveillance. However, adversaries can manipulate inter-domain routing such that the actual communication path involves a foreign country, which could invalidate the legal protections and allow large-scale surveillance of online communications.

Concluding Thoughts:  The emergence of routing attacks on anonymity systems, Internet domain validation, and cryptocurrencies showcases that conventional wisdom has significantly underestimated the attack surface introduced due to the insecurity of Internet routing. It is imperative for critical Internet applications to be aware of the insecurity of Internet routing, and analyze the resulting security threats.

Given the vulnerabilities in Internet routing, applications should consider domain specific defense mechanisms for enhancing user security and privacy. Examples include our Counter-RAPTOR analytics for Tor and Multiple vantage point defense for domain validation). We hope that our work, and the research discussed above is an enabler for this vision.

While it is important to design and deploy application-specific defenses for protecting our systems against routing attacks that exploit current insecure Internet infrastructure, it is even more important to rethink the status quo of insecure routing protocols. Our ultimate goal ought to be to fundamentally eliminate the insecurity in today’s Internet routing protocols by moving towards the adoption of secure countermeasures. How do we drive this change?

Is It Time for an Data Sharing Clearinghouse for Internet Researchers?

Today’s Senate hearing with Facebook’s Mark Zuckerberg will start a long discussion on data collection and privacy from Internet companies. Although the spotlight is currently on Facebook, we shouldn’t forget that the picture is  broader: companies from device manufacturers to ISPs collect network traffic and use it for a variety of purposes.

The uses that we will hear about today are largely about the widespread collection of data about Internet users for targeted content delivery and advertising.  Meanwhile, yesterday Facebook announced an initiative to share data with independent researchers to study social media’s impact on elections. At the same time Facebook is being raked over the coals for sharing their data with “researchers” (Cambridge Analytica), they’ve announced a program to share their data with (presumably more “legitimate”) researchers.

Internet researchers depend on data. Sometimes, we can gather the data ourselves, using measurement tools deployed at the edge of the Internet (e.g., in home networks, on phones). In other cases, we need data from the companies that operate parts of the Internet, such as an Internet service provider (ISP), an Internet registrar, or an application provider (e.g., Facebook).

  • If incentives align, data flows to the researcher. Interacting with a company can work very well when goals are aligned. I’ve worked well with companies to develop new spam filtering algorithms, to develop new botnet detection algorithms, and to highlight empirical results that have informed policy debates.
  • If incentives do not align, then the researcher probably won’t get the data.  When research is purely technical, incentives often align. When the technical work crosses over into policy (as it does in areas like net neutrality, and as we are seeing with Facebook), there can be (insurmountable) hurdles to data access.

How an Internet Researcher Gets Data Today

How do Internet researchers get data from companies today? An Internet operator I know aptly characterizes the status quo:

“Show Internet operators you can do something useful, and they’ll give you data.”

Researchers get access to Internet data from companies in two ways: (1) working for the company (as an employee), or (2) working with the company (as an “independent” researcher).

Option #1: Work for a Company.

Working for a company offers privileged access to data, which can be used to mint impressive papers (irreproducibility aside) simply because nobody else has the same data. I have taken this approach myself on a number of occasions, having worked for an ISP (AT&T), a DNS company (Verisign), and an Internet security service provider (Damballa).

How this approach works. In the 2000s, research labs at AT&T and Sprint had privileged access to data, which gave rise to a proliferation of papers on “Everything You Wanted to Know About the Internet Backbone But Were Afraid to Ask”.  Today, the story repeats itself, except that the players are Google and Facebook, and the topic du jour is data-center networks.

Shortcomings of This Approach. Much research—from projects with a longer arc to certain policy-oriented questions—would never come to light if we only relied on company employees to do it. By the nature of their work, however, company employees lack independence. They lack both autonomy of selecting problems and in the ability to take positions or publish results that run counter to the company’s goals or priorities. This shortcoming may not matter if what the researcher wants to work on and what the company want to accomplish are the same. For many technical problems, this is the case (although there is still the tendency for the technical community to develop tunnel vision around areas where there is an abundance of data, while neglecting other areas). But for many problems—ranging from problems with a longer arc to deployment to those that may run counter to priorities—we can’t rely on industry to do the work.

#2: Work with a Company. 

How this approach works. A researcher may instead work with a company, typically gaining privileged access to data for a particular project. Sometimes, we demonstrate the promise of a technique with some data that we can gather or bootstrap without any help and use that initial study to pique the interest of a company who may then share data with us to further develop the idea.

Shortcomings of this approach. Research done in collaboration with a company often has similar shortcomings as the research that is done within a company’s walls. If the results of the research align with the company’s perspectives and viewpoints, then data sharing is copacetic. Even these cooperative settings do pose some risks to researchers, who may create the perception that they are not independent, merely by their association with the company. With purely technical research risks are lower, though still non-zero: for example, because the work depends on privileged data access, the researcher may still face challenges in presenting the research in a way that could help others reproduce it in the future.

With technical work that can inform or speak to policy questions, there are some concerns. First, certain types of research or results may never come to light—if a company doesn’t like the result that may result from the data analysis, then they may simply not share the data, or they may ask for “pre-publication review” for results based on that data (this practice is common for research that is conducted within companies as well). There is also a second, more subtle concern. Even when the work is technically watertight, a researcher can still face questions—fair or unfair—about the soundness of the work due to the perceived motivations or agendas of cooperative parties involved.

Current Data Sharing Approaches are Good, But They are Not Sufficient

The above methods for data sharing can work well for certain types of research. In my career, I have made hay playing by these rules—often working with a company, first by demonstrating the viability of an idea with a smaller dataset that we gather ourselves and “pitching” the idea to a company.

Yet, in my experience these approaches have two shortcomings. The first relates to incentives. The second relates to privacy.

Problem #1: Incentives.

Certain types of work depend on access to Internet data, but the company who holds the data may not have a direct incentive to facilitate the research. Possible studies of Facebook’s effect on elections certainly fall into this category: They simply may not like the results of the research.

But, there are plenty of other lines of research that fall into the category where incentives may not align. Other examples range from measurements of Internet capacity and performance as they relate to broadband regulation (e.g., net neutrality) to evaluation of an online platform’s content moderation algorithms and techniques. Lots of other work relating to consumer protection falls into this category as well. We have to rely on users and researchers measuring things at the edge of the network to figure out what’s going on; from this vantage point, certain activities may naturally slip under the radar more easily.

The current Internet data sharing roadmap doesn’t paint a rosy picture for research where incentives don’t align. Even when incentives do align, there can be perceptions of “capture”—effectively shilling an intellectual or technical finding in exchange for data access.

It is in the interests of everyone—the academics and their industry partners alike—to establish more formal modes of data exchange when either (1) there is determination that the problem is important to study for the health of the Internet, or for the benefit of consumers; (2) there is the potential that the research will be perceived as not objective due to the nature of the data sharing agreement.

Problem #2: Privacy.

Sharing Internet data with researchers can introduce substantial privacy risks, and the need to share data with any researcher who works with a company should be evaluated carefully—ideally by an independent third party.

When helping develop the researcher exception to the FCC’s broadband privacy rules, I submitted a comment that proposed the following criteria for sharing ISP data with researchers:

  1. Purpose of research. The data satisfies research that aims to promote security, stability, and reliability of networks. The research should have clear benefits for Internet innovation, operations, or security.
  2. Research goals do not violate privacy. The goals of the research does not include compromising consumer privacy;
  3. Privacy risks of data sharing are offset by benefits of the research. The risks of the data exchange are offset by the benefits of the research;
  4. Privacy risks of the data sharing are mitigated. Researchers should strive to use de-­identified data wherever possible.
  5. The data adds value to the research. The research is enhanced by access to the data.

Yet, outlining the criteria is one thing. The thornier question (which we did not address!) is: Who gets to decide the answers?

Universities have institutional review boards that can help evaluate the merits of such a data sharing agreement. But, Cambridge Analytica might have the veneer of “research”, and a company may have no internal incentive to independently evaluate the data sharing agreement on its merits. In light of recent events, we may be headed towards the conclusion that such data-sharing agreements should always be vetted by independent third-party review. If the research doesn’t involve a university, however, the natural question is: Who is that third party?

Looking Ahead: Data Clearinghouses for Internet Data?

Certain types of Internet research—particularly those that involve thorny regulatory or policy questions—could benefit from an independent clearing house, where researchers could propose studies and experiments for independent evaluation and have them evaluated and selected by an independent third party, based on their benefits and risks. Facebook is exploring this avenue in the limited setting of election integrity. This is an exciting step.

Moving forward, it will be interesting to see how Facebook’s meta-experiment on data sharing plays out, and whether it—or some variant—can serve as a model for Internet data sharing for other types of work writ large. In purely technical areas, such a clearinghouse could allow a broader range of researchers to explore, evaluate, reproduce and extend the types of work that for now remains largely irreproducible because data is under lock and key. For these questions, there could be significant benefit to the scientific community. In areas where the technical work or data analysis informs policy questions, the benefits to consumers could be even greater.