July 17, 2018

Can Classes on Field Experiments Scale? Lessons from SOC412

Last semester, I taught a Princeton undergrad/grad seminar on the craft, politics, and ethics of behavioral experimentation. The idea was simple: since large-scale human subjects research is now common outside universities, we need to equip students to make sense of that kind of power and think critically about it.

Path diagram from SOC412 lecture on the Social Media Color Experiment

Most behavioral experiments out in the world are conducted by people with no university training. In 2016, bloggers at NerdyData estimated that A/B test company Optimizely’s software was deployed on over half a million websites. In 2017, the company announced that it had passed its one millionth experiment. Companies trying to support millions of behavioral studies aren’t waiting for universities to train socially conscious experimenters. Instead, training happens in hotel ballrooms at events like Opticon, which draws in over a thousand people every year,  SearchEngineLand’s similarly sized SMX marketing conference series, and O’Reilly’s Strata conferences. And while scientists might consider experiments to be innocuous on their own, many have begun to wonder if the drive to optimize profits through mass behavioral experimentation may have damaging side effects.

Traditionally, training on field experiments has primarily been offered to gradstudents, mostly through mentorship with an advisor, in small graduate seminars, or in classes like ICPSR’s field experiments summer course. Master’s programs in education, economics, and policy also have a history of classes on evaluation. These classes tend to focus on the statistics of experimentation or on the politics of government-directed research. So far, I’ve only found two other undergraduate field experiments classes: one by Esther Duflo in economics and Carolina Castilla’s class at Colgate.

My class, SOC412, set out to introduce students to the actual work of doing experiments and also to give them opportunity to reflect, discuss, and write about the wider societal issues surrounding behavioral research at scale in today’s society. This 10-student seminar class was a prototype for a much larger lecture class I’m considering. The class also gave me an opportunity grow at teaching hybrid classes that combine statistics with critical reflection.

In this post, I describe the class, imagine how it could be improved as a seminar, outline what might need to change for a larger lecture class, and what I learned. I will also include notes for anyone thinking about teaching a class like this.

Goals of the Class

My goal for students was to introduce them to the process of conducting experiments in context of wider debates about the role of experiments in society. By the end of the class, students would have designed and conducted more than one field experiment and have a chance to write about how that experiment connects to these wider social issues. The class alternated between lecture/discussion sessions and lab-like sessions focused on methods. Assignments in the first part of the semester focused on the basics of experimentation, and a second part of the class focused more on developing a final project. You can see the syllabus here.

Scaffolding Student Work on Field Experiments

Designing and completing a field experiment in a single semester isn’t just a lot of work- it requires multiple external factors to converge:

  • Collaborations with external partners need to go smoothly
  • The delivery of the intervention and any measurement need to be simple
  • Experiments need to be doable in the available time
  • The university’s ethics board needs to work on the timeline of an undergrad class

In the first post about SOC412, I give more detail on the work I did to scaffold external factors.

SOC412 also gave me a chance to test the idea that the software I’m developing with the team at CivilServant could reduce the overhead of planning and conducting meaningful experiments with communities. By dedicating some of CivilServant’s resources to the class and inviting our community partners to work with students after our research summit in January, I hoped that students would be able to complete a full research cycle in the course of the class.

How the CivilServant software supports community experiments online

We almost did it <grin>. Students were able to fully develop experiment plans by the end of the semester, and we are conducting all of the studies they designed. Here are some of the great outcomes I got to see from students and that I want to remember for my own future teaching:

  • Asking students to do a first experiment in their own lives is a powerful way to prompt student reflection on the ethics of experimentation
  • Conversations with affected communities do help students think more critically about the contributions and limitations of experimentation in pragmatic settings
  • The statistics parts of the class went smoothest when I anticipated student needs and wrote well documented code for students to work from
  • It worked well to review basic statistical concepts through prepared datasets and transition students to data from their own experiments partway through the course
  • Lectures that illustrated central concepts in multiple ways worked well
  • Simulations were powerful ways to illustrate p-hacking, false positives, false negatives, and decision criteria for statistical results, since we could adjust the parameters and see the results to grow student intuitions
  • Short student presentations prompted close reading by students of specific field experiment designs and gave them a chance to explore personal interests more deeply
  • I think I did the right thing to offer students a chance to develop their own own unique research ideas beyond CivilServant. This added substantial time to my workload, but it allowed students to explore their own interests. I don’t think it will scale well.

Areas for Improving the Class

Here are some of the things that prevented us from meeting the full goals of the class, and how I would teach a seminar class differently in the future:

  • Online discussion:
    • Never use Piazza again. The system steers conversations toward question-answer with the instructor rather than discussion, and the company data-mines student behavior and sells it to job recruiters (they make a big show about opt-in, but it’s an dark pattern default checkbox). I’m thinking about shifting to the open source tools Discourse and NB.
  • Statistics:
    • Introduce students directly to a specific person who can provide extra statistics support as needed, rather than just point them to institutional resources (Brendan Nyhan does this in his politics experiments syllabus)
    • Pre-register every anticipated hypothesis test before the class, unless you want students to legitimately question your own work after you teach them about p-hacking <grin>
    • When teaching meta-analysis and p-hacking, give students a pre-baked set of experiment results (I’m working on getting several large corpora of A/B tests for this, please get in touch if you know where I can find one)
  • Designing experiments:
    • Students conducted power analyses based on historical data, and difficulties with power analysis caused substantial delays on student projects. Develop standard software code for conducting power analyses for experiments with count variable outcomes, which can be reasonably run on student laptops before the heat death of the universe. 
    • Experiments are a multi-stage process where early planning errors can compound. The class needs a good way to handle ongoing documents that will be graded over time, and which may need to be directly adjusted by the instructor or TA for the project to continue.
    • When using software to carry out experiment interventions, don’t expect that students will read the technical description of the system. Walk students through a worked example of an experiment carried out using that software.
    • Create a streamlined process for piloting surveys quickly
    • Create a standard experiment plan template in Word, Google Docs, and LaTeX. Offering an outline and example still yields considerable variation between student work
    • Consider picking a theme for the semester, which will focus students’ theory reading and their experiment ideas
    • Since classes have hard deadlines that cannot easily be altered, do not support student research ideas that involve any new software development.
  •  Participatory research process:
    • Schedule meetings with research partners before the class starts and include a regular meeting time in the syllabus (Nyhan does something similar with an “X period”). If you want to offer students input, choose the meeting time at the beginning of the semester and stick to it. Otherwise, you will lose time to scheduling and projects will slip.
    • Write a guide for students on the process of co-designing a research study, one that you run by research partners, that gives students a way to know where they are in the process, check off their progress, and communicate to the instructor where they are in the process.
  • Team and group selection:
    • While it would be nice to allow students to form project teams based on the studies they are interested in, teams likely need to be formed and settled before students are in a position to imagine and develop final project ideas.
  • Writing: Even students with statistics training will have limited experience writing about statistics for a general audience. Here are two things I would do differently:
    • Create a short guide, partly based on the Chicago Guide to Writing about Numbers, that shows a single finding well reported, poorly/accurately reported, and poorly/inaccurately reported. Talk through this example in class/lab.
    • In the early part of the class, while waiting for results from their own first set of experiments, assign students to write results paragraphs from a series of example studies, referring to the guide.

Supporting a Class With a Startup Nonprofit

This class would not have been possible without the CivilServant nonprofit or Eric Pennington, CivilServant’s data architect. CivilServant provides software infrastructure for collecting data, conducting surveys, and carrying out randomized trials with online communities. The CivilServant nonprofit (which gained a legal status independent of MIT on April 1st, halfway through the semester) also provided research relationships for students. While gradstudents developed their own studies, undergraduate students used CivilServant software and depended on the nonprofit’s partner relationships.

After the class, some students expressed regret that they didn’t end up doing research outside of the opportunities provided through CivilServant. During the semester, I developed several opportunities to conduct field experiments on the Princeton campus, and I explored further ideas with university administrators. Unfortunately, none of the fascinating student ideas or university leads were achievable within a semester (negotiating with university administrators takes time).

Between the cost of the summit and staff time, CivilServant put substantial resources into the class. Was it worth the time and expense? When working with learners, our research couldn’t happen as quickly or efficiently as it might have otherwise. Yet student research also helped CivilServant better focus our software engineering priorities. Supporting the class also gave us a first taste at what it might be like to combine a faculty position with my public interest research. Next spring, we will need to plan well to ensure that CivilServant’s wider work isn’t put on hold to support the class.

Should SOC 412 Be a Lecture or Seminar?

Do I think this class can scale to be a lecture course? I think a larger lecture course may be possible with some modifications under specific conditions:

  • Either (a) drop the participatory component of the course or (b) organize each precept (section) to carry out a single field experiment, coordinated by the preceptor (TA)
  • If needed, relax the goal of completing studies by the end of the semester and find other ways for students to develop their experience communicating results
  • The technical processes for student experiments should not require any custom software, or it will be impossible to support a large number of student projects. This would constrain the scope of possible experiments but increase the chance of students completing their experiments
  • If I’m to teach this as a lecture course next year, I should apply for a teaching grant from Princeton, since scaling the class will take substantial work on software, assignments, and class materials to formalize
  • Notes on Preceptors (TAs)
    • Careful preceptor recruitment, training, and coordination would be essential to scale this class
    • If each precept (section) does a single experiment, the work of developing studies will need to be distributed and managed differently than with the teams of 2-3 that I led
    • The class needs clear grading systems and rubrics for student writing assignments
    • Preceptors in the course could receive a privileged authorship position on any peer reviewed studies from their section, in acknowledgment of the substantial work of supporting this course

Should You Teach a Class Like This?

I had an amazing time teaching SOC412, the students learned well, and we completed and are launching a series of field experiments, all of which are publishable. Teaching this class with ten students was a lot of work, much more than a typical discussion seminar. If you’re thinking about teaching a class like this, here are some questions to ask yourself:

  • do you have the means to deploy multiple field experiments?
  • do you have staff who can support community partnerships?
  • do you have enough partners lined up?
  • is your IRB responsive enough to make quick emendations during a semester?
  • does your department already teach students the needed statistics prerequisites?
  • do you have streamlined ways to conduct experiments that will work for learners?
  • do you have Standard Operating Procedures for common study types, along with full code for the statistical methods?
  • do you have the resources to continuously update any incomplete parts of student projects throughout the course to ensure the quality of projects?

Demystifying The Dark Web: Peeling Back the Layers of Tor’s Onion Services

by Philipp Winter, Annie Edmundson, Laura Roberts, Agnieskza Dutkowska-Żuk, Marshini Chetty, and Nick Feamster

Want to find US military drone data leaks online? Frolick in a fraudster’s paradise for people’s personal information? Or crawl through the criminal underbelly of the Internet? These are the images that come to most when they think of the dark web and a quick google search for “dark web” will yield many stories like these. Yet, far less is said about how the dark web can actually enhance user privacy or overcome censorship by enabling anonymous browsing through Tor. Recently, for example, Brave, dedicated to protecting user privacy, integrated Tor support to help users surf the web anonymously from a regular browser. This raises questions such as: is the dark web for illicit content and dealings only? Can it really be useful for day-to-day web privacy protection? And how easy is it to use anonymous browsing and dark web or “onion” sites in the first place?

To answer some of these pressing questions, we studied how Tor users use onion services. Our work will be presented at the upcoming USENIX Security conference in Baltimore next month and you can read the full paper here or the TLDR version here.

What are onion services?: Onion services were created by the Tor project in 2004. They not only offer privacy protection for individuals browsing the web but also allow web servers, and thus websites themselves, to be anonymous. This means that any “onion site” or dark web site cannot be physically traced to identify those running the site or where the site is hosted. Onion services differ from conventional web services in four ways. First, they can only be accessed over the Tor network. Second, onion domains, (akin to URLs for the regular web), are hashes over their public key and consist of a string of letters and numbers, which make them long, complicated, and difficult to remember. These domains sometimes contain prefixes that are human-readable but they are expensive to generate (e.g. torprojectqyqhjn.onion). We refer to these as vanity domains. Third, the network path between the client and the onion service is typically longer, meaning slower performance owing to longer latencies. Finally, onion services are private by default, meaning that to find and use an onion site, a user has to know the onion domain, presumably by finding this information organically, rather than with a search engine.

What did we do to investigate how Tor users make use of onion services?: We conducted a large scale survey of 517 Tor users and interviewed 17 Tor users in depth to determine how users perceive, use, and manage onion services and what challenges they face in using these services. We asked our participants about how they used Tor’s onion services and how they managed onion domains. In addition, we asked users about their expectations of privacy and their privacy and security concerns when using onion services. To compliment our qualitative data, we analyzed “leaked” DNS lookups to onion domains, as seen from a DNS root server. This data gave us insights into actual usage patterns to corroborate some of the findings from the interviews and surveys. Our final sample of participants were young, highly educated, and comprised of journalists, whistleblowers, everyday users wanting to protect their privacy to those doing competitive research on others and wanting to avoid being “outed”. Other participants included activists and those who wanted to avoid government detection for fear of persecution or worse.

What were the main findings? First, unsurprisingly, onion services were mostly used for anonymity and security reasons. For instance, 71% of survey respondents reported using onion services to protect their identity online. Almost two thirds of the survey respondents reported using onion services for non-browsing activities such as TorChat, a secure messaging app built on top of onion services. 45% of survey participants had other reasons for using Tor such as to help educate users about the dark web or for their personal blogs. Only 27% of survey respondents reported using onion services to explore the dark web and its content “out of curiosity”.

Second, users had a difficult time finding, tracking, and saving onion links. Finding links: Almost half of our survey respondents discovered onion links through social media such as Twitter or Reddit or by randomly encountering links while browsing the regular web. Fewer survey respondents discovered links through friends and family. Challenges users mentioned for finding onion services included:

  • Onion sites frequently change addresses and so often onion domain aggregators have broken and out of date links.
  • Unlike traditional URLS, onion links give no indication of the content of the website so it is difficult to avoid potentially offensive or illicit content.
  • Again, unlike traditional URLS, participants said it is hard to determine through a glance at the address bar if a site is the authentic one you are trying to reach instead of a phishing site.

A frequent wish expressed by participants was for a better search engine that is more up to date and gives an indication of the content before one clicks on the link as well as authenticity of the site itself.

Tracking and Saving links: To track and save complicated onion domains, many participants opted to bookmark links but some did not want to leave a trace of websites they visited on their machines. The majority of other survey respondents had ad-hoc measures to deal with onion links. Some memorized a few links and did so to protect privacy by not writing the links down. However, this was only possible for a few vanity domains in most cases. Others just navigated to the places where they found the links in the first place and used the links from there to open the websites they needed.

Third, onion domains are also hard to verify as authentic. Vanity domains: Users appreciated vanity domains where onion services operators have taken extra effort and expense to set up a domain that is almost readable such as the case of Facebook’s onion site, facebookcorewwwi.onion. Many participants liked the fact that vanity domains give more indication of the content of the domain. However, our participants also felt vanity domains could lead to more phishing attacks since people would not try to verify the entire onion domain but only the readable prefix. “We also get false expectations of security from such domains. Somebody can generate another onion key with same facebookcorewwwi address. It’s hard but may be possible. People who believe in uniqueness of generated characters, will be caught and impersonated.” – Participant S494

Verification Strategies: Our participants had a variety of strategies such as cutting and pasting links, using bookmarks, or verifying the address in the address bar to check the authenticity of a website. Some checked for a valid HTTPS certificate or familiar images in the website. However, a over a quarter of our survey respondents reported that they could not tell if a site was authentic (28%) and 10% did not even check for authenticity at all. Some lamented this is innate to the design of onion services and that there is not real way to tell if an onion service is authentic epitomized by a quote from Participant P1: “I wouldn’t know how to do that, no. Isn’t that the whole point of onion services? That people can run anonymous things without being able to find out who owns and operates them?”

Fourth, onion lookups suggest typos or phishing. In our DNS dataset, we found similarities between frequently visited popular onion sites such as Facebook’s onion domain and similar significantly less frequently visited websites, suggesting users were making typos or potentially that phishing sites exist. Of the top 20 onion domains we encountered in our data set, 16 were significantly similar to at least one other onion domain in the data set. More details are available in the paper.

What do these findings mean for Tor and onion services? Tor and onion services do have a part to play in helping users to protect their anonymity and privacy for reasons other than those usually associated with a “nefarious” dark web such as support for those overcoming censorship, stalking, and exposing others’ wrong-doing or whistleblowing. However, to better support these uses of Tor and onion services, our users wanted onion service improvements. Desired improvements included more support for Tor in general in browsers, improvement in performance, improved privacy and security, educational resources on how to use Tor and onion services, and finally improved onion services search engines. Our results suggest that to enable more users to make use of onion services, users need:

  • better security indicators to help them understand Tor and onion services are working correctly
  • automatic detection of phishing in onion services
  • opt in publishing of onion domains to improve search for legitimate and legal content
  • better ways to track and save onion links including privacy preserving onion bookmarking.

Future studies to further demystify the dark web are warranted and in our paper we make suggestions for more work to understand the positive aspects of the dark web and how to support privacy protections for everyday users.

You can read more about our study and its limitations here (such as the fact our participants were self-selected and may not represent those who do use the dark web for illicit activities for instance) or skim the paper summary.

Internet of Things in Context: Discovering Privacy Norms with Scalable Surveys

by Noah Apthorpe, Yan Shvartzshnaider, Arunesh Mathur, Nick Feamster

Privacy concerns surrounding disruptive technologies such as the Internet of Things (and, in particular, connected smart home devices) have been prevalent in public discourse, with privacy violations from these devices occurring frequently. As these new technologies challenge existing societal norms, determining the bounds of “acceptable” information handling practices requires rigorous study of user privacy expectations and normative opinions towards information transfer.

To better understand user attitudes and societal norms concerning data collection, we have developed a scalable survey method for empirically studying privacy in context.  This survey method uses (1) a formal theory of privacy called contextual integrity and (2) combinatorial testing at scale to discover privacy norms. In our work, we have applied the method to better understand norms concerning data collection in smart homes. The general method, however, can be adapted to arbitrary contexts with varying actors, information types, and communication conditions, paving the way for future studies informing the design of emerging technologies. The technique can provide meaningful insights about privacy norms for manufacturers, regulators, researchers and other stakeholders.  Our paper describing this research appears in the Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies.

Scalable CI Survey Method

Contextual integrity. The survey method applies the theory of contextual integrity (CI), which frames privacy in terms of the appropriateness of information flows in defined contexts. CI offers a framework to describe flows of information (attributes) about a subject from a sender to a receiver, under specific conditions (transmission principles).  Changing any of these parameters of an information flow could result in a violation of privacy.  For example, a flow of information about your web searches from your browser to Google may be appropriate, while the same information flowing from your browser to your ISP might be inappropriate.

Combinatorial construction of CI information flows. The survey method discovers privacy norms by asking users about the acceptability of a large number of information flows that we automatically construct using the CI framework. Because the CI framework effectively defines an information flow as a tuple (attributes, subject, sender, receiver, and transmission principle), we can automate the process of constructing information flows by defining a range of parameter values for each tuple and generating a large number of flows from combinations of parameter values.

Applying the Survey Method to Discover Smart Home Privacy Norms

We applied the survey method to 3,840 IoT-specific information flows involving a range of device types (e.g., thermostats, sleep monitors), information types (e.g., location, usage patterns), recipients (e.g., device manufacturers, ISPs) and transmission principles (e.g., for advertising, with consent). 1,731 Amazon Mechanical Turk workers rated the acceptability of these information flows on a 5-point scale from “completely unacceptable” to “completely acceptable”.

Trends in acceptability ratings across information flows indicate which context parameters are particularly relevant to privacy norms. For example, the following heatmap shows the average acceptability ratings of all information flows with pairwise combinations of recipients and transmission principles.

Average acceptability scores of information flows with given recipient/transmission principle pairs.

Average acceptability scores of information flows with given recipient/transmission principle pairs. For example, the top left box shows the average acceptability score of all information flows with the recipient “its owner’s immediate family” and the transmission principle “if its owner has given consent.” Higher (more blue) scores indicate that flows with the corresponding parameters are more acceptable, while lower (more red) scores indicate that the flows are less acceptable. Flows with the null transmission principle are controls with no specific condition on their occurrence. Empty locations correspond to less intuitive information flows that were excluded from the survey. Parameters are sorted by descending average acceptability score for all information flows containing that parameter.

These results provide several insights about IoT privacy, including the following:

  • Advertising and Indefinite Data Storage Generally Violate Privacy Norms. Respondents viewed information flows from IoT devices for advertising or for indefinite storage as especially unacceptable. Unfortunately, advertising and indefinite storage remain standard practice for many IoT devices and cloud services.
  • Transitive Flows May Violate Privacy Norms. Consider a device that sends its owner’s location to a smartphone, and the smartphone then sends the location to a manufacturer’s cloud server. This device initiates two information flows: (1) to the smartphone and (2) to the phone manufacturer. Although flow #1 may conform to user privacy norms, flow #2 may violate norms. Manufacturers of devices that connect to IoT hubs (often made by different companies), rather than directly to cloud services, should avoid having these devices send potentially sensitive information with greater frequency or precision than necessary.

Our paper expands on these findings, including more details on the survey method, additional results, analyses, and recommendations for manufacturers, researchers, and regulators.

We believe that the survey method we have developed is broadly applicable to studying societal privacy norms at scale and can thus better inform privacy-conscious design across a range of domains and technologies.