September 19, 2017

LinkedIn reveals your personal email to your connections

[Huge thanks to Dillon ReismanArvind Narayanan, and Joanna Huey for providing great feedback on early drafts.]

LinkedIn makes the primary email address associated with an account visible to all direct connections, as well as to people who have your email address in their contacts lists. By default, the primary email address is the one that was used to sign up for LinkedIn. While the primary address may be changed to another email in your account settings, there is no way to prevent your contacts from visiting your profile and viewing there whatever email you chose to be primary. In addition, the current data archive export feature of LinkedIn allows users to download their connections’ email addresses in bulk. It seems that the archive export includes all emails associated with an account, not just the one designated as primary.

It appears that many of these addresses are personal, rather than professional. This post uses the contextual integrity (CI) privacy framework to consider whether the access given by LinkedIn violates the privacy norms of using a professional online social network.
[Read more…]

Web Census Notebook: A new tool for studying web privacy

As part of the Web Transparency and Accountability Project, we’ve been visiting the web’s top 1 million sites every month using our open-source privacy measurement tool OpenWPM. This has led to numerous worrying findings such as the systematic abuse of newly introduced web features for fingerprinting, leading to better privacy tools and occasionally strong responses from browser vendors.

Enabling research is great — OpenWPM has led to 14 papers so far — but research is slow and requires expertise. To make our work more directly useful, today we’re announcing a new tool to study web privacy: a Jupyter notebook interface and a set of libraries to quickly answer most questions about web tracking by querying the the 500 GB of data we collect every month.

Jupyter notebook is an intuitive tool for data analysis using Python, and it’s what we use here internally for much of our own research. Notebooks are accessible with a simple web interface; yet the code, data, and memory persists on the server if you close the browser and return to it later (even from a different device). Notebooks combine code with visualizations, making them ideal for data exploration and analysis.

Who could benefit from this tool? We envision uses such as these:

  • Publishers could use our data to understand third-party tracking on their own websites.
  • Journalists could use our data to investigate and expose privacy-infringing practices.
  • Regulators and enforcement agencies could use our tool in investigations.
  • Creators of browser privacy tools could use our data to test their effectiveness.

Let’s look at an example that shows the feel of the interface. The code below computes the average number of embedded trackers on the top 100 websites in various categories such as “news” and “shopping”. It is intuitive and succinct. Without our interface, not only would the SQL version of this query be much more cumbersome, but it would require a ton of legwork and setup to even get to a point where you can write the query. Now you just need to point your browser at our notebook.

    for category, domains in census.first_parties.alexa_categories.items():
        avg = sum(1 for first_party in domains[:100]
                    for third_party in first_party.third_party_resources
                    if third_party.is_tracker) / 100
        print("Average number of trackers on %s sites: %.1f" % (category, avg))

The results confirm our finding that news sites have the most trackers, and adult sites the least. [1]

Here’s what happens behind the scenes:

  • census is a Python object that exposes all the relationships between websites and third parties as object attributes, hiding the messy details of the underlying database schema. Each first party is represented by a FirstParty object that gives access to each third-party resource (URI object) on the first party, and the ThirdParty that the URI belongs to. When the objects are accessed, they are instantiated automatically by querying the database.
  • census.first_parties is a container of FirstParty objects ordered by Alexa traffic rank, so you can easily analyze the top sites, or sites in the long tail, or specific sites. You can also easily slice the sites by category: in the example above, we iterate through each category of census.first_parties.alexa_categories.
  • There’s a fair bit of logic that goes into analyzing the crawl data which third parties are embedded on which websites, and cross-referencing that with tracking-protection lists to figure out which of those are trackers. This work is already done for you, and exposed via attributes such as ThirdParty.is_tracker.

Since the notebooks run on our server, we expect to be able to support only a limited number (a few dozen) at this point, so you need to apply for access. The tool is currently in beta as we smooth out rough edges and add features, but it is usable and useful. Of course, you’re welcome to run the notebook on your own server — the underlying crawl datasets are public, and we’ll release the code behind the notebooks soon. We hope you find this of use to you, and we welcome your feedback.

 

[1] The linked graph from our paper measures the number of distinct domains whereas the query above counts every instance of every tracker. The trends are the same in both cases, but the numbers are different. Here’s the output of the query:

 

Average number of third party trackers on computers sites: 41.0
Average number of third party trackers on regional sites: 68.8
Average number of third party trackers on recreation sites: 58.2
Average number of third party trackers on health sites: 38.4
Average number of third party trackers on news sites: 151.2
Average number of third party trackers on business sites: 55.0
Average number of third party trackers on kids_and_teens sites: 74.8
Average number of third party trackers on home sites: 94.5
Average number of third party trackers on arts sites: 108.6
Average number of third party trackers on sports sites: 86.6
Average number of third party trackers on reference sites: 43.8
Average number of third party trackers on science sites: 43.1
Average number of third party trackers on society sites: 73.5
Average number of third party trackers on shopping sites: 53.1
Average number of third party trackers on adult sites: 16.8
Average number of third party trackers on games sites: 70.5

Engineering around social media border searches

The latest news is that the U.S. Department of Homeland Security is considering a requirement, while passing through a border checkpoint, to inspect a prospective visitor’s “online presence”. That means immigration officials would require users to divulge their passwords to Facebook and other such services, which the agent might then inspect, right there, at the border crossing. This raises a variety of concerns, from its chilling impact on freedom of speech to its being an unreasonable search or seizure, nevermind whether an airport border agent has the necessary training to make such judgments, much less the time to do it while hundreds of people are waiting in line to get through.

Rather than conduct a serious legal analysis, however, I want to talk about technical countermeasures. What might Facebook or other such services do to help defend their users as they pass a border crossing?

Fake accounts. It’s certainly feasible today to create multiple accounts for yourself, giving up the password to a fake account rather than your real account. Most users would find this unnecessarily cumbersome, and the last thing Facebook or anybody else wants is to have a bunch of fake accounts running around. It’s already a concern when somebody tries to borrow a real person’s identity to create a fake account and “friend” their actual friends.

Duress passwords. Years ago, my home alarm system had the option to have two separate PINs. One of them would disable the alarm as normal. The other would sound a silent alarm, summoning the police immediately while making it seem like I disabled the alarm. Let’s say Facebook supported something similar. You enter the duress password, then Facebook locks out your account or switches to your fake account, as above.

Temporary lockouts. If you know you’re about to go through a border crossing, you could give a duress password, as above, or you could arrange an account lockout in advance. You might, for example, designate ten trusted friends, where any five must declare that the lockout is over. Absent those declarations, your account would remain locked, and there would be no means for you to be coerced into giving access to your own account.

Temporary sanitization. Absent any action from Facebook, the best advice today for somebody about to go through a border crossing is to sanitize their account before going through. That means attempting to second-guess what border agents are looking for and delete it in advance. Facebook might assist this by providing search features to allow users to temporarily drop friends, temporarily delete comments or posts with keywords in them, etc. As with the temporary lockouts, temporary sanitization would need to have a restoration process that could be delegated to trusted friends. Once you give the all-clear, everything comes back again.

User defense in bulk. Every time a user, going through a border crossing, exercises a duress password, that’s an unambiguous signal to Facebook. Even absent such signals, Facebook would observe highly unusual login behavior coming from those specific browsers and IP addresses. Facebook could simply deny access to its services from government IP address blocks. While it’s entirely possible for the government to circumvent this, whether using Tor or whatever else, there’s no reason that Facebook needs to be complicit in the process.

So is there a reasonable alternative?

While it’s technically feasible for the government to require that Facebook give it full “backdoor” access to each and every account so it can render threat judgments in advance, this would constitute the most unreasonable search and seizure in the history of that phrase. Furthermore, if and when it became common knowledge that such unreasonable seizures were commonplace, that would be the end of the company. Facebook users have an expectation of privacy and will switch to other services if Facebook cannot protect them.

Wouldn’t it be nice if there was some less invasive way to support the government’s desire for “extreme vetting”? Can we protect ordinary users’ privacy while still enabling the government to intercept people who intend harm to our country? We certainly must assume that an actual bona fide terrorist is going to have no trouble creating a completely clean online persona to use while crossing a border. They can invent wholesome friends with healthy children sharing silly videos of cute kittens. While we don’t know too much about our existing vetting strategies to distinguish tourists from terrorists, we have to assume that the process involves the accumulation of signals and human intelligence, and other painstaking efforts by professional investigators to protect our country from harm. It’s entirely possible that they’re already doing a good job.