Computer science research on re-identification has repeatedly demonstrated that sensitive information can be inferred even from de-identified data in a wide variety of domains. This has posed a vexing problem for practitioners and policy makers. If the absence of “personally identifying information” cannot be relied on for privacy protection, what are the alternatives? Joanna Huey, Ed Felten, and I tackle this question in a new paper “A Precautionary Approach to Big Data Privacy”. Joanna presented the paper at the Computers, Privacy & Data Protection conference earlier this year.
In a recent post, I talked about our paper showing how to identify anonymous programmers from their coding styles. We used a combination of lexical features (e.g., variable name choices), layout features (e.g., spacing), and syntactic features (i.e., grammatical structure of source code) to represent programmers’ coding styles. The previous post focused on the overall results and techniques we used. Today I’ll talk about applications and explain how source code authorship attribution can be used in software forensics, plagiarism detection, copyright or copyleft investigations, and other domains.
Freedom to Tinker readers are probably aware of the current controversy over Google’s handling of ongoing security vulnerabilities in its Android WebView component. What sounds at first like a routine security problem turns out to have some deep challenges. Let’s start by filling in some background and build up to the big problem they’re not talking about: Android advertising.
Every programmer learns to code in a unique way which results in distinguishing “fingerprints” in coding style. These fingerprints can be used to compare the source code of known programmers with an anonymous piece of source code to find out which one of the known programmers authored the anonymous code. This method can aid in finding malware programmers or detecting cases of plagiarism. In a recent paper, we studied this question, which we call source-code authorship attribution. We introduced a principled method with a robust feature set and achieved a breakthrough in accuracy.
In the news, we have a consortium of French publishers, which somehow includes several major U.S. corporations (Google, Microsoft), attempting to sue AdBlock Plus developer Eyeo, a German firm with developers around the world. I have no idea of the legal basis for their case, but it’s all about the money. AdBlock Plus and the closely related AdBlock are among the most popular Chrome extensions, by far, and publishers will no doubt claim huge monetary damages around presumed “lost income”.
[Let's welcome Aylin Caliskan-Islam, a graduate student at Drexel. In this post she discusses new work that applies machine learning and natural-language processing to questions of privacy and social behavior. — Arvind Narayanan.]
How do we decide how much to share online given that information can spread to millions in large social networks? Is it always our own decision or are we influenced by our friends? Let’s isolate this problem to one variable, private information. How much private information are we sharing in our posts and are we the only authority controlling how much private information to divulge in our textual messages? Understanding how privacy behavior is formed could give us key insights for choosing our privacy settings, friends circles, and how much privacy to sacrifice in social networks. Christakis and Fowler’s network analytics study showed that obesity spreads through social ties. In another study, they explain that smoking cessation is a collective behavior. Our intuition before analyzing end users’ privacy behavior was that privacy behavior might also be under the effect of network phenomena.
In a recent paper that appeared at the 2014 Workshop on Privacy in the Electronic Society, we present a novel method for quantifying privacy behavior of users by using machine learning classifiers and natural-language processing techniques including topic categorization, named entity recognition, and semantic classification. Following the intuition that some textual data is more private than others, we had Amazon Mechanical Turk workers label tweets of hundreds of users as private or not based on nine privacy categories that were influenced by Wang et al.’s Facebook regrets categories and Sleeper et al.’s Twitter regrets categories. These labels were used to associate a privacy score with each user to reflect the amount of private information they reveal. We trained a machine learning classifier based on the calculated privacy scores to predict the privacy scores of 2,000 Twitter users whose data were collected through the Twitter API.
“A Scanner Darkly”, a dystopian 1977 Philip K. Dick novel (adapted to a 2006 film), describes a society with pervasive audio and video surveillance. Our paper “A Scanner Darkly”, which appeared in last year’s IEEE Symposium on Security and Privacy (Oakland) and has just received the 2014 PET Award for Outstanding Research in Privacy Enhancing Technologies, takes a closer look at the soon-to-come world where ubiquitous surveillance is performed not by the drug police but by everyday devices with high-bandwidth sensors. [Read more...]