[Let’s welcome Aylin Caliskan-Islam, a graduate student at Drexel. In this post she discusses new work that applies machine learning and natural-language processing to questions of privacy and social behavior. — Arvind Narayanan.]
How do we decide how much to share online given that information can spread to millions in large social networks? Is it always our own decision or are we influenced by our friends? Let’s isolate this problem to one variable, private information. How much private information are we sharing in our posts and are we the only authority controlling how much private information to divulge in our textual messages? Understanding how privacy behavior is formed could give us key insights for choosing our privacy settings, friends circles, and how much privacy to sacrifice in social networks. Christakis and Fowler’s network analytics study showed that obesity spreads through social ties. In another study, they explain that smoking cessation is a collective behavior. Our intuition before analyzing end users’ privacy behavior was that privacy behavior might also be under the effect of network phenomena.
In a recent paper that appeared at the 2014 Workshop on Privacy in the Electronic Society, we present a novel method for quantifying privacy behavior of users by using machine learning classifiers and natural-language processing techniques including topic categorization, named entity recognition, and semantic classification. Following the intuition that some textual data is more private than others, we had Amazon Mechanical Turk workers label tweets of hundreds of users as private or not based on nine privacy categories that were influenced by Wang et al.’s Facebook regrets categories and Sleeper et al.’s Twitter regrets categories. These labels were used to associate a privacy score with each user to reflect the amount of private information they reveal. We trained a machine learning classifier based on the calculated privacy scores to predict the privacy scores of 2,000 Twitter users whose data were collected through the Twitter API.