May 29, 2017

How do we decide how much to reveal? (Hint: Our privacy behavior might be socially constructed.)

[Let’s welcome Aylin Caliskan-Islam, a graduate student at Drexel. In this post she discusses new work that applies machine learning and natural-language processing to questions of privacy and social behavior. — Arvind Narayanan.]

How do we decide how much to share online given that information can spread to millions in large social networks? Is it always our own decision or are we influenced by our friends? Let’s isolate this problem to one variable, private information. How much private information are we sharing in our posts and are we the only authority controlling how much private information to divulge in our textual messages? Understanding how privacy behavior is formed could give us key insights for choosing our privacy settings, friends circles, and how much privacy to sacrifice in social networks. Christakis and Fowler’s network analytics study showed that obesity spreads through social ties. In another study, they explain that smoking cessation is a collective behavior. Our intuition before analyzing end users’ privacy behavior was that privacy behavior might also be under the effect of network phenomena.

In a recent paper that appeared at the 2014 Workshop on Privacy in the Electronic Society, we present a novel method for quantifying privacy behavior of users by using machine learning classifiers and natural-language processing techniques including topic categorization, named entity recognition, and semantic classification. Following the intuition that some textual data is more private than others, we had Amazon Mechanical Turk workers label tweets of hundreds of users as private or not based on nine privacy categories that were influenced by Wang et al.’s Facebook regrets categories and Sleeper et al.’s Twitter regrets categories. These labels were used to associate a privacy score with each user to reflect the amount of private information they reveal. We trained a machine learning classifier based on the calculated privacy scores to predict the privacy scores of 2,000 Twitter users whose data were collected through the Twitter API.

Additionally, we found that there is a correlation between the privacy score of a user and those of her friends. There is even a higher correlation of privacy score between a user and the other users mentioned in her tweets. People with similar privacy scores appear in groups. The possible causal relationships in this phenomenon need further exploration.

The ability to automatically quantify private information disclosure and compute privacy scores provides a potentially useful method for users, researchers, and companies. A user can make sharing decisions in a more informed manner if the privacy risk associated with each friend is known. For example, she can take privacy scores into account when constructing friend lists. Researchers who study people’s use of social media can also use the privacy score calculation method for a fine grained analysis of individual privacy behavior. Which type of textual data, namely messages, status updates, mentions, or comments have more private information?

Social media companies could tailor “nudges” based on users’ (and their friends’) privacy scores. For example, a social network could alert the user when she is about to share content that appears to be highly private with a group of friends that includes users with low privacy scores. A recent study by Wang et al. on privacy nudges show promising results on preventing unintended disclosure and associated regret. Finally, ethical issues aside, social media companies are also in a position to run controlled experiments to determine if privacy behaviors are indeed contagious.

We are planning to do another privacy analytics study after obtaining IRB approval to learn more about how people are influenced to reveal private information and the effects of Facebook’s default newsfeed algorithm. The correlation between the privacy score of a user and her friends gives a starting point for investigating the causal factors behind self-disclosure. Better understanding these factors can help effectively design privacy enhancing technologies and target educational interventions.


  1. A couple of observations from 20+ years in various online fora:

    1) sharing tends to encourage sharing and discretion, discretion, as a matter of reciprocity. The same is true in physical spaces, where people tend to associate with other people whose level of personal-information-sharing they feel comfortable with (No, I really didn’t need to hear a long explication of your inlaws’ time in rehab).

    2) There are typically a few outliers in any group, “leakers”, “performers” or non-sharers. Group dynamics with regard to such people can be interesting, and it might be interesting (IRBs willing) to see what happened if there were tools for identifying same.

  2. David R Brake says:

    I am a scholar working in this area and this looks like a very handy piece of research! If you want a more qualitative picture of why people share (overall, not just in terms of their peer sharing behaviour) you might want to check out Sharing Our Lives Online: Risks and Exposure in Social Media, just published by Palgrave.