August 22, 2017

Differential Privacy is Vulnerable to Correlated Data — Introducing Dependent Differential Privacy

[This post is joint work with Princeton graduate student Changchang Liu and IBM researcher Supriyo Chakraborty. See our paper for full details. — Prateek Mittal ]

The tussle between data utility and data privacy

Information sharing is important for realizing the vision of a data-driven customization of our environment. Data that were earlier locked up in private repositories are now being increasingly shared for enabling new context-aware applications, better monitoring of population statistics, and facilitating academic research in diverse fields. However, sharing personal data gives rise to serious privacy concerns as the data can contain sensitive information that a user might want to keep private. Thus, while on one hand, it is imperative to release utility-providing information, on the other hand, the privacy of users whose data is being shared also needs to be protected.

What is differential privacy?

Among existing data privacy metrics, differential privacy (DP) has emerged has a rigorous mathematical framework for defining and preserving privacy, and has received considerable attention from the privacy community. DP is used for protecting the privacy of individual users’ data while publishing aggregate query responses over statistical databases, and guarantees that the distribution of query responses changes only slightly with the addition or deletion of a single tuple (entry or record) in the database. Thus, after observing the query response, an adversary is limited in its ability to infer sensitive data about any tuple and in turn about the user contributing the tuple. For example, if the government aims to publish the average salary of individuals in New Jersey, DP can reveal the approximate average salary of this region while protecting the privacy of any individual user’s salary. DP achieves this balance between utility and privacy by adding noise (perturbation) to the true query result, for example, via the Laplace perturbation mechanism.

Differential privacy is vulnerable to data correlation!

To provide its guarantees, DP implicitly assumes that the data tuples in the database, each from a different user, are all independent. This is a weak assumption, especially because tuple dependence occurs naturally in datasets due to social interactions between users. For example, consider a social network graph with nodes representing users, and edges representing friendship relationships. An adversary can infer the existence of a social link between two users that are not explicitly connected in the graph using the existence of edges among other users in the graph ( since two friends are likely to have many other friends in common). Similarly, private attributes in a user’s record can be inferred by exploiting the public attributes of other users sharing similar interests. An adversary can easily infer a user’s susceptibility to a contagious disease using access to the noisy query result and also the fact that the user’s immediate family members are part of the database. In an era of big-data, the tuples within a database present rich semantics and complex structures that inevitably lead to data correlation. Therefore, data correlation, i.e., the probabilistic dependence relationship among tuples in the database, which is implicitly ignored in DP, can seriously deteriorate the privacy properties of DP.

Previous work has investigated this problem and proposed several interesting privacy metrics, such as zero-knowledge privacy, pufferfish privacy, membership privacy, inferential privacy, etc. However, the privacy community is lacking effective perturbation mechanisms that can achieve these proposed privacy metrics.

In a paper that we presented at the Network and Distributed System Security Symposium (NDSS 2016), we first demonstrate a Bayesian attack on differentially private mechanisms using real-world datasets (such as Gowalla location data). Our attack exploits the correlation between location information and social information of users in the Gowalla dataset. We show that it is possible to infer a user’s sensitive location information from differentially private responses by exploiting her social relationships. When data is correlated, DP underestimates the amount of noise that must be added to achieve a desired bound on the adversary’s ability to make sensitive inferences. Therefore, correlation (or dependency) among data tuples would break the expected privacy guarantees.

How can we improve differential privacy to protect correlated data? Introducing dependent differential privacy.

Overall, our work generalizes the classic differential privacy framework to explicitly consider correlations in datasets, thus enhancing the applicability of the approach to a wide range of datasets and applications. We formalize the notion of dependent differential privacy (DDP); DDP aims to explicitly incorporate probabilistic dependency constraints among tuples in the privacy formulation. We also propose a perturbation mechanism to achieve the privacy guarantees in DDP; our approach extends the conventional Laplace perturbation mechanism by incorporating the correlation between data tuples. To do so, we introduce the concept of dependent coefficient between two tuples, which aims to capture the correlation between data tuples from the perspective of privacy. (The dependent coefficient between two tuples evaluates the ratio between the dependent indistinguishability of one tuple induced by the other tuple and the self indistinguishability of the tuple itself.) Our perturbation mechanism can be applied for arbitrary data queries and we validated its effectiveness using real-world datasets.

What are future research directions for dependent differential privacy?

First, to form a deeper understanding of our dependent differential privacy framework, it would be interesting to explore the application of standard concepts in differential privacy to our framework, such as local sensitivity and smooth sensitivity. Second, the effectiveness of our proposed perturbation mechanism depends on how well the dependence among data can be modeled and computed. One limitation of our work is that the dependence coefficient is exactly known to both the adversary and the database owner. How to accurately compute the dependence coefficient and deal with potential underestimation would be an interesting future work (note that the overestimation of dependence coefficient continues to provide rigorous DDP guarantees). Finally, in our dependent differential privacy framework, we only considered pairwise correlations between tuples in a database. An important research direction is to generalize the pairwise correlations that we considered in our paper to joint correlations in the dataset.

Comments

  1. This is a very nice post! Thanks for pointing out this piece of work. We have some work currently on arxiv on this kind of privacy for correlated data that you might find relevant: http://arxiv.org/abs/1603.03977

    While we use a slightly different privacy framework (Pufferfish privacy), we provide a generalized notion of sensitivity for privacy in correlated data, and also a more specialized algorithm for Bayesian networks. We were not aware of your work before, but we will make sure to cite you in an updated version of our paper.

    • Thanks Kamalika. Your technical report and the generalized notion of sensitivity look very interesting!

  2. Frank McSherry says:

    Hi!

    From your paper,

    > To provide its guarantees, DP mechanisms assume that the data tuples (or records) in the database, each from a different user, are all independent.

    This isn’t true. I think what you probably meant was “To provide *some other, stronger* guarantees, …” or something like that. But, best not to write things that aren’t true, because then other people just copy them down as if they were true.

    You seem interested in the guarantee that a thing remains a secret, even when the subject of the secret no longer has the agency to keep the secret (i.e. withholding their data doesn’t maintain the secret). It is a fine thing to study, but it isn’t the guarantee differential privacy provides.

    I personally think this may not even count as “privacy” any more; is learning that smoking leads to health risks in most people (so, likely you as well) a violation of your privacy, if we would learn this with or without your data? This is probably a lot more debatable, but I hope we can agree that the value of being clear about what is and isn’t true is less debatable.

    I wrote more about this in long form, if that is in any way helpful in explaining where I’m coming from:

    https://github.com/frankmcsherry/blog/blob/master/posts/2016-08-16.md

    Cheers, and thanks for the interesting work!

    • Hi Frank!

      >> “You seem interested in the guarantee that a thing remains a secret, even when the subject of the secret no longer has the agency to keep the secret (i.e. withholding their data doesn’t maintain the secret). It is a fine thing to study, but it isn’t the guarantee differential privacy provides.”

      Indeed, this is the point of the blog post, and what we mean by “privacy for correlated data”. In scenarios where “data owner withholding their data” suffices to maintain the secret (i.e. “independent tuples in the terminology used in our paper), stronger notions of privacy including dependent differential privacy, membership privacy, and inferential privacy seem to reduce to differential privacy.

      **However**, in scenarios where “data owner withholding their data” doesn’t suffice to maintain the secret (i.e. “privacy for correlated data”), what does differential privacy mean? Our work on dependent differential privacy is a generalization in this context.

      In my opinion, the latter scenario is the dominant scenario in many real-world datasets, including social data, communication data, genetic data etc. For example, removing my genetic information from a genomic database may not suffice to protect the secret if a close relative is part of the database.

      Currently, systems designers are applying differential privacy as a black-box to various datasets, without considering the difference between the two scenarios outlined above. We hope that this post and our work inspires further research into stronger notions of privacy (and associated mechanisms!) for protecting user data.

      Finally, +1 to Frank’s blogpost that nicely clarifies the difference between “your secrets” and “secrets about you” that seem to somewhat map to the two scenarios outlined above (?). The challenge for privacy engineers is to clearly differentiate between the two when applying state-of-the-art privacy mechanisms, otherwise users may not get the privacy guarantees they might expect. For example, there are number of papers that apply differential privacy to networked settings, without considering if the intended secrets could be leaked via correlations in the data.

      (Hopefully this clarifies our perspective)

      Thanks,
      Prateek

      • Frank McSherry says:

        Right, so my point is that while

        > In scenarios where “data owner withholding their data” suffices to maintain the secret (i.e. “independent tuples in the terminology used in our paper), stronger notions of privacy including dependent differential privacy, membership privacy, and inferential privacy seem to reduce to differential privacy.

        may be true, when the assumption of independence doesn’t exist (and it really shouldn’t, because assumptions like that are impossible to validate), differential privacy *still provides its guarantees*. It doesn’t turn into a mathematical pumpkin or anything. Saying that its guarantees no longer hold is … =(

        > **However**, in scenarios where “data owner withholding their data” doesn’t suffice to maintain the secret (i.e. “privacy for correlated data”), what does differential privacy mean?

        It provides exactly the guarantee it says: no disclosures (or other bad things) happen as a result of participating in the computation. I have no idea if the disclosure will happen or will not happen, and neither do you, because the data owner may have already disclosed enough information for others to discover the secret; unless you are going to micromanage the rest of their lives, the others may eventually get around to learning the secret. However, if and when it happens, it will not be a result of mis-managing the individual data, and so the person can participate in these sorts of computations without that specific concern.

        > For example, removing my genetic information from a genomic database may not suffice to protect the secret if a close relative is part of the database.

        That’s true, and it is a fun ethical question to determine if you have the right to suppress your relative’s genomic contribution in the interest of keeping your genomic data secret. In a very real sense, your genomic information is not *your* secret. In either case, your *contribution* of genetic information is not what will lead to people learning anything specific about your genome.

      • Frank McSherry says:

        Just to try and make my point differently, here is an alternate abstract you could write that I think is less wrong and more on-message:

        Differential privacy (DP) is a widely accepted mathematical framework for protecting data privacy. Simply stated, it guarantees that the distribution of query results changes only slightly due to the modification of any one tuple in the database. However, this guarantee only conceals the data *from* each individual, it does not conceal the data *about* each individual. Many datasets contain correlated records, where information from some records reflects on many others, and for which differential privacy does not guarantee that we will not learn overly specific information about the participants.

        In this paper, we make several contributions that not only demonstrate the feasibility of exploiting the above vulnerability but also provide steps towards mitigating it. …

  3. Frank McSherry says:

    Hi! I have a few comments (I tried to post more, but they were eaten).

    > To provide its guarantees, DP mechanisms assume that the data tuples (or records) in the database, each from a different user, are all independent.

    This is not true. You are interested in a stronger guarantee, that a thing remains a secret, even when the subject of the secret no longer has the ability to keep it a secret (i.e. withholding their data doesn’t maintain the secret). It is a fine thing to study, but it isn’t the guarantee differential privacy provides.

    More details here, https://github.com/frankmcsherry/blog/blob/master/posts/2016-08-16.md, if that helps explain where I’m coming from.

  4. Frank McSherry says:

    I wrote up some follow-up comments, in longer form.

    https://github.com/frankmcsherry/blog/blob/master/posts/2016-08-29.md

    In particular, I have some questions about Section IV, including (i) which k-means algorithm you used, (ii) why the “information leaked” should be zero when epsilon is zero, and (iii) what the weird shape of your figure 5b is due to.

    Thanks!

    • Changchang Liu says:

      Hi Frank,

      Thanks for your interests. We utilized the differentially private k-means algorithm in [14] (please see their related paragraphs prior to Section 5). For DP adversary, the information leakage is zero under zero differential privacy. For DDP adversary, the information leakage is larger than zero even under zero differential privacy since the dependence relationship available to the adversary would leak some information. Thanks.

    • +1 — Thanks Frank, for the clear and lucid exposition. I look forward to the response by the authors.