Last week, the Federal Communications Commission (FCC) announced new privacy rules that govern how Internet service providers can share information about consumers with third parties. One focus of this rulemaking has been on the use and sharing of so-called “Consumer Proprietary Network Information (CPNI)”—information about subscribers—for advertising. The Center for Information Technology Policy and the Center for Democracy and Technology jointly hosted a panel exploring this topic last May, and I have previously written on certain aspects of this issue, including what ISPs might be able to infer about user behavior, even if network traffic were encrypted.
Although the forthcoming rulemaking targets the collection, use, and sharing of customer data with “third parties”, an important—and oft-forgotten—facet of this discussion is that (1) ISPs rely on the collection, use, and sharing of CPNI to operate and secure their networks and (2) network researchers (myself included) rely on this data to conduct our research. As one example of our work that is discussed today in the Wall Street Journal, we used DNS domain registration data to identify cybercriminals before they launch attacks. Performing this research required access to all .com domain registrations. We have also developed algorithms that detect the misuse of DNS domain names by analyzing the DNS lookups themselves. We have also worked with ISPs to explore the relationship between Internet speeds and usage, which required access to byte-level usage data from individual customers. ISPs also rely on third parties, including Verisign and Arbor Networks, to detect and mitigating attacks; network equipment vendors also use traffic traces from ISPs to test new products and protocols. In summary, although the goal of the FCC’s rulemaking is to protect the use of consumer data, the rulemaking could have had unintended negative consequences for the stability and security of the Internet, as well as for Internet innovation.
In response to the potential negative effects this rule could have created for Internet security and networking researchers, I filed comment with the FCC highlighting how network operators researchers depend on data to keep the network operating well, to keep it secure, and to foster continued innovation. My comment in May highlights the type of data that Internet service providers (ISPs) collect, how they use it for operational and research purposes, and potential privacy concerns with each of these datasets. In my comment, I exhaustively enumerate the types of data that ISPs collect; the following data types are particularly interesting because ISPs and researchers rely on them heavily, yet they also introduce certain privacy concerns:
- IPFIX (“NetFlow”) data, which is the Internet traffic equivalent of call data records. IPFIX data is collected at a router and contains statistics about each traffic flow that traverses the router. It contains information about the “metadata” of each flow (e.g., the source and destination IP address, the start and end time of the flow). This data doesn’t contain “payload” information, but as previous research on information like telephone metadata has shown, a lot can be learned about a user from this kind of information. Nonetheless, this data has been used in research and security for many purposes, including (among other things) detecting botnets and denial of service attacks.
- DNS Query data, which contains information about the domain names that each IP address (i.e., customer) is looking up (i.e., from a Web browser, from an IoT device, etc.). DNS query data can be highly revealing, as we have shown in previous work. Yet, at the same time, DNS query data is also incredibly valuable for detecting Internet abuse, including botnets and malware.
Over the summer, I gave a follow-up a presentation and filed follow-up comments (several of which were jointly authored with members of the networking and security research community) to help draw attention to how much Internet research depends on access to this type of data. In early August, a group of us filed a comment with proposed wording for the upcoming rule. In this comment, we delineated the types of work that should be exempt from the upcoming rules. We argue that research should be exempt from the rulemaking if the research: (1) aims to promote security, stability, and reliability of networks, (2) does not have the end-goal of violating user privacy; (3) has benefits that outweigh the privacy risks; (4) takes steps to mitigate privacy risks; (5) would be enhanced by access to the ISP data. In delineating this type of research, our goal was to explicitly “carve out” researchers at universities and research labs without opening a loophole for third-party advertisers.
Of course, the exception notwithstanding, researchers also should be mindful of user privacy when conducting research. Just because a researcher is “allowed” to receive a particular data trace from an ISP does not mean that such data should be shared. For example, much network and security research is possible with de-identified network traffic data (e.g., data with anonymized IP addresses), or without packet “payloads” (i.e., the kind of traffic data collected with Deep Packet Inspection). Researchers and ISPs should always take care to apply data minimization techniques that limit the disclosure of private information to only the granularity that is necessary to perform the research. Various practices for minimization exist, such as hashing or removing IP addresses, aggregating statistics over longer time windows, and so forth. The network and security research communities should continue developing norms and standard practices for deciding when, how, and to what degree private data from ISPs can be minimized when it is shared.
The FCC, ISPs, customers, and researchers should all care about the security, operation, and performance of the Internet. Achieving these goals often involves sharing customer data with third-parties, such as the network and security research community. As a member of the research community, I am looking forward to reading the text of the rule, which, if our comments are incorporated, will help preserve both customer privacy and the research that keeps the Internet secure and performing well.
I read the post you linked on QNAME minimization before making my previous post. I disregarded it as disambiguating that requires having access to the queryer’s unredacted IP address, which raises privacy issues. I couldn’t see any scenerio where disambiguation was possible that didn’t raise privacy concerns. The degree of privacy concern of course greatly differs on if the DNS server is recursive and if it’s shared by other users (and if so, how many). As for DNS being unencrypted, there’s efforts to resolve that. https://tools.ietf.org/html/rfc7858
Hashing IPs aggregated per octet isn’t exactly privacy when a lookup table can trivially be made for the 256 possible hashes that make up an IPv4 octet. Now – I understand and accept the need for an obfuscated or ambiguated IP address to need to be consistent within a given capture of data being analyzed. I’m more concerned about the obfuscated or ambiguated IP being associated with future dataflows from me. Assuming I’m coming from the same IP each time, simple hashing of my IP will not afford me any privacy as you’ll link multiple sessions together.
Without being informed of the type of analysis taking place, I have no meaningful way of giving informed consent.
The use of IRBs does mitigate my privacy concerns somewhat, but ultimately the interests of a university or a researcher may not necessarily align with my interests as an end-user. I have absolutely no guarantee that a given university’s IRB won’t loosen their ethical standards in an effort to compete for scarce grant $. Cynical as that makes me sounds – I’ve had my trust broken enough over the past 20 years that I’m not inclined to give the benefit of the doubt anymore when it comes to my private communications and information.
You touch on some important tradeoffs between privacy and utility for research/analysis, for sure.
It might be edifying to read up a bit on both how IRBs function, as well as why informed consent is not always an appropriate litmus test. These things are rarely cut and dried, but your fears are not totally warranted. I highly recommend the ethics chapter in https://bitbybitbook.com/ for a thorough discussion of the nuances.
I read all of Chapter 6 and I am not convinced. The arguments in support of permissionless surveillance and research seem hypothetical.
Additionally, after mulling it over, I’m actually offended at the notion that hashing an IP provides any modicum of privacy. Hashing at a per-octet level is the equivalent of paying lip service. It’s snake oil privacy meant to appease people who don’t know better.
The justifications for research without consent of those being researched are extraordinary in nature yet are used to justify research on nonconsenting subjects as a matter of course. Yes, there’s talk of weighing perceived benefits vs perceived harms, but the discretion of a researcher and the IRB are not substitutes for consent. Even within the chapter, examples were given of how “anonymized” datasets were anything but.
I fear the zeal for conducting research on persons is overriding any sense of respect for the dignity of the persons in question. This is reckless.
The examples are not at all hypothetical. I would encourage you to read up. For example, the following are just a few examples of research that have resulted in a more secure Internet that would not have been possible without researcher access to DNS or IPFIX data:
https://www.usenix.org/legacy/event/sec08/tech/full_papers/gu/gu.pdf
https://www.usenix.org/legacy/event/sec09/tech/full_papers/hao.pdf
http://astrolavos.gatech.edu/articles/Antonakakis.pdf
… there are *many* others. You’ll notice that in all of these cases, researchers don’t need access to PII, and hashed IP addresses have worked just fine (and yet, are necessary for disambiguating flows).
I think your threat model for deanonymization isn’t quite right. Start by assuming that researchers generally follow ethical standards and are held accountable by oversight bodies—and read the papers—and you’ll likely conclude that the benefits far outweigh the minimal risks, in these cases.
As stated many times, there was no PII retained in the traffic above. What you refer to as “mundane botnet research” led to commercial products for botnet detection deployed across hundreds of millions of homes.
At this point, I’m repeating myself. We can take this offline.
We have a fundamental disconnect on basic definitions.
But lets say IP addresses aren’t considered PII – why bother hashing the individual octets at all?
I consider looking at DNS queries in this case to be DPI. Hell, my E-mail domain has my name in it, so even if you anonymize the sending IPs for the session, you’re still getting my DNS history attached to my name.
I’m also confused regarding disambiguating user flows. If that’s even possible, how is what you’re doing “anonymizing”? Unless I’m misunderstanding what you mean by disambiguate, what you’re describing offers no functional privacy.
Your use of the phrase “term of art” when it comes to “minimization” doesn’t inspire any confidence. It implies constant exceptions are made due to creative interpretations of the language used in the RFC.
Ultimately though, it sounds like researchers are getting raw “unminimized” data and it’s up to each individual researcher’s discretion on if any data redaction takes place. I am not thrilled with that prospect. Not to mention it takes an enormous ego to apply consequentialism in a manner where the researcher is determining if privacy or exploitation is appropriate for an arbitrary number of users whose private data the researcher has obtained.
DNS queries can indeed be quite revealing. My previous post on this (linked in the post above) talks at length about that. I think that’s an important part of the discussion since “end-to-end encryption” (i.e., HTTPS) is not the end of the story, since DNS traffic remains unencrypted for the most part.
To give an example of disambiguating flows: Hashed source IPs (or aggregation by octet) can protect PII to some extent. This is not perfect, as you know, and in some cases aggregation by octet can render the data useless for certain types of research or analysis.
For these reasons and others, university research—at least in the United States—is subject to the Common Rule, which requires that research involving this kind of data be subject to institutional review boards, which brings me to the second part of your comment.
You’ll be happy to know that your perception about researcher discretion isn’t the least bit accurate. Due to rules such as the Common Rule, researchers (at least in the US) are by and large *not* getting access to “unminimized” data—most network and security research does not need access to network traffic where PII remains in the data, and it can (in most cases) be safely removed. IRBs typically make judgments about this.
Where IRBs are not applicable (this is a bit of an aside, since network traffic is typically subject to IRB review), researchers are generally expected to adhere to the principles of the Belmont Report—these derive from both consequentialism and deontology. Those theories are typically not applied directly, but are used to frame the context for guidelines and rules, which generally evolve much more slowly than technology. (As an aside, Salganik has a good discussion in his book about how beneficience—which derives from consequentialism—can be applied; in short, there is no presumption that the researcher is the same person who is adjudicating the risk-benefit tradeoff. See my comments above on IRBs).
This discussion makes the most sense in the context of university research in the United States. Outside of the United States, IRBs do not exist. In companies, IRBs do not exist, either, although many large companies that deal with datasets such as those described above have begun forming internal review committees akin to IRBs in universities.
How does your proposal impact the “citizen researcher”. The person who wants to figure things out for himself, but is not a full time researcher.
If such people are not catered for you’re creating a favoured group. Such things often lead to corrupt practices.
The definition in the comment I submitted defines the researcher and qualification for exemption in terms of the nature of the research, not based on membership in some group or class. So, your concern is addressed. FWIW, the text of the rule incorporated text that is based on this same principle from my comment.
The ends don’t always justify the means.
As a consumer, I don’t particularly care that you have a research goal in mind. I don’t want you to ever have access to any DPI capture of my info without my explicit, informed consent.
As for metadata, better standardization is needed to mask the underlying users. Hashing is *not* acceptable given the ease with which a hash can be cross-referenced to its raw info in a rainbow table.
I would also recommend against using terminology shared by the Intelligence Community such as “minimization” given the degree of mistrust surrounding their wholesale collection and (mostly) discretionary “minimization” procedures.
I agree with you, which is precisely why I mention the need to encourage minimization.
First of all, the post is not talking about DPI. For much networking and security research, payloads are not necessary in the first place. I could probably update the post to clarify this point.
I also mention the need to remove IP addresses if possible; hashing is not perfect, but it is sometimes necessary if the research needs to disambiguate individual users or flows.
Minimization is a commonly used term. For example, it is the term of art in the DNS privacy working group. See, for example, the work on QName minimization, which exemplifies what I am talking about in the post above: https://tools.ietf.org/html/draft-ietf-dnsop-qname-minimisation-09
As far as the ends justifying the means—it depends on the ethical framework one is applying. Consequentialism takes more of the “ends justify the means” type of viewpoint, whereas deontology is a bit more holistic in its considerations. Consequentialism leads to guidelines/principles such as beneficence (weighing risks vs. benefits), though you are right that there are other factors one also needs to consider, such as respect for humans (which leads to the reasoning for informed consent). Informed consent, however, is not always necessary or practical. Matt Salganik goes into some details on this dilemma in his book: http://www.bitbybitbook.com/en/ethics/dilemmas/consent/.