August 19, 2017

Innovation in Network Measurement Can and Should Affect the Future of Internet Privacy

As most readers are likely aware, the Federal Communications Commission (FCC) issued a rule last fall governing how Internet service providers (ISPs) can gather and share data about consumers that was recently rolled back through the Congressional Review Act. The media stoked consumer fear with headlines such as “For Sale: Your Private Browsing History” and comments about how ISPs can now “sell your Web browsing history to advertisers“. We also saw promises from large ISPs such as Comcast promising not to do exactly that. What’s next is anyone’s guess, but technologists need not stand idly by.

Technologists can and should play an important role in this discussion in several ways.  In particular, conveying knowledge about the capabilities and uses of network monitoring, and developing both new monitoring technologies and privacy-preserving capabilities can and should shape this debate in three important ways: (1) Level-setting on the data collection capabilities of various parties; (2) Understanding and limiting the power of inference; and (3) Developing new monitoring technologies that help facilitate network operations and security while protecting consumer privacy.

1. Level-setting on data collection uses and capabilities. Before entering a debate about privacy, it helps to have a firm understanding of who can collect what types of data—both in theory and in practice, as well as the myriad ways that data might be used for good (and bad). For example, in practice, if anyone has your browsing history, your ISP is a less likely culprit than an online service provider such as Google—who operates a browser, and (perhaps more importantly) whose analytics scripts are on a large fraction of the Internet’s web pages. Your browsing is also likely being logged by many of the countless online trackers that keep track of your browsing history, often without your knowledge or consent. In contrast, the network monitoring technology that is available in routers and switches today makes it a lot more difficult to extract “browsing history”; that requires a technology commonly referred to as “deep packet inspection” (DPI), or complete capture of network traffic data, which is expensive to deploy, and even more costly when data storage and analysis is concerned. Most ISPs will tell you than DPI is deployed on only a small fraction of the links in their networks, and that fraction is going down as speeds are increasing; it’s expensive to collect and analyze all of that data.

ISPs do, of course, collect other types of traffic statistics, such as lookups to domain names via the Domain Name System (DNS) and coarse-grained traffic volume statistics via IPFIX. That data can, of course, be revealing. At the same time, ISPs will correctly point out that monitoring DNS and IPFIX is critical to securing and operating the network. DNS traffic, for example, is central to detecting denial of service attacks or infected devices. IPFIX statistics are critical for monitoring and mitigating network congestion. DNS is a quintessential example of data that is both incredibly sensitive (because it reveals the domains and websites we visit, among other things, and is typically unencrypted) and incredibly useful for detecting attacks, ranging from phishing to denial of service attacks.

The long line of security and traffic engineering research illustrates both the importance of data collection, as well as the limitations of current network monitoring capabilities in performing these tasks. Take, for example, research on botnet detection, which has shown the power of using DNS lookup data and IPFIX statistics for detecting compromise and intrusion. Or, the development of traffic engineering capabilities in the data center and in the wide area, which depend on the collection and analysis of IPFIX records and in some cases packet traces.

2. Understanding (and mitigating) the power of inference. While most of the focus in the privacy debate thus far concerns data collection (specifically, a focus on DPI, which is somewhat misguided per the discussion above), we would be wise to also consider what can be inferred from any data that is collected. For example, various aspects of “browsing history” could be evident from various datasets ranging from DNS to DPI, but as discussed above all of these datasets also have legitimate operational uses. Furthermore, “browsing history” is evident from a wide range of datasets that many parties are privy to without our consent, beyond just ISPs. Such inference capabilities are only going to increase with the proliferation of data-producing Internet-connected devices coupled with advances in machine learning. If prescriptive rules specify which some types of data can be collected, we risk over-prescribing rules, while failing to achieve the goal of protecting the higher-level information that we really want to protect.

While asking questions about collection is a fine place to start a discussion, we should be at least as concerned with how the data is usedwhat it can be used to infer, and who it is shared with.We likely should be asking: (1) What data do we think should be protected or private? (2) What types network data permits inference of that private data? (3) Who has access to that data and under what circumstances? Suppose that I am interested in protecting information about whether I am at home. My ISP could learn this information from my traffic patterns, simply based on the decline in traffic volume from individual devices, even if all of my web traffic were encrypted, and even if I used a virtual private network (VPN) for all of my traffic. Such inference will be increasingly possible as more devices in our homes connect to the Internet. But, online service providers could also come to know the same information without my consent, based on different data; Google, for example, would know that I’m browsing the web at my office, rather than at home, through the use of technologies such as cookies, browser fingerprinting, and other online device tracking mechanisms.

Past and ongoing research, such as the Web Transparency and Accountability Project, as well as the “What They Know” series from the Wall Street Journal, shed important light on what can be inferred from various digital data sources. The Upturn report last year was similarly illuminating with respect to ISP data. More recently, researchers at Princeton including Noah Apthorpe and Dillon Reisman have been developing techniques to mitigate the power of inference using various traffic shaping and camouflaging techniques to limit what an ISP can infer from traffic patterns coming from a home network.

3. Facilitating purpose-driven network measurement and data minimization. Part of the tension surrounding network measurement and privacy is that current network monitoring technology is very crude; in fact, this technology hasn’t changed considerably in nearly 30 years. It at once gathers too much data, and yet, for many purposes, it is still too little. Consider, for example, that with current network monitoring technology, an ISP (or content provider) have incredible difficulty determining a user’s quality of experience for a given application, such as video streaming, simply because the wrong kind of data is collected, at the wrong granularity. As a result, ISPs (and many other parties in the Internet ecosystem) adopt a post hoc “collect first, ask questions later” approach, simply because current network monitoring technology (1) is oriented towards offline processing on warehoused data; (2) does not make it easy to figure out what data is needed to answer a particular analysis question.

Instead, network data collection could be driven by the questions operators were asking; data could be collected if—and only if—it were pertinent to a specific question or network operations task, such as monitoring application performance or detecting attacks. For example, suppose that an operator could ask a query such as “tell me the average packet loss rate of all Netflix video streams for subscribers in Seattle”. Answering such a query with today’s tools is challenging: one would have to collect all packet traces and all DNS queries and somehow identify post hoc that these streams correspond to the application of interest. In short, it’s difficult, if not impossible, answer such an operational query today without large-scale collection and storage of (very sensitive) data—all to find what is essentially a needle in a haystack.

Over the past year, my Ph.D. student Arpit Gupta at Princeton has been leading the design and development of a system called Sonata that may ultimately resolve this dichotomy and give us the best of both worlds. Two emerging technologies—(1) in-band network measurement, as supported by Barefoot’s Tofino chipset; (2) scalable streaming analytics platforms such as Spark—make it possible to write a high-level query in advance and only collect the data that is needed to satisfy the query. Such technology allows a network operator to write a query in a high-level language (in this case, Scala), specifying only the question, but allowing the runtime to figure out the minimal set of raw data that is needed to satisfy the operator’s query.

Our goal in the design and implementation of Sonata was to satisfy the operational and scaling limitations of network measurement, but achieving such scalability also has data minimization effects that have positive benefits for privacy. Data that is collected can also be a liability; it may, for example, become the target of law enforcement requests or subpoenas, which parties such as ISPs, but also online providers such as Google are regularly subject to. Minimizing the collected data to only that which is pertinent to operational queries can also ultimately help reduce this risk.

Sonata is open source, and we welcome contributions and suggestions from the community about how we can better support specific types of network queries and tasks.

Summary. Network monitoring and analytics technology is moving at a rapid pace, in terms of its capabilities to help network operators answer important questions about performance and security, without coming at the cost of consumer privacy. Technologists should devote attention to developing new technologies that can help achieve the best of both worlds, and on helping educate policymakers about the capabilities (and limitations) of existing network monitoring technology. Policymakers should be aware that network monitoring technology continues to advance, and should focus discussion around protecting what can be inferred, rather than focusing only on who can collect a packet trace.

Dissecting the (Likely) Forthcoming Repeal of the FCC’s Privacy Rulemaking

Last week, the House and Senate both passed a joint resolution that prevents the new privacy rules from the Federal Communications Commission (FCC) from taking effect; the rules were released by the FCC last November, and would have bound Internet Service Providers (ISPs) in the United States to a set of practices concerning the collection and sharing of data about consumers. The rules were widely heralded by consumer advocates, and several researchers in the computer science community, including myself , played a role in helping to shape aspects of the rules. I provided input into the rules that helped preserve the use of ISP traffic data for research and protocol development.

How much should we be concerned? Consumers have cause for concern, but almost certainly not as much as the media would have you believe. The joint resolution is expected to be signed by the President, whereupon it will go into law. Many articles in the news last week announced the joint resolution passed by Congress as a watershed moment, saying effectively that Internet service providers can “now” sell your data to the highest bidder. Yet, the first thing to realize is that Internet service providers were never prevented from doing this, and in some sense, the Congressional repeal simply preserves the status quo, with respect to ISPs and data sharing. That is, the privacy rule that was released last November, never went into effect. That said, there is one thing that consumers might be more concerned about: The resolution also prevents the FCC from making similar rules in the future, which has the effect of removing the threat of regulatory action on privacy. Previously, even though it was legal for ISPs to share your data without your consent, they might not have done so simply for fear of regulatory action from the FCC. If this resolution becomes law, there is no longer such a threat, and we will have to rely on market forces for ISPs to be good stewards of our data.

With these high-order bits in mind, the rest of this post will dissect the events over the past year or so in more detail.

Who regulates privacy? Part of the complication surrounding the debates on privacy is that there are currently two agencies in our government who are primarily responsible for protecting consumer privacy. The Federal Trade Commission (FTC) operates under the FTC Act and regulates consumer protection for businesses that are not “common carriers”; this includes most businesses, with the exception of public utilities, and—recently, with the passage of the Open Internet Order (the so-called “net neutrality” rule) in 2015—ISPs. One of the landmark decisions in the Open Internet Order was to classify ISPs under “Title II” (telecommunications providers), whereas previously they were classified under Title I. This action effectively moved the jurisdiction for regulating ISP privacy from the FTC (where Google, Facebook, and other Internet companies are regulated) to the FCC.

Essentially, there is a firewall of sorts between the two agencies when it comes to privacy rulemaking: The FTC is prohibited by federal law from regulating common carriers, and the FCC has a statutory mandate (under Section 222 of the telecommunications act) to protect customer data that is collected by common carriers.

Are the FCC’s privacy rules “fair”? Part of the debate from the ISPs surrounds whether this separation is fair: ISPs like Comcast and online service providers (so called “edge providers” in Washington) like Google are increasingly competing in the same markets, and regulating them under different rules can in some sense create an uneven playing field. Depending on your viewpoint and orientation, there is some merit to this argument: The FCC’s privacy rules are stronger than the FTC’s rules, as the FCC’s rules govern additional information that cannot be shared without user consent, such as browsing history, application usage history, and geolocation. Companies who are regulated by the FTC (Google, Facebook, etc.) have no such restrictions on sharing your data without your consent. Whether this situation is “fair” depends in some sense on your perspective about whether edge providers like Google and ISPs like Comcast should be subject to the same rules.

  • The ISP viewpoint (and the Republican rationale behind the resolution) of the joint resolution is that for the Googles and Facebooks of the world, your data is not considered sensitive; they can already gather this information about your browsing history and sell it to third-party marketers. The ISPs and Republicans view that if ISPs and edge providers are really in the same market (or should allowed to be), then they shouldn’t be subject to different rules. That sounds good, except there are a couple of hangups. The first is, as mentioned, the FTC cannot regulate ISPs; they are prohibited from doing so by federal law. Unless the ISPs are reclassified again under Title I, they may currently end up in a situation where nobody can legally regulate them, since the FTC is already prevented from doing so, and it is increasingly looking like the FCC will be prevented from doing so, as well. The charitable viewpoint to the situation is that the goal appears to be not to get rid of privacy rules entirely, but rather to shift everything concerning consumer privacy back to the FTC, where ISPs and edge providers are subject to the same rules. But, in the meantime, the situation may be suspended in a strange limbo.
  • The consumer advocate viewpoint is that, in the current market for ISPs in the United States, many consumers do not have a choice of ISP. Therefore, the ISPs are in a position of power that the edge providers do not have. In many senses, that is true: in many parts of the United States, studies from the FCC and elsewhere have shown that consumers have only one choice of broadband ISP. This places the ISP in a position of great power, because we can’t just rely on “market forces” to encourage good behavior towards consumers if consumers can’t vote with their feet. Effectively, in contrast to edge providers such as Google or Facebook, in certain markets in the US, one cannot simply “opt out” of one’s ISP. There are also some arguments that ISPs can see a lot more data than edge providers can; that point is certainly arguable, given the level of instrumentation that a company like Google has on everything from the trackers they place on just about every website on the Internet to their command over our browser, mobile operating system, etc. More likely, we should be equally concerned about both edge providers and ISPs.

The repeal, and the status quo. In essence, the repeal that is likely to come in the coming weeks should cause concern, but it is not quite as simple as “ISPs can now sell your data to the highest bidder”. Keep in mind that ISPs have always legally been able to do so, and they haven’t done so yet. In fact, on Friday, Comcast just committed to not selling your data to third-party marketers, which provides some hope that the market will, in fact, induce behavior that is good for consumers. In some sense, the repeal will do nothing except to preserve the status quo. Ultimately, time will tell. I do expect that increasingly ISPs may look increasingly like advertisers—after all, they have been trying to get into the business of advertising for years. Without the threat of regulatory enforcement that has existed until now, ISPs may be more likely to enter these markets (or at least try to do so). In the coming years, there may not be much we can do about this except hope that the market enforces good behavior. It should be noted that, despite the widespread attention to Virtual Private Networks as a possible defense against ISP data collection over the past week, these offer scant protection against the kinds of data that would or could be collected about you, as I and others have previously explained.

Privacy is a red herring. The real problem is lack of competition. The prospect of relying on the market brings me to a final point. One of the oft-forgotten provisions of the Open Internet Order’s reclassification of the ISPs under Title II is that the FCC can compel the ISPs to “unbundle the local loop”—a technical term for letting competing ISPs share the underlying physical infrastructure. We used to have this situation in the United States (older readers probably remember the days of “mom and pop” DSL providers who leased infrastructure from the telcos), and many countries in Europe still have competitive markets by virtue of this structure. One possible path forward that could give more leverage to market forces would be to unbundle the local loop under Title II. This outcome is widely viewed to be highly unlikely.

Part of the reason this might be unlikely is that if Title II reclassification is walked back and ISPs end up in the Title I regime once again. Oddly, though we are likely to hear much uproar over the “repeal” of the net neutrality rules, one silver lining will be that if and when such a rollback occurs, the ISPs will be bound by some privacy rules. If the current resolution passes, they’ll be bound by none at all.

Finally, it is worth remembering that there are other uses of customer data besides selling it to advertisers. My biggest role in helping shape the FCC’s original privacy rules was to help preserve the use of this data for Internet engineers and researchers who continue to develop new algorithms and protocols to help the Internet perform better, and to keep us safe from attacks ranging from denial of service to phishing. While none of us may be excited at the prospect of having our data shared with advertisers without our consent, we all benefit from other operational uses of this data, and those uses should certainly be preserved.

Mitigating the Increasing Risks of an Insecure Internet of Things

The emergence and proliferation of Internet of Things (IoT) devices on industrial, enterprise, and home networks brings with it unprecedented risk. The potential magnitude of this risk was made concrete in October 2016, when insecure Internet-connected cameras launched a distributed denial of service (DDoS) attack on Dyn, a provider of DNS service for many large online service providers (e.g., Twitter, Reddit). Although this incident caused large-scale disruption, it is noteworthy that the attack involved only a few hundred thousand endpoints and a traffic rate of about 1.2 terabits per second. With predictions of upwards of a billion IoT devices within the next five to ten years, the risk of similar, yet much larger attacks, is imminent.

The Growing Risks of Insecure IoT Devices

One of the biggest contributors to the risk of future attack is the fact that many IoT devices have long-standing, widely known software vulnerabilities that make them vulnerable to exploit and control by remote attackers. Worse yet, the vendors of these IoT devices often have provenance in the hardware industry, but they may lack expertise or resources in software development and systems security. As a result, IoT device manufacturers may ship devices that are extremely difficult, if not practically impossible, to secure. The large number of insecure IoT devices connected to the Internet poses unprecedented risks to consumer privacy, as well as threats to the underlying physical infrastructure and the global Internet at large:

  • Data privacy risks. Internet-connected devices increasingly collect data about the physical world, including information about the functioning of infrastructure such as the power grid and transportation systems, as well as personal or private data on individual consumers. At present, many IoT devices either do not encrypt their communications or use a form of encrypted transport that is vulnerable to attack. Many of these devices also store the data they collect in cloud-hosted services, which may be the target of data breaches or other attack.
  • Risks to availability of critical infrastructure and the Internet at large. As the Mirai botnet attack of October 2016 demonstrated, Internet services often share underlying dependencies on the underlying infrastructure: crippling many websites offline did not require direct attacks on these services, but rather a targeted attack on the underlying infrastructure on which many of these services depend (i.e., the Domain Name System). More broadly, one might expect future attacks that target not just the Internet infrastructure but also physical infrastructure that is increasingly Internet- connected (e.g., power and water systems). The dependencies that are inherent in the current Internet architecture create immediate threats to resilience.

    The large magnitude and broad scope of these risks implore us to seek solutions that will improve infrastructure resilience in the face of Internet-connected devices that are extremely difficult to secure. A central question in this problem area concerns the responsibility that each stakeholder in this ecosystem should bear, and the respective roles of technology and regulation (whether via industry self-regulation or otherwise) in securing both the Internet and associated physical infrastructure against these increased risks.

Risk Mitigation and Management

One possible lever for either government or self-regulation is the IoT device manufacturers. One possibility, for example, might be a device certification program for manufacturers that could attest to adherence to best common practice for device and software security. A well-known (and oft-used) analogy is the UL certification process for electrical devices and appliances.

Despite its conceptual appeal, however, a certification approach poses several practical challenges. One challenge is outlining and prescribing best common practices in the first place, particularly due to the rate at which technology (and attacks) progress. Any specific set of prescriptions runs the risk of falling out of date as technology advances; similarly, certification can readily devolve into a checklist of attributes that vendors satisfy, without necessarily adhering to the process by which these devices are secured over time. As daunting as challenges of specifying a certification program may seem, enforcing adherence to a certification program may prove even more challenging. Specifically, consumers may not appreciate the value of certification, particularly if meeting the requirements of certification increases the cost of a device. This concern may be particularly acute for consumer IoT, where consumers may not bear the direct costs of connecting insecure devices to their home networks.

The consumer is another stakeholder who could be incentivized to improve the security of the devices that they connect to their networks (in addition to more effectively securing the networks to which they connect these devices). As the entity who purchases and ultimately connects IoT devices to the network, the consumer appears well-situated to ensure the security of the IoT devices on their respective networks. Unfortunately, the picture is a bit more nuanced. First, consumers typically lack either the aptitude or interest (or both!) to secure either their own networks or the devices that they connect to them. Home broadband Internet access users have generally proved to be poor at applying software updates in a timely fashion, for example, and have been equally delinquent in securing their home networks. Even skilled network administrators regularly face network misconfigurations, attacks, and data breaches. Second, in many cases, users may lack the incentives to ensure that their devices are secure. In the case of the Mirai botnet, for example, consumers did not directly face the brunt of the attack; rather, the ultimate victims of the attack were DNS service providers and, indirectly, online service providers such as Twitter. To the first order, consumers suffered little direct consequence as a result of insecure devices on their networks.

Consumers’ misaligned incentives suggest several possible courses of action. One approach might involve placing some responsibility or liability on consumers for the devices that they connect to the network, in the same way that a citizen might be fined for other transgressions that have externalities (e.g., fines for noise or environmental pollution). Alternatively, Internet service providers (or another entity) might offer users a credit for purchasing and connecting only devices that it pass certification; another variation of this approach might require users to purchase ”Internet insurance” from their Internet service providers that could help offset the cost of future attacks. Consumers might receive credits or lower premiums based on the risk associated with their behavior (i.e., their software update practices, results from security audits of devices that they connect to the network).

A third stakeholder to consider is the Internet service provider (ISP), who provides Internet connectivity to the consumer. The ISP has considerable incentives to ensure that the devices that its customer connects to the network are secure: insecure devices increase the presence of attack traffic and may ultimately degrade Internet service or performance for the rest of the ISPs’ customers. From a technical perspective, the ISP is also in a uniquely effective position to detect and squelch attack traffic coming from IoT devices. Yet, relying on the ISP alone to protect the network against insecure IoT devices is fraught with non-technical complications. Specifically, while the ISP could technically defend against an attack by disconnecting or firewalling consumer devices that are launching attacks, such an approach will certainly result in increased complaints and technical support calls from customers, who connect devices to the network and simply expect them to work. Second, many of the technical capabilities that an ISP might have at its disposal (e.g., the ability to identify attack traffic coming from a specific device) introduce serious privacy concerns. For example, being able to alert a customer to (say) a compromised baby monitor requires the ISP to know (and document) that a consumer has such a device in the first place.

Ultimately, managing the increased risks associated with insecure IoT devices may require action from all three stakeholders. Some of the salient questions will concern how the risks can be best balanced against the higher operational costs that will be associated with improving security, as well as who will ultimately bear these responsibilities and costs.

Improving Infrastructure Resilience

In addition to improving defenses against the insecure devices themselves, it is also critical to determine how to better build resilience into the underlying Internet infrastructure to cope with these attacks. If one views the occasional IoT-based attack inevitable to some degree, one major concern is ensuring that the Internet Infrastructure (and the associated cyberphysical infrastructure) remains both secure and available in the face of attack. In the case of the Mirai attack on Dyn, for example, the severity of the attack was exacerbated by the fact that many online services depended on the infrastructure that was attacked. Computer scientists and Internet engineers should be thinking about technologies that can both potentially decouple these underlying dependencies and ensure that the infrastructure itself remains secure even in the event that regulatory or legal levers fail to prevent every attack. One possibility that we are exploring, for example, is the role that an automated home network firewall could play in (1) help- ing users keep better inventory of connected IoT devices; (2) providing users both visibility into and control over the traffic flows that these devices send.

Summary

Improving the resilience of the Internet and cyberphysical infrastructure in the face of insecure IoT devices will require a combination of technical and regulatory mechanisms. Engineers and regulators will need to work together to improve security and privacy of the Internet of Things. Engineers must continue to advance the state of the art in technologies ranging from lightweight encryption to statistical network anomaly detection to help reduce risk; similarly, engineers must design the network to improve resilience in the face of the increased risk of attack. On the other hand, realizing these advances in deployment will require the appropriate alignment of incentives, so that the parties that introduce risks are more aligned with those who bear the costs of the resulting attacks.