January 16, 2017

Archives for December 2015

How Will Consumers Use Faster Internet Speeds?

This week saw an exciting announcement about the experimental deployment of DOCSIS 3.1 in limited markets in the United States, including Philadelphia, Atlanta, and parts of northern California, which will bring gigabit-per-second Internet speeds to many homes over the existing cable infrastructure. The potential for gigabit speeds over the existing cable networks bring hope that more consumers will ultimately enjoy much higher-speed Internet connectivity both in the United States and elsewhere.

This development is also a pointed response to the not-so-implicit pressure from the Federal Communications Commission to deploy higher-speed Internet connectivity, which includes other developments such as the redefinition of broadband to a downstream throughput rate of 25 megabits per second, up from a previous (and somewhat laughable) definition of 4 Mbps; many commissioners have also stated their intentions to raise the threshold for the definition of a broadband network to a downstream throughput of 100 Mbps, as a further indication that ISPs will see increasing pressure for higher speed links to home networks. Yet, the National Cable and Telecommunications Association has also claimed in an FCC filing that such speeds are far more than a “typical” broadband user would require.

These developments and posturing beg the question: How will consumers change their behavior in response to faster downstream throughput from their Internet service providers? 

Ph.D. student Sarthak Grover, postdoc Roya Ensafi, and I set out to study this question with a cohort of about 6,000 Comcast subscribers in Salt Lake City, Utah, from October through December 2014. The study involved what is called a randomized controlled trial, an experimental method commonly used in scientific experiments where a group of users is randomly divided into a control group (whose user experience no change in conditions) and a treatment group (whose users are subject to a change in conditions).  Assuming the cohort is large enough and represents a cross-section of the demographic of interest, and that the users for the treatment group are selected at random, it is possible to observe differences between the two groups’ outcomes and conclude how the treatment affects the outcome.

In the case of this specific study, the control group consisted of about 5,000 Comcast subscribers who were paying for (and receiving) 105 Mbps downstream throughput; the treatment group, on the other hand, comprised about 1,500 Comcast subscribers who were paying for 105 Mbps but at the beginning of the study period were silently upgraded to 250 Mbps. In other words, users in the treatment group were receiving faster Internet service but was unaware of the faster downstream throughput of their connections. We explored how this treatment affected user behavior and made a few surprising discoveries:

“Moderate” users tend to adjust their behavior more than the “heavy” users. We expected that subscribers who downloaded the most data in the 250 Mbps service tier would be the ones causing the largest difference in mean demand between the two groups of users (previous studies have observed this phenomenon, and we do observe this behavior for the most aggressive users). To our surprise, however, the median subscribers in the two groups exhibited much more significant differences in traffic demand, particularly at peak times.  Notably, the 40% of subscribers with lowest peak demands more than double their daily peak traffic demand in response to service-tier upgrades (i.e., in the treatment group).

With the exception of the most aggressive peak-time subscribers, the subscribers who are below the 40th percentile in terms of peak demands increase their peak demand more than users who initially had higher peak demands.

This result suggests a surprising trend: it’s not the aggressive data hogs who account for most of the increased use in response to faster speeds, but rather the “typical” Internet user, who tends to use the Internet more as a result of the faster speeds. Our dataset does not contain application information, so it is difficult to say what, exactly is responsible for the higher data usage of the median user. Yet, the result uncovers an oft-forgotten phenomena of faster links: even existing applications that do not need to “max out” the link capacity (e.g., Web browsing, and even most video streaming) can benefit from a higher capacity link, simply because they will see better performance overall (e.g., faster load times and more resilience to packet loss, particularly when multiple parallel connections are in use). It might just be that the typical user is using the Internet more with the faster connection simply because the experience is better, not because they’re interested in filling the link to capacity (at least not yet!).

Users may use faster speeds for shorter periods of time, not always during “prime time”. There has been much ado about prime-time video streaming usage, and we most certainly see those effects in our data. To our surprise, the average usage per subscriber during prime-time hours was roughly the same between the treatment and control groups, yet outside of prime time, the difference in usage was much more pronounced between the two groups, with average usage per subscriber in the treatment group exhibiting 25% more usage than that in the control group for non-prime-time weekday hours.  We also observe that the peak-to-mean ratios for usage in the treatment group are significantly higher than they are in the control group, indicating that users with faster speeds may periodically (and for short times) take advantage of the significantly higher speeds, even though they are not sustaining a high rate that exhausts the higher capacity.

These results are interesting for last-mile Internet service providers because they suggest that the speeds at the edge may not currently be the limiting factor for user traffic demand. Specifically, the changes in peak traffic outside of prime-time hours also suggest that even the (relatively) lower-speed connections (e.g., 105 Mbps) may be sufficient to satisfy the demands of users during prime-time hours. Of course, the constraints on prime-time demand (much of which is largely streaming) likely result from other factors, including both available content and perhaps the well-known phenomena of congestion in the middle of the network, rather than in the last mile. All of this points to the increasing importance of resolving the performance issues that we see as a result of interconnection. In the best case, faster Internet service moves the bottleneck from the last mile to elsewhere in the network (e.g., interconnection points, long-haul transit links); but, in reality, it seems that the bottlenecks are already there, and we should focus on mitigating those points of congestion.

Further reading and study. You’ll be able to read more about our study in the following paper: A Case Study of Traffic Demand Response to Broadband Service-Plan Upgrades. S. Grover, R. Ensafi, N. Feamster. Passive and Active Measurement Conference (PAM). Heraklion, Crete, Greece. March 2016. (We will post an update when the final paper is published in early 2016.) There is plenty of room for follow-up work, of course; notably, the data we had access to did not have information about application usage, and only reflected byte-level usage at fifteen-minute intervals. Future studies could (and should) continue to study the effects of higher-speed links by exploring how the usage of specific applications (e.g., streaming video, file sharing, Web browsing) changes in response to higher downstream throughput.

When coding style survives compilation: De-anonymizing programmers from executable binaries

In a recent paper, we showed that coding style is present in source code and can be used to de-anonymize programmers. But what if only compiled binaries are available, rather than source code? Today we are releasing a new paper showing that coding style can survive compilation. Consequently, we can utilize these stylistic fingerprints via machine learning and de-anonymize programmers of executable binaries with high accuracy. This finding is of concern for privacy-aware programmers who would like to remain anonymous.

 

Update: Video of the talk at the Chaos Communication Congress

 

How to represent the coding style in an executable binary.

Executable binaries of compiled source code on their own are difficult to analyze because they lack human readable information. Nevertheless reverse engineering methods make it possible to disassemble and decompile executable binaries. After applying such reverse engineering methods to executable binaries, we can generate numeric representations of authorial style from features preserved in binaries.

We use a dataset consisting of source code samples of 600 programmers which are available on the website of the annual programming competition Google Code Jam (GCJ). This dataset contains information such as how many rounds the contestants were able to advance, inferring their programming skill, while all contestants were implementing algorithmic solutions to the same programming tasks. Since all the contestants implement the same functionality, the main difference between their samples is their coding style. Such ground truth provides a controlled environment to analyze coding style.

We compile the source code samples of GCJ programmers to executable binaries. We disassemble these binaries to obtain their assembly instructions. We also decompile the binaries to generate approximations of the original source code and the respective control flow graphs. We subsequently parse the decompiled code using a fuzzy parser to obtain abstract syntax trees from which structural properties of code can be derived.

These data sources provide a combination of high level program flow information as well as low level assembly instructions and structural syntactic properties. We convert these properties to numeric values to represent the prevalent coding style in each executable binary.

We cast programmer de-anonymization as a machine learning problem.

We apply the machine learning workflow depicted in the figure below to de-anonymize programmers. There are three main stages in our programmer de-anonymization method. First, we extract a large variety of features from each executable binary to represent coding style with feature vectors, which is possible after applying state-of-the-art reverse engineering methods. Second, we train a random forest classifier on these vectors to generate accurate author models. Third, the random forest attributes authorship to the vectorial representations of previously unseen executable binaries.

The contribution of our work is that we show coding style is embedded in executable binaries as a fingerprint, making de-anonymization possible. The novelty of our method is that our feature set incorporates reverse engineering methods to generate a rich and accurate representation of properties hidden in binaries.

 

 

Results.

We are able to de-anonymize executable binaries of 20 programmers with 96% correct classification accuracy. In the de-anonymization process, the machine learning classifier trains on 8 executable binaries for each programmer to generate numeric representations of their coding styles. Such a high accuracy with this small amount of training data has not been reached in previous attempts. After scaling up the approach by increasing the dataset size, we de-anonymize 600 programmers with 52% accuracy. There has been no previous attempt to de-anonymize such a large binary dataset. The abovementioned executable binaries are compiled without any compiler optimizations, which are options to make binaries smaller and faster while transforming the source code more than plain compilation. As a result, compiler optimizations further normalize authorial style. For the first time in programmer de-anonymization, we show that we can still identify programmers of optimized executable binaries. While we can de-anonymize 100 programmers from unoptimized executable binaries with 78% accuracy, we can de-anonymize them from optimized executable binaries with 64% accuracy. We also show that stripping and removing symbol information from the executable binaries reduces the accuracy to 66%, which is a surprisingly small drop. This suggests that coding style survives complicated transformations.

Other interesting contributions of the paper.

  • By comparing advanced and less advanced programmers’, we found that more advanced programmers are easier to de-anonymize and they have a more distinct coding style.
  • We also de-anonymize GitHub users in the wild, which we explain in detail in the paper. These promising results are encouraging us to extend our method to large real world datasets of various natures for future work.
  • Why does de-anonymization work so well? It’s not because the decompiled source code looks anything like the original. Rather, the feature vector obtained from disassembly and decompilation can be used to predict, using machine learning, the features in the original source code — with over 80% accuracy. This shows that executable binaries preserve transformed versions of the original source code features. 

I would like to thank Arvind Narayanan for his valuable comments.

 

CITP Call for Visitors and Affiliates for 2016-17

The Center for Information Technology Policy is an interdisciplinary research center at Princeton that sits at the crossroads of engineering, the social sciences, law, and policy.

We are seeking applicants for various residential visiting positions and for non-residential affiliates. For more information about these positions, please see our general information page and our lists of current and past visitors.

We are happy to hear from anyone working at the intersection of digital technology and public life, including experts in computer science, sociology, economics, law, political science, public policy, information studies, communication, and other related disciplines.

We have a particular interest this year in candidates working on issues related to the Internet of Things (IoT).

Visitors

Note that you should apply to only one of the Visiting IT Policy Fellow position (if you would be on leave from a full-time position, as with a professor on sabbatical) and the IT Policy Researcher position (if Princeton University would be your primary affiliation during your visit, as with a postdoctoral researcher or a professional between two jobs). Applicants to either of those positions may also apply to the Microsoft Visiting Professor posting.

All applicants should submit a current curriculum vitae, a research plan (including a description of potential courses to be taught if applying for the Visiting Professorship), and a cover letter describing background, interest in the program, and any funding support for the visit. CITP has secured limited resources from a range of sources to support visitors. However, many of our visitors are on paid sabbatical from their own institutions or otherwise provide some or all of their own outside funding.
For full consideration, completed applications must be received by January 29, 2016.

Microsoft Visiting Professor of Information Technology Policy
To apply, please go to Jobs at Princeton, click on “Search Open Positions,” and enter requisition number 1501013.

Visiting IT Policy Fellow
To apply, please go to Jobs at Princeton, click on “Search Open Positions,” and enter requisition number 1501014.

IT Policy Researcher
To apply, please go to Jobs at Princeton, click on “Search Open Positions,” and enter requisition number 1501016.

Affiliates

Technology policy researchers and experts who wish to have a formal affiliation with CITP, but cannot be in residence in Princeton, may apply to become a CITP Affiliate. The affiliation typically will last for two years. Affiliates do not have any formal appointment at Princeton University.

Applicants should email applications to Please send a current curriculum vitae and a cover letter describing background and interest in the program. For full consideration, completed applications must be received by January 29, 2016.