October 6, 2022

Archives for 2015

How Will Consumers Use Faster Internet Speeds?

This week saw an exciting announcement about the experimental deployment of DOCSIS 3.1 in limited markets in the United States, including Philadelphia, Atlanta, and parts of northern California, which will bring gigabit-per-second Internet speeds to many homes over the existing cable infrastructure. The potential for gigabit speeds over the existing cable networks bring hope that more consumers will ultimately enjoy much higher-speed Internet connectivity both in the United States and elsewhere.

This development is also a pointed response to the not-so-implicit pressure from the Federal Communications Commission to deploy higher-speed Internet connectivity, which includes other developments such as the redefinition of broadband to a downstream throughput rate of 25 megabits per second, up from a previous (and somewhat laughable) definition of 4 Mbps; many commissioners have also stated their intentions to raise the threshold for the definition of a broadband network to a downstream throughput of 100 Mbps, as a further indication that ISPs will see increasing pressure for higher speed links to home networks. Yet, the National Cable and Telecommunications Association has also claimed in an FCC filing that such speeds are far more than a “typical” broadband user would require.

These developments and posturing beg the question: How will consumers change their behavior in response to faster downstream throughput from their Internet service providers? 

Ph.D. student Sarthak Grover, postdoc Roya Ensafi, and I set out to study this question with a cohort of about 6,000 Comcast subscribers in Salt Lake City, Utah, from October through December 2014. The study involved what is called a randomized controlled trial, an experimental method commonly used in scientific experiments where a group of users is randomly divided into a control group (whose user experience no change in conditions) and a treatment group (whose users are subject to a change in conditions).  Assuming the cohort is large enough and represents a cross-section of the demographic of interest, and that the users for the treatment group are selected at random, it is possible to observe differences between the two groups’ outcomes and conclude how the treatment affects the outcome.

In the case of this specific study, the control group consisted of about 5,000 Comcast subscribers who were paying for (and receiving) 105 Mbps downstream throughput; the treatment group, on the other hand, comprised about 1,500 Comcast subscribers who were paying for 105 Mbps but at the beginning of the study period were silently upgraded to 250 Mbps. In other words, users in the treatment group were receiving faster Internet service but was unaware of the faster downstream throughput of their connections. We explored how this treatment affected user behavior and made a few surprising discoveries:

“Moderate” users tend to adjust their behavior more than the “heavy” users. We expected that subscribers who downloaded the most data in the 250 Mbps service tier would be the ones causing the largest difference in mean demand between the two groups of users (previous studies have observed this phenomenon, and we do observe this behavior for the most aggressive users). To our surprise, however, the median subscribers in the two groups exhibited much more significant differences in traffic demand, particularly at peak times.  Notably, the 40% of subscribers with lowest peak demands more than double their daily peak traffic demand in response to service-tier upgrades (i.e., in the treatment group).

With the exception of the most aggressive peak-time subscribers, the subscribers who are below the 40th percentile in terms of peak demands increase their peak demand more than users who initially had higher peak demands.

This result suggests a surprising trend: it’s not the aggressive data hogs who account for most of the increased use in response to faster speeds, but rather the “typical” Internet user, who tends to use the Internet more as a result of the faster speeds. Our dataset does not contain application information, so it is difficult to say what, exactly is responsible for the higher data usage of the median user. Yet, the result uncovers an oft-forgotten phenomena of faster links: even existing applications that do not need to “max out” the link capacity (e.g., Web browsing, and even most video streaming) can benefit from a higher capacity link, simply because they will see better performance overall (e.g., faster load times and more resilience to packet loss, particularly when multiple parallel connections are in use). It might just be that the typical user is using the Internet more with the faster connection simply because the experience is better, not because they’re interested in filling the link to capacity (at least not yet!).

Users may use faster speeds for shorter periods of time, not always during “prime time”. There has been much ado about prime-time video streaming usage, and we most certainly see those effects in our data. To our surprise, the average usage per subscriber during prime-time hours was roughly the same between the treatment and control groups, yet outside of prime time, the difference in usage was much more pronounced between the two groups, with average usage per subscriber in the treatment group exhibiting 25% more usage than that in the control group for non-prime-time weekday hours.  We also observe that the peak-to-mean ratios for usage in the treatment group are significantly higher than they are in the control group, indicating that users with faster speeds may periodically (and for short times) take advantage of the significantly higher speeds, even though they are not sustaining a high rate that exhausts the higher capacity.

These results are interesting for last-mile Internet service providers because they suggest that the speeds at the edge may not currently be the limiting factor for user traffic demand. Specifically, the changes in peak traffic outside of prime-time hours also suggest that even the (relatively) lower-speed connections (e.g., 105 Mbps) may be sufficient to satisfy the demands of users during prime-time hours. Of course, the constraints on prime-time demand (much of which is largely streaming) likely result from other factors, including both available content and perhaps the well-known phenomena of congestion in the middle of the network, rather than in the last mile. All of this points to the increasing importance of resolving the performance issues that we see as a result of interconnection. In the best case, faster Internet service moves the bottleneck from the last mile to elsewhere in the network (e.g., interconnection points, long-haul transit links); but, in reality, it seems that the bottlenecks are already there, and we should focus on mitigating those points of congestion.

Further reading and study. You’ll be able to read more about our study in the following paper: A Case Study of Traffic Demand Response to Broadband Service-Plan Upgrades. S. Grover, R. Ensafi, N. Feamster. Passive and Active Measurement Conference (PAM). Heraklion, Crete, Greece. March 2016. (We will post an update when the final paper is published in early 2016.) There is plenty of room for follow-up work, of course; notably, the data we had access to did not have information about application usage, and only reflected byte-level usage at fifteen-minute intervals. Future studies could (and should) continue to study the effects of higher-speed links by exploring how the usage of specific applications (e.g., streaming video, file sharing, Web browsing) changes in response to higher downstream throughput.

When coding style survives compilation: De-anonymizing programmers from executable binaries

In a recent paper, we showed that coding style is present in source code and can be used to de-anonymize programmers. But what if only compiled binaries are available, rather than source code? Today we are releasing a new paper showing that coding style can survive compilation. Consequently, we can utilize these stylistic fingerprints via machine learning and de-anonymize programmers of executable binaries with high accuracy. This finding is of concern for privacy-aware programmers who would like to remain anonymous.

 

Update: Video of the talk at the Chaos Communication Congress

 

How to represent the coding style in an executable binary.

Executable binaries of compiled source code on their own are difficult to analyze because they lack human readable information. Nevertheless reverse engineering methods make it possible to disassemble and decompile executable binaries. After applying such reverse engineering methods to executable binaries, we can generate numeric representations of authorial style from features preserved in binaries.

We use a dataset consisting of source code samples of 600 programmers which are available on the website of the annual programming competition Google Code Jam (GCJ). This dataset contains information such as how many rounds the contestants were able to advance, inferring their programming skill, while all contestants were implementing algorithmic solutions to the same programming tasks. Since all the contestants implement the same functionality, the main difference between their samples is their coding style. Such ground truth provides a controlled environment to analyze coding style.

We compile the source code samples of GCJ programmers to executable binaries. We disassemble these binaries to obtain their assembly instructions. We also decompile the binaries to generate approximations of the original source code and the respective control flow graphs. We subsequently parse the decompiled code using a fuzzy parser to obtain abstract syntax trees from which structural properties of code can be derived.

These data sources provide a combination of high level program flow information as well as low level assembly instructions and structural syntactic properties. We convert these properties to numeric values to represent the prevalent coding style in each executable binary.

We cast programmer de-anonymization as a machine learning problem.

We apply the machine learning workflow depicted in the figure below to de-anonymize programmers. There are three main stages in our programmer de-anonymization method. First, we extract a large variety of features from each executable binary to represent coding style with feature vectors, which is possible after applying state-of-the-art reverse engineering methods. Second, we train a random forest classifier on these vectors to generate accurate author models. Third, the random forest attributes authorship to the vectorial representations of previously unseen executable binaries.

The contribution of our work is that we show coding style is embedded in executable binaries as a fingerprint, making de-anonymization possible. The novelty of our method is that our feature set incorporates reverse engineering methods to generate a rich and accurate representation of properties hidden in binaries.

 

 

Results.

We are able to de-anonymize executable binaries of 20 programmers with 96% correct classification accuracy. In the de-anonymization process, the machine learning classifier trains on 8 executable binaries for each programmer to generate numeric representations of their coding styles. Such a high accuracy with this small amount of training data has not been reached in previous attempts. After scaling up the approach by increasing the dataset size, we de-anonymize 600 programmers with 52% accuracy. There has been no previous attempt to de-anonymize such a large binary dataset. The abovementioned executable binaries are compiled without any compiler optimizations, which are options to make binaries smaller and faster while transforming the source code more than plain compilation. As a result, compiler optimizations further normalize authorial style. For the first time in programmer de-anonymization, we show that we can still identify programmers of optimized executable binaries. While we can de-anonymize 100 programmers from unoptimized executable binaries with 78% accuracy, we can de-anonymize them from optimized executable binaries with 64% accuracy. We also show that stripping and removing symbol information from the executable binaries reduces the accuracy to 66%, which is a surprisingly small drop. This suggests that coding style survives complicated transformations.

Other interesting contributions of the paper.

  • By comparing advanced and less advanced programmers’, we found that more advanced programmers are easier to de-anonymize and they have a more distinct coding style.
  • We also de-anonymize GitHub users in the wild, which we explain in detail in the paper. These promising results are encouraging us to extend our method to large real world datasets of various natures for future work.
  • Why does de-anonymization work so well? It’s not because the decompiled source code looks anything like the original. Rather, the feature vector obtained from disassembly and decompilation can be used to predict, using machine learning, the features in the original source code — with over 80% accuracy. This shows that executable binaries preserve transformed versions of the original source code features. 

I would like to thank Arvind Narayanan for his valuable comments.

 

New Professors' Letter Opposing The Defend Trade Secrets Act of 2015

As Freedom to Tinker readers may recall, I’ve been very concerned about the problems associated with the proposed Defend Trade Secrets Act. Ostensibly designed to combat cyberespionage against United States corporations, it is instead not a solution to that problem, and fraught with downsides. Today, over 40 colleagues in the academic world joined Eric Goldman, Chris Seaman, Sharon Sandeen and me in raising a variety of concerns about the DTSA in the following letter:

Professors’ Letter in Opposition to the Defend Trade Secrets Act of 2015.

Importantly, this new letter incorporates our 2014 opposition letter. As we explained,

While we agree that effective legal protection for U.S. businesses’ legitimate trade secrets is important to American innovation, we believe that the DTSA—which would represent the most significant expansion of federal law in intellectual property since the Lanham Act in 1946—will not solve the problems identified by its sponsors. Instead of addressing cyberespionage head-on, passage of the DTSA is likely to create new problems that could adversely impact domestic innovation, increase the duration and cost of trade secret litigation, and ultimately negatively affect economic growth. Therefore, the undersigned call on Congress to reject the DTSA.

We also call on Congress to hold hearings “that focus on the costs of the legislation and whether the DTSA addresses the cyberespionage problem that it is allegedly designed to combat. Specifically, Congress should evaluate the DTSA through the lens of employees, small businesses, and startup companies that are most likely to be adversely affected by the legislation.”

I will continue to blog on the DTSA as events warrant, and encourage Freedom to Tinker readers to contact their members of Congress and urge them to vote against the DTSA.

 

Provisions: how Bitcoin exchanges can prove their solvency

Millions of Bitcoin users store their bitcoins with online exchanges (e.g. Coinbase, Kraken) which store bitcoins on their customers’ behalf. They present an interface that looks somewhat like an online bank, allowing users to log in and request payments to other users or withdrawals. For many users this approach makes a lot more sense than the traditional approach of storing private keys on your laptop or phone and interacting with the Bitcoin network directly. Online exchanges require no software installation, enable a familiar password-based authentication model, and can guard against the risk of losing funds with a stolen laptop. Online exchanges can also improve the scalability and efficiency of Bitcoin by settling many logical transactions between users without actually moving funds on the block chain.

Of course, users must trust these exchanges not to get hacked or simply abscond with their money, both of which happened frequently in the early days of Bitcoin (nearly half of exchanges studied in a 2013 research paper failed). Famously, Mt. Gox was the largest online exchange until 2014 when it lost most of its customers’ funds under murky circumstances.

It has long been a goal of the Bitcoin community for exchanges to be able to cryptographically prove solvency—that is, to prove that they still control enough bitcoins to cover all of their customers’ accounts. Greg Maxwell first proposed an approach using Merkle trees in 2013, but this requires revealing (at a minimum) the total value of the exchange’s assets and which addresses the exchange controls. Exchanges have specifically cited these privacy risks as a reason they have not deployed proofs of solvency, relying on trusted audit instead.

In a new paper presented this month at CCS (co-authored with Gaby G. Dagher, Benedikt Bünz, Jeremy Clark and Dan Boneh), we present Provisions, the first cryptographic proof-of-solvency with strong privacy guarantees. Our protocol is suitable for Bitcoin but would work for most other cryptocurrencies (e.g. Litecoin, Ethereum). Our protocol hides the total assets and liabilities of the exchange, proving only that assets are strictly greater than liabilities. If desired, the value of this surplus can be proven. Provisions also hides all customer balances and hides which Bitcoin addresses the bank controls within a configurable anonymity set of other addresses on the block chain. The proofs are large, but reasonable to compute on a daily basis (in the tens of GB for a large exchange, computable in about an hour). Best of all, it is very simple and fast for each user to verify that they have been correctly included. We can even extend the protocol to prevent collusion between exchanges. The details are in the paper, the full version of which is now online.

While our Provisions protocol removes the privacy concerns of performing a cryptographic proof-of-solvency, there are still some practical deployment questions because the proof requires the exchange to compute using its private keys. Exchanges rightly go to great lengths to protect these keys, often keeping them offline and/or in hardware security modules. Performing a regular solvency proof requires careful thinking about the right internal procedure for accessing these keys.

These deployment questions can be solved. We hope that cryptographic proofs of solvency will soon be expected of upstanding exchanges. Incidents like that of Mt. Gox have greatly damaged public perception of the entire Bitcoin ecosystem. While solvency proofs can’t prevent exchange compromises, they would have made Mt. Gox’s troubles public earlier and more clearly. They would also shore up confidence in today’s exchanges which are (presumably) solvent.

Taking a step back, solvency proofs are yet another example where we can replace an  expensive and trust-laden process in the offline world (financial inspection by a trusted auditor) with a “trustless” cryptographic protocol. It’s always exciting to take a new step in that direction. There remain limits as to what cryptography can do though. Critically, solvency proofs do not create a binding obligation to pay. A malicious exchange could complete a Provisions proof and then immediately abscond with all of the money. For this reason, some form of government regulation of online exchanges makes sense. Though regulation is dreaded by many in the Bitcoin community, it appears to be on the horizon. Bills have been proposed in several states, largely aimed at exchanges. Interestingly, the model regulatory framework proposed by the Conference of State Bank Supervisors in September already mentions cryptographic solvency proofs as a means of demonstrating solvency. We hope this recommendation is enacted in law and solvency proofs are a tool to avoid the cost of the heavyweight auditing requirements traditionally demanded of banks, while simultaneously increasing transparency for exchange customers.

How is NSA breaking so much crypto?

There have been rumors for years that the NSA can decrypt a significant fraction of encrypted Internet traffic. In 2012, James Bamford published an article quoting anonymous former NSA officials stating that the agency had achieved a “computing breakthrough” that gave them “the ability to crack current public encryption.” The Snowden documents also hint at some extraordinary capabilities: they show that NSA has built extensive infrastructure to intercept and decrypt VPN traffic and suggest that the agency can decrypt at least some HTTPS and SSH connections on demand.

However, the documents do not explain how these breakthroughs work, and speculation about possible backdoors or broken algorithms has been rampant in the technical community. Yesterday at ACM CCS, one of the leading security research venues, we and twelve coauthors presented a paper that we think solves this technical mystery.

[Read more…]