April 19, 2014

avatar

Side-Channel Leaks in Web Applications

Popular online applications may leak your private data to a network eavesdropper, even if you’re using secure web connections, according to a new paper by Shuo Chen, Rui Wang, XiaoFeng Wang, and Kehuan Zhang. (Chen is at Microsoft Research; the others are at Indiana.) It’s a sobering result — yet another illustration of how much information can be leaked by ordinary web technologies. It’s also really clever.

Here’s the background: Secure web connections encrypt traffic so that only your browser and the web server you’re visiting can see the contents of your communication. Although a network eavesdropper can’t understand the requests your browser sends, nor the replies from the server, it has long been known that an eavesdropper can see the size of the request and reply messages, and that these sizes sometimes leak information about which page you’re viewing, if the request size (i.e., the size of the URL) or the reply size (i.e., the size of the HTML page you’re viewing) is distinctive.

The new paper shows that this inference-from-size problem gets much, much worse when pages are using the now-standard AJAX programming methods, in which a web “page” is really a computer program that makes frequent requests to the server for information. With more requests to the server, there are many more opportunities for an eavesdropper to make inferences about what you’re doing — to the point that common applications leak a great deal of private information.

Consider a search engine that autocompletes search queries: when you start to type a query, the search engine gives you a list of suggested queries that start with whatever characters you have typed so far. When you type the first letter of your search query, the search engine page will send that character to the server, and the server will send back a list of suggested completions. Unfortunately, the size of that suggested completion list will depend on which character you typed, so an eavesdropper can use the size of the encrypted response to deduce which letter you typed. When you type the second letter of your query, another request will go to the server, and another encrypted reply will come back, which will again have a distinctive size, allowing the eavesdropper (who already knows the first character you typed) to deduce the second character; and so on. In the end the eavesdropper will know exactly which search query you typed. This attack worked against the Google, Yahoo, and Microsoft Bing search engines.

Many web apps that handle sensitive information seem to be susceptible to similar attacks. The researchers studied a major online tax preparation site (which they don’t name) and found that it leaks a fairly accurate estimate of your Adjusted Gross Income (AGI). This happens because the exact set of questions you have to answer, and the exact data tables used in tax preparation, will vary based on your AGI. To give one example, there is a particular interaction relating to a possible student loan interest calculation, that only happens if your AGI is between $115,000 and $145,000 — so that the presence or absence of the distinctively-sized message exchange relating to that calculation tells an eavesdropper whether your AGI is between $115,000 and $145,000. By assembling a set of clues like this, an eavesdropper can get a good fix on your AGI, plus information about your family status, and so on.

For similar reasons, a major online health site leaks information about which medications you are taking, and a major investment site leaks information about your investments.

The paper goes on to consider possible mitigations. The most obvious mitigation is to add padding to messages so that their sizes are not so distinctive — for example, every message might be padded to make its size a multiple of 256 bytes. This turns out to be less effective than you might expect — significant information can still leak even if messages are generously padded — and the padded messages are slower and more expensive to transmit.

We don’t know which sites the researchers studied, but it seems like a safe bet that most, if not all, of the sites in these product categories have similar problems. It’s important to keep these attacks in perspective — bear in mind that they can only be carried out by someone who can eavesdrop on the network between you and the site you’re visiting.

It’s becoming increasingly clear that securing web-based applications is very difficult, and that the basic tools for developing web apps don’t do much to help. The industry, and researchers, will be struggling with web app security issues for years to come.

Comments

  1. Jeff S. says:

    Fascinating analysis, interesting topic.

    My contribution is needlessly pedantic and contributes nothing to the discussion, but… can we please banish the phrase “search query” from the vernacular? I shall be sending a strongly worded memo to the President of the Internet.

    Sorry to take the conversation off topic right away, but this issue is exceedingly annoying.

    Please resume intelligent discourse now.

    • felten says:

      I’m curious: What’s wrong with the term “search query”? And what would you replace it with?

      • Jeff S. says:

        “Search” and “Query” are synonyms. Perhaps not entirely interchangeable, but for most cases, they mean the same thing.

        “Search query” is needlessly redundant and wordified, like having a “hamburger sandwich” or driving a “motor car.”

        Said another way, how is a “search query” different from other types of queries? In fact, what other types of queries are there? Do they not involve searching?

        Alternatives:

        “In the end the eavesdropper will know exactly what search you typed.”

        -or-

        “In the end the eavesdropper will know exactly what query you typed.”

        • Anonymous says:

          A Search Query in programming could be a specific query object used to pass to a search provider. var searchQuery = new SearchQuery() { QueryText = “security” };
          Meanwhile, you can have a BlogCommentQuery that is used to query for a BlogComment. var commentQuery = new BlogCommentQuery() { BlogPostId = “9″ }; The SearchQuery object is querying a searchable index. The BlogCommentQuery object is querying a table in a database.

          • Anonymous says:

            I didn’t see any code examples in the article. You could also have SrchQuery as the object in programming, and I don’t think you could argue the correct spelling of search is now srch.

    • Feto says:

      I’m curious… Who is The President of The Internet?

      • Jeff S. says:

        “President of the Internet” was supposed to be funny. Next time, I’ll have to use one of those punctuation smileys that all the kids are using these days.

  2. Anonymous says:

    I’m very far from an expert on this stuff, but…

    It has been my understanding that military encryption systems (as of several decades ago) were designed to transmit a continuous stream of (pseudo???)-random “garbage”. The result being that any eavesdropper couldn’t tell when actual traffic was flowing by merely inspecting the data stream. As you point out, simply watching the bursts of data go by does itself leak info, so making it a continuous stream of bits closes that loophole.

    unfortunately, that method is probably not particularly practical for the current internet as it would vastly increase the amount of traffic being transported.

    • felten says:

      Yes, these approaches are probably too expensive for the web setting. Padding messages up to a fixed (or quantized) length is a milder version of this approach, adding some “cover traffic” but not too much.

      • rp says:

        Isn’t padding to a quantized length going to give out way too much information? You could pad to a random set of lengths without using significantly more bandwidth. (And somehow I think the bandwidth concerns are pretty meaningless here — unless the serving organization’s pipes are saturated, even doubling the amount of material transmitted is going to be a fraction of a typical video clip or flash graphic — which many of those sites also send out.)

  3. Jon-Michael C. Brook says:

    The Government performs all sorts of manipulations to avoid side channel attacks. They primarily fall into two categories. Government agencies avoid timing attacks by adding random CPU cycles or network delays, just so an adversary cannot tell they are doing things like a large amount of encryption or ordering a bunch of pizzas. Chen, Wang, Wang and Zhang’s research points to sizing attacks, where memory usage or network packet size may be used to glean a bit more information. As mentioned in the article, random memory calls and network packets, or packet padding will circumvent many of these attacks.

    Most of these only work when there is a large amount of information known about the system. Proprietary systems (those built from the ground up) have the security by obscurity aspect. One of the beauties of cloud computing is how well it is defined – I see this as yet another weakness to data storage/processing in the cloud. Then again, it’s probably easier just to look up the user’s data on a public records web site.

  4. Claude says:

    This is very very nice paper!
    However I am really wondering to which extend the “query work leaks” attack work. I wish they had done some more tests and provided experimental results.

    They are at least 2 scenarios, I can think of, where the attack does not work well:
    (1) Google signed-in users get personalized suggestions (from their web history)…and these entries would be hard(er) to predict (personalization in this case helps privacy ;-) )…
    (2) If a user types quickly, the number of AJAX requests can be reduced (i.e. a request might be sent for 2-3 letters)…and this, again, will make the guessing more difficult!

    If you are interested by this type of work, please have a look at the paper “Information Private Information Disclosure from Web Searches (the case of Google Web History)”, available at:
    http://planete.inrialpes.fr/projects/private-information-disclosure-from-web-searches/
    This paper shows how a user’s web history can be inferred from his web searches and more…

  5. paranoid says:

    I have looked at this some time back and already then it was fixed by Yahoo!. Bing and Google hasn’t addressed this.