April 23, 2014

avatar

NSA call data analysis: inside or outside government?

Last week the President suggested that the NSA’s database of phone call data be stored outside the government, and he asked his Administration to study how this could be done. Today I’d like to start unpacking the options.

The phone call data consists of a record, for every phone call, of the calling and called numbers, along with the date, time, and duration of the call. This is revealing data, affecting the privacy of essentially every American, so there has been natural pushback against the NSA’s collection, retention, and use of the data.

The question is often framed as: Who should store the data? That’s the approach taken by the White House Review Panel’s report. One aspect of this question has to do with how long the data will be retained. The NSA holds call data for five years, but some phone companies keep it for a shorter period. Keeping the data in the phone companies will tend to shorten the data’s lifetime, absent some kind of retention requirement. For today, let’s set aside the question of how long data should be stored, and focus on the question as framed by the Review Panel and the President: Who should store the data?

Implicit in this framing is the assumption that the database is used by searching for specific records: the analyst sends a search query like “Show me all calls involving the number 301-688-6524,” and the system supplies any records responsive to the search. If that’s your model, then the most important design question is who stores the data.

But that reflects an outmoded view of how large datasets tend to be used, and probably an inaccurate view of how the NSA uses this particular dataset. Most likely, the NSA doesn’t just search the database, it performs computations on it. For example, they might want to do something like starting with an approved “seed” number S and asking which other numbers have calling patterns most similar to S.

(This might appear at first to be inconsistent with the NSA’s public statements about the call data program, for example the assertion that they never explore the call graph to a distance more than two or three “hops” from an approved seed number. But there isn’t necessarily an inconsistency. Any number whose calls are similar to S would necessarily be within two hops of S; otherwise that number doesn’t talk to anyone that S talks to. And (some versions of) the similarity computation can be implemented in a way that never looks at any number more than two hops away from S. So this computation seems to be consistent with a hop limit.)

If call data is held outside the NSA, there are different approaches to doing this computation. One approach is to “pull” all of the data needed by the computation and ship it to the NSA, which then does the computation. A second approach is to do all or part of the computation outside the NSA, shipping only pre-digested results or partial results to the NSA. Which approach is taken will impact privacy: if you are two hops away from S but have a calling pattern totally different from S, then the NSA will receive your full call data in the first approach but probably not in the second.

The key point is that it matters not only where the data is stored, but also where the computation is done. Indeed, computer scientists often think of the structure of the computation as the primary question, because data placement tends to follow the needs of the computation.

A sophisticated answer to the President’s question, then, starts by restating the question in a more technically sophisticated form: Which types of computations does the NSA need to do on call data, and how can those computations be structured to maximize the goals of legitimate intelligence gathering, personal privacy, and effective oversight?

I’ll start attacking that question in the next post.

Comments

  1. Harry Johnston says:

    Stating the obvious, I see that they’ve skipped very lightly indeed over the first and perhaps more important question, “Should the data be stored at all?”

    • NathanT says:

      Thumbs up for Harry Johnston, that is indeed the very and most important question. The data should not be stored at all, is my answer, and see my reasoning below.

  2. travel says:

    Please let me know if you’re looking for a writer for your site.
    You have some really good posts and I think I would be a good asset.

    If you ever want to take some of the load off, I’d really
    like to write some content for your blog in exchange for a link back to mine.

    Please blast me an email if interested. Thank you!

  3. NathanT says:

    Ed, Ed, Ed, You did it again, you ask a question, divert the question, and change the question before giving any answer.

    You say you are going to explore whether storage should be inside or outside of the government, you divert the question of how long data should be retained, and then finally answer that the more important thing thing that matters is the computations that look at the data.

    The answer to the question of where such data should be store should be NOWHERE! The question of how long data should be stored is inseparable because reality says some storage is needed by the company that handles the data initially.

    Thus the two questions combined: the answer to how long and where should be “only stored by the phone company for only as long as needed for billing transactions to take place.” Once the bill has been paid for the data should be erased from the phone companies records, if the individual (who received the bill) wants to keep or destroy those records, so be it. If the government needs access, they need to get a warrant to search for those records at the person’s billing address, or to institute again with a proper warrant with probable cause a phone trace/phone tap and get such records individually as they occur.

    As for computations, the government shouldn’t be allowed to conduct computations over my phone records unless they have a clear and specific warrant against my communications with probable cause to show that my communications would show evidence of subversion to the government (or some other criminal behavior).

    Everything else is completely in violation of the US Constitution, because ALL the options presented herein the government is intruding on the privacy and security of people’s “papers and effects” (which meant “communications” in their day) without a warrant detailing the specifics of what is to be searched, and probable cause as to indicate such a search will indeed reveal evidence of a specific crime.

  4. paul says:

    This doesn’t make a lot of sense to me. If it’s really important to find out who has calling patterns similar to S, you can simply get the numbers S called and get approval for searches on who has called some named subset of them. If you can’t articulate your reasons for wanting that well enough to get a rubber-stamp judge to agree, then you shouldn’t be playing games with personal data.