Last week the President suggested that the NSA’s database of phone call data be stored outside the government, and he asked his Administration to study how this could be done. Today I’d like to start unpacking the options.
The phone call data consists of a record, for every phone call, of the calling and called numbers, along with the date, time, and duration of the call. This is revealing data, affecting the privacy of essentially every American, so there has been natural pushback against the NSA’s collection, retention, and use of the data.
The question is often framed as: Who should store the data? That’s the approach taken by the White House Review Panel’s report. One aspect of this question has to do with how long the data will be retained. The NSA holds call data for five years, but some phone companies keep it for a shorter period. Keeping the data in the phone companies will tend to shorten the data’s lifetime, absent some kind of retention requirement. For today, let’s set aside the question of how long data should be stored, and focus on the question as framed by the Review Panel and the President: Who should store the data?
Implicit in this framing is the assumption that the database is used by searching for specific records: the analyst sends a search query like “Show me all calls involving the number 301-688-6524,” and the system supplies any records responsive to the search. If that’s your model, then the most important design question is who stores the data.
But that reflects an outmoded view of how large datasets tend to be used, and probably an inaccurate view of how the NSA uses this particular dataset. Most likely, the NSA doesn’t just search the database, it performs computations on it. For example, they might want to do something like starting with an approved “seed” number S and asking which other numbers have calling patterns most similar to S.
(This might appear at first to be inconsistent with the NSA’s public statements about the call data program, for example the assertion that they never explore the call graph to a distance more than two or three “hops” from an approved seed number. But there isn’t necessarily an inconsistency. Any number whose calls are similar to S would necessarily be within two hops of S; otherwise that number doesn’t talk to anyone that S talks to. And (some versions of) the similarity computation can be implemented in a way that never looks at any number more than two hops away from S. So this computation seems to be consistent with a hop limit.)
If call data is held outside the NSA, there are different approaches to doing this computation. One approach is to “pull” all of the data needed by the computation and ship it to the NSA, which then does the computation. A second approach is to do all or part of the computation outside the NSA, shipping only pre-digested results or partial results to the NSA. Which approach is taken will impact privacy: if you are two hops away from S but have a calling pattern totally different from S, then the NSA will receive your full call data in the first approach but probably not in the second.
The key point is that it matters not only where the data is stored, but also where the computation is done. Indeed, computer scientists often think of the structure of the computation as the primary question, because data placement tends to follow the needs of the computation.
A sophisticated answer to the President’s question, then, starts by restating the question in a more technically sophisticated form: Which types of computations does the NSA need to do on call data, and how can those computations be structured to maximize the goals of legitimate intelligence gathering, personal privacy, and effective oversight?
I’ll start attacking that question in the next post.