There are many interesting things to discuss in Judge Leon’s opinion from yesterday, finding the NSA’s phone metadata program likely unconstitutional. In this post, I’ll focus on an interesting bit of computer science in the judge’s ruling, and I’ll explain why the judge’s computer science argument is actually more powerful than he realized.
The judge found that the plaintiffs had standing to challenge the constitutionality of the NSA’s practices, based on the NSA’s use of plaintiffs’ data in processing queries. (He also found standing for other reasons.)
To do this, the judge found that the NSA’s contact chaining analysis was necessarily using data about these specific plaintiffs.
(The relevant part of the opinion starts at the bottom of page 38, and goes through page 41.)
The NSA’s contact chaining analysis uses a notion of distance based on “hops”. If A has talked to B in the last five years, then A and B are one hop apart. If A has talked to B in the last five years, and B in turn has talked to C in the last five years, then A and C are two hops apart. And so on. The NSA’s analysis starts with a “seed” phone number that has been approved as meeting a legally required level of suspicion. The analysis then extends up to three hops away from the seed number.
So how does the judge find that the NSA analysis necessarily uses the plaintiffs’ data? Here’s the key passage in the judge’s opinion:
The Government, however, describes the advantages of bulk collection in such a way as to convince me that plaintiffs’ metadata—indeed everyone’s metadata—is analyzed … whenever the Government runs a query using as the “seed” a phone number or identifier associated with a phone for which the NSA has not collected metadata (e.g., phones operating through foreign phone companies). According to the declaration submitted by NSA Director of Signals Intelligence Directorate (“SID”) Teresa H. Shea, the data collected as part of the Bulk Telephony Metadata Program—had it been in place at that time—would have allowed the NSA to determine that a September 11 hijacker living in the United States had contacted a known al Qaeda safe house in Yemen. Presumably, the NSA is not collecting metadata from whatever Yemeni telephone company was servicing that safehouse, which means that the metadata program remedies the investigative problem in Director Shea’s example only if the metadata can be queried to determine which callers in the United States had ever contacted or been contacted by the target Yemeni safehouse number. [The same point is reinforced elsewhere in the Shea declaration.] When the NSA runs such a query, its system must necessarily analyze metadata for every phone number in the database by comparing the foreign target number against all of the stored call records to determine which U.S. phones, if any, have interacted with the target number.
(pp. 39-40, emphasis in original, internal citations omitted)
The basic argument is that if the analysis needs to know whether Alice and Bob ever talked, then it must look at either Alice’s or Bob’s record. If Alice’s record is unavailable, then the only way to know whether Alice and Bob are connected is to look at Bob’s record.
(You might argue that instead of looking at Bob’s record, the analysis could instead look at some kind of precomputed index to find out the answer. But that doesn’t change anything, because the index-building process would still have to look at Bob’s record, otherwise the index couldn’t “know” whether Alice and Bob were connected. There’s no way to get the answer without looking at Bob’s record at some point.)
It follows that if you want a full list of people who talked to Alice, and you don’t have access to Alice’s record, then you have to look at every record in the database, to figure out whether that record is connected to Alice. If you fail to look at any record, then you can’t be sure that you have a complete list of Alice’s contacts.
This result is actually more powerful than the judge seems to have realized. He applied this argument to the case where the seed number was external (i.e., from a carrier not providing data to the NSA). The same argument, that you must look at every record in the database to get an accurate result, applies not only to the case where the seed is an external record, but also to every case where an external record appears at any point after one hop or two hops. In such a case, the analysis would have to look at every record in the database in order to extend the results to the next hop. (As above, you could instead use an index that was built by looking at every record.)
This case will come up very often. Using the judge’s very conservative calculation, there are at least 10,000 numbers within two hops of a typical seed. If even one of those 10,000 numbers is external, then the system will have to look at every record in the database to complete the three-hop analysis. It looks like this would usually be the case in practice. So the plaintiffs’ data—and your data as well—is not just used occasionally; it is probably used in most every contact chaining calculation done by the NSA.
If you want to traverse a graph of contacts that are associated to a fixed point in the graph, by definition you dont search the entire graph! I dont see the judges argument being convincing unless you look at the underlying technique used to store and query metadata.
As long as the entire billion dollar personal analytics industry runs amuck unregulated, the NSA is a side show. They can just buy this information, a whole lot more, from Google, Microsoft and the other Feudal Lords of the Internet who are shocked, shocked that the King’s men dare to muscle in on their territory. They own you and your digital presence. They don’t want to share, unless paid of course by anyone: government, political group, crime front, domestic or foreign spy agency, etc.
This arg is analogous to why Google has to crawl the entire web to calculate votes for an individual page — there’s no central registry of links.
I doubt they have to look at all calls. US phone companies probably have a cache of international calls, prioritized by investigative interest.
Since we are talking about computer science here, let’s take it a step further. The only practical reason for the NSA to have all the metadata is to increase the efficiency of their graph-traversal queries by pre-fetching all the data into a location (their servers) where it can be traversed quickly and easily.
Another way to do the query is to only request Bob’s data from the phone company once you decide Bob is interesting. But then you have the pesky problem of requesting and waiting for a judge to grant a warrant. When the judge does, you get a list of Bob’s contacts, and you have to repeat the process for each of them if you want to make a second hop. Of course you also have the risk that the judge may not issue all the warrants you want, so you might miss out on some of the links entirely. But ultimately (meaning within days) if you have good reason to be interested in Bob, you are most likely going to be able to find out whether he is connected to Alice or not.
With this approach, the NSA does not ever have to examine any of Alice’s records to determine that she is not connected to Bob. Indeed they need never know Alice even exists, since none of the data they collect in examining Bob’s contacts references her in any way. The phone company has Alice’s records as a consequence of doing business with her, but the NSA never does.
It’s always risky to try to guess what the authors of the 4th amendment would have specifically thought about a technology they never even imagined existing. But in this case I think it’s pretty easy to guess that they would have preferred the local-search via warrants and limited queries to pre-fetching and caching the entire graph.
Leon quoted the supreme court:
We are not inclined to hold that a different constitutional result is required because the telephone company has decided to automate
(page 41)
The same should hold true for the NSA. If it would be illegal for a human nsa employee to do the intermediary steps that occur before an analyst sees the results, then the same should be true if a machine does it.
“the index-building process would still have to look at Bob’s record” – if you could get the phone companies to provide the pre-computed index, you could theoretically compute over the graph without knowing the details. However, that would be tricky, and prone to all sorts of de-anonymizing attacks.