Two weeks ago I started a series of posts (so far: 1, 2) about how new technologies change the policy issues around government wiretapping. I argued that technology changed the policy equation in two ways, by making storage much cheaper, and by enabling fancy computerized analyses of intercepted communications.
My plan was to work my way around to a carefully-constructed hypothetical that I designed to highlight these two issues – a hypothetical in which the government gathered a giant database of everybody’s phone call records and then did data mining on the database to identify suspected bad guys. I had to lay a bit more groundwork before getting to the hypothetical, but I was planning to get to it after a few more posts.
Events intervened – the “hypothetical” turned out, apparently, to be true – which makes my original plan moot. So let’s jump directly to the NSA call-database program. Today I’ll explain why it’s a perfect illustration of the policy issues in 21st century surveillance. In the next post I’ll start unpacking the larger policy issues, using the call record program as a running example.
The program illustrates the cheap-storage trend for obvious reasons: according to some sources, the NSA’s call record database is the biggest database in the world. This part of the program probably would not have been possible, within the NSA’s budget, until the last few years.
The data stored in the database is among the least sensitive (i.e., private) communications data around. This is not to say that it has no privacy value at all – all I mean is that other information, such as full contents of calls, would be much more sensitive. But even if information about who called whom is not particularly sensitive for most individual calls, the government might, in effect, make it up on volume. Modestly sensitive data, in enormous quantities, can add up to a big privacy problem – an issue that is much more important now that huge databases are feasible.
The other relevant technology trend is the use of automated algorithms, rather than people, to analyze communications traffic. With so many call records, and relatively few analysts, simple arithmetic dictates that the overwhelming majority of call records will never be seen by a human analyst. It’s all about what the automated algorithms do, and which information gets forwarded to a person.
I’ll start unpacking these issues in the next post, starting with the storage question. In the meantime, let me add my small voice to the public complaints about the NSA call record program. They ruined my beautiful hypothetical!
I’m reading my way through the series backwards. There’s something I expected to see in this installment that isn’t there: Within the last few months I read a reference on the net (e-mail list or web) to a study – someone gimmicked a hundred cellphones to keep a list of all calls. When they got the phones back from the hundred volunteers, traffic analysis quickly identified who were friends and who were only colleagues. The telco billing records theNSA is buying support the same traffic analysis. I think there is a heavy tension between a database of all affinity groupings among the US population existing in government hands and the freedom of association clause of the first amendment.
“…..also suspect that they are being targeted by the NSA, and we pick 100 random telephone numbers from the phone book (or better still, 100 random people with middle-eastern sounding names). We then all make phone calls to those numbers,…”
And also, just for good measure you call some numbers in North Korea, just for dialing those 100 numbers…
Which leads to the hole in the NSA system:
The knowledge of its existence defeats its stated purpose. The observed, knowing he is observed, changes his behavoir…
Here’s a scary thought…
Suppose I suspect that I am being targeted by the NSA. I get a bunch of my friends who also suspect that they are being targeted by the NSA, and we pick 100 random telephone numbers from the phone book (or better still, 100 random people with middle-eastern sounding names). We then all make phone calls to those numbers, and perhaps leave messages that would make them wish to call us back. (For example, I leave a message like, “My kid wants a play date with your kid. Could you please call me back at this number?”) We have now established a calling pattern that might make these 100 random people look like hubs for me and my suspicious friends. As a result, we have added 100 new people for the NSA to investigate more closely. (Presumably, after much misery, these 100 victims will be found to have no real connection to me and my friends.) But meanwhile we will have diverted lots of NSA resources from the investigation of us.
Denial of Service attack?
I know that datamining of contacts might be used to find the “criminal” leaks that provide “irresponsible” journalists with information about “state secret” illegal government projects. I don’t feel comfortable in a country where the government is spying on its citicens in the way the Stasi used to do.
Mark:
If you only go back and work from identified terrorists, you will only have a model of terrorists you’ve identified. That may buy you some knowledge of what phones some of the associates used at some time in the past. And if you’re doing that, you can just ask the phone companies for their call records (typically retained for quite some time) after the fact. That’s how we traced the 9/11 crew, according to reports.
I don’t think the realtime trolling will be purely for particular shapes of graphs, but rather graph shapes in conjunction with other things. Not that many church phone trees (as far as I know) make extensive use of disposable cellphones or payphones, for example. (And mind you, I’m not suggesting all this is such a good idea, just wondering how it might turn out to work, if it worked.)
I hope you still discuss the general case, and the respects in which the present case fits (or doesn’t) the general case. I look forward to your analysis, and would hate for it to be written off as politically motivated or relevant only to the present controversy.
paul: I think it highly unlikely that the entire data set is ‘trolled’ automatically looking for suspicious behavior. There would be far too many false positives from reading clubs, church “phone trees”, and other non-criminal conspiracies. 😉
It is more likely that the data is accumulated so that when a suspicious individual is identified at some later date, the NSA can backtrack through his previous call data and apply a more targetted analysis.
Real-time analysis would also have to be based on some seed set of “persons of interest” to have any value.
This Podcast seems to be relevant: http://www.itconversations.com/shows/detail812.html
It would be interesting to find out how one develops training sets for the automated classifiers. There are obvious models of social networks and terrorist behavior that you could use, but it’s not clear how well those correlate with the real thing. Those models can be updated by backtracking call data every time you catch or identify a terrorist, but of course that in turn tells you about the class of terrorists you can catch and identify by other means…
One can “normalize” out the time component… If you map (phone number, date) to (contract id) two different people which used the same phone number in different years can be distinguished. It isn’t that difficult to trace a family through different phone numbers this way.
Such normalization is trivial in processing capacity compared to the analysis of the data.
One issue that seems to be overlooked and would be interesting to evaluate in your discussion is the time sensitivity of the data. The data of today is not the same as yesterday nor would tomorrow’s data necessarily be the same as today. People change phone numbers for a variety of reasons, especially with cell phones. As you note, we will be dependent on automated algorithms to uncover linkages. Based on the time sensitivity of data, the dataset will not be static but dynamic. Therefore, I would assume that any automated algorithm that takes into account time will be significanlty more complex and take up significantly greater computing power. I would like to see the time sensitivity issue brought into the discussion and evaluated.