One of the notable claims we have heard, in light of the Verizon / PRISM revelations, is that data extraction measures are calibrated to make sure that 51% or more of affected individuals are non-U.S. persons. As a U.S. person, I don’t find this at all reassuring. To see why, let’s think about the underlying statistics.
As an example, consider Facebook, which appears to have about 1 billion users worldwide, of which roughly 160 million are in the U.S and the other 840 million are foreign. If you collect data about every single Facebook user, then you are getting 84% non-U.S. records. So even a “collect all data” procedure meets the 51% foreign test—despite doing nothing to shield Americans from collection.
But let’s assume that intelligence analysts can’t just ask for everything about everybody, but instead are required to use some kind of selector to narrow in on a small number of records. What kinds of selectors will meet the 51% foreign test?
One selector that works is just to pick a record at random. That will return a foreign record with 84% probability (because 84% of records are foreign). More generally, a selector that is independent of nationality will easily meet the 51% standard. If a selector matches a fraction F of U.S. persons and also matches the same fraction F of non-U.S. persons, then its output will again be 84% foreign.
Even a selector that triggers preferentially on U.S. persons can meet the 51% test. Suppose a selector matches one foreign record out of every 10 million, and one U.S. record out of every 2 million. That’s biased toward selecting U.S records, by a factor of five. Yet the selector will match 84 foreign records and 80 U.S. records, which is 51.2% foreign. So even a selector that is strongly biased toward selecting U.S. records can meet the 51% foreign test.
This is just basic statistics. If we’re selecting from an underlying population that is biased on one direction, then the result will be biased in the same direction, unless the selection criteria are biased more strongly in the opposite direction. In a user population that is mostly foreign—which is the case for most or all of the big Internet services—a “51% foreign” test is not at all the same as a “not targeted to U.S. persons” test.
[UPDATE (June 10, 2013): In the comments, Steve Checkoway suggests another interpretation of the 51% rule: for each person returned by a query, the analyst must have 51% confidence that that person is non-U.S—and I explain why I think that doesn’t help. There are different ways to interpret a 51% rule, but I don’t think any of them offers much comfort to U.S. persons.]
So, if I’m foreigner, it’s OK for the NSA to search through my e-mails ? I don’t think so.
Ed, here is another interpretation at a quick blush from your article without attending to the actual source of the 51% statistic.
Could it mean they have resalable assurance that 51% of the individuals are foreign? Meaning the other 49% may be US Citizens and they don’t care.
After all, what was reported was; they are supposedly tracking people outside if the U.S. who have contact inside of the U.S.
At an exact 1 to 1 relationship that would be 50/50. They want to be sure they are targeting more non-US citizens than US citizens so they don’t want an exact 1 to 1 relationship they hope to catch the possible half a percent that communicate 1 non US citizen outside of the US to 1 non US citizen inside the US.
And that is how I read the statistic. And to me that is very alarming, because it means anyone who has any family that for whatever reason is outside of the US [be them US citizens traveling abroad or not] are still being targeted.
I think a far better metric would be the total number of different people in the US that have had data collected on them, posibly broken out by various data categories, like phone records, emails, facebook, recordings of actual phonecalls, then convert that to a % of the US population. That figure wold be far harder to mask with the kind of distortions this article describes. And by collected data I mean does the NSA have the data on their own machines, and thus no longer have to go through the original providers to access it, whether they have accessed it yet or not.
Is it really necessary to keep pointing out that the data provided to NSA by the Internet companies via the PRISM mechanism is not a a mass transfer of all their data through a back door but selected batches of data requested by specific warrants — which might (I’m guessing here) be requests like, give me everything coming out of Iran or give me everything you can identify as coming out of the Russian Mission to the UN. The foreigness would be largely established before any transfer. The “selectors” would then work to confirm that before an NSA employee puts eyes on it. Then, of course, if it turns out to be an answer to an email John Burke sent to the Russian Mission, NSA would have to destroy it or promptly seek a FISA warrant to pursue it.
Glenn Greenwalt, The Guardian and the Post have caused a huge amount of mischief by misrepresenting this program.
The analysis in this post applies whether the request returns a large number of records or only a few records. You seem to be assuming that the only requests are for data that is nearly 100% certain to be non-US. Others are saying that a 51% principle applies. My point is that a 51% rule isn’t enough to protect us.
Do you have any factual basis for the assumption that records are requested only with nearly 100% assurance that they are non-US? Do you believe the law requires near-100% assurance?
What if someone communicates by posting a “coded” message on a blog that his counter party understand?. The counter party could reply on another blog. The only measurable event by the counterparties would be page views. Would an intelligence agency be able to collect all page views? What if the message exchange only happened via search engines e.g. The message was read by looking at the thumbnail returned from the search engines database. Ergo no direct page view. What if a foreign power owned the search engine and knew where to look for coded messages? How could even the NSA track message exchange?
Which brings me to my point. Why bother collecting email or telephone records? It only impinges on honest people not on committed terrorists.
Ed,
I don’t think your reading of the 51% is correct. From the Post article, ‘Analysts who use the system from a Web portal at Fort Meade key in “selectors,” or search terms, that are designed to produce at least 51 percent confidence in a target’s “foreignness.” ‘
To me, this seems to be saying something different from 51% of the affected persons are non-US persons, namely that each individual person has a 51% chance of being outside the US. See Orin Kerr’s comment here, http://www.volokh.com/2013/06/07/is-the-prism-surveillance-program-legal/#comment-921894594.
Steve,
Saying that you need 51% confidence that each individual is non-US doesn’t really fix the problem. It’s true that that would break my preferential-triggering example as written. But if you take my preferential-triggering example, and divide all of the probabilities by 164, you get a selector that picks (in expectation) 0.512 non-US persons, and 0.488 US persons, which meets the 51% test.
Another way to interpret the 51% test is that, having selected the records of a person, there is then an obligation to look at all of the available evidence to determine the likelihood that that person is non-US. Suppose you apply that version of the 51% test, and find that the single person selected is non-US with 51.2% probability. It is still the case that US people are disproportionately likely to have been selected—or in other words that the probability that a randomly chosen US person in the sample was selected is five times as large as the probability that a randomly chosen non-US person in the sample was selected.
I can’t see how any kind of 51% test can guarantee US persons even equal treatment when the underlying population is strongly tilted toward non-US.
Without knowing just how those criteria are put together, you’ve also got a problem. “Designed to produce confidence” is way, way different from “actually yield”, much less what happens when you do a bunch of independent searches and look at the intersections.