One of the notable claims we have heard, in light of the Verizon / PRISM revelations, is that data extraction measures are calibrated to make sure that 51% or more of affected individuals are non-U.S. persons. As a U.S. person, I don’t find this at all reassuring. To see why, let’s think about the underlying statistics.
As an example, consider Facebook, which appears to have about 1 billion users worldwide, of which roughly 160 million are in the U.S and the other 840 million are foreign. If you collect data about every single Facebook user, then you are getting 84% non-U.S. records. So even a “collect all data” procedure meets the 51% foreign test—despite doing nothing to shield Americans from collection.
But let’s assume that intelligence analysts can’t just ask for everything about everybody, but instead are required to use some kind of selector to narrow in on a small number of records. What kinds of selectors will meet the 51% foreign test?
One selector that works is just to pick a record at random. That will return a foreign record with 84% probability (because 84% of records are foreign). More generally, a selector that is independent of nationality will easily meet the 51% standard. If a selector matches a fraction F of U.S. persons and also matches the same fraction F of non-U.S. persons, then its output will again be 84% foreign.
Even a selector that triggers preferentially on U.S. persons can meet the 51% test. Suppose a selector matches one foreign record out of every 10 million, and one U.S. record out of every 2 million. That’s biased toward selecting U.S records, by a factor of five. Yet the selector will match 84 foreign records and 80 U.S. records, which is 51.2% foreign. So even a selector that is strongly biased toward selecting U.S. records can meet the 51% foreign test.
This is just basic statistics. If we’re selecting from an underlying population that is biased on one direction, then the result will be biased in the same direction, unless the selection criteria are biased more strongly in the opposite direction. In a user population that is mostly foreign—which is the case for most or all of the big Internet services—a “51% foreign” test is not at all the same as a “not targeted to U.S. persons” test.
[UPDATE (June 10, 2013): In the comments, Steve Checkoway suggests another interpretation of the 51% rule: for each person returned by a query, the analyst must have 51% confidence that that person is non-U.S—and I explain why I think that doesn’t help. There are different ways to interpret a 51% rule, but I don’t think any of them offers much comfort to U.S. persons.]