March 29, 2024

Twenty-First Century Wiretapping: False Positives

Lately I’ve been writing about the policy issues surrounding government wiretapping programs that algorithmically analyze large amounts of communication data to identify messages to be shown to human analysts. (Past posts in the series: 1; 2; 3; 4; 5; 6; 7.) One of the most frequent arguments against such programs is that there will be too many false positives – too many innocent conversations misidentified as suspicious.

Suppose we have an algorithm that looks at a set of intercepted messages and classifies each message as either suspicious or innocuous. Let’s assume that every message has a true state that is either criminal (i.e., actually part of a criminal or terrorist conspiracy) or innocent. The problem is that the true state is not known. A perfect, but unattainable, classifier would label a message as suspicious if and only if it was criminal. In practice a classifier will make false positive errors (mistakenly classifying an innocent message as suspicious) and false negative errors (mistakenly classifying a criminal message as innocuous).

To illustrate the false positive problem, let’s do an example. Suppose we intercept a million messages, of which ten are criminal. And suppose that the classifier correctly labels 99.9% of the innocent messages. This means that 1000 innocent messages (0.1% of one million) will be misclassified as suspicious. All told, there will be 1010 suspicious messages, of which only ten – about 1% – will actually be criminal. The vast majority of messages labeled as suspicious will actually be innocent. And if the classifier is less accurate on innocent messages, the imbalance will be even more extreme.

This argument has some power, but I don’t think it’s fatal to the idea of algorithmically classifying intercepts. I say this for three reasons.

First, even if the majority of labeled-as-suspicous messages are innocent, this doesn’t necessarily mean that listening to those messages is unjustified. Letting the police listen to, say, ten innocent conversations is a good tradeoff if the eleventh conversation is a criminal one whose interception can stop a serious crime. (I’m assuming that the ten innocent conversations are chosen by some known, well-intentioned algorithmic process, rather than being chosen by potentially corrupt government agents.) This only goes so far, of course – if there are too many innocent conversations or the crime is not very serious, then this type of wiretapping will not be justified. My point is merely that it’s not enough to argue that most of the labeled-as-suspcious messages will be innocent.

Second, we can learn by experience what the false positive rate is. By monitoring the operation of the system, we can see learn how many messages are labeled as suspicious and how many of those are actually innocent. If there is a warrant for the wiretapping (as I have argued there should be), the warrant can require this sort of monitoring, and can require the wiretapping to be stopped or narrowed if the false positive rate is too high.

Third, classification algorithms have (or can be made to have) an adjustable sensitivity setting. Think of it as a control knob that can be moved continuously between two extremes, where one extreme is labeled “avoid false positives” and the other is labeled “avoid false negatives”. Adjusting the knob trades off one kind of error for the other.

We can always make the false positive rate as low as we like, by turning the knob far enough toward “avoid false positives”. Doing this has a price, because turning the knob in that direction also increases the number of false negatives, that is, it causes some criminal messages to be missed. If we turn the knob all the way to the “avoid false positives” end, then there will be no false positives at all, but there might be many false negatives. Indeed, we might find that when the knob is turned to that end, all messages, whether criminal or not, are classified as innocuous.

So the question is not whether we can reduce false positives – we know we can do that – but whether there is anywhere we can set the knob that gives us an acceptably low false positive rate yet still manages to flag some messages that are criminal.

Whether there is an acceptable setting depends on the details of the classification algorithm. If you forced me to guess, I’d say that for algorithms based on today’s voice recognition or speech transcription technology, there probably isn’t an acceptable setting – to catch any appreciable number of criminal conversations, we’d have to accept huge numbers of false positives. But I’m not certain of that result, and it could change as the algorithms get better.

The most important thing to say about this is that it’s an empirical question, which means that it’s possible to gather evidence to learn whether a particular algorithm offers an acceptable tradeoff. For example, if we had a candidate classification algorithm, we could run it on a large number of real-world messages and, without recording any of those messages, simply count how many messages the algorithm would have labeled as suspicious. If that number were huge, we would know we had a false positive problem. We could do this for different settings of the knob, to see where we had to get an acceptable false positive rate. Then we could apply the algorithm with that knob setting to a predetermined set of known-to-be-criminal messages, to see how many it flagged.

If governments are using algorithmic classifiers – and the U.S. government may be doing so – then they can do these types of experiments. Perhaps they have. It doesn’t seem too much to ask for them to report on their false positive rates.

Comments

  1. Here’s the question: why is it ok for a computer program to listen to our phone calls, but the goverment agents have to get a warrent first? If we trust a computer with our phone calls, why don’t we trust a goverment agent? If we don’t trust the goverment agents, why do we trust the computer?

  2. Steve R. says

    The theme of this column concerns false positives, but potential for false negatives is also noted. Since the next column by Ed went on to talk about privacy, I will take the opportunity to discuss false negatives, which in many ways may have even more serious consequences. A false positive, in many cases, can be “corrected” through investigation. For example, Mr. Doe can be thrown in jail for a few days for supposed terroist remarks but after his discussion is investigated it is determined that it his remarks had nothing to do with terrosim; he is released with a “sorry about that, have a nice day”.

    However, in the case of a false negative, heads can role and we may have the passage of bad legislation. Heads will role in response to a “witch hunt” to find blame (irrespective of actual guilt) for why “obvious” signs of the terroist actions where overlooked. To “prevent” future failures to identify obvious signs; we end up with new onerous laws that move us closer to becoming a police state. Unlike, the example of Mr. Doe above, a false negative can lead to an extended period in jail, irrespective of actual guilt, for failing to recognize the “obvious” and can result in ever more draconian “corrective” laws.

  3. Mathfox: In my example, I think the implemetnation would be as you suggest for practical reasons. And there is no reason that FISA cannot be involved to approve a warrant for the specific type of search after the fact.

    I gave one simple example. Multivariate classification under many circumstances can make today’s technology useful. Humans, coupled with a 21st century FISA law, can provide over-sight for this type of activity.

    Fear shouldn’t be of the technology. It should be that our representatives will abuse it for their own ends. I’m not writing a new age Federalist paper, but Federalist #51 comes to mind as a basic method to continue to guarentee our rights.

  4. Jim Lyon says

    Much of the discussion here has addressed the question of whether the false positive rate can be made low enough for the supposed benefits of a monitoring program to exceed the costs. In theory, the benefits of the program accrue to society at large, and the costs are born by society at large.

    However, it’s also important to look at things from the motiviation of the organization operating the program. They get a benefit for each terrorist identified (after all, it’s the whole point), but to them the cost is just an externality, and likely to be ignored. In short, their motivation is to get results, and everything else be damned.

    One could perhaps cure this by internalizing the externality. Suppose the department in question were required, for each phone call listened to by human, to pay $10 to each participant if the call were not determined to be part of a terrorist conspiracy. This in and of itself could a counterbalance, and give the operator of the program an incentive to minimize the false positive rate.

    Imagine receiving a letter saying “Dear Mr. Felton, we monitored your conversation of June 8, 2:43PM with Sally Smith. Since it wasn’t part of a terrorist conspiracy, we’ve enclosed $10 in compensation.” Among other things, such letters would go a long way toward moving the discussion from the theorectical to the personal.

    Given that the benefit to the monitoring organization is largely reputational, and that these letters cost reputation, it’s possible that the letter without the money might be sufficient deterrent. (But I doubt it: many organizations have succeeded at tuning required notifications to a level that won’t attract the recipient’s attention.)

  5. Of course multivariate classification buys you some improvement in false positives. Is it enough? Maybe, maybe not. You still need a lot of nines, and you need to be sure that the people following up on the conversations flagged by your classifier don’t read things in because they’re so sure the classifier works.

    Worse yet, multivariate classification increases not only random false negatives but also the possibility of systemic false negatives. Let your target change one characteristic (perhaps by using a courier to make a phone call or send email somewhere else) and until you catch on to the change you will be filtering out 100% of the messages you want to find. Since terrorist organizations don’t need to have a whole lot of phone or email conversations to carry out their plans, those aren’t good odds.

  6. Stephen Purpura made the very interesting comment that looking at additional data associated with the call would allow the government to make a better classification of calls into “suspect” and “innocent” categories.
    Why wouldn’t the government put a filter before the voice recognition filter, perform selective wire-tapping. It would reduce the false positives by an order of magnitude, reduce the privacy problems by the same amount and, provided that the CIA and other intelligence agencies can provide reliable pointers, generate nearly the same set of true positives.
    Filtering all Chicago hotels on calls containing the word Semtex is less invasive than doing the filtering for all US calls.

  7. “Whether there is an acceptable setting depends on the details of the classification algorithm. If you forced me to guess, I’d say that for algorithms based on today’s voice recognition or speech transcription technology, there probably isn’t an acceptable setting — to catch any appreciable number of criminal conversations, we’d have to accept huge numbers of false positives. But I’m not certain of that result, and it could change as the algorithms get better.”

    I think you are considering a classification system which only uses a single set of features — those from the conversation. Classification systems get much more interesting results when you combine endogenous and exogenous features to make decisions.

    Let me give an example based on real life crime fighting.

    On U.S./Mexico border, DEA agents try to catch mules running drugs across the border. They have learned that a scout car crosses the border typically 15 – 30 minutes ahead of the mule car. Under certain conditions, the scout car may make a phone call to someone else in Mexico to alert the mule either to cross/not cross. The scout car is sometimes identifiable because it is owned or driven by someone with a criminal history.

    In this case, it would be useful to track calls (potentially originating from or going to a foreign national) that originated in a specific geographic area. So an additional “feature” input to the classification engine could be geographic location or other variables related to the situation. Such features can dramatically increase the probability of matching successfully, even with today’s technology.

  8. “Letting the police listen to, say, ten innocent conversations is a good tradeoff if the eleventh conversation is a criminal one whose interception can stop a serious crime.”

    there are four problems here.

    first, there’s an assumption that “letting the police listen” improves the performance of the classifier, simply because they are “real people” instead of an algorithm. they are part of a larger process, if not a larger algorithm, and that process itself is subject to a false alarm rate as well as a false negative rate.

    certainly, if independent classifiers are used in series, it is possible to reduce false positives ALTHOUGH NOT FALSE NEGATIVES. (they have been screened out by the first stage.) the key is statistical INDEPENDENCE. would the criteria applied by “listening in” be independent of the criteria used to screen? who can tell with it all being secret? but it’s hard to believe they are.

    second, the proposal only works if the classifier does indeed reduce false positives to 0.1%. suppose it does not. suppose the false positive rate is 10% or even 15%? ESM sets (hostile radar detectors) for military aircraft rarely have false positive rates better than 10%.

    third, there is a large amount of case law concerned precisely with the problem of opportunistic evidence. that is, suppose the screen is an anti-terrorist screen which pulls a large number of false positives. suppose the content of one of the false positives is not anything at all terrorist related, but it MIGHT be an indication of some other crime. are the authorities justified in acting upon that? if so, why not subject ALL conversations at ALL times to such consideration? BECAUSE we have the (inconvenient to some) idea of due process here, and this is not it.

    four, PROPERLY setting false positive and false negative thresholds is not merely a matter of technology or “what’s true”. truth data are, as in this case, often hard to come by. the thresholds are chosen based upon estimates of the cost of making a false positive error versus a false negative error. i think it’s fair to claim that those costs involve a good deal more than merely administrative, technical, and enforcement costs. yet, because these are devised in secret, there is no public input as to where these thresholds should be set, not even of representatives elected by the people to weigh in on governmental policy. i’d say because the program is secret it is inherently unfair, undemocratic, and dangerous.

  9. Thanks for directly responding to the Bayesian criticism, but I still think you’ve missed an important point.

    Even in your case, you only get 10/1010 bad calls, which sounds like a lot of privacy violations for a little information (a la the Arar example previously mentioned). The number of telephone calls in the US in day or month is orders and orders of magnitude bigger than one million. In that case, we need to violate potentially hundreds of thousands or million of innocent calls in order to even have a chance of having a single real call terrorist call in the pool of suspicious calls even with the dubiously high 99.9% accuracy rate.

    It’s fundamentally not a question of the power of the automatic flagging mechanism. We’ll never be able to have a flagging protocol based on the very limited aggregate data (e.g., what number called which number, when and for how long) that is 99.9999% correct. And we will have very limited positive results to even train the data on. It is a function of limiting the pool to a small enough number of communications.

    The real answer is to use human intelligence to indentify suspected terrorists and then monitor them REALLY carefully than it is to monitor all the citizens in the US in fishing expedition.

    And that’s before we bring in the very real concerns about the nearly irresistible motivation to use that data for other purposes once it has been assembled (which you have previously acknowledged but not yet addressed). Meanwhile, won’t terrorists be using harder to track secure VoIP channels for their suspicious activity?

    So that’s why this entire approach should be scrapped, and I am so shocked that you seem to be defending it. As I commented on an earlier thread, random checkpoints don’t even work for regularly occurring crimes like drunk driving let alone incredibly infrequent and carefully planned terrorist operations.

  10. I’m not sure the “bad science” thing is new — training sets for computer algorithms have pretty much always been based on human evaluation. There’s the well-known legend from the computerized mortgage evaluation business where the algorithm derived from the training set learned to base the lion’s share of its decisions on whether the applicant was white or black.

  11. Another Kevin says

    Can I join the Bayesian Party, please?

    As someone with EE training, I would want to see a constructed set of conversations – perhaps taken a set of people who consent to have their everyday conversations recorded for the purpose – being searched for a given pattern that is artificially seeded into the data. This type of study could then be used to construct an ROC curve (see http://www-psych.stanford.edu/~lera/psych115s/notes/signal/ if you don’t know what ROC is).

    I’d want further to make the request for a warrant estimate both sensitivity and specificity, based on a properly controlled ROC study. And estimate the conditional probability of a false positive presupposing a positive result.

    But of course, that’s more statistics than I’d expect a judge, much less a jury (and even less a Congresscritter) to understand. Moreover, the pressure to find that somebody — anybody — is guilty of something — anything — will be well-nigh impossible for law enforcement to resist.

    If one of us in a million is a terrorist, and the selectivity of the test is .001 and its sensitivity is .999, the test applied to a million people will find about 1 terrorist and 1000 innocents. If the test also generates human-plausible results, I shudder at the consequences (I remember a story from the 1970s where the cast of _1776_ were reviewing their lines on a plane, and another passenger reported them to the stewardess as “dangerous radicals.” Imagine if that had happened post-9/11; the actors would have undergone “rendition” to Syria or Albania, and the play would never have reached Broadway.)

  12. There is significant difficulty in determining the semantic comment of a call without the background and knowledge of the participants. Given what we know of human nature, even the participants are frequently confused as to exactly what they are talking about. People often encode their meanings, even when law is not a factor. It is done because people know what they are talking about, to avoid saying that which must not be said in a relationship, courtesy, flattery, and to limit local (non wiretap) eavesdropping.

    Everyone has significant experience with misinterpreting one side of a conversation, even knowing both the callers, and the basic intent of the conversation. How many false positives will result from relatively ordinary human expressions such as “I am going to kill that bastard.” How many false positives will there be with simple encoding — “Tonight is the night.” Are they talking about sex, a bombing, painting a room, or going for a boat ride?
    And that is without intentional coding of activities.

    The above says that we can turn the false positive and false negative ratchets continuously to any level of probability desired, but I don’t think that is the case. Most conversations are not long enough, and won’t contain enough information, to give more than a gross p value.

    What are we going to use to train these processes? It should be real conversations of real terrorists. I doubt that there is much of this material around.

    Even with extraordinary sensitivity (probability when the test is positive when the property is present) and specificity (the probability the test is negative when the property is not present), the predictive value of a test falls extremely rapidly when the numbers of true positives (in this case terrorists) are small in the population.

    Having seen enough analysts use and more usually abuse high quality data, particularly under the pressures of politics and aggressive admin types who “want an answer”, or more particularly, “want a perpetrator”. I shudder to think of the abuses of such a program.

    Given the great care that the VA has taken of its data, why should anyone want to trust their phone calls to our government?

  13. You don’t consider that the false positive problem is not simply one of the automated algorithms. Human based justice systems have false positives. We know this, and design our human justice systems on the principle that it is better that many murderers walk free than to imprison one innocent person. And even so, we do imprison the innocent, and even execute them rarely.

    What’s new, however, is the problem of bad science. Human beings have terrible intuitions about statistics and coincidence. Scientists have to receive lots of training to get over those intuitions, and even so they often fail. The truth is, if you go looking in a big sea of data, you can often find it even when it’s not actually there.

    This could be made worse with automated systems scanning the sea of data. You presume the errors made by the computers will be of the sort that, once a human checks out the results, the human will quickly identify the mistake.

    What if it’s the other way around — and I think it easily could be — that the humans trust the computers too much, or the computers are programmed to find just the sort of things that look damned suspicious to us humans, but the computer finds more of them.

    Consider Arar. Arar witnessed a car loan back in Quebec by a casual associate who had been temporarily on a watchlist. Just the sort of thing a computer search might find and flag. And just the sort of thing which caused U.S. agents to grab him as he flew through JFK, and ship him off to Syria for torture.

    Now imagine a computer search is able to look at everybody’s contacts, and finds thousands of people as suspicious — to humans and it — as Arar.

  14. Matt Austern’s point is a crucial one: without independent — perhaps even adeversarial — review of all the classifications made, there will be a strong incentive to game the numbers. (And, in response to CM, the simplistic definition of what a classifier would do appears to make “hand-off” perfectly legal. Whee.)

    Since it would probably be unreasonable for defense lawyers to have access to all interceptions just so that they could determine whether there was some kind of irregular pattern involving the interceptions of their clients (and yes, you would need access to at least a large sample of the dataset to determine that), maybe another choice for adversarial scrutiny would be to put the eavesdropping program and the legwork folks following up the lead in the same budget line. Of course, much or all of the privacy damage would already be done by that point in the evaluation.

  15. It isn’t actually true that “we can learn by experience what the false positive rate is”. What we can learn is which of the labeled-positive messages still appear to be positive after further examination.

    The distinction is important because, if the humans doing followup examination of the labeled-positive messages mistakenly believe that the labeling is accurate, they will almost certainly be able to fool themselves into thinking they’ve found evidence that the message is a true positive. If you’re sure that there must be evidence somewhere linking someone to a criminal conspiracy, you’ll certainly find it.

    If the institutional mindset of the people using this tool is wrong, then they will never discover the true false positive rate. It’s a neat feedback loop. A mistaken a priori belief that the false positive rate is low will lead to “experience” that the false positive rate really is low.

  16. Cotillion says

    Even better would be if you scanned the message with multiple settings of the “knob” (or have the algorithm give a rating instead of yes/no). Then the warrent specifies that the police can listen to as many calls as they want until 50% (or so) of the calls they have listened to have been false positives. Then they can only listen to calls that are more likely to be real positives (better rating). That way they start with the most likely ones and use a little adaptation while they are on the job.

  17. paul: My thrust from a few threads back was that this “handing off”, while perhaps not the upfront motivation, will be a major use mode of any such system. Combined with covertly scanning/recording more information than covered by the warrant, once the access to all information is provided.

  18. There’s an error in your argument that may be minor in principle, but crucial in any real practice: the 0/1 “true” state of each conversation that a classifier might flag is not “innocuous” vs. “part of a terrorist/criminal conspiracy” but rather “part of the conspiracy the warrant is looking for” and “not part of the conspiracy the warrant is looking for”. You can think of this either as a potentially enormous increase in false positives or as an opportunity to shred the constitutional guarantees of privacy in the ostensible cause of law enforcement.

    Back in the quaint real world, there’s been substantial litigation on this point, resulting from the police practice of “handing off”, or laundering, information gleaned from wiretaps to make arrests for crimes unrelated to the probable cause for which the tap was sanctioned. You can argue that the end justifies the means, but once you do, it’s difficult to see any level of false positives that would be undesireable, except from a resource-allocation point of view.

    I wonder if there’s a way to do a test run, not only on the classifiers (which can already work with suitably distorted and/or anonymized data).