Lately I’ve been writing about the policy issues surrounding government wiretapping programs that algorithmically analyze large amounts of communication data to identify messages to be shown to human analysts. (Past posts in the series: 1; 2; 3; 4; 5; 6; 7.) One of the most frequent arguments against such programs is that there will be too many false positives – too many innocent conversations misidentified as suspicious.
Suppose we have an algorithm that looks at a set of intercepted messages and classifies each message as either suspicious or innocuous. Let’s assume that every message has a true state that is either criminal (i.e., actually part of a criminal or terrorist conspiracy) or innocent. The problem is that the true state is not known. A perfect, but unattainable, classifier would label a message as suspicious if and only if it was criminal. In practice a classifier will make false positive errors (mistakenly classifying an innocent message as suspicious) and false negative errors (mistakenly classifying a criminal message as innocuous).
To illustrate the false positive problem, let’s do an example. Suppose we intercept a million messages, of which ten are criminal. And suppose that the classifier correctly labels 99.9% of the innocent messages. This means that 1000 innocent messages (0.1% of one million) will be misclassified as suspicious. All told, there will be 1010 suspicious messages, of which only ten – about 1% – will actually be criminal. The vast majority of messages labeled as suspicious will actually be innocent. And if the classifier is less accurate on innocent messages, the imbalance will be even more extreme.
This argument has some power, but I don’t think it’s fatal to the idea of algorithmically classifying intercepts. I say this for three reasons.
First, even if the majority of labeled-as-suspicous messages are innocent, this doesn’t necessarily mean that listening to those messages is unjustified. Letting the police listen to, say, ten innocent conversations is a good tradeoff if the eleventh conversation is a criminal one whose interception can stop a serious crime. (I’m assuming that the ten innocent conversations are chosen by some known, well-intentioned algorithmic process, rather than being chosen by potentially corrupt government agents.) This only goes so far, of course – if there are too many innocent conversations or the crime is not very serious, then this type of wiretapping will not be justified. My point is merely that it’s not enough to argue that most of the labeled-as-suspcious messages will be innocent.
Second, we can learn by experience what the false positive rate is. By monitoring the operation of the system, we can see learn how many messages are labeled as suspicious and how many of those are actually innocent. If there is a warrant for the wiretapping (as I have argued there should be), the warrant can require this sort of monitoring, and can require the wiretapping to be stopped or narrowed if the false positive rate is too high.
Third, classification algorithms have (or can be made to have) an adjustable sensitivity setting. Think of it as a control knob that can be moved continuously between two extremes, where one extreme is labeled “avoid false positives” and the other is labeled “avoid false negatives”. Adjusting the knob trades off one kind of error for the other.
We can always make the false positive rate as low as we like, by turning the knob far enough toward “avoid false positives”. Doing this has a price, because turning the knob in that direction also increases the number of false negatives, that is, it causes some criminal messages to be missed. If we turn the knob all the way to the “avoid false positives” end, then there will be no false positives at all, but there might be many false negatives. Indeed, we might find that when the knob is turned to that end, all messages, whether criminal or not, are classified as innocuous.
So the question is not whether we can reduce false positives – we know we can do that – but whether there is anywhere we can set the knob that gives us an acceptably low false positive rate yet still manages to flag some messages that are criminal.
Whether there is an acceptable setting depends on the details of the classification algorithm. If you forced me to guess, I’d say that for algorithms based on today’s voice recognition or speech transcription technology, there probably isn’t an acceptable setting – to catch any appreciable number of criminal conversations, we’d have to accept huge numbers of false positives. But I’m not certain of that result, and it could change as the algorithms get better.
The most important thing to say about this is that it’s an empirical question, which means that it’s possible to gather evidence to learn whether a particular algorithm offers an acceptable tradeoff. For example, if we had a candidate classification algorithm, we could run it on a large number of real-world messages and, without recording any of those messages, simply count how many messages the algorithm would have labeled as suspicious. If that number were huge, we would know we had a false positive problem. We could do this for different settings of the knob, to see where we had to get an acceptable false positive rate. Then we could apply the algorithm with that knob setting to a predetermined set of known-to-be-criminal messages, to see how many it flagged.
If governments are using algorithmic classifiers – and the U.S. government may be doing so – then they can do these types of experiments. Perhaps they have. It doesn’t seem too much to ask for them to report on their false positive rates.