January 17, 2025

How I Became a Policy Wonk

It’s All-Request Friday, when I blog on topics suggested by readers. David Molnar writes,

I’d be interested to hear your thoughts on how your work has come to have significant interface with public policy questions. Was this a conscious decision, did it “just happen,” or somewhere in between? Is this the kind of work you thought you’d be doing when you first set out to do research? What would you do differently, if you could do it again, and what in retrospect were the really good decisions you made?

I’ll address most of this today, leaving the last sentence for another day.

When I started out in research, I had no idea public policy would become a focus of my work. The switch wasn’t so much a conscious decision as a gradual realization that events and curiosity had led me into a new area. This kind of thing happens all the time in research: we stumble around until we reach an interesting result and then, with the benefit of hindsight, we construct a just-so story explaining why that result was natural and inevitable. If the result is really good, then the just-so story is right, in a sense – it justifies the result and it explains how we would have gotten there if only we hadn’t been so clueless at the start.

My just-so story has me figuring out three things. (1) Policy is deep and interesting. (2) Policy affects me directly. (3) Policy and computer security are deeply connected.

Working on the Microsoft case first taught me that policy is deep and interesting. The case raised obvious public policy issues that required deep legal, economic, and technical thinking, and deep connections between the three, to figure out. As a primary technical advisor to the Department of Justice, I got to talk to top-notch lawyers and economists about these issues. What were the real-world consequences of Microsoft doing X? Would would be the consequences if they were no longer allowed to do Y? Theories weren’t enough because concrete decisions had to be made (not by me, of course, but I saw more of the decision-making process than most people did). These debates opened a window for me, and I saw in a new way the complex flow from computer science in the lab to computer products in the market. I saw, too, how public policy modulates this flow.

The DMCA taught me that policy affects me directly. The first time I saw a draft of the DMCA, before it was even law, I knew it would mean trouble for researchers, and I joined a coalition of researchers who tried to get a research exemption inserted. The DMCA statute we got was not as bad as some of the drafts, but it was still problematic. As fate would have it, my own research triggered the first legal battle to protect research from DMCA overreaching. That was another formative experience.

The third realization, that policy and computer security are joined at the hip, can’t be tied to any one experience but dawned on me slowly. I used to tell people at cocktail parties, after I had said I work on computer security and they had asked what in the world that meant, that computer security is “the study of who can do what to whom online.” This would trigger either an interesting conversation or an abrupt change of topic. What I didn’t know until somebody pointed it out was that Lenin had postulated “who can do what to whom” (and the shorthand “who-whom”) as the key question to ask in politics. And Lenin, though a terrible role model, did know a thing or two about political power struggles.

More to the point, it seems that almost every computer security problem I work on has a policy angle, and almost every policy problem I work on has a computer security angle. Policy and security try, by different means, to control what people can do, to protect people from harmful acts and actors, and to ensure freedom of action where it is desired. Working on security makes my policy work better, and vice versa. Many of the computer scientists who are most involved in policy debates come from the security community. This is not an accident but reflects the deep connections between the two fields.

(Have another topic to suggest for All-Request Friday? Suggest it in the comments here.)

Fact check: The New Yorker versus Wikipedia

In July—when The New Yorker ran a long and relatively positive piece about Wikipedia—I argued that the old-media method of laboriously checking each fact was superior to the wiki model, where assertions have to be judged based on their plausibility. I claimed that personal experience as a journalist gave me special insight into such matters, and concluded: “the expensive, arguably old fashioned approach of The New Yorker and other magazines still delivers a level of quality I haven’t found, and do not expect to find, in the world of community-created content.”

Apparently, I was wrong. It turns out that EssJay, one of the Wikipedia users described in The New Yorker article, is not the “tenured professor of religion at a private university” that he claimed he was, and that The New Yorker reported him to be. He’s actually a 24-year-old, sans doctorate, named Ryan Jordan.

Jimmy Wales, who is as close to being in charge of Wikipedia as anybody is, has had an intricate progression of thought on the matter, ably chronicled by Seth Finklestein. His ultimate reaction (or at any rate, his current public stance as of this writing) is on his personal page in Wikipedia

I only learned this morning that EssJay used his false credentials in content disputes… I understood this to be primarily the matter of a pseudonymous identity (something very mild and completely understandable given the personal dangers possible on the Internet) and not a matter of violation of people’s trust.

As Seth points out, this is an odd reaction since it seems simultaneously to forgive EssJay for lying to The New Yorker (“something very mild”) and to hold him much more strongly to account for lying to other Wikipedia users. One could argue that lying to The New Yorker—and by extension to its hundreds of thousands of subscribers—was in the aggregate much worse than lying to the Wikipedians. One could also argue that Mr. Jordan’s appeal to institutional authority, which was as successful as it was dishonest, raises profound questions about the Wikipedia model.

But I won’t make either of those arguments. Instead, I’ll return to the issue that has me putting my foot in my mouth: How can a reader decide what to trust? I predicted you could trust The New Yorker, and as it turns out, you couldn’t.

Philip Tetlock, a long-time student of the human penchant for making predictions, has found (in a book whose text I can’t link to, but which I encourage you to read) that people whose predictions are falsified typically react by making excuses. They typically claim that they are off the hook because the conditions based on which they predicted a certain result were actually not as they seemed at the time of the inaccurate prediction. This defense is available to me: The New Yorker fell short of its own standards, and took EssJay at his word without verifying his identity or even learning his name. He had, as all con men do, a plausible-sounding story, related in this case to a putative fear of professional retribution that in hindsight sits rather uneasily with his claim that he had tenure. If the magazine hadn’t broken its own rules, this wouldn’t have gotten into print.

But that response would be too facile, as Tetlock rightly observes of the general case. Granted that perfect fact checking makes for a trustworthy story; how do you know when the fact checking is perfect and when it is not? You don’t. More generally, predictions are only as good as someone’s ability to figure out whether or not the conditions are right to trigger the predicted outcome.

So what about this case: On the one hand, incidents like this are rare and tend to lead the fact checkers to redouble their meticulousness. On the other, the fact claims in a story that are hardest to check are often for the same reason the likeliest ones to be false. Should you trust the sometimes-imperfect fact checking that actually goes on?

My answer is yes. In the wake of this episode The New Yorker looks very bad (and Wikipedia only moderately so) because people regard an error in The New Yorker to be exceptional in a way the exact same error in Wikipedia is not. This expectations gap tells me that The New Yorker, warts and all, still gives people something they cannot find at Wikipedia: a greater, though conspicuously not total, degree of confidence in what they read.

Introducing All-Request Friday

Adapting an idea from Tyler Cowen, I’m going to try a new feature, where on Fridays I post about topics suggested by readers. Please post your suggested topics in the comments.

Manipulating Reputation Systems

BoingBoing points to a nice pair of articles by Annalee Newitz on how people manipulate online reputation systems like eBay’s user ratings, Digg, and so on.

There’s a myth floating around that such systems distill an uncannily accurate folk judgment from the votes submitted by millions of ordinary citizens. The wisdom of crowds, and all that. In fact, reputation systems are fraught with problems, and the most important systems survive because companies expend great effort to supplement the algorithms by investigating abuse and trying to compensate for it. eBay, for example, reportedly works very hard to fight abuse of its reputation system.

Why do people put more faith in reputation systems than the systems really deserve? One reason is the compelling but not entirely accurate analogy to the power of personal reputations in small town gossip networks. If a small-town merchant is accused of cheating a customer, everyone in town will find out quickly and – here’s where the analogy goes off the rails – individual townspeople will make nuanced judgments based on the details of the story, the character of the participants, and their own personal experiences. The reason this works is that the merchant, the customer, and the person evaluating the story are embedded in a complex, densely interconnected network.

When the network of participants gets much bigger and the interconnections much sparser, there is no guarantee that the same system will still work. Even if it does work, a large-scale system might succeed for different reasons than the small-town system. What we need is some kind of theory: some kind of explanation for why a reputation system can succeed. Our theory, whatever it is, will have to account for the desires and incentives of participants, the effect of relevant social norms, and so on.

The incentive problem is especially challenging for recommendation services like Digg. Digg assumes that users will cast votes for the sites they like. If I vote for sites that I really do like, this will mostly benefit strangers (by helping them find something cool to read). But if I sell my votes or cast them for sites run by my friends and me, I will benefit more directly. In short, my incentive is to cheat. These sorts of problems seem likely to get worse as a service grows, because the stakes will grow and the sense of community may weaken.

It seems to me that reputation systems are a fruitful area for technical, economic and social research. I know there is research going on already – and readers will probably chastise me in the comments for not citing it all – but we’re still far from understanding online reputation.

Sarasota: Could a Bug Have Lost Votes?

At this point, we still don’t know what caused the high undervote rate in Sarasota’s Congressional election. [Background: 1, 2.] There are two theories. The State-commissioned study released last week argues that for the theory that a badly designed ballot caused many voters to not see that race and therefore not cast a vote.

Today I want to make the case for the other theory: that a malfunction or bug in the voting machines caused votes to be not recorded. The case sits on four pillars: (1) The postulated behavior is consistent with a common type of computer bug. (2) Similar bugs have been found in voting machines before. (3) The state-commissioned study would have been unlikely to find such a bug. (4) Studies of voting data show patterns that point to the bug theory.

(1) The postulated behavior is consistent with a common type of computer bug.

Programmers know the kind of bug I’m talking about: an error in memory management, or a buffer overrun, or a race condition, which causes subtle corruption in a program’s data structures. Such bugs are maddeningly hard to find, because the problem isn’t evident immediately but the corrupted data causes the program to go wrong in subtle ways later. These bugs often seem to be intermittent or “random”, striking sometimes but lying dormant at other times, and seeming to strike more or less frequently depending on the time of day or other seemingly irrelevant factors. Every experienced programmer tells horror stories about such bugs.

Such a bug is consistent with the patterns we saw in the election. Undervotes didn’t happen to every voter, but they did happen in every precinct, though with different frequency in different places.

(2) Similar bugs have been found in voting machines before.

We know of at least two examples of similar bugs in voting machines that were used in real elections. After problems in Maryland voting machines caused intermittent “freezing” behavior, the vendor recalled the motherboards of 4700 voting machines to remedy a hardware design error.

Another example, this time caused by a software bug, was described by David Jefferson:

In the volume testing of 96 Diebold TSx machines … in the summer of 2005, we had an enormously high crash rate: over 20% of the machines crashed during the course of one election day’s worth of votes. These crashes always occurred either at the end of one voting transaction when the voter touched the CAST button, or right at the beginning of the next voter’s session when the voter SmartCard was inserted.

It turned out that, after a huge effort on Diebold’s part, a [Graphical User Interface] bug was discovered. If a voter touched the CAST button a sloppily, and dragged his/her finger from the button across a line into another nearby window (something that apparently happened with only one of every 400 or 500 voters) an exception would be signaled. But the exception was not handled properly, leading to stack corruption or heap corruption (it was never clear to us which), which apparently invariably lead to the crash. Whether it caused other problems also, such as vote corruption, or audit log corruption, was never determined, at least to my knowledge. Diebold fixed this bug, and at least TSx machines are free of it now.

These are the two examples we know about, but note that neither of these examples was made known to the public right away.

(3) The State-commissioned study would have been unlikely to find such a bug.

The State of Florida study team included some excellent computer scientists, but they had only a short time to do their study, and the scope of their study was limited. They did not perform the kind of time-consuming dynamic testing that one would use in an all-out hunt for such a bug. To their credit, they did the best they could given the limited time and tools they had, but they would have had to get lucky to find such a bug if it existed. Their failure to find such a bug is not strong evidence that a bug does not exist.

(4) Studies of voting data show patterns that point to the bug theory.

Several groups have studied detailed data on the Sarasota election results, looking for patterns that might help explain what happened.

One of the key questions is whether there are systematic differences in undervote rate between individual voting machines. The reason this matters is that if the ballot design theory is correct, then the likelihood that a particular voter undervoted would be independent of which specific machine the voter used – all voting machines displayed the same ballot. But an intermittent bug might well manifest itself differently depending on the details of how each voting machine was set up and used. So if undervote rates depend on attributes of the machines, rather than attributes of the voters, this tends to point toward the bug theory.

Of course, one has to be careful to disentangle the possible causes. For example, if two voting machines sit in different precincts, they will see different voter populations, so their undervote rate might differ even if the machines are exactly identical. Good data analysis must control for such factors or at least explain why they are not corrupting the results.

There are two serious studies that point to machine-dependent results. First, Mebane and Dill found that machines that had a certain error message in their logs had a higher undervote rate. According to the State study, this error message was caused by a particular method used by poll workers to wake the machines up in the morning; so the use of this method correlated with higher undervote rate.

Second, Charles Stewart, an MIT political scientist testifying for the Jennings campaign in the litigation, looked at how the undervote rate depended on when the voting machine was “cleared and tested”, an operation used to prepare the machine for use. Stewart found that machines that were cleared and tested later (closer to Election Day) had a higher undervote rate, and that machines that were cleared and tested on the same day as many other machines also had a higher undervote rate. One possibility is that clearing and testing a machine in a hurry, as the election deadline approached or just on a busy day, contributed to the undervote rate somehow.

Both studies indicate a link between the details of a how a machine was set up and used, and the undervote rate on that machine. That’s the kind of thing we’d expect to see with an intermittent bug, but not if undervotes were caused strictly by ballot design and user confusion.

Conclusion

What conclusion can we draw? Certainly we cannot say that a bug definitely caused undervotes. But we can say with confidence that the bug theory is still in the running, and needs to be considered alongside the ballot design theory as a possible cause of the Sarasota undervotes. If we want to get to the bottom of this, we need to investigate further, by looking more deeply into undervote patterns, and by examining the voting machine hardware and software.

[Correction (Feb. 28): I changed part (3) to say that the team “had” only a short time to do their sstudy. I originally wrote that they “were given” only a short time, which left the impression that the state had set a time limit for the study. As I understand it, the state did not impose such a time limit. I apologize for the error.]