January 15, 2025

Could Too Much Transparency Lead to Sunburn?

On Tuesday, the Houston Chronicle published a story about the salaries of local government employees. Headlined “Understaffing costs Houston taxpayers $150 million in overtime,” it was in many respects a typical piece of local “enterprise” journalism, where reporters go out and dig up information that the public might not already be aware is newsworthy. The story highlighted short staffing in the police department, which has too few workers for all the protection it is required to provide the citizens of Houston.

The print story used summaries and cited a few outliers, like a police sergeant who earned $95,000 in overtime. But the reporters had much more data: using Texas’s strong Public Information Act, they obtained electronic payroll data on 81,000 local government employees—essentially the entire workforce. Rather than keep this larger data set to themselves, as they might have done in a pre-Internet era, they posted the whole thing online. The notes to the database say that the Chronicle obtained even more information than it displays, and that before republishing the data, the newspaper “lumped together” what it obliquely descibes as “wellness and termination pay” into each employee’s reported base salary.

In a related blog post, Chronicle staffer Matt Stiles writes:

The editors understand this might be controversial. But this information already is available to anyone who wants to see it. We’re only compiling it in a central location, and following a trend at other news organizations publishing databases. We hope readers will find the information interesting, and, even better, perhaps spot some anomalies we’ve missed.

The value proposition here seems plausible: Among the 81,000 payroll records that have just been published, there very probably are news stories of legitimate public interest, waiting to be uncovered. Moreover (given that the Chronicle, like everyone else in the news business, is losing staff) it’s likely that crowdsourcing the analysis of this data will uncover things the reporting staff would have missed.

But it also seems likely that this release of data, by making it overwhelmingly convenient to unearth the salary of any government worker in Houston, will have a raft of side effects—where by “side” I mean that they weren’t intended by the Chronicle. For example, it’s now easy as pie for any nonprofit that raises funds from public employees in Houston to get a sense of the income of their prospects. Comparing other known data, such as approximate home values or other visible spending patterns, with information about salary can allow inferences about other sources of income. In fact, you might argue that this method—researching and linking the home value for every real estate transaction related to a city worker, and combining this data with salary information—would be an extraordinary screening mechanism for possible corruption, since those who buy above what their salary would suggest they should be able to afford must have additional income, and corruption is presumably one major reason why (generally low-paid) government workers are sometimes able to live beyond their apparent means.

More generally, it seems like there is a new world of possible synergies opened up by the wide release of this information. We almost certainly haven’t thought of all the consequences that will turn out, in retrospect, to be serious.

Houston isn’t the first place to try this—it turns out that the salaries of faculties at state schools are often quietly available for download as well, for example—but it seems to highlight a real problem. It may be good for the salaries of all public employees to be a click away, but the laws that make this possible generally weren’t passed in the last ten years, and therefore weren’t drafted with the web in mind. The legislative intent reflected in most of our current statutes, when a piece of information is statutorily required to be publicly available, is that citizens should be able to get the information by obtaining, filling out, and mailing a form, or by making a trip to a particular courthouse or library. Those small obstacles made a big difference, as their recent removal reveals: Information that you used to need a good reason to justify the cost of obtaining is now worth retrieving for the merest whim, on the off chance that it might be useful or interesting. And massive projects that require lots of retrieval, which used to be entirely impractical, can now make sense in light of any of a wide and growing range of possible motivations.

Put another way: As technology evolves, the same public information laws create novel and in some cases previously unimaginable levels of transparency. In many cases, particularly those related to the conduct of top public officials, this seems to be a clearly good thing. In others, particularly those related to people who are not public figures, it may be more of a mixed blessing or even an outright problem. I’m reminded of the “candidates” of ancient Rome—the Latin word candidatus literally means “clothed in white robes,” which would-be officeholders wore to symbolize the purity and fitness for office they claimed to possess. By putting themselves up for public office, they invited their fellow citizens to hold them to higher standards. This logic still runs strong today—for example, under the Supreme Court’s Sullivan precedent, public figures face a heightened burden if they try to sue the press for libel after critical coverage.

I worry that some kinds of progress in information technology are depleting a kind of civic ozone layer. The policy solutions here aren’t obvious—one shudders to think of a government office with the power to foreclose new, unforeseen transparencies—but it at least seems like something that legislators and their staffs ought to keep an eye on.

Viacom, YouTube, and the Dangerous Assembly of Facts

On July 2nd, Viacom’s lawsuit against Google’s YouTube unit saw a significant ruling, potentially troubling for user privacy. Viacom asked for, and judge Louis L. Stanton ordered Google to turn over, the logs of each viewing of all videos in the YouTube database, showing the username and IP address of the user who was viewing the video, a timestamp, and a code identifying the video. The judge found that Viacom “need[s] the data to compare the relative attractiveness of allegedly infringing videos with that of non-infringing videos.” The fraction of views that involve infringing video bears on Viacom’s claim that Google should have vicarious copyright liability–if the infringing videos appear to be an important draw for YouTube users, this implies a financial benefit to Google from the infringement, which would weigh in favor of a claim of vicarious liability.

As Doug Tygar has observed, the judge’s optimistic belief that disclosure of these logs won’t harm privacy seems to be based in part on the conflicting briefs of the parties. Viacom, desiring the logs, told the judge that “the login ID is an anonymous pseudonym that users create for themselves when they sign up with YouTube” which without more data “cannot identify specific individuals.” After quoting this claim and noting that Google did not refute it, the judge goes on to quote a Google employee’s blog post arguing that “in most cases, an IP address without additional information cannot” identify a particular user.

Each of these claims–first, that the login IDs of users are anonymous pseudonyms, and second, that IP addresses alone don’t suffice to identify individuals–is debatable. I haven’t reviewed the briefs that led Judge Stanton to believe each of the assertions. I suppose that his conclusions are reasonable in light of the material presented. It might be the case that the briefs should have led him to a different conclusion. Then again, as the blog post quoted above suggests, Google has at times found itself downplaying the privacy risks associated with certain data. A victory in this argument, causing the judge to take a more expansive view of the possible privacy harms, might have been a mixed blessing for Google in the longer run.

In any case, when he combined the two claims to compel the turnover of the logs, the judge made a significant mistake of his own. Agreeing for the sake of argument that login IDs alone don’t compromise privacy, and that IP addresses alone also don’t compromise privacy, it doesn’t follow that the two combined are equally innocuous. Earlier cases like the AOL debacle have shown us that information that may seem privacy-safe in isolation can be privacy-compromising when it is combined. The fact of combination–the fact that some viewing by a particular login ID happened at a certain IP address, and conversely that a viewing from a particular IP address occurred under the login of a particular user–is itself a potentially important further piece of information. If the judge thought about this fact–if he thought about the further privacy risk involved in the combination of IPs and login IDs–I couldn’t find any evidence of such consideration in his ruling.

Google wants to be permitted to modify the data to reduce the privacy risk before handing it over to Viacom, but it’s not yet clear what agreement if any the parties will reach that would do more to protect privacy that Judge Stanton’s ruling requires. It’s also not yet apparent exactly how the judge’s protective order will be constructed. But if the logs are turned over unaltered, as they may yet be, the result could be significant risk: YouTube’s users would then face extreme privacy harm in the event that the data were to leak from Viacom’s possession.

[As always, this post is the opinion of the author (David Robinson) only.]

Vendor misinformation in the e-voting world

Last week, I testified before the Texas House Committee on Elections (you can read my testimony).  I’ve done this many times before, but I figured this time would be different.  This time, I was armed with the research from the California “Top to Bottom” reports and the Ohio EVEREST reports.  I was part of the Hart InterCivic source code team for California’s analysis.  I knew the problems.  I was prepared to discuss them at length.

Wow, was I disappointed.  Here’s a quote from Peter Lichtenheld, speaking on behalf of Hart InterCivic:

Security reviews of the Hart system as tested in California, Colorado, and Ohio were conducted by people who were given unfettered access to code, equipment, tools and time and they had no threat model.  While this may provide some information about system architecture in a way that casts light on questions of security, it should not be mistaken for a realistic approximation of what happens in an election environment.  In a realistic election environment, the technology is enhanced by elections professionals and procedures, and those professionals safeguard equipment and passwords, and physical barriers are there to inhibit tampering.  Additionally, jurisdiction ballot count, audit, and reconciliation processes safeguard against voter fraud.

You can find the whole hearing online (via RealAudio streaming), where you will hear the Diebold/Premier representative, as well as David Beirne, the director of their trade organization, saying essentially the same thing.  Since this seems to be the voting system vendors’ party line, let’s spend some time analyzing it.

Did our work cast light on questions of security? Our work found a wide variety of flaws, most notably the possibility of “viral” attacks, where a single corrupted voting machine could spread that corruption, as part of regular processes and procedures, to every other voting system.  In effect, one attacker, corrupting one machine, could arrange for every voting system in the county to be corrupt in the subsequent election.  That’s a big deal.

At this point, the scientific evidence is in, it’s overwhelming, and it’s indisputable.  The current generation of DRE voting systems have a wide variety of dangerous security flaws.  There’s simply no justification for the vendors to be making excuses or otherwise downplaying the clear scientific consensus on the quality of their products.

Were we given unfettered access? The big difference between what we had and what an attacker might have is that we had some (but not nearly all) source code to the system.  An attacker who arranged for some equipment to “fall off the back of a truck” would be able to extract all of the software, in binary form, and then would need to go through a tedious process of reverse engineering before reaching parity with the access we had. The lack of source code has demonstrably failed to do much to slow down attackers who find holes in other commercial software products.  Debugging and decompilation tools are really quite sophisticated these days.  All this means is that an attacker would need additional time to do the same work that we did.

Did we have a threat model? Absolutely!  See chapter three of our report, conveniently titled “Threat Model.”  The different teams working on the top to bottom report collaborated together to draft this chapter. It talks about attackers’ goals, levels of access, and different variations on how sophisticated an attacker might be.  It is hard to accept that the vendors can get away with claiming that the reports did not have a threat model, when a simple check of the table of contents of the reports disproves their claim.

Was our work a “realistic approximation” of what happens in a real election? When the vendors call our work “unrealistic”, they usually mean one of two things:

  1. Real attackers couldn’t discover these vulnerabilities
  2. The attackers can’t be exploited in the real world.

Both of these arguments are wrong. In real elections, individual voting machines are not terribly well safeguarded.  In a studio where I take swing dance lessons, I found a rack of eSlates two weeks after the election in which they were used.  They were in their normal cases.  There were no security seals.  (I didn’t touch them, but I did have a very good look around.) That’s more than sufficient access for an attacker wanting to tamper with a voting machine.  Likewise, Ed Felten has a series of Tinker posts about unguarded voting machines in Princeton.

Can an attacker learn enough about these machines to construct the attacks we described in our report? This sort of thing would need to be done in private, where a team of smart attackers could carefully reverse engineer the machine and piece together the attack.  I’ll estimate that it would take a group of four talented people, working full time, two to three months of effort to do it.  Once.  After that, you’ve got your evil attack software, ready to go, with only minutes of effort to boot a single eSlate, install the malicious software patch, and then it’s off to the races.  The attack would only need to be installed on a single eSlate per county in order to spread to every other eSlate.  The election professionals and procedures would be helpless to prevent it.  (Hart has a “hash code testing” mechanism that’s meant to determine if an eSlate is running authentic software, but it’s trivial to defeat.  See issues 9 through 12 in our report.)

What about auditing, reconciliation, “logic and accuracy” testing, and other related procedures? Again, all easily defeated by a sophisticated attacker.  Generally speaking, there are several different kinds of tests that DRE systems support.  “Self-tests” are trivial for malicious software to detect, allowing the malicious software to either disable and fake the test results, or simply behave correctly.  Most “logic and accuracy” tests boil down to casting a handful of votes for each candidate and then doing a tally.  Malicious software might simply behave correctly until more than a handful of votes have been received.  Likewise, malicious software might just look at the clock and behave correctly unless it’s the proper election day.  Parallel testing is about pulling machines out of service and casting what appears to be completely normal votes on them while the real election is ongoing.  This may or may not detect malicious software, but nobody in Texas does parallel testing.  Auditing and reconciliation are all about comparing different records of the same event.  If you’ve got a voter-verified paper audit trail (VVPAT) attachment to a DRE, then you could compare it with the electronic records.  Texas has not yet certified any VVPAT printers, so those won’t help here.  (The VVPAT printers sold by current DRE vendors have other problems, but that’s a topic for another day.) The “redundant” memories in the DREs are all that you’ve got left to audit or reconcile.  Our work shows how this redundancy is unhelpful against security threats; malicious code will simply modify all of the copies in synchrony.

Later, the Hart representative remarked:

The Hart system is the only system approved as-is for the November 2007 general election after the top to bottom review in California.

This line of argument depends on the fact that most of Hart’s customers will never bother to read our actual report.  As it turns out, this was largely true in the initial rules from the CA Secretary of State, but you need to read the current rules, which were released several months later.  The new rules, in light of the viral threat against Hart systems, requires the back-end system (“SERVO”) to be rebooted after each and every eSlate is connected to it.  That’s hardly “as-is”.  If you have thousands of eSlates, properly managing an election with them will be exceptionally painful.  If you only have one eSlate per precinct, as California required for the other vendors, with most votes cast on optical-scanned paper ballots, you would have a much more manageable election.

What’s it all mean? Unsurprisingly, the vendors and their trade organization are spinning the results of these studies, as best they can, in an attempt to downplay their significance.  Hopefully, legislators and election administrators are smart enough to grasp the vendors’ behavior for what it actually is and take appropriate steps to bolster our election integrity.

Until then, the bottom line is that many jurisdictions in Texas and elsewhere in the country will be using e-voting equipment this November with known security vulnerabilities, and the procedures and controls they are using will not be sufficient to either prevent or detect sophisticated attacks on their e-voting equipment. While there are procedures with the capability to detect many of these attacks (e.g., post-election auditing of voter-verified paper records), Texas has not certified such equipment for use in the state.  Texas’s DREs are simply vulnerable to and undefended against attacks.

CORRECTION: In the comments, Tom points out that Travis County (Austin) does perform parallel tests.  Other Texas counties don’t.  This means that some classes of malicious machine behavior could potentially be discovered in Travis County.

Newspapers' Problem: Trouble Targeting Ads

Richard Posner has written a characteristically thoughtful blog entry about the uncertain future of newspapers. He renders widespread journalistic concern about the unwieldy character of newspapers into the crisp economic language of “bundling”:

Bundling is efficient if the cost to the consumer of the bundled products that he doesn’t want is less than the cost saving from bundling. A particular newspaper reader might want just the sports section and the classified ads, but if for example delivery costs are high, the price of separate sports and classified-ad “newspapers” might exceed that of a newspaper that contained both those and other sections as well, even though this reader was not interested in the other sections.

With the Internet’s dramatic reductions in distribution costs, the gains from bundling are decreased, and readers are less likely to prefer bundled products. I agree with Posner that this is an important insight about the behavior of readers, but would argue that reader behavior is only a secondary problem for newspapers. The product that newspaper publishers sell—the dominant source of their revenues—is not newspapers, but audiences.

Toward the end of his post, Posner acknowledges that papers have trouble selling ads because it has gotten easier to reach niche audiences. That seems to me to be the real story: Even if newspapers had undiminished audiences today, they’d still be struggling because, on a per capita basis, they are a much clumsier way of reaching readers. There are some populations, such as the elderly and people who are too poor to get online, who may be reachable through newspapers and unreachable through online ads. But the fact that today’s elderly are disproportionately offline is an artifact of the Internet’s novelty (they didn’t grow up with it), not a persistent feature of the marektplace. Posner acknoweldges that the preference of today’s young for online sources “will not change as they get older,” but goes on to suggest incongruously that printed papers might plausibly survive as “a retirement service, like Elderhostel.” I’m currently 26, and if I make it to 80, I very strongly doubt I’ll be subscribing to printed papers. More to the point, my increasing age over time doesn’t imply a growing preference for print; if anything, age is anticorrelated with change in one’s daily habits.

As for the claim that poor or disadvantaged communities are more easily reached offline than on, it still faces the objection that television is a much more efficient way of reaching large audiences than newsprint. There’s also the question of how much revenue can realistically be generated by building an audience of people defined by their relatively low level of purchasing power. If newsprint does survive at all, I might expect to see it as a nonprofit service directed at the least advantaged. Then again, if C. K. Prahalad is correct that businesses have neglected a “fortune at the bottom of the pyramid” that can be gathered by aggregating the small purchases of large numbers of poor people, we may yet see papers survive in the developing world. The greater relative importance of cell phones there, as opposed to larger screens, could augur favorably for the survival of newsprint. But phones in the developing world are advancing quickly, and may yet emerge as a better-than-newsprint way of reading the news.

The End of Theory? Not Likely

An essay in the new Wired, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete,” argues that we won’t need scientific theories any more, now that we have so much stored information and such great tools for analyzing it. Wired has never been the best source for accurate technology information, but this has to be a new low point.

Here’s the core of the essay’s argument:

[…] The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.

Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.

But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. Consider physics: Newtonian models were crude approximations of the truth (wrong at the atomic level, but still useful). A hundred years ago, statistically based quantum mechanics offered a better picture — but quantum mechanics is yet another model, and as such it, too, is flawed, no doubt a caricature of a more complex underlying reality. The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decades (the “beautiful story” phase of a discipline starved of data) is that we don’t know how to run the experiments that would falsify the hypotheses — the energies are too high, the accelerators too expensive, and so on.

There are several errors here, but the biggest one is about correlation and causation. It’s true that correlation does not imply causation. But the reason is not that the correlation might have arisen by chance – that possibility can be eliminated given enough data. The problem is that we need to know what kind of causation is operating.

To take a simple example, suppose we discover a correlation between eating spinach and having strong muscles. Does this mean that eating spinach will make you stronger? Not necessarily; this will only be true if spinach causes strength. But maybe people in poor health, who tend to have weaker muscles, have an aversion to spinach. Maybe this aversion is a good thing because spinach is actually harmful to people in poor health. If that is true, then telling everybody to eat more spinach would be harmful. Maybe some common syndrome causes both weak muscles and aversion to spinach. In that case, the next step would be to study that syndrome. I could go on, but the point should be clear. Correlations are interesting, but if we want a guide to action – even if all we want to know is what question to ask next – we need models and experimentation. We need the scientific method.

Indeed, in a world with more and more data, and better and better tools for finding correlations, we need the scientific method more than ever. This is confirmed by the essay’s physics story, in which physics theory (supposedly) went off the rails due to a lack of experimental data. Physics theory would be more useful if there were more data. And the same is true of scientific theory in general: theory and experiment advance in tandem, with advances in one creating opportunities for the other. In the coming age, theory will not wither away. Instead, it will be the greatest era ever for theory, and for experiment.