May 29, 2017

Archives for September 2016

Which voting machines can be hacked through the Internet?

Over 9000 jurisdictions (counties and states) in the U.S. run elections with a variety of voting machines: optical scanners for paper ballots, and direct-recording “touchscreen” machines.  Which ones of them can be hacked to make them cheat, to transfer votes from one candidate to another?

The answer:  all of them.  An attacker with physical access to a voting machine can install fraudulent vote-miscounting software.  I’ve demonstrated this on one kind of machine, others have demonstrated it on other machines.  It’s a general principle about computers: they run whatever software is installed at the moment.

So let’s ask:

  1. Which voting machines can be hacked from anywhere in the world, through the Internet?  
  2. Which voting machines have other safeguards, so we can audit or recount the election to get the correct result even if the machine is hacked?

The answers, in summary:

  1. Older machines (Shouptronic, AVC Advantage, AccuVote OS, Optech-III Eagle) can be hacked by anyone with physical access; newer machines (almost anything else in use today) can be hacked by anyone with physical access, and are vulnerable to attacks from the Internet.
  2. Optical scan machines, even though they can be hacked, allow audits and recounts of the paper ballots marked by the voters.  This is a very important safeguard.  Paperless touchscreen machines have no such protection.  “DRE with VVPAT” machines, i.e. touchscreens that print on paper (that the voter can inspect under glass while casting the ballot) are “in between” regarding this safeguard.

The most widely used machine that fails #1 and #2 is the AccuVote TS, used throughout the state of Georgia, and in some counties in other states.

[Read more…]

Bitcoin’s history deserves to be better preserved

Much of Bitcoin’s development has happened in the open in a transparent manner through the mailing list and the bitcoin-dev IRC channel. The third-party website BitcoinStats maintains logs of the bitcoin-dev IRC chats. [1] This resource has proved useful is linked to by other sources such as the Bitcoin wiki.

When reading a blog post about the 2013 Bitcoin fork, I noticed something strange about a discussion on BitcoinStats that was linked from it. Digging around, I found that Wayback Machine version of the logs from BitcoinStats are different; the log had been changed at some point. I was curious if only this conversation was truncated, or if other logs had changed.

After scraping the current version of the BitcoinStats website and scraping the Wayback Machine versions, I found that many pages are different from their Wayback Machine version. For example on the log for January 11, 2016 many entries for the user by the username ‘Lightsword’ are now blank. The number and nature of the errors makes it appear there might be a bug in the backend of the BitcoinStats website, rather than a malicious censure of certain conversations. There may not be a complete history of the IRC channels anywhere, as the Wayback Machine also has holes in its coverage.

It is unfortunate that artifacts of Bitcoin’s development history are being lost. There is value in knowing how critical decisions were made in frantic hours of the 2013 fork. An important part of learning from history is having access to historical data. Decisions that shape what Bitcoin is today were originally discussed on IRC, and those decisions will continue to shape Bitcoin. Understanding what went right and what went wrong can inform future technology and community design.

The lesson is that online communities must make deliberate efforts to preserve important digital artifacts. Often this is merely a matter of picking the right technology. If GitHub were to disappear tomorrow, all of Bitcoin’s code history would not be lost thanks to git’s decentralized and distributed nature. All of Bitcoin’s transaction history is likewise famously replicated and resilient to corruption or loss.

Preserving the IRC logs would not be difficult. The community could distribute the logs via BitTorrent, as Wikipedia does with its content. Another option is to use the form the Wayback Machine provides to ensure the archiving of a page (to minimize effort, one could automate the invocation of this functionality). Given how important preserving this data is and how easy it is, it seems worthwhile.

[1] IRC as a whole has a culture of ephemerality, and so Freenode, the server that hosts the bitcoin-dev IRC channel doesn’t provide logs.

All the News That’s Fit to Change: Insights into a corpus of 2.5 million news headlines

[Thanks to Joel Reidenberg for encouraging this deeper dive into news headlines!]

There is no guarantee that a news headline you see online today will not change tomorrow, or even in the next hour, or will even be the same headlines your neighbor sees right now. For a real-life example of the type of change that can happen, consider this explosive headline from NBC News…

“Bernanke: More Execs Deserved Jail for Financial Crisis”

…contrasted with the much more subdued…

“Bernanke Thinks More Execs Should Have Been Investigated”

These headlines clearly suggest different stories, which is worrying because of the effect that headlines have on our perception of the news — a recent survey found that, “41 percent of Americans report that they watched, read, or heard any in-depth news stories, beyond the headlines, in the last week.”

As part of the Princeton Web Transparency and Accountability Project (WebTAP), we wanted to understand more about headlines. How often do news publishers change headlines on articles? Do variations offer different editorial slants on the same article? Are some variations ‘clickbait-y’?

To answer these questions we collected over ~1.5 million article links seen since June 1st, 2015 on 25 news sites’ front pages through the Internet Archive’s Wayback Machine. Some articles were linked to with more than one headline (at different times or on different parts of the page), so we ended up with a total of ~2.5 million headlines.[1] To clarify, we are defining headlines as the text linking to articles on the front page of news websites — we are not talking about headlines on the actual article pages themselves. Our corpus is available for download here. In this post we’ll share some preliminary research and outline further research questions.


One in four articles had more than one headline associated with it

We were limited in our analysis to how many snapshots of the news sites the Wayback Machine took. For the six months of data from 2015 especially, some of the less-popular news sites did not have as many daily snapshots as the more popular sites — the effect of this might suppress the measure of headline variation on less popular websites. Even so, we were able to capture many instances of articles with multiple headlines for each site we looked at.


Clickbait is common, and hasn’t changed much in the last year

We took a first pass at our data using an open source library to classify headlines as clickbait. The classifier was trained by the developer using Buzzfeed headlines as clickbait and New York Times headlines as non-clickbait, so it can more accurately be called a Buzzfeed classifier. Unsurprisingly then, Buzzfeed had the most clickbait headlines detected of the sites we looked at.

But we also discovered that more “traditional” news outlets regularly use clickbait headlines too. The Wall Street Journal, for instance, has used clickbait headlines in place of more traditional headlines for its news stories, as in two variations they tried for an article on the IRS:

‘Think Tax Time Is Tough? Try Being at the IRS’


‘Wait Times Are Down, But IRS Still Faces Challenges’

Overall, we found that at least 10% of headlines were classified as clickbait on a majority of sites we looked at. We also found that overall, clickbait does not appear to be any more or less common now than it was in June 2015.


Using lexicon-based heuristics we were able to identify many instances of bias in headlines

Identifying bias in headlines is a much harder problem than finding clickbait. One research group from Stanford approached detecting bias as a machine learning problem — they trained a classifier to recognize when Wikipedia edits did or did not reflect a neutral point of view, as identified by thousands of human Wikipedia editors. While Wikipedia edits and headlines differ in some pretty important ways, using their feature set was informative. They developed a lexicon of suspect words, curated from decades of research on biased language. Consider the use of the root word “accuse,” as in this example we found from Time Magazine:

‘Roger Ailes Resigns From Fox News’


‘Roger Ailes Resigns From Fox News Amid Sexual Harassment Accusations’

The first headline just offers the “who” and “what” of the news story — the second headline’s “accusations” add the much more attention-grabbing “why.” Some language use is more subtle, like in this example from Fox News:

‘DNC reportedly accuses Sanders campaign of improperly accessing Clinton voter data’


‘DNC reportedly punishes Sanders campaign for accessing Clinton voter data’

The facts implied by these headlines are different in a very important way. The second headline, unlike the first, can cause a reader presuppose that the Sanders campaign did do something wrong or malicious, since they are being punished. The first headline hedges the story significantly, only saying that the Sanders campaign may have done something “improper” — the truth of that proposition is not suggested. The researchers identify this as a bias of entailment.

Using a modified version of the biased-language lexicon, we looked at our own corpus of headlines and identified when headline variations added or dropped these biased words. We found approximately 3000 articles in which headline variations for the same article used different biased words, which you can look at here. From our data collection we clearly have evidence of editorial bias playing a role in the different headlines we see on news sites.


Detecting all instances of bias and avoiding false positives is an open research problem

While identifying bias in 3000 articles’ headlines is a start, we think we’ve identified only a fraction of biased articles. One reason for missing bias is that our heuristic defines differential bias narrowly (explicit use of biased words in one headline not present in another). There are also false positives in the headlines that we detected as biased. For instance, an allegation or an accusation might show a lack of neutrality in a Wikipedia article, but in a news story an allegation or accusation may simply be the story.

We know for sure that our data contains evidence of editorial and biased variations in headlines, but we still have a long way to go. We would like to be able to identify at scale and with high confidence when a news outlet experiments with its headlines. But there are many obstacles compared to the previous work on identifying bias in Wikipedia edits:

– Without clear guidelines, finding bias in headlines is a more subjective exercise than finding it in Wikipedia articles.

– Headlines are more information-dense than Wikipedia articles — fewer words in headlines contribute to a headline’s implication.

– Many of the stories that the news publishes are necessarily more political than most Wikipedia articles.

If you have any ideas on how to overcome these obstacles, we invite you to reach out to us or take a look at the data yourself, available for download here.


[1] Our main measurements ignore subpages like, which appear to often have article links that are at some point featured on the front page. For each snapshot of the front page, we collected the links to articles seen on the page along with the ‘anchor text’ of those links, which are generally the headlines that are being varied.