March 30, 2017

All the News That’s Fit to Change: Insights into a corpus of 2.5 million news headlines

[Thanks to Joel Reidenberg for encouraging this deeper dive into news headlines!]

There is no guarantee that a news headline you see online today will not change tomorrow, or even in the next hour, or will even be the same headlines your neighbor sees right now. For a real-life example of the type of change that can happen, consider this explosive headline from NBC News…

“Bernanke: More Execs Deserved Jail for Financial Crisis”

…contrasted with the much more subdued…

“Bernanke Thinks More Execs Should Have Been Investigated”

These headlines clearly suggest different stories, which is worrying because of the effect that headlines have on our perception of the news — a recent survey found that, “41 percent of Americans report that they watched, read, or heard any in-depth news stories, beyond the headlines, in the last week.”

As part of the Princeton Web Transparency and Accountability Project (WebTAP), we wanted to understand more about headlines. How often do news publishers change headlines on articles? Do variations offer different editorial slants on the same article? Are some variations ‘clickbait-y’?

To answer these questions we collected over ~1.5 million article links seen since June 1st, 2015 on 25 news sites’ front pages through the Internet Archive’s Wayback Machine. Some articles were linked to with more than one headline (at different times or on different parts of the page), so we ended up with a total of ~2.5 million headlines.[1] To clarify, we are defining headlines as the text linking to articles on the front page of news websites — we are not talking about headlines on the actual article pages themselves. Our corpus is available for download here. In this post we’ll share some preliminary research and outline further research questions.


One in four articles had more than one headline associated with it

We were limited in our analysis to how many snapshots of the news sites the Wayback Machine took. For the six months of data from 2015 especially, some of the less-popular news sites did not have as many daily snapshots as the more popular sites — the effect of this might suppress the measure of headline variation on less popular websites. Even so, we were able to capture many instances of articles with multiple headlines for each site we looked at.


Clickbait is common, and hasn’t changed much in the last year

We took a first pass at our data using an open source library to classify headlines as clickbait. The classifier was trained by the developer using Buzzfeed headlines as clickbait and New York Times headlines as non-clickbait, so it can more accurately be called a Buzzfeed classifier. Unsurprisingly then, Buzzfeed had the most clickbait headlines detected of the sites we looked at.

But we also discovered that more “traditional” news outlets regularly use clickbait headlines too. The Wall Street Journal, for instance, has used clickbait headlines in place of more traditional headlines for its news stories, as in two variations they tried for an article on the IRS:

‘Think Tax Time Is Tough? Try Being at the IRS’


‘Wait Times Are Down, But IRS Still Faces Challenges’

Overall, we found that at least 10% of headlines were classified as clickbait on a majority of sites we looked at. We also found that overall, clickbait does not appear to be any more or less common now than it was in June 2015.


Using lexicon-based heuristics we were able to identify many instances of bias in headlines

Identifying bias in headlines is a much harder problem than finding clickbait. One research group from Stanford approached detecting bias as a machine learning problem — they trained a classifier to recognize when Wikipedia edits did or did not reflect a neutral point of view, as identified by thousands of human Wikipedia editors. While Wikipedia edits and headlines differ in some pretty important ways, using their feature set was informative. They developed a lexicon of suspect words, curated from decades of research on biased language. Consider the use of the root word “accuse,” as in this example we found from Time Magazine:

‘Roger Ailes Resigns From Fox News’


‘Roger Ailes Resigns From Fox News Amid Sexual Harassment Accusations’

The first headline just offers the “who” and “what” of the news story — the second headline’s “accusations” add the much more attention-grabbing “why.” Some language use is more subtle, like in this example from Fox News:

‘DNC reportedly accuses Sanders campaign of improperly accessing Clinton voter data’


‘DNC reportedly punishes Sanders campaign for accessing Clinton voter data’

The facts implied by these headlines are different in a very important way. The second headline, unlike the first, can cause a reader presuppose that the Sanders campaign did do something wrong or malicious, since they are being punished. The first headline hedges the story significantly, only saying that the Sanders campaign may have done something “improper” — the truth of that proposition is not suggested. The researchers identify this as a bias of entailment.

Using a modified version of the biased-language lexicon, we looked at our own corpus of headlines and identified when headline variations added or dropped these biased words. We found approximately 3000 articles in which headline variations for the same article used different biased words, which you can look at here. From our data collection we clearly have evidence of editorial bias playing a role in the different headlines we see on news sites.


Detecting all instances of bias and avoiding false positives is an open research problem

While identifying bias in 3000 articles’ headlines is a start, we think we’ve identified only a fraction of biased articles. One reason for missing bias is that our heuristic defines differential bias narrowly (explicit use of biased words in one headline not present in another). There are also false positives in the headlines that we detected as biased. For instance, an allegation or an accusation might show a lack of neutrality in a Wikipedia article, but in a news story an allegation or accusation may simply be the story.

We know for sure that our data contains evidence of editorial and biased variations in headlines, but we still have a long way to go. We would like to be able to identify at scale and with high confidence when a news outlet experiments with its headlines. But there are many obstacles compared to the previous work on identifying bias in Wikipedia edits:

– Without clear guidelines, finding bias in headlines is a more subjective exercise than finding it in Wikipedia articles.

– Headlines are more information-dense than Wikipedia articles — fewer words in headlines contribute to a headline’s implication.

– Many of the stories that the news publishes are necessarily more political than most Wikipedia articles.

If you have any ideas on how to overcome these obstacles, we invite you to reach out to us or take a look at the data yourself, available for download here.


[1] Our main measurements ignore subpages like, which appear to often have article links that are at some point featured on the front page. For each snapshot of the front page, we collected the links to articles seen on the page along with the ‘anchor text’ of those links, which are generally the headlines that are being varied.

Sloppy Reporting on the "University Personal Records" Data Breach by the New York Times Bits Blog

This morning I ran across a distressing headline while perusing my RSS feeds. The New York Times’ Bits Blog proclaimed that, “Hackers Breach 53 Universities and Dump Thousands of Personal Records Online.” I clicked, and was informed that:

Hackers published online Monday thousands of personal records from 53 universities, including Harvard, Stanford, Cornell, Princeton, Johns Hopkins, the University of Zurich and other universities around the world.

I stifled the instinct to do a spit-take with my morning cup of coffee.

Did the Sanford E-Mail Tipster or the Newspaper Break the Law?

Part of me doesn’t want to comment on the Mark Sanford news, because it’s all so tawdry and inconsistent with the respectable, family-friendly tone of Freedom to Tinker. But since everybody from the Gray Lady on down is plastering the web with stories, and because all of this reporting is leaving unanalyzed some Internet law questions, let me offer this:

On Wednesday, after Sanford’s confessional press conference, The State, the largest newspaper in South Carolina, posted email messages appearing to be love letters between the Governor and his mistress. (The paper obscured the name of the mistress, calling her only “Maria.”) The paper explained in a related news story that they had received these messages from an anonymous tipster back in December, but until yesterday’s unexpected corroboration of their likely authenticity, they had just sat on them.

Did the anonymous tipster break the law by obtaining or disclosing the email messages? Did the paper break the law by publishing them? After the jump, I’ll offer my take on these questions.

Three disclaimers: First, the paper has not yet revealed (and may not even know) most of the important facts I would need to know to thoroughly analyze whether a law has been broken. Like a first year law student, I am trying to spot legal issues that will turn on what might be the facts. Second, I know nothing about the law of South Carolina (or, for that matter, Argentina). I am analyzing three specific federal laws with which I am very familiar. Third, I am barely scratching the surface of some very complex laws.

The Anonymous Tipster

Let’s start with the anonymous tipster (AT). AT might have broken three federal laws, depending on who AT is and how he or she obtained the messages. First, the Stored Communications Act (SCA) prohibits unauthorized access to a “facility through which an electronic communication service is provided” to obtain messages “in electronic storage.” In a separate provision, the SCA prohibits providers from disclosing the content of user communications. Second, the Wiretap Act prohibits the interception of electronic communications and the disclosure and use of illegally intercepted communications. Third, the Computer Fraud and Abuse Act (CFAA) prohibits certain types of unauthorized conduct on computers and computer networks.

All three of these laws provide both civil remedies (Maria, Sanford, or an affected ISP can sue the anonymous tipster for damages) and criminal prohibitions. So should AT worry about jail or a hefty fine? Probably not, but it turns on who AT turns out to be.

What if AT turns out to be Maria herself? Even putting to one side whether these laws apply outside the U.S., she almost certainly would not have broken any of them. Each of these laws provides an exception or defense for consent of the communicating party or authorization of the email account owner. To take one example, under the SCA it is not illegal for the owner of an email account to access or disclose his or her email messages.

These defenses would also protect AT if he turns out, in a bizarre twist, to be Sanford himself.

For the same reasons, AT probably did not break these laws if it turns out Maria or Sanford intentionally disclosed the email messages to AT, perhaps a friend or acquaintance or employee, who then passed them on to the newspaper. This is probably true even if Maria or Sanford asked AT to promise to protect the secret. As in other parts of the law, misplaced trust is no defense under these three laws.

But now we get to more difficult cases. What if AT is a friend or acquaintance or employee of Maria or Sanford who had access to Maria’s or Sanford’s email account, but did not have specific permission to access these particular messages? For example, what if AT was Sanford’s secretary, a person likely to have permission to view his inbox? On these facts, the case against AT would turn on hard questions of authorization. Did Sanford or Maria limit AT’s authorized access to the inbox? If so, how? With written rules, technological access controls, or vague admonitions? Courts have interpreted the word “authorization” in the CFAA, in particular, quite narrowly, ruling that otherwise-authorized users may no longer act with authorization once they violate rules or contractual promises. (This is the legal theory being advanced by DOJ in the Lori Drew CFAA prosecution.)

Next, what if AT works for an ISP—perhaps on the IT staff for the State of South Carolina or for a commercial email provider? In this case, AT should worry a little more. Although ISPs tend to have many legal reasons to access the content of communications stored on their servers or passing through their wires, this authority is not unlimited, as I have written about elsewhere. The ISP employee’s liability or culpability will turn on factors like terms of service and motive. For example, if the employee stumbled upon the messages during routine server maintenance, there may be a good defense.

The Newspaper

Lastly, let’s turn to the newspaper, The State. First, if AT did not break any of these laws by obtaining or disclosing the messages, then the newspaper likewise did not break any of these laws by publishing them.

Even if AT has broken the CFAA or SCA, the newspaper probably has no downstream liability for its subsequent publication. These two laws focus on initial access or disclosure, not on subsequent, downstream uses and disclosures.

The Wiretap Act, on the other hand, restricts the downstream use and disclosure of illegally intercepted communications. Here, however, the First Amendment probably provides a defense.

In Bartnicki v. Vopper, the Supreme Court held that the First Amendment shields the media from liability for the publication of content illegally intercepted under the Wiretap Act if the content is “about a matter of public concern.” Granted, the private communications in Bartnicki—a phone call between a union negotiator and the union’s president about the status of negotiations—seem more a matter of public concern and less private than the intimate love letters between a politician and his mistress. But, I am no First Amendment expert, so I will leave it to others to decide how these facts fare under Bartnicki. To my nonexpert eye, given the sweeping language both in Bartnicki and in the cases cited by Bartnicki (starting with New York Times v. Sullivan), it seems that the First Amendment shield applies here.

Final Thought: So, Who is the Tipster?

Finally, Sanford or Maria might sue the newspaper and AT (as a so-called “John Doe” defendant) in order to discover AT’s identity. A plaintiff in a civil lawsuit can ask a judge to order a subpoena to discover an unknown defendant’s identity. No doubt, the newspaper would fight such a subpoena vigorously, but whether or not it would succeed is a topic for another day.