May 25, 2017

All the News That’s Fit to Change: Insights into a corpus of 2.5 million news headlines

[Thanks to Joel Reidenberg for encouraging this deeper dive into news headlines!]

There is no guarantee that a news headline you see online today will not change tomorrow, or even in the next hour, or will even be the same headlines your neighbor sees right now. For a real-life example of the type of change that can happen, consider this explosive headline from NBC News…

“Bernanke: More Execs Deserved Jail for Financial Crisis”

…contrasted with the much more subdued…

“Bernanke Thinks More Execs Should Have Been Investigated”

These headlines clearly suggest different stories, which is worrying because of the effect that headlines have on our perception of the news — a recent survey found that, “41 percent of Americans report that they watched, read, or heard any in-depth news stories, beyond the headlines, in the last week.”

As part of the Princeton Web Transparency and Accountability Project (WebTAP), we wanted to understand more about headlines. How often do news publishers change headlines on articles? Do variations offer different editorial slants on the same article? Are some variations ‘clickbait-y’?

To answer these questions we collected over ~1.5 million article links seen since June 1st, 2015 on 25 news sites’ front pages through the Internet Archive’s Wayback Machine. Some articles were linked to with more than one headline (at different times or on different parts of the page), so we ended up with a total of ~2.5 million headlines.[1] To clarify, we are defining headlines as the text linking to articles on the front page of news websites — we are not talking about headlines on the actual article pages themselves. Our corpus is available for download here. In this post we’ll share some preliminary research and outline further research questions.

 

One in four articles had more than one headline associated with it

We were limited in our analysis to how many snapshots of the news sites the Wayback Machine took. For the six months of data from 2015 especially, some of the less-popular news sites did not have as many daily snapshots as the more popular sites — the effect of this might suppress the measure of headline variation on less popular websites. Even so, we were able to capture many instances of articles with multiple headlines for each site we looked at.

 

Clickbait is common, and hasn’t changed much in the last year

We took a first pass at our data using an open source library to classify headlines as clickbait. The classifier was trained by the developer using Buzzfeed headlines as clickbait and New York Times headlines as non-clickbait, so it can more accurately be called a Buzzfeed classifier. Unsurprisingly then, Buzzfeed had the most clickbait headlines detected of the sites we looked at.

But we also discovered that more “traditional” news outlets regularly use clickbait headlines too. The Wall Street Journal, for instance, has used clickbait headlines in place of more traditional headlines for its news stories, as in two variations they tried for an article on the IRS:

‘Think Tax Time Is Tough? Try Being at the IRS’

vs.

‘Wait Times Are Down, But IRS Still Faces Challenges’

Overall, we found that at least 10% of headlines were classified as clickbait on a majority of sites we looked at. We also found that overall, clickbait does not appear to be any more or less common now than it was in June 2015.

 

Using lexicon-based heuristics we were able to identify many instances of bias in headlines

Identifying bias in headlines is a much harder problem than finding clickbait. One research group from Stanford approached detecting bias as a machine learning problem — they trained a classifier to recognize when Wikipedia edits did or did not reflect a neutral point of view, as identified by thousands of human Wikipedia editors. While Wikipedia edits and headlines differ in some pretty important ways, using their feature set was informative. They developed a lexicon of suspect words, curated from decades of research on biased language. Consider the use of the root word “accuse,” as in this example we found from Time Magazine:

‘Roger Ailes Resigns From Fox News’

vs.

‘Roger Ailes Resigns From Fox News Amid Sexual Harassment Accusations’

The first headline just offers the “who” and “what” of the news story — the second headline’s “accusations” add the much more attention-grabbing “why.” Some language use is more subtle, like in this example from Fox News:

‘DNC reportedly accuses Sanders campaign of improperly accessing Clinton voter data’

vs.

‘DNC reportedly punishes Sanders campaign for accessing Clinton voter data’

The facts implied by these headlines are different in a very important way. The second headline, unlike the first, can cause a reader presuppose that the Sanders campaign did do something wrong or malicious, since they are being punished. The first headline hedges the story significantly, only saying that the Sanders campaign may have done something “improper” — the truth of that proposition is not suggested. The researchers identify this as a bias of entailment.

Using a modified version of the biased-language lexicon, we looked at our own corpus of headlines and identified when headline variations added or dropped these biased words. We found approximately 3000 articles in which headline variations for the same article used different biased words, which you can look at here. From our data collection we clearly have evidence of editorial bias playing a role in the different headlines we see on news sites.

 

Detecting all instances of bias and avoiding false positives is an open research problem

While identifying bias in 3000 articles’ headlines is a start, we think we’ve identified only a fraction of biased articles. One reason for missing bias is that our heuristic defines differential bias narrowly (explicit use of biased words in one headline not present in another). There are also false positives in the headlines that we detected as biased. For instance, an allegation or an accusation might show a lack of neutrality in a Wikipedia article, but in a news story an allegation or accusation may simply be the story.

We know for sure that our data contains evidence of editorial and biased variations in headlines, but we still have a long way to go. We would like to be able to identify at scale and with high confidence when a news outlet experiments with its headlines. But there are many obstacles compared to the previous work on identifying bias in Wikipedia edits:

– Without clear guidelines, finding bias in headlines is a more subjective exercise than finding it in Wikipedia articles.

– Headlines are more information-dense than Wikipedia articles — fewer words in headlines contribute to a headline’s implication.

– Many of the stories that the news publishes are necessarily more political than most Wikipedia articles.

If you have any ideas on how to overcome these obstacles, we invite you to reach out to us or take a look at the data yourself, available for download here.

@dillonthehuman

[1] Our main measurements ignore subpages like nytimes.com/pages/politics, which appear to often have article links that are at some point featured on the front page. For each snapshot of the front page, we collected the links to articles seen on the page along with the ‘anchor text’ of those links, which are generally the headlines that are being varied.