October 6, 2022

We are releasing three longitudinal datasets of Yelp review recommendations with over 2.5M unique reviews.

By Ryan Amos, Roland Maio, and Prateek Mittal

Online reviews are an important source of consumer information, play an important role in consumer protection, and have a substantial impact on businesses’ economic outcomes. Some of these reviews may be problematic; for example, incentivized reviews, reviews with a conflict of interest, irrelevant reviews, and entirely fabricated reviews. To address this problem, many review platforms develop systems to determine which reviews to show to users. Little is known about how such online reviews recommendations change over time.

We introduce a novel dataset of Yelp reviews to study these changes, which we call reclassification. Studying reclassification can help understand the validity of prior work that depends on Yelp’s labels, evaluate the existing classifier, and shed light on the fairly opaque process of review recommendation.

Data Overview

Our data is sourced from Yelp between 2020 and 2021 and contains reviews that Yelp classifies as “Recommended” and “Not Recommended,” with a total of 2.2 million reviews described in 12.5 million data points. Our data release consists of three datasets: a small dataset with an eight year span (when combined with prior work), a large dataset concentrated in the Chicago area, and a large dataset spread across the US and stratified by population density and household income.

The data is pseudonymized to protect reviewer privacy, and the analyses in our corresponding paper can be reproduced with the pseudonymous data.

Obtaining Access

Please visit our website for more information on requesting access:


What should we do about re-identification? A precautionary approach to big data privacy

Computer science research on re-identification has repeatedly demonstrated that sensitive information can be inferred even from de-identified data in a wide variety of domains. This has posed a vexing problem for practitioners and policy makers. If the absence of “personally identifying information” cannot be relied on for privacy protection, what are the alternatives? Joanna Huey, Ed Felten, and I tackle this question in a new paper “A Precautionary Approach to Big Data Privacy”. Joanna presented the paper at the Computers, Privacy & Data Protection conference earlier this year.

[Read more…]

My Bill to #OpenPACER in memory of #aaronsw – Open for Comment and Available on Github

I unveiled a draft bill at an event on Capitol Hill this week. It is drafted in Legislative XML, allows you to comment, and the code is available on github. Here’s the video:

The Open PACER Act provides for free and open access to electronic federal court records. The courts currently offer an expensive and difficult-to-use web site. They charge more than their cost of offering the service—more than Congress has authorized—violating the E-Government Act of 2002. This Act seeks to, once and for all, compel the courts to fulfil Congress’ longstanding vision of making this information “freely available to the greatest extent possible“.

More details are at openpacer.org. Twitter hashtag is #openpacer, of course.

Transcript after the jump.
[Read more…]

Smart Campaigns, Meet Smart Voters

Zeynep pointed to her New York Times op-ed, “Beware the Smart Campaign,” about political campaigns collecting and exploiting detailed information about individual voters. Given the emerging conventional wisdom that the Obama campaign’s technological superiority played an important role in the President’s re-election, we should expect more aggressive attempts to micro-target voters by both parties in future election cycles. Let’s talk about how voters might respond.
[Read more…]

My NYT Op-Ed: "Beware the Smart Campaign"

I just published a new opinion piece in the New York Times, entitled “Beware the Smart Campaign”. I react to the Obama campaign’s successful use of highly quantitative voter targeting that is inspired by “big data” commercial marketing techniques and implemented through state-of-the-art social science knowledge and randomized field experiments.  In the op-ed, I wonder whether the “persuasion score” strategy championed by Jim Messina, Obama’s campaign manager, is on balance good for democracy in the long run.

Mr. Messina is understandably proud of his team, which included an unprecedented number of data analysts and social scientists. As a social scientist and a former computer programmer, I enjoy the recognition my kind are getting. But I am nervous about what these powerful tools may mean for the health of our democracy, especially since we know so little about it all.

For all the bragging on the winning side — and an explicit coveting of these methods on the losing side — there are many unanswered questions. What data, exactly, do campaigns have on voters? How exactly do they use it? What rights, if any, do voters have over this data, which may detail their online browsing habits, consumer purchases and social media footprints?

You can read the full article here.

The argument in an op-ed is necessarily concise and leaves out much of the nuance but I think this is an important question facing democracies.  The key to my argument is that big data analytics + better social science isn’t just the same old, same old but poses novel threats to healthy public discourse.  I welcome feedback and comments as we are just starting to grapple with these new developments!