December 14, 2024

We are releasing three longitudinal datasets of Yelp review recommendations with over 2.5M unique reviews.

By Ryan Amos, Roland Maio, and Prateek Mittal

Online reviews are an important source of consumer information, play an important role in consumer protection, and have a substantial impact on businesses’ economic outcomes. Some of these reviews may be problematic; for example, incentivized reviews, reviews with a conflict of interest, irrelevant reviews, and entirely fabricated reviews. To address this problem, many review platforms develop systems to determine which reviews to show to users. Little is known about how such online reviews recommendations change over time.

We introduce a novel dataset of Yelp reviews to study these changes, which we call reclassification. Studying reclassification can help understand the validity of prior work that depends on Yelp’s labels, evaluate the existing classifier, and shed light on the fairly opaque process of review recommendation.

Data Overview

Our data is sourced from Yelp between 2020 and 2021 and contains reviews that Yelp classifies as “Recommended” and “Not Recommended,” with a total of 2.2 million reviews described in 12.5 million data points. Our data release consists of three datasets: a small dataset with an eight year span (when combined with prior work), a large dataset concentrated in the Chicago area, and a large dataset spread across the US and stratified by population density and household income.

The data is pseudonymized to protect reviewer privacy, and the analyses in our corresponding paper can be reproduced with the pseudonymous data.

Obtaining Access

Please visit our website for more information on requesting access:

https://sites.google.com/princeton.edu/longitudinal-review-data