June 18, 2024

Facial recognition datasets are being widely used despite being taken down due to ethical concerns. Here’s how.

This post describes ongoing research by Kenny Peng, Arunesh Mathur, and Arvind Narayanan. We are grateful to Marshini Chetty for useful feedback.

Computer vision research datasets have been criticized for violating subjects’ privacy, reinforcing cultural biases, and enabling questionable applications. But regulating their use is hard.

For example, although the DukeMTMC dataset of videos recorded on Duke’s campus was taken down in June 2019 due to a backlash, the data continues to be used by other researchers. We found at least 135 papers that use this data and were published after this date, many of which were in the field’s most prestigious conferences. Worse, we found that at least 116 of these papers used “derived” datasets, those datasets that reuse data from the original source. In particular, the DukeMTMC-ReID dataset remains a popular dataset in the field of person reidentification and continues to be free for anyone to download.

The case of DukeMTMC illustrates the challenges of regulating a dataset’s usage in light of ethical concerns, especially when the data is separately available in derived datasets. In this post, we reveal how these problems are endemic and not isolated to this dataset.

Background: Why was DukeMTMC criticized?

DukeMTMC received criticism on two fronts following investigations by MegaPixels and The Financial Times. Firstly, the data collection deviated from IRB guidelines in two respects — the recordings were done outdoors and the data was made available without protections. Secondly, the dataset was being used in research with applications to surveillance, an area which has drawn increased scrutiny in recent years.

The backlash toward DukeMTMC was part of growing concerns that the faces of ordinary people were being used without permission to serve questionable ends.

Following its takedown, data from DukeMTMC continues to be used

In response to the backlash, the author of DukeMTMC issued an apology and took down the dataset. It is one of several datasets that has been removed or modified due to ethical concerns. But the story doesn’t end here. In the case of DukeMTMC, the data had already been copied over into other derived datasets, which use data from the original with some modifications. These include DukeMTMC-SI-Tracklet, DukeMTMC-VideoReID, and DukeMTMC-ReID. Although some of these derived datasets were also taken down, others, like DukeMTMC-ReID, remain freely available.

Yet the data isn’t just available — it continues to be used prominently in academic research. We found 135 papers that use DukeMTMC or its derived datasets. These papers were published in such venues as CVPR, AAAI, and BMVC — some of the most prestigious conferences in the field. Furthermore, at least 116 of these used data from derived datasets, showing that regulating a given dataset also requires regulating its derived counterparts.

Together, the availability of the data, and the willingness of researchers and reviewers to allow its use, has made the removal of DukeMTMC only a cosmetic response to ethical concerns.

This set of circumstances is not unique to DukeMTMC. We found the same result for the MS-Celeb-1M dataset, which was removed by Microsoft in 2019 after receiving criticism. The dataset lives on through several derived datasets, including MS1M-IBUG, MS1M-ArcFace, and MS1M-RetinaFace — each, publicly available for download. The original dataset is also available via Academic Torrents. We also found that, like DukeMTMC, this data remains widely used in academic research.

Derived datasets can enable unintended and unethical research

In the case of DukeMTMC, the most obvious ethical concern may have been that the data was collected unethically. However, a second concern — that DukeMTMC was being used for ethically questionable research, namely surveillance — is also relevant to datasets that are collected responsibly.

Even if a dataset was created for benign purposes, it may have uses in more questionable areas. Oftentimes, these uses are enabled by a derived dataset. This was the case for DukeMTMC. The authors of the Duke MTMC dataset note that they have  never conducted research in facial recognition, and that the dataset was not intended for this purpose. However, the dataset turned out to be particularly popular for the person re-identification problem, which has drawn criticism for its applications to surveillance. This usage was enabled by datasets like DukeMTMC-ReID dataset, which tailored the original dataset specifically for this problem.

Also consider the SMFRD dataset, which was released soon after the COVID-19 pandemic took hold. The dataset contains masked faces, including those in the popular Labeled Faces in the Wild (LFW) dataset with facemasks superimposed. The ethics of masked face recognition is a question for another day, but we point to SMFRD as evidence of the difficulty of anticipating future uses of a dataset. Released more than 12 years after LFW, SMFRD was created in a very different societal context.

It is difficult for a dataset’s author to anticipate harmful uses of their dataset — especially those that may arise in the future. However, we do suggest that a dataset’s author can reasonably anticipate that their dataset has potential to contribute to unethical research, and accordingly, think about how they might restrict their dataset upon release.

Derived datasets are widespread and unregulated

In the few years that DukeMTMC was available, it spawned several derived datasets. MS-Celeb-1M has also been used in several derived datasets.

More popular datasets can spawn even more derived counterparts. For instance, we found that LFW has been used in at least 14 derived datasets, 7 of which make their data freely available for download. These datasets were found through a semi-manual analysis of papers citings LFW. We suspect that many more derived datasets of LFW exist. 

Before thinking about how one could regulate derived datasets, in the present circumstances, it is even challenging to know what derived datasets exist.

For both DukeMTMC and LFW, the authors lack control over these derived datasets. Neither requires giving any information to the authors prior to using the data, as is the case with some other datasets. The authors also lack control via licensing. DukeMTMC was released under the CC BY-NC-SA 4.0 license, which allows for sharing and adapting the dataset, as long as the use is non-commercial and attribution is given. The LFW dataset was released without a license entirely.


Though regulating data is notoriously difficult, we suggest steps that the academic community can take in response to the concerns outlined above.

In light of ethical concerns, taking down a dataset is often an inadequate method of preventing further use of a dataset. Derived datasets should also be identified and also taken down. Even more importantly, researchers should subsequently not use these datasets, and journals should assert that they will not accept papers using these datasets. Similarly to how NeurIPS is requiring a broader impact statement, we suggest requiring a statement listing and justifying any datasets used in a paper.

At the same time, more efforts should be made to regulate dataset usage from the outset, particularly with respect to the creation of derived datasets. There is a need to keep track of where a dataset’s data is available, as well as to regulate the creation of derived datasets that enable unethical research. We suggest that authors consider more restrictive licenses and distribution practices when releasing their dataset.