October 22, 2020

Facial recognition datasets are being widely used despite being taken down due to ethical concerns. Here’s how.

This post describes ongoing research by Kenny Peng, Arunesh Mathur, and Arvind Narayanan. We are grateful to Marshini Chetty for useful feedback.

Computer vision research datasets have been criticized for violating subjects’ privacy, reinforcing cultural biases, and enabling questionable applications. But regulating their use is hard.

For example, although the DukeMTMC dataset of videos recorded on Duke’s campus was taken down in June 2019 due to a backlash, the data continues to be used by other researchers. We found at least 135 papers that use this data and were published after this date, many of which were in the field’s most prestigious conferences. Worse, we found that at least 116 of these papers used “derived” datasets, those datasets that reuse data from the original source. In particular, the DukeMTMC-ReID dataset remains a popular dataset in the field of person reidentification and continues to be free for anyone to download.

The case of DukeMTMC illustrates the challenges of regulating a dataset’s usage in light of ethical concerns, especially when the data is separately available in derived datasets. In this post, we reveal how these problems are endemic and not isolated to this dataset.

Background: Why was DukeMTMC criticized?

DukeMTMC received criticism on two fronts following investigations by MegaPixels and The Financial Times. Firstly, the data collection deviated from IRB guidelines in two respects — the recordings were done outdoors and the data was made available without protections. Secondly, the dataset was being used in research with applications to surveillance, an area which has drawn increased scrutiny in recent years.

The backlash toward DukeMTMC was part of growing concerns that the faces of ordinary people were being used without permission to serve questionable ends.

Following its takedown, data from DukeMTMC continues to be used

In response to the backlash, the author of DukeMTMC issued an apology and took down the dataset. It is one of several datasets that has been removed or modified due to ethical concerns. But the story doesn’t end here. In the case of DukeMTMC, the data had already been copied over into other derived datasets, which use data from the original with some modifications. These include DukeMTMC-SI-Tracklet, DukeMTMC-VideoReID, and DukeMTMC-ReID. Although some of these derived datasets were also taken down, others, like DukeMTMC-ReID, remain freely available.

Yet the data isn’t just available — it continues to be used prominently in academic research. We found 135 papers that use DukeMTMC or its derived datasets. These papers were published in such venues as CVPR, AAAI, and BMVC — some of the most prestigious conferences in the field. Furthermore, at least 116 of these used data from derived datasets, showing that regulating a given dataset also requires regulating its derived counterparts.

Together, the availability of the data, and the willingness of researchers and reviewers to allow its use, has made the removal of DukeMTMC only a cosmetic response to ethical concerns.

This set of circumstances is not unique to DukeMTMC. We found the same result for the MS-Celeb-1M dataset, which was removed by Microsoft in 2019 after receiving criticism. The dataset lives on through several derived datasets, including MS1M-IBUG, MS1M-ArcFace, and MS1M-RetinaFace — each, publicly available for download. The original dataset is also available via Academic Torrents. We also found that, like DukeMTMC, this data remains widely used in academic research.

Derived datasets can enable unintended and unethical research

In the case of DukeMTMC, the most obvious ethical concern may have been that the data was collected unethically. However, a second concern — that DukeMTMC was being used for ethically questionable research, namely surveillance — is also relevant to datasets that are collected responsibly.

Even if a dataset was created for benign purposes, it may have uses in more questionable areas. Oftentimes, these uses are enabled by a derived dataset. This was the case for DukeMTMC. The authors of the Duke MTMC dataset note that they have  never conducted research in facial recognition, and that the dataset was not intended for this purpose. However, the dataset turned out to be particularly popular for the person re-identification problem, which has drawn criticism for its applications to surveillance. This usage was enabled by datasets like DukeMTMC-ReID dataset, which tailored the original dataset specifically for this problem.

Also consider the SMFRD dataset, which was released soon after the COVID-19 pandemic took hold. The dataset contains masked faces, including those in the popular Labeled Faces in the Wild (LFW) dataset with facemasks superimposed. The ethics of masked face recognition is a question for another day, but we point to SMFRD as evidence of the difficulty of anticipating future uses of a dataset. Released more than 12 years after LFW, SMFRD was created in a very different societal context.

It is difficult for a dataset’s author to anticipate harmful uses of their dataset — especially those that may arise in the future. However, we do suggest that a dataset’s author can reasonably anticipate that their dataset has potential to contribute to unethical research, and accordingly, think about how they might restrict their dataset upon release.

Derived datasets are widespread and unregulated

In the few years that DukeMTMC was available, it spawned several derived datasets. MS-Celeb-1M has also been used in several derived datasets.

More popular datasets can spawn even more derived counterparts. For instance, we found that LFW has been used in at least 14 derived datasets, 7 of which make their data freely available for download. These datasets were found through a semi-manual analysis of papers citings LFW. We suspect that many more derived datasets of LFW exist. 

Before thinking about how one could regulate derived datasets, in the present circumstances, it is even challenging to know what derived datasets exist.

For both DukeMTMC and LFW, the authors lack control over these derived datasets. Neither requires giving any information to the authors prior to using the data, as is the case with some other datasets. The authors also lack control via licensing. DukeMTMC was released under the CC BY-NC-SA 4.0 license, which allows for sharing and adapting the dataset, as long as the use is non-commercial and attribution is given. The LFW dataset was released without a license entirely.

Implications

Though regulating data is notoriously difficult, we suggest steps that the academic community can take in response to the concerns outlined above.

In light of ethical concerns, taking down a dataset is often an inadequate method of preventing further use of a dataset. Derived datasets should also be identified and also taken down. Even more importantly, researchers should subsequently not use these datasets, and journals should assert that they will not accept papers using these datasets. Similarly to how NeurIPS is requiring a broader impact statement, we suggest requiring a statement listing and justifying any datasets used in a paper.

At the same time, more efforts should be made to regulate dataset usage from the outset, particularly with respect to the creation of derived datasets. There is a need to keep track of where a dataset’s data is available, as well as to regulate the creation of derived datasets that enable unethical research. We suggest that authors consider more restrictive licenses and distribution practices when releasing their dataset.

Federal judge denies injunction, so 7 states won’t be forced to accept internet ballot return

In the case of Harley v. Kosinski, Matthew Harley (and 9 other individuals) sued the election officials of 7 states (New York, Pennsylvania, Ohio, Texas, Kentucky, Wisconsin, and Georgia). The Plaintiffs, U.S. citizens living abroad, said that voting by mail (from abroad) has become so slow and unreliable that these states should be forced to let them vote by internet.

The lawsuit was filed September 30, 2020, requesting a preliminary injunction requiring online ballot return. The state defendants responded in writing by (the Court’s deadline of) October 9. On October 13, Federal district judge Brian Cogan denied the plaintiffs’ motion for a preliminary injunction.

Each of the seven states filed a reply brief arguing (as usual for preliminary injunctions) that the plaintiffs lack standing, they’re suing the wrong parties, they have not established a clear likelihood of success on the merits, and they have not demonstrated irreparable harm.

I will summarize New York State’s reply brief; the other states made similar arguments.

Lack of standing: the New-York-resident plaintiff “cannot establish an injury in fact that is traceable to any challenged conduct of the New York State Defendants”. Mr. Harley is “concerned” that his completed ballot will not be received on time. It was mailed to him on September 18, but he does not say when he received it or when he mailed it back. His “concern” is not an “actual” or “imminent” injury.

Sued the wrong parties: Election officials are just following state law, which does not provide them the discretion to permit internet ballot return. Go sue the post office.

No likelihood of success on the merits: it serves a compelling state interest to avoid internet voting:

  1. The secret ballot is a compelling state interest, to protect voters from intimidation and (vote-buying) fraud. Internet voting cannot protect the secrecy of the ballot.
  2. The security of the voting process is a compelling state interest. “As set forth in the Declarations of Professor Appel, Susan Greenhalgh, Barbara Simons, and David Jefferson . . . there is a broad consensus within the scientific community that the return of ballots via the internet or by fax is not secure and creates a high-risk threat to the integrity of the election process and should not be used in voting now or in the foreseeable future.”

No demonstration of irreparable harm: “the speculative harms identified by … Mr. Harley [and the other N.Y. plaintiff] are partially self-imposed. Their ballots were emailed to each of them in September, but they have yet to mail them back … because of their subjective “concerns”.

Well, indeed, I would be concerned too. Mail service is slower this year, and it may be true (as plaintiffs allege) that international mail is even slower and less reliable. But allowing internet voting–which can be hacked from anywhere and everywhere–cannot be the solution. To reason that “we really want this, so there must be some way to make it secure” is magical thinking.

My own declaration played a (very small) role in this; it was filed by New York in support of their reply brief that I have summarized above.

Judge Cogan explained his ruling as follows (paraphrase):

  • Second Circuit case law imposes a very high hurdle for a preliminary injunction that imposes a mandate on state government. There must be a strong showing of irreparable harm and a clear showing of likely success on the merits.
  • Plaintiffs could not show jurisdiction over the six states other than New York. That mail might pass through the JFK International Processing Center was not a sufficient basis for jurisdiction.
  • Plaintiffs could not show standing because none showed an injury in fact, only a speculative chain of possibility.
  • Judge Cogan referred to the Purcell principle, that courts must be extremely cautious before granting injunctive relief on the eve of an election.
  • The US Constitution does not guarantee overseas voters the right to vote and overseas voters do not have a constitutional right to a particular method for returning a ballot beyond what Congress authorized in UOCAVA. Voters do not take their right to vote with them when they move abroad.
  • A change this close to the election would undermine voter confidence in the system. States need time to set up systems. He acknowledged potential security risks, which have a significant effect on voters’ confidence in the system. Recent problems that New York experienced implementing expansion of absentee voting underscore the concern that any court-ordered changes could be difficult to implement.

A few hours after the Court denied the preliminary injunction, the Plaintiffs moved to dismiss the case, without prejudice. So I guess that’s that.

Election Security and Transparency in 2020

Earlier this month I gave a public lecture at the invitation of the Center for Information Technology Policy and the League of Women Voters. The League had asked, “What can we as voters do to protect our elections and our representative government?”

The video is available here. A longer video, that includes introductions, Q&A moderated by the LWV, and some remarks by the Union County (NJ) Administrator of Elections, is available here.

First, I talk about how the principles of security, transparency, the secret ballot, and trustworthiness were built into American election procedures more than 100 years ago; how computerized voting machines affect these principles; and how the best solution is optical-scan paper ballots, counted by computers but recountable by hand, and with risk-limiting audits.

Next, starting at 12:50 (or 15:36 in the longer video), I talk about Ballot-Marking Devices, and their particular insecurity compared to hand-marked optical-scan ballots.

Starting at 20:18 (or 35:46 in the longer video), I talk about voting during the pandemic, which particularly means Vote By Mail in many states. How do election officials make the processing of absentee ballots secure and transparent, so that we (the public) can trust that it’s secure? I explain how vote-by-mail works (especially in NJ), and how we, the public, should vote in the year 2020. And I’ll explain why in some states we should really vote in person.