August 8, 2022

Archives for 2020

New Jersey gets ballot-tracking only half right

Two months before the November 2020 election, I wrote about New Jersey’s plans for an almost-all-vote-by-mail election. What I was told by one county’s Administrator of Elections was,

New this year is ballot tracking offered on the NJ Division of Elections’ website.  The tracking numbers are not USPS tracking–they can’t tell you where inside the U.S. mail your ballot is–but the tracking system can tell the voter:  when the County Clerk cleared the absentee ballot for mailing to the voter; when it was received back from the voter by the BoE; whether the ballot was accepted or not. 

(September 12, 2020)

This makes a lot of sense. The voter would like to know whether their signature was accepted–or whether they forgot to sign it at all–so they can “cure” their ballot, or vote in person with a provisional ballot. The tracking system allows voters to look this up online. (The outer ballot-return envelope is preprinted with a bar code identifying the voter, so even if the voter forgot the inner envelope or improperly removed the “DO NOT REMOVE” certificate from the inner envelope, the tracking system has this information for the voter.)

Unfortunately, the State Division of Elections disabled this important feature of the tracking system. When I log in to track-my-ballot, the following message appears:

Due to historically high volume, ballots deposited in Secure Ballot Drop Box locations may take up to one week to show up as “Received” and Ballots sent via US Mail may take up to two weeks to show up as “Received” in the Track My Ballot tool.

Ballot status information (i.e. – received, etc.) is provided by the counties via an automated process. The amount of time it takes until updates post to the Track My Ballot tool may vary from county to county. A voter’s ballot status won’t be changed to “Accepted” or “Rejected” until after the certification of the Election, on November 20th. Please check back periodically for updates to your ballot status.

(viewed November 10, 2020)

The first paragraph–a delay in processing the signatures at local elections offices–is forgivable this year. But the second paragraph–intentionally withholding information from the voters until it’s too late about whether their ballot was accepted–is a deliberate policy decision by New Jersey Division of Elections, and it’s the wrong decision. It makes the tracking system practically useless to the voter.

CITP call for the postdoctoral track of the CITP Fellows Program 2021-22

The Center for Information Technology Policy (CITP) is an interdisciplinary center at Princeton University. The center is a nexus of expertise in technology, engineering, public policy, and the social sciences on campus. In keeping with the strong University tradition of service, the center’s research, teaching, and events address digital technologies as they interact with society.

CITP is seeking applications for the postdoctoral track of the CITP Fellows Program for 2021-22. It is for people that have recently received a Ph.D. in fields such as computer science, sociology, economics, political science, psychology, public policy, information science, communication, philosophy, and other related technology policy disciplines. In this application cycle, we especially welcome applicants with interests in: Artificial Intelligence (AI), Data Science, Blockchain, and Cryptocurrencies.

The goals of this fully-funded, in-residence program are to support people doing important research and policy engagement related to the center’s mission and to enrich the center’s intellectual life. Fellows typically will conduct research with members of the center’s community and engage in the center’s public programs. The Fellows Program provides freedom to pursue projects of interest and a stimulating intellectual environment.

Application review will begin in the middle of December 2020.

For more information about these positions, please see our Fellows Program webpage. If you’d like to go directly to the application, please click here.

Facial recognition datasets are being widely used despite being taken down due to ethical concerns. Here’s how.

This post describes ongoing research by Kenny Peng, Arunesh Mathur, and Arvind Narayanan. We are grateful to Marshini Chetty for useful feedback.

Computer vision research datasets have been criticized for violating subjects’ privacy, reinforcing cultural biases, and enabling questionable applications. But regulating their use is hard.

For example, although the DukeMTMC dataset of videos recorded on Duke’s campus was taken down in June 2019 due to a backlash, the data continues to be used by other researchers. We found at least 135 papers that use this data and were published after this date, many of which were in the field’s most prestigious conferences. Worse, we found that at least 116 of these papers used “derived” datasets, those datasets that reuse data from the original source. In particular, the DukeMTMC-ReID dataset remains a popular dataset in the field of person reidentification and continues to be free for anyone to download.

The case of DukeMTMC illustrates the challenges of regulating a dataset’s usage in light of ethical concerns, especially when the data is separately available in derived datasets. In this post, we reveal how these problems are endemic and not isolated to this dataset.

Background: Why was DukeMTMC criticized?

DukeMTMC received criticism on two fronts following investigations by MegaPixels and The Financial Times. Firstly, the data collection deviated from IRB guidelines in two respects — the recordings were done outdoors and the data was made available without protections. Secondly, the dataset was being used in research with applications to surveillance, an area which has drawn increased scrutiny in recent years.

The backlash toward DukeMTMC was part of growing concerns that the faces of ordinary people were being used without permission to serve questionable ends.

Following its takedown, data from DukeMTMC continues to be used

In response to the backlash, the author of DukeMTMC issued an apology and took down the dataset. It is one of several datasets that has been removed or modified due to ethical concerns. But the story doesn’t end here. In the case of DukeMTMC, the data had already been copied over into other derived datasets, which use data from the original with some modifications. These include DukeMTMC-SI-Tracklet, DukeMTMC-VideoReID, and DukeMTMC-ReID. Although some of these derived datasets were also taken down, others, like DukeMTMC-ReID, remain freely available.

Yet the data isn’t just available — it continues to be used prominently in academic research. We found 135 papers that use DukeMTMC or its derived datasets. These papers were published in such venues as CVPR, AAAI, and BMVC — some of the most prestigious conferences in the field. Furthermore, at least 116 of these used data from derived datasets, showing that regulating a given dataset also requires regulating its derived counterparts.

Together, the availability of the data, and the willingness of researchers and reviewers to allow its use, has made the removal of DukeMTMC only a cosmetic response to ethical concerns.

This set of circumstances is not unique to DukeMTMC. We found the same result for the MS-Celeb-1M dataset, which was removed by Microsoft in 2019 after receiving criticism. The dataset lives on through several derived datasets, including MS1M-IBUG, MS1M-ArcFace, and MS1M-RetinaFace — each, publicly available for download. The original dataset is also available via Academic Torrents. We also found that, like DukeMTMC, this data remains widely used in academic research.

Derived datasets can enable unintended and unethical research

In the case of DukeMTMC, the most obvious ethical concern may have been that the data was collected unethically. However, a second concern — that DukeMTMC was being used for ethically questionable research, namely surveillance — is also relevant to datasets that are collected responsibly.

Even if a dataset was created for benign purposes, it may have uses in more questionable areas. Oftentimes, these uses are enabled by a derived dataset. This was the case for DukeMTMC. The authors of the Duke MTMC dataset note that they have  never conducted research in facial recognition, and that the dataset was not intended for this purpose. However, the dataset turned out to be particularly popular for the person re-identification problem, which has drawn criticism for its applications to surveillance. This usage was enabled by datasets like DukeMTMC-ReID dataset, which tailored the original dataset specifically for this problem.

Also consider the SMFRD dataset, which was released soon after the COVID-19 pandemic took hold. The dataset contains masked faces, including those in the popular Labeled Faces in the Wild (LFW) dataset with facemasks superimposed. The ethics of masked face recognition is a question for another day, but we point to SMFRD as evidence of the difficulty of anticipating future uses of a dataset. Released more than 12 years after LFW, SMFRD was created in a very different societal context.

It is difficult for a dataset’s author to anticipate harmful uses of their dataset — especially those that may arise in the future. However, we do suggest that a dataset’s author can reasonably anticipate that their dataset has potential to contribute to unethical research, and accordingly, think about how they might restrict their dataset upon release.

Derived datasets are widespread and unregulated

In the few years that DukeMTMC was available, it spawned several derived datasets. MS-Celeb-1M has also been used in several derived datasets.

More popular datasets can spawn even more derived counterparts. For instance, we found that LFW has been used in at least 14 derived datasets, 7 of which make their data freely available for download. These datasets were found through a semi-manual analysis of papers citings LFW. We suspect that many more derived datasets of LFW exist. 

Before thinking about how one could regulate derived datasets, in the present circumstances, it is even challenging to know what derived datasets exist.

For both DukeMTMC and LFW, the authors lack control over these derived datasets. Neither requires giving any information to the authors prior to using the data, as is the case with some other datasets. The authors also lack control via licensing. DukeMTMC was released under the CC BY-NC-SA 4.0 license, which allows for sharing and adapting the dataset, as long as the use is non-commercial and attribution is given. The LFW dataset was released without a license entirely.


Though regulating data is notoriously difficult, we suggest steps that the academic community can take in response to the concerns outlined above.

In light of ethical concerns, taking down a dataset is often an inadequate method of preventing further use of a dataset. Derived datasets should also be identified and also taken down. Even more importantly, researchers should subsequently not use these datasets, and journals should assert that they will not accept papers using these datasets. Similarly to how NeurIPS is requiring a broader impact statement, we suggest requiring a statement listing and justifying any datasets used in a paper.

At the same time, more efforts should be made to regulate dataset usage from the outset, particularly with respect to the creation of derived datasets. There is a need to keep track of where a dataset’s data is available, as well as to regulate the creation of derived datasets that enable unethical research. We suggest that authors consider more restrictive licenses and distribution practices when releasing their dataset.