April 4, 2020

Building a Bridge with Concrete… Examples

Thanks to Annette Zimmermann and Arvind Narayanan for their helpful feedback on this post.

Algorithmic bias is currently generating a lot of lively public and scholarly debate, especially amongst computer scientists and philosophers. But do these two groups really speak the same language—and if not, how can they start to do so?

I noticed at least two different ways of thinking about algorithmic bias during a recent research workshop on the ethics of algorithmic decision-making at Princeton University’s Center for Human Values, organized by political philosopher Dr. Annette Zimmermann. Philosophers are thinking about algorithmic bias in terms of things like the inherent value of explanation, the fairness and accountability rights afforded to humans, and whether groups that have been systematically affected by unfair systems should bear the burden for integration when transitioning to a uniform system. Computer scientists, by contrast, are thinking about algorithmic bias in terms of things like running a gradient backwards to visualize a heat map, projecting features into various subspaces devoid of protected attributes, and tuning hyperparameters to better satisfy a new loss function. Of course these are vast generalizations about the two fields, and there are plenty of researchers doing excellent work at the intersection, but it seems that for the most part while philosophers are debating which sets of ethical axioms ought to underpin algorithmic decision-making system, computer scientists are in the meantime already deploying these systems into the real world.

In formulating loss functions, consequentialists might prioritize maximizing accurate outcomes for the largest possible number of people, even if that is at the cost of fair treatment, whereas deontologists might prioritize treating everyone fairly, even if that is at the cost of optimality. But there isn’t a definitive “most moral” answer, and if something like equalizing false positive rates were the key to fairness, we would not be having the alarming headlines of algorithmic bias that we have today.

Inundated with various conflicting definitions of fairness, scientists are often optimizing for metrics they believe to be best and proceeding onwards. For example, one might reasonably think that the way to ensure fairness of an algorithm between different racial groups could be to enforce predictive parity (equal likelihood of accurate positive predictions), or to equalize false error rates, or just to treat similar individuals similarly. However, it is actually mathematically impossible to simultaneously satisfy seemingly reasonable fairness criteria like these in most real world settings. It is unclear how to choose amongst the criteria, and even more unclear how one would go about translating complex ideas that may require consideration, such as systematic oppression, into a world of optimizers and gradients.

Since concrete mappings between a mathematical loss function and moral concepts are likely impossible to dictate, and philosophers are unlikely to settle on an ultimate theory of fairness, perhaps for now we can adopt a strategy that is, at least, not impossible to implement: a purposefully created, context- and application-specific validation/test set. The motivation behind this is that even if philosophers and ethicists cannot decisively articulate a set of general, static fairness desiderata, perhaps they can make more domain-specific, dynamic judgements: for instance whether one should prefer a system that gives person A with a set of attributes and features a loan or not. And they can also say that for person B and C and so on. Of course there will not be unanimous agreement, but at least a general consensus towards a particular outcome as preferable over the other. One could then create a whole set of such examples. Concepts like the idea that similar people should be treated similarly in a given decision scenario—the ‘like cases maxim’ in legal philosophy—could be encoded into this test set by having groups of people that differ only in a protected attribute be given the same result, and even concepts like equal accuracy rates across protected groups could be encoded in by having the test set be represented by equal numbers of people from each group rather than proportional to the real world majority/minority representations. However, the test set is not a constructually valid way to enforce these fairness constraints, and it shouldn’t be either, because the reason why such a test set would exist is that the right fairness criteria are not actually known, otherwise it would just be explicitly formulated into the loss function.

At this juncture, ethicists and computer scientists could usefully engage in complementary work: ethicists could identify difficult edge cases that challenge what we think about moral questions and incorporate this into the test set, and computer scientists could work on optimizing accuracy rates on a given validation set. There are a few crucial differences, however, from similar collaborative approaches in other domains like when doctors are called on to provide expert labels on medical data so models can be trained to detect things like eye diseases. There is now the new notion that the distribution of the test set, in addition to just the labels, are going to be specifically decided upon by domain experts. Further, this collaboration would last beyond just the labeling of the data. Failure cases should be critically investigated earlier in the machine learning pipeline in an iterative and reflective way to ensure things like overfitting are not happening. Whether performing well on the hidden test set requires learning fairer representations in the feature space or thresholding different groups differently, scientists will build context-specific models that encompass certain moral values defined by ethicists, who are grounding the test set in examples of realizations of such values.

But does this proposal mean adopting a potentially dangerous, ethically objectionable “the ends justify the means” logic? Not necessarily. With algorithm developers working in conjunction with ethicists to ensure the means are not unsavory, this could be a way to bridge the divide between abstract notions of fairness, and concrete ways of implementing systems.

This may not be a long-term ideal way to deal with the problem of algorithmic fairness because of the difficulty in generalizing between applications, and in situations where creating an expert-curated test set is too expensive or not scalable, not preferred over satisfying one of the many mathematical definitions of fairness, but it could be one possible way to incorporate philosophical notions of fairness into the development of algorithms. Because technologists are not going to hold off and wait on deploying machine learning systems until they are in a state of fairness everyone agrees on, finding a way of incorporating philosophical views about central moral values like fairness and justice into algorithmic systems right now is an urgent problem.

Supervised machine learning has traditionally been focused on predicting based on historical and existing data, but maybe we can structure our data in a way that is a model not of the society we actually live in, but of the one we hope to live in. Translating complex philosophical values into representative examples is not an easy task, but it is one that ethicists have been doing a version of for centuries in order to investigate moral concepts—and perhaps it can also be the way to convey some sense of our morals to machines.

Ballot-level comparison audits: BMD

In my previous posts, I’ve been discussing ballot-level comparison audits, a form of risk-limiting audit. Ballots are imprinted with serial numbers (after they leave the voter’s hands); during the audit, a person must find a particular numbered ballot in a batch of a thousand (more or less).

With CCOS (central-count optical scan) this works fine: the CCOS prints the serial numbers consecutively, and the human auditor can easily find the right ballot in a minute or two. With PCOS (precinct-count optical scan), we are reluctant to print the serial numbers consecutively, because the order in which people insert their ballots at the polling place is visible to the public, and (in theory) someone could learn how you voted by correlating with the CVR file.

What about ballot-marking devices (BMDs)? How do the serial numbers work for use in ballot-level comparison audits?

First of all, let’s remember that RLAs of BMD-marked ballots are not very meaningful, because the RLA can only assure that what’s marked on the paper is correctly tabulated. Because most voters don’t inspect what’s marked on the paper, the RLA cannot assure that what the voter indicated to the BMD (on the touchscreen) has been correctly tabulated, if the BMD had been hacked to make it cheat.

But suppose we set that concern aside. And indeed, some jurisdictions are conducting “RLAs” on BMD-marked ballots. So let’s examine how such “RLAs” should work.

If the BMD prints a serial number onto the marked ballot before presenting the ballot for the voter to examine, then the voter can see the serial number, and can make a note of it. Then the voter can sell their vote, by telling the criminal vote-buyer the serial number. Or the voter can be coerced to do so. You may think this is a far-fetched scenario, but voter coercion and vote selling were common in the 19th-century and early 20th-century United States, and occurs now in some other countries.

Some “all-in-one” BMDs incorporate a scanning function, and don’t require a separate PCOS scanner. Suppose such a BMD prints a serial number onto the marked ballot after presenting the ballot for the voter to examine? That helps address the “voter-sees-the-number” problem. But it’s unpleasant to contemplate voting machines that can mark your ballot after the last time you see it. Any voting machine whose physical hardware can print votes onto the ballot after the last time the voter sees the paper,  is not a voter verified paper ballot system, and is not acceptable. But even so–suppose we permit this–we are in a similar situation to PCOS ballots. That is, the serial numbers should be in random order, not consecutive order, because otherwise observers in the polling place could calculate what serial number you’ll get.

And therefore, ballot-comparison audits of BMD-marked ballots run into just the same problem as audits of PCOS-scanned ballots, and maybe the same solutions would apply.

Because of this problem, some manufacturers of BMDs have done the same as manufacturers of PCOS: omit serial numbers entirely. For example, the ExpressVote and ExpressVote XL do not print serial numbers on the ballot*, and therefore their ballots (like PCOS ballots) cannot be easily audited by ballot-level comparison audits (except by a cumbersome “transitive audit”).

*Based on information about the ExpressVote and ExpressVote XL as configured in 2019 and deployed in more than one state, including New Jersey.

Finding a randomly numbered ballot

In my previous posts, I’ve been discussing ballot-level comparison audits, a form of risk-limiting audit. Ballots are imprinted with serial numbers (after they leave the voter’s hands); during the audit, a person must find a particular numbered ballot in a batch of a thousand (more or less).

If the ballot papers are numbered consecutively, that’s not too difficult. But if the serial numbers are in random order, it’s very time-consuming.

An answer to the second puzzle.

So here’s my next idea. Likely I’m not the first to think of it, so I can’t claim much credit. And this idea may or may not be practical; it would need to be tested in practice.

Problem: You have a batch of serial-numbered ballots, like this, and you need to find the one numbered 0236000482.

Take the pile of ballots, and feed them through a high-volume scanner. Scanners that can do 140 pages per minute cost about $6000. The computer attached to the scanner can use OCR (optical character recognition) software just on the corners of the page, to find and recognize the serial number. When it finds the right number, the computer commands the scanner to stop.

Then the human auditor can pick up the last-scanned page, and examine it to make sure it’s the right number.

If the OCR software does not work perfectly (false positives), no harm done: the human sees that it’s the wrong number, and resumes the scanner. False negatives are more annoying, but still recoverable: the human would have to search through the entire pile. Because we don’t rely on the scanner to work perfectly, because the scanner is not counting or tabulating votes, there’s no need to put this equipment through an EAC certification process.

As you’ll notice, the serial number is printed in fairly low-quality, hard-to-read print. This might pose problems for the OCR software. Better-quality printing would help the OCR, but it would help the humans too, and might be worth doing in any case.

Another variant of this solution is to print the serial number as a barcode in addition to human-readable digits. That would be easier for the scanner to recognize. If the PCOS tries to cheat in some way by making the barcode mismatch the human-readable number, this will be detected immediately by the human auditor.

Puzzle number 3: The solution I propose in this article might work; but surely a creative person can find even better ways to support ballot-level comparison audits of PCOS machines.