August 24, 2016


Language necessarily contains human biases, and so will machines trained on language corpora

I have a new draft paper with Aylin Caliskan-Islam and Joanna Bryson titled Semantics derived automatically from language corpora necessarily contain human biases. We show empirically that natural language necessarily contains human biases, and the paradigm of training machine learning on language corpora means that AI will inevitably imbibe these biases as well.

Specifically, we look at “word embeddings”, a state-of-the-art language representation used in machine learning. Each word is mapped to a point in a 300-dimensional vector space so that semantically similar words map to nearby points.

We show that a wide variety of results from psychology on human bias can be replicated using nothing but these word embeddings. We primarily look at the Implicit Association Test (IAT), a widely used and accepted test of implicit bias. The IAT asks subjects to pair concepts together (e.g., white/black-sounding names with pleasant or unpleasant words) and measures reaction times as an indicator of bias. In place of reaction times, we use the semantic closeness between pairs of words.

In short, we were able to replicate every single result that we tested, with high effect sizes and low p-values.

These include innocuous, universal associations (flowers are associated with pleasantness and insects with unpleasantness), racial prejudice (European-American names are associated with pleasantness and African-American names with unpleasantness), and a variety of gender stereotypes (for example, career words are associated with male names and family words with female names).

But we go further. We show that information about the real world is recoverable from word embeddings to a striking degree. The figure below shows that for 50 occupation words (doctor, engineer, …), we can accurately predict the percentage of U.S. workers in that occupation who are women using nothing but the semantic closeness of the occupation word to feminine words!

These results simultaneously show that the biases in question are embedded in human language, and that word embeddings are picking up the biases.

Our finding of pervasive, human-like bias in AI may be surprising, but we consider it inevitable. We mean “bias” in a morally neutral sense. Some biases are prejudices, which society deems unacceptable. Others are facts about the real world (such as gender gaps in occupations), even if they reflect historical injustices that we wish to mitigate. Yet others are perfectly innocuous.

Algorithms don’t have a good way of telling these apart. If AI learns language sufficiently well, it will also learn cultural associations that are offensive, objectionable, or harmful. At a high level, bias is meaning. “Debiasing” these machine models, while intriguing and technically interesting, necessarily harms meaning.

Instead, we suggest that mitigating prejudice should be a separate component of an AI system. Rather than altering AI’s representation of language, we should alter how or whether it acts on that knowledge, just as humans are able to learn not to act on our implicit biases. This requires a long-term research program that includes ethicists and domain experts, rather than formulating ethics as just another technical constraint in a learning system.

Finally, our results have implications for human prejudice. Given how deeply bias is embedded in language, to what extent does the influence of language explain prejudiced behavior? And could transmission of language explain transmission of prejudices? These explanations are simplistic, but that is precisely our point: in the future, we should treat these as “null hypotheses’’ to be eliminated before we turn to more complex accounts of bias in humans.


Security against Election Hacking – Part 2: Cyberoffense is not the best cyberdefense!

State and county election officials across the country employ thousands of computers in election administration, most of them are connected (from time to time) to the internet (or exchange data cartridges with machines that are connected).  In my previous post I explained how we must audit elections independently of the computers, so we can trust the results even if the computers are hacked.

Still, if state and county election computers were hacked, it would be an enormous headache and it would certainly cast a shadow on the legitimacy of the election.  So, should the DHS designate election computers as “critical cyber infrastructure?”

This question betrays a fundamental misunderstanding of how computer security really works.  You as an individual buy your computers and operating systems from reputable vendors (Apple, Microsoft, IBM, Google/Samsung, HP, Dell, etc.).  Businesses and banks (and the Democratic National Committee, and the Republican National Committee) buy their computers and software from the same vendors.  Your security, and the security of all the businesses you deal with, is improved when these hardware and software vendors build products without security bugs in them.   Election administrators use computers that run Windows (or MacOS, or Linux) bought from the same vendors.

Parts of the U.S. government, particularly inside the NSA, have “cyberdefense” teams that analyze widely used software for security vulnerabilities.  The best thing they could do to enhance our security is notify the vendors immediately about vulnerabilities, so the vendors can fix the bugs (and learn their lessons).   Unfortunately, the NSA also has “cyberoffense” teams that like to save up these vulnerabilities, keep them secret, and use them as weak points to break into their adversaries’ computers.  They think they’re so smart that the Russkies, or the Chinese, will never be able to figure out the same vulnerabilities and use them to break into the computers of American businesses, individuals, the DNC or RNC, or American election administrators.  There’s even an acronym for this fallacy: NOBUS.  “NObody But US” will be able to figure out this attack.

Vulnerability lists accumulated by the NSA and DHS probably don’t include a lot of vote-counting software: those lists (probably) focus on widely used operating systems, office and word-processing, network routers, phone apps, and so on.  But vote-counting software typically runs on widely used operating systems, uses PDF-handling software for ballot printing, network routers for vote aggregation.  Improvements in these components would improve election security.

So, the “cyberdefense” experts in the U.S. Government could improve everyone’s security, including election administrators, by promptly warning Microsoft, Apple, IBM, and so on about security bugs.  But their hands are often tied by the “cyberoffense” hackers who want to keep the bugs secret—and unfixed.  For years, independent cybersecurity experts have advocated that the NSA’s cyberdefense and cyberoffense teams be split up into two separate organizations, so that the offense hackers can’t deliberately keep us all insecure.   Unfortunately, in February 2016 the NSA did just the opposite: it merged its offense and defense teams together.

Some in the government talk as if “national cyberdefense” is some kind of “national guard” that they can send in to protect a selected set of computers.  But it doesn’t work that way.  Our computers are secure because of the software we purchase and install; we can choose vendors such as Apple, IBM, Microsoft, HP, or others based on their track record or based on their use of open-source software that we can inspect.  The DHS’s cybersecurity squad is not really in that process, except as they help the vendors improve the security of their products.  (See also:  “The vulnerabilities equities process.”)

Yes, it’s certainly helpful that the Secretary of Homeland Security has offered “assistance in helping state officials manage risks to voting systems in each state’s jurisdiction.”  But it’s too close to the election to be fiddling with the election software—election officials (understandably) don’t want to break anything.

But really we should ask: Should the FBI and the NSA be hacking us or defending us?  To defend us, they must stop hoarding secret vulnerabilities, and instead get those bugs fixed by the vendors.


Security against Election Hacking – Part 1: Software Independence

There’s been a lot of discussion of whether the November 2016 U.S. election can be hacked.  Should the U.S. Government designate all the states’ and counties’ election computers as “critical cyber infrastructure” and prioritize the “cyberdefense” of these systems?  Will it make any difference to activate those buzzwords with less than 3 months until the election?

First, let me explain what can and can’t be hacked.  Election administrators use computers in (at least) three ways:

  1. To maintain voter registration databases and to prepare the “pollbooks” used at every polling place to list who’s a registered voter (for that precinct); to prepare the “ballot definitions” telling the voting machines who are the candidates in each race.
  2. Inside the voting machines themselves, the optical-scan counters or touch-screen machines that the voter interacts with directly.
  3. When the polls close, the vote totals from all the different precincts are gathered (this is called “canvassing”) and aggregated together to make statewide totals for each candidate (or district-wide totals for congressional candidates).

Any of these computers could be hacked.  What defenses do we have?  Could we seal off the internet so the Russians can’t hack us?  Clearly not; and anyway, maybe the hacker isn’t the Russians—what if it’s someone in your opponent’s political party?  What if it’s a rogue election administrator?

The best defenses are ways to audit the election and count the votes outside of, independent of the hackable computers.  For example,

[Read more…]


Can Facebook really make ads unblockable?

[This is a joint post with Grant Storey, a Princeton undergraduate who is working with me on a tool to help users understand Facebook’s targeted advertising.]

Facebook announced two days ago that it would make its ads indistinguishable from regular posts, and hence impossible to block. But within hours, the developers of Adblock Plus released an update which enabled the tool to continue blocking Facebook ads. The ball is now back in Facebook’s court. So far, all it’s done is issue a rather petulant statement. The burning question is this: can Facebook really make ads indistinguishable from content? Who ultimately has the upper hand in the ad blocking wars?

There are two reasons — one technical, one legal — why we don’t think Facebook will succeed in making its ads unblockable, if a user really wants to block them.

The technical reason is that the web is an open platform. When you visit, Facebook’s server sends your browser the page content along with instructions on how to render them on the screen, but it is entirely up to your browser to follow those instructions. The browser ultimately acts on behalf of the user, and gives you — through extensions — an extraordinary degree of control over its behavior, and in particular, over what gets displayed on the screen. This is what enables the ecosystem of ad-blocking and tracker-blocking extensions to exist, along with extensions for customizing web pages in various other interesting ways.

Indeed, the change that Adblock Plus made in order to block the new, supposedly unblockable ads is just a single line in the tool’s default blocklist:[id^="substream_"] div[id^="hyperfeed_story_id_"][data-xt]

What’s happening here is that Facebook’s HTML code for ads has slight differences from the code for regular posts, so that Facebook can keep things straight for its own internal purposes. But because of the open nature of the web, Facebook is forced to expose these differences to the browser and to extensions such as Adblock Plus. The line of code above allows Adblock Plus to distinguish the two categories by exploiting those differences.

Facebook engineers could try harder to obfuscate the differences. For example, they could use non-human-readable element IDs to make it harder to figure out what’s going on, or even randomize the IDs on every page load. We’re surprised they’re not already doing this, given the grandiose announcement of the company’s intent to bypass ad blockers. But there’s a limit to what Facebook can do. Ultimately, Facebook’s human users have to be able to tell ads apart, because failure to clearly distinguish ads from regular posts would run headlong into the Federal Trade Commission’s rules against misleading advertising — rules that the commission enforces vigorously. [1, 2] And that’s the second reason why we think Facebook is barking up the wrong tree.

Facebook does allow human users to easily recognize ads: currently, ads say “Sponsored” and have a drop-down with various ad-related functions, including a link to the Ad Preferences page. And that means someone could create an ad-blocking tool that looks at exactly the information that a human user would look at. Such a tool would be mostly immune to Facebook’s attempts to make the HTML code of ads and non-ads indistinguishable. Again, the open nature of the web means that blocking tools will always have the ability to scan posts for text suggestive of ads, links to Ad Preferences pages, and other markers.

But don’t take our word for it: take our code for it instead. We’ve created a prototype tool that detects Facebook ads without relying on hidden HTML code to distinguish them. [Update: the source code is here.] The extension examines each post in the user’s news feed and marks those with the “Sponsored” link as ads. This is a simple proof of concept, but the detection method could easily be made much more robust without incurring a performance penalty. Since our tool is for demonstration purposes, it doesn’t block ads but instead marks them as shown in the image below.  

All of this must be utterly obvious to the smart engineers at Facebook, so the whole “unblockable ads” PR push seems likely to be a big bluff. But why? One possibility is that it’s part of a plan to make ad blockers look like the bad guys. Hand in hand, the company seems to be making a good-faith effort to make ads more relevant and give users more control over them. Facebook also points out, correctly, that its ads don’t contain active code and aren’t delivered from third-party servers, and therefore aren’t as susceptible to malware.

Facebook does deserve kudos for trying to clean up and improve the ad experience. If there is any hope for a peaceful resolution to the ad blocking wars, it is that ads won’t be so annoying as to push people to install ad blockers, and will be actually useful at least some of the time. If anyone can pull this off, it is Facebook, with the depth of data it has about its users. But is Facebook’s move too little, too late? On most of the rest of the web, ads continue to be creepy malware-ridden performance hogs, which means people will continue to install ad blockers, and as long as it is technically feasible for ad blockers to block Facebook ads, they’re going to continue to do so. Let’s hope there’s a way out of this spiral.

[1] Obligatory disclaimer: we’re not lawyers.

[2] Facebook claims that Adblock Plus’s updates “don’t just block ads but also posts from friends and Pages”. What they’re most likely referring to that Adblock Plus blocks ads that are triggered by one of your friends Liking the advertiser’s page. But these are still ads: somebody paid for them to appear in your feed. Facebook is trying to blur the distinction in its press statement, but it can’t do that in its user interface, because that is exactly what the FTC prohibits.


The workshop on Data and Algorithmic Transparency

From online advertising to Uber to predictive policing, algorithmic systems powered by personal data affect more and more of our lives. As our society begins to grapple with the consequences of this shift, empirical investigation of these systems has proved vital to understand the potential for discrimination, privacy breaches, and vulnerability to manipulation.

This emerging field of research, which we’re calling Data and Algorithmic Transparency, seems poised to grow dramatically. But it faces a number of methodological challenges which can only be solved by bringing together expertise from a variety of disciplines. That is why Alan Mislove and I are organizing the first workshop on Data and Algorithmic Transparency at Columbia University on Nov 19, 2016.

Here are three reasons you should participate in this workshop.

  1. Start of a new, interdisciplinary community. The set of disciplines represented on the Program Committee is strikingly diverse: Internet measurement, information privacy/security, computer systems, human-computer interaction, law, and media studies. Industrial research and government are also represented. We expect the workshop itself to have a similar mix of participants, and that is exactly what is needed to make transparency research a success. Alan and I (and others including Nikolaos Laoutaris) are committed to growing and nurturing this community over the next several years.
  1. Co-located with two other exciting events: the Data Transparency Lab conference (DTL ‘16) and the Fairness, Accountability, and Transparency in Machine Learning workshop (FAT-ML ‘16). DTL shares many of the goals of the DAT workshop, but is non-academic. FAT-ML has a complementary relationship with the goals of DAT: it seeks to develop machine learning techniques for developers of algorithmic systems to improve fairness and accountability, whereas DAT seeks to analyze existing systems, typically “from the outside”. The events are consecutive and non-overlapping, and participants of each event are encouraged to attend the others.
  1. A format that makes the most of everyone’s time. At most computer science conferences, each speaker mumbles through their slides while the audience is a sea of laptops, awaiting their turn. DAT will be the opposite. We plan to have paper discussions instead of paper presentations, with commenters and participants, rather than authors, doing most of the speaking about each paper. This first edition of DAT will be non-archival (but peer-reviewed), and one goal of the discussions is to help authors improve their papers for later publication. We are also soliciting talk proposals about already published work; groups of accepted talks will be organized into panels.

See you in New York City!


A response to the National Association of Secretaries of State

NASS logo
Election administration in the United States is largely managed state-by-state, with a small amount of Federal involvement. This generally means that each state’s chief election official is that state’s Secretary of State. Their umbrella organization, the National Association of Secretaries of State, consequently has a lot of involvement in voting issues, and recently issued a press release concerning voting system security that was remarkably erroneous. What follows is a point-by-point commentary on their press release.

To date, there has been no indication from national security agencies to states that any specific or credible threat exists when it comes to cyber security and the November 2016 general election.

Unfortunately, we now know that it appears that Russia broke into the DNC’s computers and leaked emails with clear intent to influence the U.S. presidential election (see, e.g., the New York Times’s article on July 26: “Why Security Experts Think Russia was Behind the DNC Breach”). It’s entirely reasonable to extrapolate from this that they may be willing to conduct further operations with the same goals, meaning that it’s necessary to take appropriate steps to mitigate against such attacks, regardless of the level of specificity of available intel.

However, as a routine part of any election cycle, Secretaries of State and their local government counterparts work with federal partners, such as the U.S. Election Assistance Commission (EAC) and the National Institute of Standards and Technology (NIST), to maintain rigorous testing and certification standards for voting systems. Risk management practices and controls, including the physical handling and storage of voting equipment, are important elements of this work.

Expert analyses of current election systems (largely conducted ten years ago in California, Ohio, and Florida) found a wide variety of security problems. While some states have responded to these issues by replacing the worst paperless electronic voting systems, other states, including several “battleground” states, continue to use unacceptably insecure systems.

State election offices also proactively utilize election IT professionals and security experts to regularly review, identify and address any vulnerabilities with systems, including voter registration databases and election night reporting systems (which display the unofficial tallies that are ultimately verified via statewide canvassing).

The implication here is that all state election officials have addressed known vulnerabilities. This is incorrect. While some states have been quite proactive, other states have done nothing of the sort.

A national hacking of the election is highly improbable due to our unique, decentralized process.

Security vulnerabilities have nothing to do with probabilities. They instead have to do with a cost/benefit analysis on the part of the attacker. An adversary doesn’t have to attack all 50 states. All they have to do is tamper with the “battleground” states where small shifts in the vote can change the outcome for the whole state.

Each state and locality conducts its own system of voting, complete with standards and security requirements for equipment and software. Most states publicly conduct logic and accuracy testing of their machines prior to the election to ensure that they are working and tabulating properly, then they are sealed until Election Day to prevent tampering.

So-called “logic and accuracy testing” varies from location to location, but most boil down to casting a small number of votes for each candidate, on a handful of machines, and making sure they’re all there in a mock tally. Similarly, local election officials will have procedures in place to make sure machines are properly “zeroed”. Computer scientists refer to these as “sanity tests”, in that if the system fails, then something is obviously broken. If these tests pass, they say nothing about the sort of tampering that a sophisticated nation-state adversary might conduct.

Some election officials conduct more sophisticated “parallel testing”, where some voting equipment is pulled out of general service and is instead set up in a mock precinct, on election day, where mock voters cast seemingly real ballots. These machines would have a harder time distinguishing whether they were in “test” versus “production” conditions. But what happens if the machines fail the parallel test? By then, the election is over, the voters are gone, and there’s potentially no way to reconstruct the intent of the voters.

Furthermore, electronic voting machines are not Internet-based and do not connect to each other online.

This is partially true. Electronic voting systems do connect to one another through in-precinct local networks or through the motion of memory cards of various sorts. They similarly connect to election management systems before the start of the election (when they’re loaded with ballot definitions) and after the end of the election (for backups, recounts, inventory control, and/or being cleared prior to subsequent elections). All of these “touch points” represent opportunities for malware to cross the “air gap” boundaries. We built attacks like these a decade ago as part of the California Top to Bottom Review, showing how malware could spread “virally” to an entire county’s fleet of voting equipment. Attacks like these require a non-trivial up-front engineering effort, plus additional effort for deployment, but these efforts are well within the capabilities of a nation-state adversary.

Following the election, state and local jurisdictions conduct a canvass to review vote counting, ultimately producing the election results that are officially certified. Post-election audits help to further guard against deliberate manipulation of the election, as well as unintentional software, hardware or programming problems.

Post-election audits aren’t conducted at all in some jurisdictions, and would likely be meaningless against the sort of adversary we’re talking about. If a paperless electronic voting system was hacked, there might well be forensic evidence that the attackers left behind, but such evidence would be a challenge to identify quickly, particularly in the charged atmosphere of a disputed election result.

We look forward to continued information-sharing with federal partners in order to evaluate cyber risks, and respond to them accordingly, as part of ongoing state election emergency preparedness planning for November.

“Emergency preparedness” is definitely the proper way to consider the problem. Just as we must have contingency plans for all sorts of natural phenomena, like hurricanes, we must also be prepared for man-made phenomena, where we might be unable to reconstruct an election tally that accurately represents the will of the people.

The correct time to make such plans is right now, before the election. Since it’s far too late to decommission and replace our insecure equipment, we must instead plan for rapid responses, such as quickly printing single-issue paper ballots, bringing voters back to the polls, and doing it all over again. If such plans are made now, their very existence changes the cost/benefit equation for our adversaries, and will hopefully dissuade these adversaries from acting.


Supplement for Revealing Algorithmic Rankers (Table 1)

Table 1: A ranking of Computer Science departments per, with additional attributes from the NRC assessment dataset. Here, the average count computes the geometric mean of the adjusted number of publications in each area by institution, faculty is the number of faculty in the department, pubs is the average number of publications per faculty (2000-2006) , GRE is the average GRE scores (2004-2006). Departments are ranked by average count.

Rank (CSR) Name Average Count (CSR) Faculty (CSR) Pubs (NRC) GRE (NRC)
1 Carnegie Mellon University 18.3 122 2 791
2 Massachusetts Institute of Technology 15 64 3 772
3 Stanford University 14.3 55 5 800
4 University of California–Berkeley 11.4 50 3 789
5 University of Illinois–Urbana-Champaign 10.5 55 3 772
6 University of Washington 10.3 50 2 796
7 Georgia Institute of Technology 8.9 81 2 797
8 University of California–San Diego 7.8 49 3 797
9 Cornell University 6.9 45 2 800
10 University of Michigan 6.8 63 3 800
11 University of Texas–Austin 6.6 43 3 789
12 Columbia University 6.3 49 3 788
13 University of Massachusetts–Amherst 6.2 47 2 796
14 University of Maryland–College Park 5.5 42 2 791
15 University of Wisconsin–Madison 5.1 35 2 793
16 University of Southern California 4.4 47 3 793
17 University of California–Los Angeles 4.3 32 3 797
18 Northeastern University 4 46 2 797
19 Purdue University–West Lafayette 3.6 42 2 772
20 Harvard University 3.4 29 3 794
20 University of Pennsylvania 3.4 32 3 800
22 University of California–Santa Barbara 3.2 28 4 793
22 Princeton University 3.2 27 2 796
24 New York University 3 29 2 796
24 Ohio State University 3 39 3 798
26 University of California–Davis 2.9 27 2 771
27 Rutgers The State University of New Jersey–New Brunswick 2.8 33 2 758
27 University of Minnesota–Twin Cities 2.8 37 2 777
29 Brown University 2.5 24 2 768
30 Northwestern University 2.4 35 1 787
31 Pennsylvania State University 2.3 28 3 790
31 Texas A & M University–College Station 2.3 36 1 775
33 State University of New York–Stony Brook 2.2 33 3 796
33 Indiana University–Bloomington 2.2 35 1 765
33 Duke University 2.2 22 3 800
33 Rice University 2.2 18 2 800
37 University of Utah 2.1 29 2 776
37 Johns Hopkins University 2.1 24 2 766
39 University of Chicago 2 28 2 779
40 University of California–Irvine 1.9 28 2 787
41 Boston University 1.6 15 2 783
41 University of Colorado–Boulder 1.6 32 1 761
41 University of North Carolina–Chapel Hill 1.6 22 2 794
41 Dartmouth College 1.6 18 2 794
45 Yale University 1.5 18 2 800
45 University of Virginia 1.5 18 2 789
45 University of Rochester 1.5 18 3 786
48 Arizona State University 1.4 14 2 787
48 University of Arizona 1.4 18 2 784
48 Virginia Polytechnic Institute and State University 1.4 32 1 780
48 Washington University in St. Louis 1.4 17 2 790

Revealing Algorithmic Rankers

By Julia Stoyanovich (Assistant Professor of Computer Science, Drexel University) and Ellen P. Goodman (Professor, Rutgers Law School)

ProPublica’s story on “machine bias” in an algorithm used for sentencing defendants amplified calls to make algorithms more transparent and accountable. It has never been more clear that algorithms are political (Gillespie) and embody contested choices (Crawford), and that these choices are largely obscured from public scrutiny (Pasquale and Citron). We see it in controversies over Facebook’s newsfeed, or Google’s search results, or Twitter’s trending topics. Policymakers are considering how to operationalize “algorithmic ethics” and scholars are calling for accountable algorithms (Kroll, et al.).

One kind of algorithm that is at once especially obscure, powerful, and common is the ranking algorithm (Diakopoulos). Algorithms rank individuals to determine credit worthiness, desirability for college admissions and employment, and compatibility as dating partners. They encode ideas of what counts as the best schools, neighborhoods, and technologies. Despite their importance, we actually can know very little about why this person was ranked higher than another in a dating app, or why this school has a better rank than that one. This is true even if we have access to the ranking algorithm, for example, if we have complete knowledge about the factors used by the ranker and their relative weights, as is the case for US News ranking of colleges. In this blog post, we argue that syntactic transparency, wherein the rules of operation of an algorithm are more or less apparent, or even fully disclosed, still leaves stakeholders in the dark: those who are ranked, those who use the rankings, and the public whose world the rankings may shape.

Using algorithmic rankers as an example, we argue that syntactic transparency alone will not lead to true algorithmic accountability (Angwin). This is true even if the complete input data is publicly available. We advocate instead for interpretability, which rests on making explicit the interactions between the program and the data on which it acts. An interpretable algorithm allows stakeholders to understand the outcomes, not merely the process by which outcomes were produced.

Opacity in Algorithmic Rankers

Algorithmic rankers take as input a database of items and produce a ranked list of items as output. The relative ranking of the items may be computed based on an explicitly provided scoring function. Or the ranking function may be learned, using learning-to-rank methods that are deployed extensively in information retrieval and recommender systems.

The simplest kind of a ranker is a score-based ranker, which applies a scoring function independently to each item and then sorts the items on their scores. Many of these rankers use monotone aggregation scoring functions, such as weighted sums of attribute values with non-negative weights. In the very simplest case, the score of an item is computed by sorting on the value of just one attribute, i.e., by setting the weight of that attribute to 1 and of all other attributes to 0.

This is illustrated in our running example in Table 1, which gives a ranking of 51 computer science departments as per (CSR). We augmented the data with several attributes from the assessment of research-doctorate programs by the National Research Council (NRC) to illustrate some points. Source of an attribute (CSR or NRC) is listed next to the attribute name. We recognize that the augmented CS rankings are already syntactically transparent. What’s more, they provide the entire data set. We use them for illustrative purposes.

Table 1: A ranking of Computer Science departments per, with additional attributes from the NRC assessment dataset. Here, the average count computes the geometric mean of the adjusted number of publications in each area by institution, faculty is the number of faculty in the department, pubs is the average number of publications per faculty (2000-2006) , GRE is the average GRE scores (2004-2006). Departments are ranked by average count.

Rank (CSR) Name Average Count (CSR) Faculty (CSR) Pubs (NRC) GRE (NRC)
1 Carnegie Mellon University 18.3 122 2 791
2 Massachusetts Institute of Technology 15 64 3 772
3 Stanford University 14.3 55 5 800
4 University of California–Berkeley 11.4 50 3 789
5 University of Illinois–Urbana-Champaign 10.5 55 3 772
full table
45 Yale University 1.5 18 2 800
45 University of Virginia 1.5 18 2 789
45 University of Rochester 1.5 18 3 786
48 Arizona State University 1.4 14 2 787
48 University of Arizona 1.4 18 2 784
48 Virginia Polytechnic Institute and State University 1.4 32 1 780
48 Washington University in St. Louis 1.4 17 2 790

Ranked results are difficult for people to interpret, whether a ranking is computed explicitly or learned, whether the method (e.g., the scoring function or, more generally, the model) is known or unknown, and whether the user can access the entire output or only the highest-ranked items (the top-k). There are several sources of this opacity, illustrated below for score-based rankers.

Sources of Opacity

Source 1: The scoring formula alone does not indicate the relative rank of an item. Rankings are, by definition, relative, while scores are absolute. Knowing how the score of an item is computed says little about the outcome — the position of a particular item in the ranking, relative to other items. Is 10.5 a high score or a low score? That depends on how 10.5 compares to the scores of other items, for example to the highest attainable score and to the highest score of some actual item in the input. In our example in Table 1 this kind of opacity is mitigated because there is both syntactic transparency (the scoring formula is known) and the input is public.

Source 2: The weight of an attribute in the scoring formula does not determine its impact on the outcome. Consider again the example in Table 1, and suppose that we first normalize the values of the attributes, and then compute the score of each department by summing up the values of faculty (with weight 0.2), average count (with weight 0.3) and GRE (with weight 0.5). According to this scoring method, the size of the department (faculty) is the least important factor. Yet, it will be the deciding factor that sets apart top-ranked departments from those in lower ranks, both because the value of this attribute changes most dramatically in the data, and because it correlates with average count (in effect, double-counting). In contrast, GRE is syntactically the most important factor in the formula, yet in this dataset it has very close values for all items, and so has limited actual effect on the ranking.

Source 3: The ranking output may be unstable. A ranking may be unstable because of the scores generated on a particular dataset. An example would be tied scores, where the tie is not reflected in the ranking. In this case, the choice of any particular rank order is arbitrary. Moreover, unless raw scores are disclosed, the user has no information about the magnitude of the difference in scores between items that appear in consecutive ranks. In Table 1, CMU (18.3) has a much higher score than the immediately following MIT (15). This is in contrast to, e.g., UIUC (10.5, rank 5) and UW (10.3, rank 6), which are nearly tied. The difference in scores between distinct adjacent ranks decreases dramatically as we move down the list: it is at most 0.3, and usually 0.1, for departments in ranks 16 through 48. CSRankings’ syntactic transparency (disclosing its ranking method to the user) and accessible data allow us to see the instability, but this is unusual.

Source 4: The ranking methodology may be unstable. The scoring function may produce vastly different rankings with small changes in attribute weights. This is difficult to detect even with syntactic transparency, and even if the data is public. Malcolm Gladwell discusses this issue and gives compelling examples in his 2011 piece, The Order of Things. In our example in Table 1, a scoring function that is based on a combination of pubs and GRE would be unstable, because the values of these attributes are both very close for many of the items and induce different rankings, and so prioritizing one attribute over the other slightly would cause significant re-shuffling.

The opacity concerns described here are all due to the interaction between the scoring formula (or, more generally, an a priori postulated model) and the actual dataset being ranked. In a recent paper, one of us observed that structured datasets show rich correlations between item attributes in the presence of ranking, and that such correlations are often local (i.e., are present in some parts of the dataset but not in others). To be clear, this kind of opacity is present whether or not there is syntactic transparency.

Harms of Opacity

Opacity in algorithmic rankers can lead to four types of harms:

(1) Due process / fairness. The subjects of the ranking cannot have confidence that their ranking is meaningful or correct, or that they have been treated like similarly situated subjects. Syntactic transparency helps with this but it will not solve the problem entirely, especially when people cannot interpret how weighted factors have impacted the outcome (Source 2 above).

(2) Hidden normative commitments. A ranking formula implements some vision of the “good.” Unless the public knows what factors were chosen and why, and with what weights assigned to each, it cannot assess the compatibility of this vision with other norms. Even where the formula is disclosed, real public accountability requires information about whether the outcomes are stable, whether the attribute weights are meaningful, and whether the outcomes are ultimately validated against the chosen norms. Did the vendor evaluate the actual effect of the features that are postulated as important by the scoring / ranking mode? Did the vendor take steps to compensate for mutually-reinforcing correlated inputs, and for possibly discriminatory inputs? Was stability of the ranker interrogated on real or realistic inputs? This kind of transparency around validation is important for both learning algorithms which operate according to rules that are constantly in flux and responsive to shifting data inputs, and for simpler score-based rankers that are likewise sensitive to the data.

(3) Interpretability. Especially where ranking algorithms are performing a public function (e.g., allocation of public resources or organ donations) or directly shaping the public sphere (e.g., ranking politicians), political legitimacy requires that the public be able to interpret algorithmic outcomes in a meaningful way. At the very least, they should know the degree to which the algorithm has produced robust results that improve upon a random ordering of the items (a ranking-specific confidence measure). In the absence of interpretability, there is a threat to public trust and to democratic participation, raising the dangers of an algocracy (Danaher) – rule by incontestable algorithms.

(4) Meta-methodological assessment. Following on from the interpretability concerns is a meta question about whether a ranking algorithm is the appropriate method for shaping decisions. There are simply some domains, and some instances of datasets, in which rank order is not appropriate. For example, if there are very many ties or near-ties induced by the scoring function, or if the ranking is too unstable, it may be better to present data through an alternative mechanism such as clustering. More fundamentally, we should question the use of an algorithmic process if its effects are not meaningful or if it cannot be explained. In order to understand whether the ranking methodology is valid, as a first order question, the algorithmic process needs to be interpretable.

The Possibility of Knowing

Recent scholarship on the issue of algorithmic accountability has devalued transparency in favor of verification. The claim is that because algorithmic processes are protean and extremely complex (due to machine learning) or secret (due to trade secrets or privacy concerns), we need to rely on retrospective checks to ensure that the algorithm is performing as promised. Among these checks would be cryptographic techniques like zero knowledge proofs (Kroll, et al.) to confirm particular features, audits (Sandvig) to assess performance, or reverse engineering (Perel and Elkin-Koren) to test cases.

These are valid methods of interrogation, but we do not want to give up on disclosure. Retrospective testing puts a significant burden on users. Proofs are useful only when you know what you are looking for. Reverse engineering with test cases can lead to confirmation bias. All these techniques put the burden of inquiry exclusively on individuals for whom interrogation may be expensive and ultimately fruitless. The burden instead should fall more squarely on the least cost avoider, which will be the vendor who is in a better position to reveal how the algorithm works (even if only partially). What if food manufacturers resisted disclosing ingredients or nutritional values, and instead we were put to the trouble of testing their products or asking them to prove the absence of a substance? That kind of disclosure by verification is very different from having a nutritional label.

What would it take to provide the equivalent of a nutritional label for the process and the outputs of algorithmic rankers? What suffices as an appropriate and feasible explanation depends on the target audience.

For an individual being ranked, a useful description would explain his specific ranked outcome and suggest ways to improve the outcome. What changes can NYU CS make to improve its ranking? Why is the NYU CS department ranked 24? Which attributes make this department perform worse than those ranked higher? As we argued above, the answers to these questions depend on the interaction between the ranking method and the dataset over which the ranker operates. When working with data that is not public (e.g., involving credit or medical information about individuals), an explanation mechanism of this kind must be mindful of any privacy considerations. Individually-responsive disclosures could be offered in a widget that allows ranked entities to experiment with the results by changing the inputs.

An individual consumer of a ranked output would benefit from a concise and intuitive description of the properties of the ranking. Based on this explanation, users will get a glimpse of, e.g., the diversity (or lack thereof) that the ranking exhibits in terms of attribute values. Both attributes that comprise the scoring function, if known (or, more generally, features that make part of the model), and attributes that co-occur or even correlate with the scoring attributes, can be described explicitly. In our example in Table 1, a useful explanation may be that a ranking on average count will over-represent large departments (with many faculty) at the top of the list, while GRE does not strongly influence rank.

Figure 1: A hypothetical Ranking Facts label.

Figure 1 presents a hypothetical “nutritional label” for rankings, using the augmented CSRankings in Table 1 as input. Inspired by Nutrition Facts, our Ranking Facts label is aimed at the consumer, such as a prospective CS program applicant, and addresses three of the four opacity sources described above: relativity, impact, and output stability. We do not address methodological stability in the label. How this dimension should be quantified and presented to the user is an open technical problem.

The Ranking Facts show how the properties of the 10 highest-ranked items compare to the entire dataset (Relativity), making explicit cases where the ranges of values, and the median value, are different at the top-10 vs. overall (median is marked with red triangles for faculty size and average publication count). The label lists the attributes that have most impact on the ranking (Impact), presents the scoring formula (if known), and explains which attributes correlate with the computed score. Finally, the label graphically shows the distribution of scores (Stability), explaining that scores differ significantly up to top-10 but are nearly indistinguishable in later positions.

Something like the Rankings Facts makes the process and outcome of algorithmic ranking interpretable for consumers, and reduces the likelihood of opacity harms, discussed above. Beyond Ranking Facts, it is important to develop Interpretability tools that enable vendors to design fair, meaningful and stable ranking processes, and that support external auditing. Promising technical directions include, e.g., quantifying the influence of various features on the outcome under different assumptions about availability of data and code, and investigating whether provenance techniques can be used to generate explanations.


Election security as a national security issue

We recently learned that Russian state actors may have been responsible for the DNC emails recently leaked to Wikileaks. Earlier this spring, once they became aware of the hack, the DNC hired Crowdstrike, an incident response firm. The New York Times reports:

Preliminary conclusions were discussed last week at a weekly cyberintelligence meeting for senior officials. The Crowdstrike report, supported by several other firms that have examined the same bits of code and telltale “metadata” left on documents that were released before WikiLeaks’ publication of the larger trove, concludes that the Federal Security Service, known as the F.S.B., entered the committee’s networks last summer.

President Obama added that “on a regular basis, [the Russians] try to influence elections in Europe.” For the sake of this blog piece, and it’s not really a stretch, let’s take it as a given that foreign nation-state actors including Russia have a large interest in the outcome of U.S. elections and are willing to take all sorts of unseemly steps to influence what happens here. Let’s take it as a given that this is undesirable and talk about how we might stop it.

It’s bad enough to see foreign actors leaking emails with partisan intent. To make matters worse,  Bruce Schneier in a Washington Post op-ed and many other security experts in the past have been worried about our voting systems themselves being hacked. How bad could this get? Several companies are now offering Internet-based voting systems alongside apparently unfounded claims as to their security. In one example, Washington D.C. looked at using one such system for its local elections and had a “pilot” in 2010, wherein the University of Michigan’s Alex Halderman and his students found and exploited significant security vulnerabilities. Had this system been used in a real election, any foreign nation-state actor could have done the same. Luckily, these systems aren’t widely used.

How vulnerable are our nation’s election systems, as they’ll be used this November 2016, to being manipulated by foreign nation-state actors? The answer depends on how close the election will be. Consider Bush v. Gore in 2000. If an attacker, knowing it would be a very close election, had found a way to specifically manipulate the outcome in Florida, then their attack could well have had a decisive impact. Of course, predicting election outcomes is as much an art as a science, so an attacker would need to hedge their bets and go after the voting systems in multiple “battleground” states. Conversely, there’s no point in going after highly polarized states, where small changes will have no decisive impact. As an attacker, you want to leave a minimal footprint.

How good are we at defending ourselves? Will cyber attacks on current voting systems leave evidence that can be detected prior to our elections? Let’s consider the possible attacks and how our defenses might respond.

Voter de-registration: The purpose of a many attacks is simply to break things. Applied with partisan intent, you’d want to break things for one party more than the other. The easiest attack would be to hack a voter registration system, deleting voters who you believe are likely to support the candidate you don’t like. For voters who have registered for a political party, you know everything you need to know for who to delete. For independent voters you can probabilistically infer a their political opinions based on how their local precinct votes and on other demographic variables. (Political scientists do this sort of thing all the time.) Selectively destroying voter registration databases is likely to be recoverable. Such voters could demand to vote “provisional ballots” and those ballots would get counted as normal, once the voter registration databases were restored.

Vote flipping: A nastier attack would require an attacker to access the computers inside DRE voting systems. (“Direct recording electronic” systems are typically touch-screen computers with no voter-verifiable paper trail. The only record of a voter’s ballot is stored electronically, inside the computer.) These voting systems are typically not connected to the Internet, although they do connect to election management computers, and those sometimes use modems to gather data from remote precincts. (Details vary from state to state and even county to county.) From the perspective of a nation-state cyber attacker, a modem might as well be a direct connection to the Internet. Once you can get malware into one of these election management computers, you can delete or flip votes. If you’re especially clever, you can use the occasional connections from these election management computers to the voting machines and corrupt the voting machines themselves. (We showed how to do these sort of viral attacks as part of the California Top to Bottom Review in 2007.)

With paperless DRE systems, attacked by a competent nation-state actor, there will be no reason to believe any of the electronic records are intact, and a competent attacker would presumably also be good enough to clean up on their way out, so there wouldn’t necessarily even be any evidence of the attack.

The good news is that paperless DRE systems are losing market share and being replaced slowly-but-surely with several varieties of paper-ballot systems (some hand-marked and electronically scanned, others machine-marked). A foreign nation-state adversary can’t reach across the Internet and change what’s printed on a piece of paper, which means that a post-election auditing strategy to compare the electronic results to the paper results can efficiently detect (and thus deter) electronic tampering.

Where would an adversary attack? The most bang-for-the-buck for a foreign nation-state bent on corrupting our election would be to find a way to tamper with paperless DRE voting systems in a battleground state. So where then? Check out the NYT’s interactive “paths to the White House” page, wherein you can play “what-if” games on which states might have what impact in the Electoral College. The top battleground state is Florida, but thanks in part to the disastrous 2006 election in Florida’s 13th Congressional district, Florida dumped its DRE voting systems for optically scanned paper ballots; it would be much harder for an adversarial cyber attack to go undetected. What about other battleground states? Following the data in the Verified Voting website, Pennsylvania continues to use paperless DREs as does Georgia. Much of Ohio uses DRE systems with “toilet paper roll” printers, where voters are largely unable to detect if anything is printed incorrectly, so we’ll lump them in with the paperless states. North Carolina uses a mix of technologies, some of which are more vulnerable than others. So let’s say the Russians want to rig the election for Trump. If they could guarantee a Trump win in Pennsylvania, Georgia, Ohio, and North Carolina, then a Florida victory could put Trump over the top. Even without conspiracy theories, Florida will still be an intensely fought battleground state, but we don’t need a foreign government making it any worse.

So what should these sensitive states do in the short term? At this point, it’s far too late to require non-trivial changes in election technologies or even most procedures. They’re committed to what they’ve got and how they’ll use it. We could imagine requiring some essential improvements (security patches and updates installed, intrusion detection and monitoring equipment installed, etc.) and even some sophisticated analyses (e.g., pulling voting machines off the line and conducting detailed / destructive analyses of their internal state, going beyond the weak tamper-protection mechanisms presently in place). Despite all of this, we could well end up in a scenario where we conclude that we have unreliable or tampered election data and cannot use it to produce a meaningful vote tally.

Consider also that all an adversary needs to do is raise enough doubt that the loser has seemingly legitimate grounds to dispute the result. Trump is already suggesting that this November’s election might be rigged, without any particular evidence to support this conjecture. This makes it all the more essential that we have procedures that all parties can agree to for recounts, for audits, and for what to do when those indicate discrepancies.

In case of emergency, break glass. If we’re facing a situation where we see tampering on a massive scale, we could end up in a crisis far worse than Florida after the Bush/Gore election of 2000. If we do nothing until after we find problems, every proposed solution will be tinted with its partisan impact, making it difficult to reach any sort of procedural consensus. Nobody wants to imagine a case where our electronic voting systems have been utterly compromised, but if we establish processes and procedures, in advance, for dealing with these contingencies, such as commissioning paper ballots and rerunning the elections in impacted areas, we will disincentivize foreign election adversaries and preserve the integrity of our democracy.

(Addendum: contingency planning was exactly the topic of discussion after Hurricane Sandy disrupted elections across the Northeast in November 2012. It would be useful to revisit whatever changes were made then, in light of the new threat landscape we have today.)

Related reading:


Brexit Exposes Old and Deepening Data Divide between EU and UK

After the Brexit vote, politicians, businesses and citizens are all wondering what’s next. In general, legal uncertainty permeates Brexit, but in the world of bits and bytes, Brussels and London have in fact been on a collision course at least since the 90s. The new British prime minister, Theresa May, has been personally responsible for a deepening divide across the North Sea on data and communication policy. Although EU citizens will see stronger privacy and cybersecurity protections through EU law post-Brexit, multinational companies should be particularly worried about how future regulation will treat the loads of data they traffic about customers, employees, and deals between the EU and the UK.  [Read more…]