October 20, 2018

Refining the Concept of a Nutritional Label for Data and Models

By Julia Stoyanovich (Assistant Professor of Computer Science at Drexel University)  and Bill Howe (Associate Professor in the Information School at the University of Washington)

In August 2016,  Julia Stoyanovich and Ellen P. Goodman spoke in this forum about the importance of bringing interpretability to the algorithmic transparency debate.  They focused on algorithmic rankers, discussed the harms of opacity, and argued that the burden on making ranked outputs transparent rests with the producer of the ranking.   They went on to propose a “nutritional label” for rankings called Ranking Facts.

In this post, Julia Stoyanovich and Bill Howe discuss their recent technical progress on bringing the idea of Ranking Facts to life, placing the nutritional label metaphor in the broader context of the ongoing algorithmic accountability and transparency debate.

In 2016, we began with a specific type of nutritional label that focuses on algorithmic rankers.  We have since developed a Web-based Ranking Facts tool, which will be presented at the upcoming ACM SIGMOD 2018 conference.   

Figure 1: Ranking Facts on the CS departments datasetThe Ingredients widget (green) has been expanded to show the details of the attributes that strongly influence the ranking.  The Fairness widget (blue) has been expanded to show details of the fairness computation.

Figure 1 presents Ranking Facts for CS department rankings, the same dataset as was used for illustration in our August 2016 post.  The nutritional label was constructed automatically, and consists of a collection of visual widgets, each with an overview and a detailed view.  

  • Recipe widget succinctly describes the ranking algorithm. For example, for score-based ranker that uses a linear scoring formula to assign as score to each item, each attribute would be listed together with its weight.
  • Ingredients widget lists attributes most material to the ranked outcome, in order of importance. For example, for a linear model, this list could present the attributes with the highest learned weights.
  • Stability widget explains whether the ranking methodology is robust on this particular dataset – would small changes in the data, such as those due to uncertainty or noise, result in significant changes in the ranked order?  
  • Fairness and Diversity widgets quantify whether the ranked outcome exhibits parity (according to some measure – three such measures are presented in Figure 1), and whether the set of results is diverse with respect to one or several demographic characteristics.

What’s new about nutritional labels?

The database and cyberinfrastructure communities have been studying systems and standards for metadata, provenance, and transparency for decades.  For example, the First Provenance Challenge in 2008 led to the creation of the Open Provenance Model that standardized years of previous efforts across multiple communities,   We are now seeing renewed interest in these topics due to the proliferation of machine learning applications that use data opportunistically.  Several projects are emerging that explore this concept, including Dataset Nutrition Label at the Berkman Klein Center at Harvard & the MIT Media LabDatasheets for Datasets, and some emerging work about Data Statements for NLP datasets from Bender and Friedman.  In our work, we are interested in automating the creation of nutritional labels, for both datasets and models, and in providing open source tools for others to use in their projects.

Is a nutritional label simply an apt new name for an old idea?  We think not! We see nutritional labels as a unifying metaphor that is responsive to changes in how data is being used today.  

Datasets are now increasingly used to train models to make decisions once made by humans.  In these automated systems, biases in the data are propagated and amplified with no human in the loop.  The bias, and the effect of the bias on the quality of decisions made, is not easily detectable due to the relative opacity of the system.  As we have seen time and time again, models will appear to work well, but will silently and dangerously reinforce discrimination. Worse, these models will legitimize the bias — “the computer said so.”  So we are designing nutritional labels for data and models to respond specifically to the harms implied by these scenarios, in contrast to the more general concept of just “data about data.”

Use cases for nutritional labels: Enhancing data sharing in the public sector

Since we first began discussing nutritional labels in 2016, we’ve seen increased interest from  the public sector in scenarios where data sharing is considered high-risk. Nutritional labels can be used to support data sharing, while mitigating some of the associated risks. Consider these examples:

Algorithmic transparency law in New York City

New York City recently passed a law requiring that a task force be put in place to survey the current use of “automated decision systems,” defined as “computerized implementations of algorithms, including those derived from machine learning or other data processing or artificial intelligence techniques, which are used to make or assist in making decisions,” in City agencies.  The task force will develop a set of recommendations for enacting algorithmic transparency, which, as we argued in our testimony before the New York City Council Committee on Technology regarding Automated Processing of Data, cannot be achieved without data transparency. Nutritional labels can support data transparency and interpretability,  surfacing the statistical properties of a dataset, the methodology that was used to produce it, and, ultimately, substantiating the “fitness for use” of a dataset in the context of a specific automated decision system or task.

Addressing the opioid epidemic

An effective response to the opioid epidemic requires coordination between at least three sectors: health care, criminal justice, and emergency housing.  An optimization problem is to effectively, fairly and transparently assign resources, such as hospital rooms, jail cells, and shelter beds,  to at-risk citizens.  Yet, centralizing all data is disallowed by law, and solving the global optimization problem is therefore difficult. We’ve seen interest in nutritional labels to share the details of local resource allocation strategies, to help bootstrap a coordinated response without violating data sharing principles.  In this case the nutritional labels are shared separately from the datasets themselves.

Mitigating urban homelessness

With the Bill and Melinda Gates Foundation, we are integrating data about homeless families from multiple government agencies and non-profits to understand how different pathways through the network of services affect outcomes.  Ultimately, we are using machine learning to deliver prioritized recommendations to specific families. But the families and case workers need to understand how a particular recommendation was made, so they can in turn make an informed decision about whether to follow it.  For example, income levels, substance abuse issues, or health issues may all affect the recommendation, but only the families themselves know whether the information is reliable.

Sharing transportation data

At the University of Washington, we are developing the Transportation Data Collaborative, an honest broker system that can provide reports and research to policy makers while maintaining security and privacy for sensitive information about companies and individuals.  We are releasing nutritional labels for reports, models, and synthetic datasets that we produce to share known biases about the data and our methods of protecting privacy.

Properties of a nutritional label

To differentiate a nutritional label from more general forms of metadata, we articulate several properties:

  • Comprehensible: The label is not a complete (and therefore overwhelming) history of every processing step applied to produce the result.  This approach has its place and has been extensively studied in the literature on scientific workflows, but is unsuitable for the applications we target.  The information on a nutritional label must be short, simple, and clear.
  • Consultative: Nutritional labels should provide actionable information, rather than just descriptive metadata.  For example, universities may invest in research to improve their ranking, or consumers may cancel unused credit card accounts to improve their credit score.
  • Comparable: Nutritional labels enable comparisons between related products, implying a standard.
  • Concrete: The label must contain more than just general statements about the source of the data; such statements do not provide sufficient information to make technical decisions on whether or not to use the data.

Data and models are chained together into complex automated pipelines — computational systems “consume” datasets at least as often as people do, and therefore also require nutritional labels!  We articulate additional properties in this context:

  • Computable: Although primarily intended for human consumption, nutritional labels should be machine-readable to enable specific applications: data discovery, integration, automated warnings of potential misuse.  
  • Composable: Datasets are frequently integrated to construct training data; the nutritional labels must be similarly integratable.  In some situations, the composed label is simple to construct: the union of sources. In other cases, the biases may interact in complex ways: a group may be sufficiently represented in each source dataset, but underrepresented in their join.  
  • Concomitant: The label should be carried with the dataset; systems should be designed to propagate labels through processing steps, modifying the label as appropriate, and implementing the paradigm of transparency by design.

Going forward

We are interested in the application of nutritional labels at various stages in the data science lifecycle: Data scientists triage datasets for use to train their models; data practitioners inspect and validate trained models before deploying them in their domains; consumers review nutritional labels to understand how decisions that affect them were made and how to respond.  

The software infrastructure implied by nutritional labels suggests a number of open questions for the computer science community: Under what circumstances can nutritional labels be generated automatically for a given dataset or model? Can we automatically detect and report potential misuse of datasets or models, given the information in a nutritional label?  We’ve suggested that nutritional labels should be computable, composable, and concomitant — carried with the datasets to which they pertain; how can we design systems that accommodate these requirements?  

We look forward to opening these discussions with the database community at two upcoming events:  at ACM SIGMOD 2018, where we are organizing a special session on a technical research agenda in data ethics and responsible data management,  and at VLDB 2018, where we will run a debate on data and algorithmic ethics.

Supplement for Revealing Algorithmic Rankers (Table 1)

Table 1: A ranking of Computer Science departments per csrankings.org, with additional attributes from the NRC assessment dataset. Here, the average count computes the geometric mean of the adjusted number of publications in each area by institution, faculty is the number of faculty in the department, pubs is the average number of publications per faculty (2000-2006) , GRE is the average GRE scores (2004-2006). Departments are ranked by average count.

Rank (CSR) Name Average Count (CSR) Faculty (CSR) Pubs (NRC) GRE (NRC)
1 Carnegie Mellon University 18.3 122 2 791
2 Massachusetts Institute of Technology 15 64 3 772
3 Stanford University 14.3 55 5 800
4 University of California–Berkeley 11.4 50 3 789
5 University of Illinois–Urbana-Champaign 10.5 55 3 772
6 University of Washington 10.3 50 2 796
7 Georgia Institute of Technology 8.9 81 2 797
8 University of California–San Diego 7.8 49 3 797
9 Cornell University 6.9 45 2 800
10 University of Michigan 6.8 63 3 800
11 University of Texas–Austin 6.6 43 3 789
12 Columbia University 6.3 49 3 788
13 University of Massachusetts–Amherst 6.2 47 2 796
14 University of Maryland–College Park 5.5 42 2 791
15 University of Wisconsin–Madison 5.1 35 2 793
16 University of Southern California 4.4 47 3 793
17 University of California–Los Angeles 4.3 32 3 797
18 Northeastern University 4 46 2 797
19 Purdue University–West Lafayette 3.6 42 2 772
20 Harvard University 3.4 29 3 794
20 University of Pennsylvania 3.4 32 3 800
22 University of California–Santa Barbara 3.2 28 4 793
22 Princeton University 3.2 27 2 796
24 New York University 3 29 2 796
24 Ohio State University 3 39 3 798
26 University of California–Davis 2.9 27 2 771
27 Rutgers The State University of New Jersey–New Brunswick 2.8 33 2 758
27 University of Minnesota–Twin Cities 2.8 37 2 777
29 Brown University 2.5 24 2 768
30 Northwestern University 2.4 35 1 787
31 Pennsylvania State University 2.3 28 3 790
31 Texas A & M University–College Station 2.3 36 1 775
33 State University of New York–Stony Brook 2.2 33 3 796
33 Indiana University–Bloomington 2.2 35 1 765
33 Duke University 2.2 22 3 800
33 Rice University 2.2 18 2 800
37 University of Utah 2.1 29 2 776
37 Johns Hopkins University 2.1 24 2 766
39 University of Chicago 2 28 2 779
40 University of California–Irvine 1.9 28 2 787
41 Boston University 1.6 15 2 783
41 University of Colorado–Boulder 1.6 32 1 761
41 University of North Carolina–Chapel Hill 1.6 22 2 794
41 Dartmouth College 1.6 18 2 794
45 Yale University 1.5 18 2 800
45 University of Virginia 1.5 18 2 789
45 University of Rochester 1.5 18 3 786
48 Arizona State University 1.4 14 2 787
48 University of Arizona 1.4 18 2 784
48 Virginia Polytechnic Institute and State University 1.4 32 1 780
48 Washington University in St. Louis 1.4 17 2 790

Revealing Algorithmic Rankers

By Julia Stoyanovich (Assistant Professor of Computer Science, Drexel University) and Ellen P. Goodman (Professor, Rutgers Law School)

ProPublica’s story on “machine bias” in an algorithm used for sentencing defendants amplified calls to make algorithms more transparent and accountable. It has never been more clear that algorithms are political (Gillespie) and embody contested choices (Crawford), and that these choices are largely obscured from public scrutiny (Pasquale and Citron). We see it in controversies over Facebook’s newsfeed, or Google’s search results, or Twitter’s trending topics. Policymakers are considering how to operationalize “algorithmic ethics” and scholars are calling for accountable algorithms (Kroll, et al.).

One kind of algorithm that is at once especially obscure, powerful, and common is the ranking algorithm (Diakopoulos). Algorithms rank individuals to determine credit worthiness, desirability for college admissions and employment, and compatibility as dating partners. They encode ideas of what counts as the best schools, neighborhoods, and technologies. Despite their importance, we actually can know very little about why this person was ranked higher than another in a dating app, or why this school has a better rank than that one. This is true even if we have access to the ranking algorithm, for example, if we have complete knowledge about the factors used by the ranker and their relative weights, as is the case for US News ranking of colleges. In this blog post, we argue that syntactic transparency, wherein the rules of operation of an algorithm are more or less apparent, or even fully disclosed, still leaves stakeholders in the dark: those who are ranked, those who use the rankings, and the public whose world the rankings may shape.

Using algorithmic rankers as an example, we argue that syntactic transparency alone will not lead to true algorithmic accountability (Angwin). This is true even if the complete input data is publicly available. We advocate instead for interpretability, which rests on making explicit the interactions between the program and the data on which it acts. An interpretable algorithm allows stakeholders to understand the outcomes, not merely the process by which outcomes were produced.

Opacity in Algorithmic Rankers

Algorithmic rankers take as input a database of items and produce a ranked list of items as output. The relative ranking of the items may be computed based on an explicitly provided scoring function. Or the ranking function may be learned, using learning-to-rank methods that are deployed extensively in information retrieval and recommender systems.

The simplest kind of a ranker is a score-based ranker, which applies a scoring function independently to each item and then sorts the items on their scores. Many of these rankers use monotone aggregation scoring functions, such as weighted sums of attribute values with non-negative weights. In the very simplest case, the score of an item is computed by sorting on the value of just one attribute, i.e., by setting the weight of that attribute to 1 and of all other attributes to 0.

This is illustrated in our running example in Table 1, which gives a ranking of 51 computer science departments as per csrankings.org (CSR). We augmented the data with several attributes from the assessment of research-doctorate programs by the National Research Council (NRC) to illustrate some points. Source of an attribute (CSR or NRC) is listed next to the attribute name. We recognize that the augmented CS rankings are already syntactically transparent. What’s more, they provide the entire data set. We use them for illustrative purposes.

Table 1: A ranking of Computer Science departments per csrankings.org, with additional attributes from the NRC assessment dataset. Here, the average count computes the geometric mean of the adjusted number of publications in each area by institution, faculty is the number of faculty in the department, pubs is the average number of publications per faculty (2000-2006) , GRE is the average GRE scores (2004-2006). Departments are ranked by average count.

Rank (CSR) Name Average Count (CSR) Faculty (CSR) Pubs (NRC) GRE (NRC)
1 Carnegie Mellon University 18.3 122 2 791
2 Massachusetts Institute of Technology 15 64 3 772
3 Stanford University 14.3 55 5 800
4 University of California–Berkeley 11.4 50 3 789
5 University of Illinois–Urbana-Champaign 10.5 55 3 772
full table
45 Yale University 1.5 18 2 800
45 University of Virginia 1.5 18 2 789
45 University of Rochester 1.5 18 3 786
48 Arizona State University 1.4 14 2 787
48 University of Arizona 1.4 18 2 784
48 Virginia Polytechnic Institute and State University 1.4 32 1 780
48 Washington University in St. Louis 1.4 17 2 790

Ranked results are difficult for people to interpret, whether a ranking is computed explicitly or learned, whether the method (e.g., the scoring function or, more generally, the model) is known or unknown, and whether the user can access the entire output or only the highest-ranked items (the top-k). There are several sources of this opacity, illustrated below for score-based rankers.

Sources of Opacity

Source 1: The scoring formula alone does not indicate the relative rank of an item. Rankings are, by definition, relative, while scores are absolute. Knowing how the score of an item is computed says little about the outcome — the position of a particular item in the ranking, relative to other items. Is 10.5 a high score or a low score? That depends on how 10.5 compares to the scores of other items, for example to the highest attainable score and to the highest score of some actual item in the input. In our example in Table 1 this kind of opacity is mitigated because there is both syntactic transparency (the scoring formula is known) and the input is public.

Source 2: The weight of an attribute in the scoring formula does not determine its impact on the outcome. Consider again the example in Table 1, and suppose that we first normalize the values of the attributes, and then compute the score of each department by summing up the values of faculty (with weight 0.2), average count (with weight 0.3) and GRE (with weight 0.5). According to this scoring method, the size of the department (faculty) is the least important factor. Yet, it will be the deciding factor that sets apart top-ranked departments from those in lower ranks, both because the value of this attribute changes most dramatically in the data, and because it correlates with average count (in effect, double-counting). In contrast, GRE is syntactically the most important factor in the formula, yet in this dataset it has very close values for all items, and so has limited actual effect on the ranking.

Source 3: The ranking output may be unstable. A ranking may be unstable because of the scores generated on a particular dataset. An example would be tied scores, where the tie is not reflected in the ranking. In this case, the choice of any particular rank order is arbitrary. Moreover, unless raw scores are disclosed, the user has no information about the magnitude of the difference in scores between items that appear in consecutive ranks. In Table 1, CMU (18.3) has a much higher score than the immediately following MIT (15). This is in contrast to, e.g., UIUC (10.5, rank 5) and UW (10.3, rank 6), which are nearly tied. The difference in scores between distinct adjacent ranks decreases dramatically as we move down the list: it is at most 0.3, and usually 0.1, for departments in ranks 16 through 48. CSRankings’ syntactic transparency (disclosing its ranking method to the user) and accessible data allow us to see the instability, but this is unusual.

Source 4: The ranking methodology may be unstable. The scoring function may produce vastly different rankings with small changes in attribute weights. This is difficult to detect even with syntactic transparency, and even if the data is public. Malcolm Gladwell discusses this issue and gives compelling examples in his 2011 piece, The Order of Things. In our example in Table 1, a scoring function that is based on a combination of pubs and GRE would be unstable, because the values of these attributes are both very close for many of the items and induce different rankings, and so prioritizing one attribute over the other slightly would cause significant re-shuffling.

The opacity concerns described here are all due to the interaction between the scoring formula (or, more generally, an a priori postulated model) and the actual dataset being ranked. In a recent paper, one of us observed that structured datasets show rich correlations between item attributes in the presence of ranking, and that such correlations are often local (i.e., are present in some parts of the dataset but not in others). To be clear, this kind of opacity is present whether or not there is syntactic transparency.

Harms of Opacity

Opacity in algorithmic rankers can lead to four types of harms:

(1) Due process / fairness. The subjects of the ranking cannot have confidence that their ranking is meaningful or correct, or that they have been treated like similarly situated subjects. Syntactic transparency helps with this but it will not solve the problem entirely, especially when people cannot interpret how weighted factors have impacted the outcome (Source 2 above).

(2) Hidden normative commitments. A ranking formula implements some vision of the “good.” Unless the public knows what factors were chosen and why, and with what weights assigned to each, it cannot assess the compatibility of this vision with other norms. Even where the formula is disclosed, real public accountability requires information about whether the outcomes are stable, whether the attribute weights are meaningful, and whether the outcomes are ultimately validated against the chosen norms. Did the vendor evaluate the actual effect of the features that are postulated as important by the scoring / ranking mode? Did the vendor take steps to compensate for mutually-reinforcing correlated inputs, and for possibly discriminatory inputs? Was stability of the ranker interrogated on real or realistic inputs? This kind of transparency around validation is important for both learning algorithms which operate according to rules that are constantly in flux and responsive to shifting data inputs, and for simpler score-based rankers that are likewise sensitive to the data.

(3) Interpretability. Especially where ranking algorithms are performing a public function (e.g., allocation of public resources or organ donations) or directly shaping the public sphere (e.g., ranking politicians), political legitimacy requires that the public be able to interpret algorithmic outcomes in a meaningful way. At the very least, they should know the degree to which the algorithm has produced robust results that improve upon a random ordering of the items (a ranking-specific confidence measure). In the absence of interpretability, there is a threat to public trust and to democratic participation, raising the dangers of an algocracy (Danaher) – rule by incontestable algorithms.

(4) Meta-methodological assessment. Following on from the interpretability concerns is a meta question about whether a ranking algorithm is the appropriate method for shaping decisions. There are simply some domains, and some instances of datasets, in which rank order is not appropriate. For example, if there are very many ties or near-ties induced by the scoring function, or if the ranking is too unstable, it may be better to present data through an alternative mechanism such as clustering. More fundamentally, we should question the use of an algorithmic process if its effects are not meaningful or if it cannot be explained. In order to understand whether the ranking methodology is valid, as a first order question, the algorithmic process needs to be interpretable.

The Possibility of Knowing

Recent scholarship on the issue of algorithmic accountability has devalued transparency in favor of verification. The claim is that because algorithmic processes are protean and extremely complex (due to machine learning) or secret (due to trade secrets or privacy concerns), we need to rely on retrospective checks to ensure that the algorithm is performing as promised. Among these checks would be cryptographic techniques like zero knowledge proofs (Kroll, et al.) to confirm particular features, audits (Sandvig) to assess performance, or reverse engineering (Perel and Elkin-Koren) to test cases.

These are valid methods of interrogation, but we do not want to give up on disclosure. Retrospective testing puts a significant burden on users. Proofs are useful only when you know what you are looking for. Reverse engineering with test cases can lead to confirmation bias. All these techniques put the burden of inquiry exclusively on individuals for whom interrogation may be expensive and ultimately fruitless. The burden instead should fall more squarely on the least cost avoider, which will be the vendor who is in a better position to reveal how the algorithm works (even if only partially). What if food manufacturers resisted disclosing ingredients or nutritional values, and instead we were put to the trouble of testing their products or asking them to prove the absence of a substance? That kind of disclosure by verification is very different from having a nutritional label.

What would it take to provide the equivalent of a nutritional label for the process and the outputs of algorithmic rankers? What suffices as an appropriate and feasible explanation depends on the target audience.

For an individual being ranked, a useful description would explain his specific ranked outcome and suggest ways to improve the outcome. What changes can NYU CS make to improve its ranking? Why is the NYU CS department ranked 24? Which attributes make this department perform worse than those ranked higher? As we argued above, the answers to these questions depend on the interaction between the ranking method and the dataset over which the ranker operates. When working with data that is not public (e.g., involving credit or medical information about individuals), an explanation mechanism of this kind must be mindful of any privacy considerations. Individually-responsive disclosures could be offered in a widget that allows ranked entities to experiment with the results by changing the inputs.

An individual consumer of a ranked output would benefit from a concise and intuitive description of the properties of the ranking. Based on this explanation, users will get a glimpse of, e.g., the diversity (or lack thereof) that the ranking exhibits in terms of attribute values. Both attributes that comprise the scoring function, if known (or, more generally, features that make part of the model), and attributes that co-occur or even correlate with the scoring attributes, can be described explicitly. In our example in Table 1, a useful explanation may be that a ranking on average count will over-represent large departments (with many faculty) at the top of the list, while GRE does not strongly influence rank.


Figure 1: A hypothetical Ranking Facts label.

Figure 1 presents a hypothetical “nutritional label” for rankings, using the augmented CSRankings in Table 1 as input. Inspired by Nutrition Facts, our Ranking Facts label is aimed at the consumer, such as a prospective CS program applicant, and addresses three of the four opacity sources described above: relativity, impact, and output stability. We do not address methodological stability in the label. How this dimension should be quantified and presented to the user is an open technical problem.

The Ranking Facts show how the properties of the 10 highest-ranked items compare to the entire dataset (Relativity), making explicit cases where the ranges of values, and the median value, are different at the top-10 vs. overall (median is marked with red triangles for faculty size and average publication count). The label lists the attributes that have most impact on the ranking (Impact), presents the scoring formula (if known), and explains which attributes correlate with the computed score. Finally, the label graphically shows the distribution of scores (Stability), explaining that scores differ significantly up to top-10 but are nearly indistinguishable in later positions.

Something like the Rankings Facts makes the process and outcome of algorithmic ranking interpretable for consumers, and reduces the likelihood of opacity harms, discussed above. Beyond Ranking Facts, it is important to develop Interpretability tools that enable vendors to design fair, meaningful and stable ranking processes, and that support external auditing. Promising technical directions include, e.g., quantifying the influence of various features on the outcome under different assumptions about availability of data and code, and investigating whether provenance techniques can be used to generate explanations.