November 26, 2024

Strengthening AI Accountability Through Better Third Party Evaluations (Part 1)

Millions of people worldwide use general purpose AI systems: They use ChatGPT to write documents, Claude to analyze data, and Stable Diffusion to generate images. While these AI systems offer significant benefits, they also pose serious risks, like producing non-consensual intimate imagery, facilitating production of bioweapons, and contributing to biased decisions. Third party AI evaluations are crucial for assessing these risks because they are independent from company interests and incorporate diverse perspectives and expertise that better reflect the wide range of real-world applications.

While software security has developed reporting infrastructure, legal protections and incentives (e.g., bug bounties) to incentivize third party evaluation, this is not yet the case for general purpose AI systems. This is why researchers at Stanford, MIT, Princeton’s Center for Information Technology Policy, and Humane Intelligence convened leaders from academia, industry, civil society and government for an October 28, 2024 virtual workshop to articulate a vision for third party AI evaluations.

Key takeaways from the workshop, reflecting areas of agreement among many speakers, included the need for more legal and technical protections for third party evaluations – reinforcing scholars’ call for more legal and technical protections (also known as “safe harbors”) for third party AI evaluators – along with the need for more standardization and coordination of evaluation processes and shared terminology.

The workshop spanned three sessions exploring evaluations in practice, evaluations by design, and evaluation law and policy, beginning with a keynote from Rumman Chowdhury, CEO at Humane Intelligence.


You can watch the full workshop here.

Session 1. The Need for Independent Oversight 

In her keynote, Chowdhury compared the status quo to a new Gilded Age in that it is characterized by major economic disruption and a lack of protections for users and citizens. She stressed the need for independent oversight: The standard practice where “companies write their own tests and they grade themselves” can result in biased evaluations and limit standardization, information sharing, and generalizability beyond specific settings.

In contrast, third party evaluators can have more depth, breadth, and independence in their assessments. Chowdhury contended that while the software security space and penetration testing may offer some lessons like legal protections for third party evaluators, AI evaluations are more complex than software testing: AI systems are probabilistic and it is difficult to precisely identify negative impacts from these systems and the mechanisms by which they occur. This, in turn, makes mitigation challenging. She called for more legal protections for third party evaluators, building a robust talent pipeline and engaging multiple stakeholders, including lawyers, AI specialists, and auditors.

Session 2. Current AI Evaluation Practices

The first panel featured presentations by Nicholas Carlini, Research Scientist at Google DeepMind; Lama Ahmad, Technical Program Manager and Red Teaming Lead at OpenAI; Avijit Ghosh, Applied Policy Researcher at Hugging Face; and Victoria Westerhoff, Director of the AI Red Team at Microsoft.

Carlini shared insights about his experience evaluating AI models, such as attacks that lead a foundation model to divulge personal information from the training dataset or that involve stealing parts of a production language model. Carlini started out researching software vulnerabilities as a penetration tester, but later shifted to machine learning. He shared that while penetration testing is a standardized procedure with rules around who to disclose vulnerabilities to and when, this is not true for AI vulnerabilities where his process is as ad hoc as “write research paper, upload to arXiv”. For example, it is not clear who to disclose to when a vulnerability resides in an AI model API, and thus affects not just the model developer, but also any deployer using that API. He expressed a wish for better established norms.

Ahmad described how third party evaluations are conducted at OpenAI, distinguishing three forms of evaluations. First, the company solicits external red teaming as part of OpenAI’s Red Teaming Network that aims to discover novel risks or stress tests systems and results in new evaluations. Second, it conducts third party assessments that specialize in particular issue areas to provide subject matter expertise, such as partnerships with AI safety institutes. Third, the company conducts independent research that promotes an ecosystem of better evaluations and methods for alignment. Examples in this area include OpenAI’s efforts to fund research on democratic inputs to AI and its collaboration with Stanford’s Center for Research on Foundation Models on its Holistic Evaluation of Language Models. Ahmad argued that all three forms are needed in the face of a rapidly evolving technological landscape. She also highlighted the challenge of building trust given the growth of the third party evaluation ecosystem and the lack of clear frameworks laying out which third party evaluators are trustworthy and in what areas.

Ghosh presented the coordinated flaw disclosures (CFD) framework for AI vulnerabilities as an analog to the coordinated vulnerability disclosure (CVD) framework for software vulnerabilities. This framework aims to address the unique challenges of AI systems, which include the complexity of ethical and safety questions, scalability issues, and the lack of a unified enumeration of products and weaknesses. The CFD framework has multiple components, including extended model cards, automated verification, an adjudication process and a dynamic scope that adapts to emerging common uses of AI.

Westerhoff described lessons from the work of the internal AI red team she leads at Microsoft. She highlighted the importance of diverse teams to enable diverse testing, especially in light of the multi-modality of models. To achieve this diversity, her team collaborates with teams across Microsoft and external experts, such as experts on specific types of harms. She has found that AI systems are susceptible to similar attacks that humans are susceptible to, and that understanding a product well helps to understand its potential harms. Westerhoff described her team’s development of the open source red teaming framework Python Risk Identification Tool for generative AI (PyRIT), which is part of her general aim to contribute to the field through greater transparency, training, tooling, and reporting mechanisms for third party evaluators. She expressed the hope that going forward, industry researchers would share more insights with each other, including on red teaming techniques.

Ruth E. Appel is a researcher at Stanford University and holds a Master’s in CS and PhD in Political Communication from Stanford, and an MPP from Sciences Po Paris

This post covers the first half of the workshop. Read the next post for the second half.

Speak Your Mind

*