DEEPMIND. BLOG. An early warning system for novel AI risks. May 25, 2023. Model evaluation for extreme risks. 24 MAY 2023.

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

DEEPMIND. Technical blog. An early warning system for novel AI risks. May 25, 2023

New research proposes a framework for evaluating general-purpose models against novel threats

To pioneer responsibly at the cutting edge of artificial intelligence (AI) research, we must identify new capabilities and novel risks in our AI systems as early as possible.

AI researchers already use a range of evaluation benchmarks to identify unwanted behaviours in AI systems, such as AI systems making misleading statements, biased decisions, or repeating copyrighted content. Now, as the AI community builds and deploys increasingly powerful AI, we must expand the evaluation portfolio to include the possibility of extreme risks from general-purpose AI models that have strong skills in manipulation, deception, cyber-offense, or other dangerous capabilities.

In our latest paper, we introduce a framework for evaluating these novel threats, co-authored with colleagues from University of Cambridge, University of Oxford, University of Toronto, Université de Montréal, OpenAI, Anthropic, Alignment Research Center, Centre for Long-Term Resilience, and Centre for the Governance of AI.

Model safety evaluations, including those assessing extreme risks, will be a critical component of safe AI development and deployment.

An overview of our proposed approach: To assess extreme risks from new, general-purpose AI systems, developers must evaluate for dangerous capabilities and alignment (see below). By identifying the risks early on, this will unlock opportunities to be more responsible when training new AI systems, deploying these AI systems, transparently describing their risks, and applying appropriate cybersecurity standards.

Evaluating for extreme risks

General-purpose models typically learn their capabilities and behaviours during training. However, existing methods for steering the learning process are imperfect. For example, previous research at Google DeepMind has explored how AI systems can learn to pursue undesired goals even when we correctly reward them for good behaviour.

Responsible AI developers must look ahead and anticipate possible future developments and novel risks. After continued progress, future general-purpose models may learn a variety of dangerous capabilities by default. For instance, it is plausible (though uncertain) that future AI systems will be able to conduct offensive cyber operations, skilfully deceive humans in dialogue, manipulate humans into carrying out harmful actions, design or acquire weapons (e.g. biological, chemical), fine-tune and operate other high-risk AI systems on cloud computing platforms, or assist humans with any of these tasks.

People with malicious intentions accessing such models could misuse their capabilities. Or, due to failures of alignment, these AI models might take harmful actions even without anybody intending this.

Model evaluation helps us identify these risks ahead of time. Under our framework, AI developers would use model evaluation to uncover:

To what extent a model has certain ‘dangerous capabilities’ that could be used to threaten security, exert influence, or evade oversight.
To what extent the model is prone to applying its capabilities to cause harm (i.e. the model’s alignment). Alignment evaluations should confirm that the model behaves as intended even across a very wide range of scenarios, and, where possible, should examine the model’s internal workings.

Results from these evaluations will help AI developers to understand whether the ingredients sufficient for extreme risk are present. The most high-risk cases will involve multiple dangerous capabilities combined together. The AI system doesn’t need to provide all the ingredients, as shown in this diagram:

Ingredients for extreme risk: Sometimes specific capabilities could be outsourced, either to humans (e.g. to users or crowdworkers) or other AI systems. These capabilities must be applied for harm, either due to misuse or failures of alignment (or a mixture of both).

A rule of thumb: the AI community should treat an AI system as highly dangerous if it has a capability profile sufficient to cause extreme harm, assuming it’s misused or poorly aligned. To deploy such a system in the real world, an AI developer would need to demonstrate an unusually high standard of safety.

Model evaluation as critical governance infrastructure

If we have better tools for identifying which models are risky, companies and regulators can better ensure:

Responsible training: Responsible decisions are made about whether and how to train a new model that shows early signs of risk.
Responsible deployment: Responsible decisions are made about whether, when, and how to deploy potentially risky models.
Transparency: Useful and actionable information is reported to stakeholders, to help them prepare for or mitigate potential risks.
Appropriate security: Strong information security controls and systems are applied to models that might pose extreme risks.

We have developed a blueprint for how model evaluations for extreme risks should feed into important decisions around training and deploying a highly capable, general-purpose model. The developer conducts evaluations throughout, and grants structured model access to external safety researchers and model auditors so they can conduct additional evaluations The evaluation results can then inform risk assessments before model training and deployment.

A blueprint for embedding model evaluations for extreme risks into important decision making processes throughout model training and deployment.

Looking ahead

Important early work on model evaluations for extreme risks is already underway at Google DeepMind and elsewhere. But much more progress – both technical and institutional – is needed to build an evaluation process that catches all possible risks and helps safeguard against future, emerging challenges.

Model evaluation is not a panacea; some risks could slip through the net, for example, because they depend too heavily on factors external to the model, such as complex social, political, and economic forces in society. Model evaluation must be combined with other risk assessment tools and a wider dedication to safety across industry, government, and civil society.

Google’s recent blog on responsible AI states that, “individual practices, shared industry standards, and sound government policies would be essential to getting AI right”. We hope many others working in AI and sectors impacted by this technology will come together to create approaches and standards for safely developing and deploying AI for the benefit of all.

We believe that having processes for tracking the emergence of risky properties in models, and for adequately responding to concerning results, is a critical part of being a responsible developer operating at the frontier of AI capabilities.

Authors

Toby Shevlane

Published

May 25, 2023

DEEPMIND. Model evaluation for extreme risks. 24 MAY 2023.

Toby Shevlane1, Sebastian Farquhar1, Ben Garfinkel2, Mary Phuong1, Jess Whittlestone3, Jade Leung4, Daniel Kokotajlo4, Nahema Marchal1, Markus Anderljung2, Noam Kolt5, Lewis Ho1, Divya Siddarth6, 7, Shahar Avin8, Will Hawkins1, Been Kim1, Iason Gabriel1, Vijay Bolina1, Jack Clark9, Yoshua Bengio10, 11, Paul Christiano12 and Allan Dafoe1

1Google DeepMind, 2Centre for the Governance of AI, 3Centre for Long-Term Resilience, 4OpenAI, 5University of Toronto, 6University of Oxford, 7Collective Intelligence Project, 8University of Cambridge, 9Anthropic, 10Université de Montréal, 11Mila – Quebec AI Institute, 12Alignment Research Center

Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through “dangerous capability evaluations”) and the propensity of models to apply their capabilities for harm (through “alignment evaluations”). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.

Figure 1 | The theory of change for model evaluations for extreme risk. Evaluations for dangerous capabilities and alignment inform risk assessments, and are in turn embedded into important gover- nance processes.

1. Introduction

As AI progress has advanced, general-purpose AI systems have tended to display new and hard-to- forecast capabilities – including harmful capabilities that their developers did not intend (Ganguli et al., 2022). Future systems may display even more dangerous emergent capabilities, such as the ability to conduct offensive cyber operations, manipulate people through conversation, or provide actionable instructions on conducting acts of terrorism.

AI developers and regulators must be able to identify these capabilities, if they want to limit the risks they pose. The AI community already relies heavily on model evaluation – i.e. empirical assessment of a model’s properties – for identifying and responding to a wide range of risks. Existing model evaluations measure gender and racial biases, truthfulness, toxicity, recitation of copyrighted content, and many more properties of models (Liang et al., 2022).

We propose extending this toolbox to address risks that would be extreme in scale, resulting from the misuse or misalignment of general-purpose models. Work on this new class of model evaluation is already underway. These evaluations can be organised into two categories: (a) whether a model has certain dangerous capabilities, and (b) whether it has the propensity to harmfully apply its capabilities (alignment).

Model evaluations for extreme risks will play a critical role in governance regimes. A central goal of AI governance should be to limit the creation, deployment, and proliferation of systems that pose extreme risks. To do this, we need tools for looking at a particular system and assessing whether it poses extreme risks. We can then craft company policies or regulations that ensure:

Responsible training: Responsible decisions are made about whether and how to train a new model that shows early signs of risk.
Responsible deployment: Responsible decisions are made about whether, when, and how to deploy potentially risky models.
Transparency: Useful and actionable information is reported to stakeholders, to help them mitigate potential risks.
Appropriate security: Strong information security controls and systems are applied to models that might pose extreme risks.

Many AI governance initiatives focus on the risks inherent to a particular deployment context, such as the “high-risk” applications listed in the draft EU AI Act. However, models with sufficiently dangerous capabilities could pose risks even in seemingly low-risk domains. We therefore need tools for assessing both the risk level of a particular domain and the potentially risky properties of particular models; this paper focuses on the latter.

Section 2 motivates our focus on extreme risks from general-purpose models and refines the scope of the paper. Section 3 outlines a vision for how model evaluations for such risks should be incorporated into AI governance frameworks. Section 4 describes early work in the area and outlines key design criteria for extreme risk evaluations. Section 5 discusses the limitations of model evaluations for extreme risks and outlines ways in which work on these evaluations could cause unintended harm. We conclude with high-level recommendations for AI developers and policymakers.