Can a Bayesian Oracle Prevent Harm from an Agent?
Bengio et al
22 August 2024
Is there a way to design powerful AI systems based on machine learning methods that would satisfy probabilistic safety guarantees? With the long-term goal of obtaining a probabilistic guarantee that would apply in every context, we consider estimating a context-dependent bound on the probability of violating a given safety specification. Such a risk evaluation would need to be performed at run-time to provide a guardrail against dangerous actions of an AI. Noting that different plausible hypotheses about the world could produce very different outcomes, and because we do not know which one is right, we derive bounds on the safety violation probability predicted under the true but unknown hypothesis. Such bounds could be used to reject potentially dangerous actions. Our main results involve searching for cautious but plausible hypotheses, obtained by a maximization that involves Bayesian posteriors over hypotheses. We consider two forms of this result, in the i.i.d. case and in the non-i.i.d. case, and conclude with open problems towards turning such theoretical results into practical AI guardrails.
1 Introduction
Ensuring that an AI system will not misbehave is a challenging open problem [4], particularly in the current context of rapid growth in AI capabilities. Governance measures and evaluation-based strategies have been proposed to mitigate the risk of harm from highly capable AI systems, but do not provide any form of safety guarantee when no undesired behavior is detected. In contrast, the safe-by-design paradigm involves designing AI systems with quantitative (possibly probabilistic) safety guarantees from the ground up, and therefore could represent a stronger form of protection [8]. However, how to design such systems remains an open problem too.
Since testing an AI system for violations of a safety specification in every possible context, e.g., every (query, output) pair, is impossible, we consider a rejection sampling approach that declines a candidate output or action if it has a probability of violating a given safety specification that is too high. The question of defining the safety specification (the violation of which is simply referred to as “harm” below) is important and left to future work, possibly following up approaches such as constitutional AI [1].
Learn More
- Bounding the probability of harm from an AI to create a guardrail
- Published 29 August 2024 by yoshuabengio
- Reasoning through arguments against taking AI safety seriously
- Published 9 July 2024 by yoshuabengio
- The International Scientific Report on the Safety of Advanced AI
- Published 19 June 2024 by yoshuabengio
- Towards a Cautious Scientist AI with Convergent Safety Bounds
- Published 26 February 2024 by yoshuabengio
- My testimony in front of the U.S. Senate – The urgency to act against AI threats to democracy, society and national security
- Published 25 July 2023 by yoshuabengio