The Singapore Consensus on Global AI Safety
Research Priorities Building a trusted AI ecosystem
Explore the conclusions drawn from the International Scientific Exchange on AI Safety at SCAI 2025.
Last updated 8 May 2025
As AI capabilities rapidly evolve, there is vigorous global debate over how to keep AI safe, reliable and beneficial. To fully understand the extent of the risks and how to tackle these risks, there needs to be technical AI safety research areas that are broadly agreed to be valuable, and prioritised to enable more effective R&D efforts to drive safety and evaluation mechanisms to achieve a trusted ecosystem for AI use.
A key goal of the 2025 Singapore Conference on AI (SCAI): International Scientific Exchange on AI Safety is identifying and prioritising such research areas. Bringing together the global research community and building on the International AI Safety Report chaired by Yoshua Bengio, SCAI 2025’s outcome document – the Singapore Consensus on Global AI Safety Research Priorities – aims to demonstrate substantial consensus around identified important technical AI safety research domains. It will also serve as a foundation document to facilitate global, meaningful conversations amongst AI policymakers, scientists, the broader industry and public, improve understanding of risk management, and spur international research collaboration.
The Singapore Consensus is a living document and continues to welcome views from the global research community.
3.1.3 AGI and ASI control problem![]()
A particularly challenging frontier in operational control involves developing techniques for controlling AI systems that are not only highly capable but may actively and strategically attempt to undermine control mechanisms. Unlike conventional methods which offer system-agnostic approaches to monitoring and intervention, this section focuses on research toward techniques for controlling systems that are potentially very powerful and may actively undermine attempts to control them (Hubinger-A).
Scalable oversight: This research studies techniques that may allow less capable systems to oversee and control more capable ones. Common themes behind scalable oversight techniques include affording less-capable overseers more computational resources (Evans-B) or having multiple highly-capable systems debate (Irving, Michael), though these might not scale to superhuman systems. Research frontiers include improving the reliability of oversight even when systemic errors are present, developing more effective AI debate protocols, methods for monitoring systems for control-undermining behaviours, and methods for monitoring untrusted smarter systems with trusted weaker systems (Greenblatt-A). Another promising approach to oversight research is to combine multiple control strategies and attempt to quantify their joint success probability. For example, researchers can work to quantify the reliability of nested scalable oversight regimes where less capable systems oversee more capable ones (Engels).
Corrigibility and agent foundations research: Corrigibility focuses on techniques for ensuring that powerful agentic systems will allow an authorised user to change its goals or shut it down (Soares), as well as on the extent to which recursively self-improving systems will tend to retain their goals. Corrigibility can be challenging to implement because it is difficult to theoretically ensure that an agentic system has no incentive to interfere with such interventions (Omohundro). It is also theoretically difficult to control and predict how goal-oriented agentic systems may pursue self-modification, and how this affects goals (Everitt). This motivates theoretical and empirical work to study and control how highly agentic systems develop tendencies and behave in corrigibility and self-modification scenarios (Anthropic-F). Agent foundations research also explores how idealised AI agents would approach moral and game-theoretic reasoning (Hammond), including how AI systems’ behaviour can vary substantially depending on whether the system infers it is in a simulation versus in real-world environments (a phenomenon already observed in current AI systems (Anthropic).
Containment: As AI systems become increasingly agentic and able to access the internet, secure containment of risky systems is a growing challenge. Meanwhile, external threats to obtain unauthorised access to models also pose a security threat (Nevo). Frontiers for research on containment will involve developing techniques and protocols to keep high-security systems safe from internal and external threats.
Non-agentic AI systems as guardrails: Agentic AI systems, especially future ones that broadly outperform humans, pose major additional risks compared to other AI systems (IAISR, Bengio-B). However, non-agentic AI systems could be used to monitor their proposed actions and intervene by blocking them if needed. These include:
- A “Scientist AI” (Bengio-B) could be used to estimate the probability that a candidate action from an agentic AI violates any plausible interpretation of safety specifications, and reject potentially unsafe actions. Certain approaches could guarantee that unsafe actions will be identified, with quantitative error bounds (Bengio-B).
- “AI control” setups that can effectively monitor and intervene even if the underlying untrusted AI system is actively attempting to subvert these safeguards (Greenblatt-A, Korbak, Griffin).
Human-centric oversight: Since humans should ultimately remain in control of general-purpose agents, there will be some natural limits on what systems could be controllable in principle given the thinking speed, proactivity, expertise level, attention to detail, and reliability of human operators. Even with AIs assisting humans in understanding the given context in question per decision point or review