FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT.

Provably Safe Systems: The Only Path to Controllable AGI

Max Tegmark and Steve Omohundro. 06 SEPT 2023

ABSTRACT

We describe a path to humanity safely thriving with powerful Artificial General Intelligences (AGIs) by building them to provably satisfy human-specified requirements. We argue that this will soon be technically feasible using advanced AI for formal verification and mechanistic interpretability. We further argue that it is the only path which guarantees safe controlled AGI. We end with a list of challenge problems whose solution would contribute to this positive outcome and invite readers to join in this work.

1 Introduction

“Once the machine thinking method had started, it would not take long to outstrip our feeble powers. At some stage therefore we should have to expect the machines to take control”

Alan Turing 1951

AGI [91] safety is of the utmost urgency, since corporations and research labs are racing to build AGI despite prominent AI researchers and business leaders warning that it may lead to human extinction [11]. While governments are drafting AI regulations, there’s little indication that they will be sufficient to resist competitive pressures and prevent the creation of AGI. Median estimates on the forecasting platform Metaculus of the date of AGI’s creation have plummeted over the past few years from many decades away to 2027 [25] or 2032 [24] depending on definitions, with superintelligence expected to follow a few years later [23].

Is Alan Turing correct that we now “have to expect the machines to take control”? If AI safety research remains at current paltry levels, this seems likely. Considering the stakes, the AI safety effort is absurdly small in terms of both funding and the number of people. One analysis [73] estimates that less than $150 million will be spent on AI Safety research this year, while, for example, $63 billion will be spent on cosmetic surgery [14] and $1 trillion on cigarettes [13]. Another analyst estimates [10] that only about one in a thousand AI researchers works on safety.

Much of the current AI safety work is focused on “alignment” which attempts to fine-tune deep neural networks so that their behavior becomes more aligned with human preferences. While this is valuable, we believe it is inadequate for human safety, especially given the profusion of open-source AI that can be used maliciously. In the face of the possibility of human extinction, we must adopt a “security mindset” [30] and rapidly work to create designs which will be safe also against adversarial AGIs. With a security mindset, we must design safety both into AGIs and also into the physical, digital, and social infrastructure that they interact with [5]. AGI computations are only dangerous for us when they lead to harmful actions in the world.

In current approaches, many researchers seem resigned to having to treat AGIs as inscrutable black boxes. Indeed, some younger researchers seem to subconsciously equate AI with neural networks, or even with transformers. This manifesto argues that such resignation is too pessimistic. Mechanistic interpretability is opening up the black box and, once opened, it can be distilled into formal representations that can be reasoned about.

We argue that mathematical proof is humanity’s most powerful tool for controlling AGIs. Regardless of how intelligent a system becomes, it cannot prove a mathematical falsehood or do what is provably impossible. And mathematical proofs are cheap to check with inexpensive, extremely reliable hardware. The behavior of physical, digital, and social systems can be precisely modeled as formal systems and precise “guardrails” can be defined that constrain what actions can occur. Even inscrutable AI systems can be required to provide safety proofs for their recommended actions. These proofs can be precisely validated independent of the alignment status of the AGIs which generated them.

There is a critical distinction between an AGI world which turns out to be safe and an AGI world in which humanity has extremely high confidence of safety. If a mathematical proof of safety doesn’t exist, then it is very likely that malicious AGIs will find the vulnerabilities and exploit them [37]. If there is a proof of safety and it is made explicit and machine- checkable, then humanity can trust protected machines to design, implement, and execute safe actions. The design and creation of provably safe AGI and infrastructure is both extremely valuable and can become increasingly routine to implement, thanks to AI-powered advances in automated theorem proving and mechanistic interpretability.