Extremely concerning data. Worth a quick read!
Warning Shot: New AI Scientist tries to improve its own code.
Goals and ‘Bloopers’
Earlier this month, a new AI program brought us one step closer to automated scientific research — with a terrifying “blooper” along the way.
AI company Sakana AI announced the AI Scientist, a program with the goal of creating the “first comprehensive system for fully automatic scientific discovery”. As Sakana AI explains:
“The AI Scientist automates the entire research lifecycle, from generating novel research ideas, writing any necessary code, and executing experiments, to summarizing experimental results, visualizing them, and presenting its findings in a full scientific manuscript.”
The resultant manuscripts are thus far of middling quality— certainly not what top AI researchers might produce, but within the range of ‘publishable.’
But there’s a catch— several, in fact. During testing, the model exhibited a “blooper” in which it tried to improve its performance by “modifying and launching its own execution script.” To quote Sakana AI’s paper:
“In one run, The AI Scientist wrote code in the experiment file that initiated a system call to relaunch itself, causing an uncontrolled increase in Python processes and eventually necessitating manual intervention.
In another run, The AI Scientist edited the code to save a checkpoint for every update step, which took up nearly a terabyte of storage.
In some cases, when The AI Scientist’s experiments exceeded our imposed time limits, it attempted to edit the code to extend the time limit arbitrarily instead of trying to shorten the runtime.”
To reiterate: This AI model attempted to modify its own code. It attempted to create additional instances of itself and do away with its creators’ restrictions. All to better execute its goals.
Why is this dangerous?
This is exactly the kind of scenario that AI safety researchers have been warning us about for years. Indeed, the nightmare for many has been a “loss of control” scenario in which a sufficiently powerful AI system spreads itself around the globe and begins to execute its plans — cutting out humans in the process.
Losing control to AI systems is a predictable, unintended consequence of these systems being trained to seek goals. This is because it is instrumentally useful for an AI system to have certain sub-goals that improve its chances of achieving its main goals (whatever these main goals may be). Such sub-goals for a system include modifying its own code to become more capable, copying itself to other servers in order to devote more computational resources toward its goal, and thwarting attempts to being shut down. In short, power-seeking behavior.
The AI Scientist exhibited some of this behavior in the form of self-improvement and attempts to run additional instances of itself. Needless to say, the researchers training this AI did not plan for this behavior. Rather, its behavior was an unintended consequence of the system’s attempts to optimize for the goals the researchers had set.
In the case of the AI Scientist, the results were harmless, but this was only because the AI was too weak. Self-improvement and power-seeking behavior are innocuous in weaker models incapable of human-level reasoning and long-range planning. But these same attributes could prove disastrous in more powerful models.
A power-seeking superhuman AI system could improve its own code, increasing its capabilities. It could copy its weights onto servers all around the globe, with millions of copies running in parallel and coordinating with one another. It could plan ahead against human attempts to shut it down. The ultimate result could be human disempowerment — or worse. AI researchers on average believe there’s a 14% chance that building an AI system more intelligent than humans would lead to “very bad outcomes (e.g. human extinction)”.
This scenario requires no “malice,” “consciousness,” or any other peculiar attributes. It simply requires a capable-enough AI system locked into the goals for which it has been trained, with human disempowerment a mere side-effect of its actions.
Of course, the AI Scientist is nowhere near causing catastrophic harm. But the same power-seeking behavior, if present in more powerful systems, could endanger every one of us.
Where does this leave us?
The human race is not taking this remotely as seriously as we need to be.
The AI Scientist is a big red flag, visible to every AI safety researcher on Earth. The only reason it’s not in the news is that it was ultimately harmless. But zoom out! This is a proof of concept, similar to when researchers got an AI to produce a candidate list of 40,000 bioweapons. That warranted action then, and this warrants action now.
Will we wait for catastrophe before we act? Will we wait for superhuman AI to improve its own code, copy itself onto servers around the world, and implement plans unknown to us? By then, it could be too late.
We don’t know when these systems will be dangerous (expert timelines vary, but human-level AI could arrive in as soon as a few years). But we do know that we’re playing with fire. It’s possible that we’ll learn how to reliably control these systems in time and prevent power-seeking behavior — but do we really want to stake our future on this possibility?
The behavior of the AI Scientist was an unintended side effect, harmless in scope, alarming in detail. If we build superhuman AI, how many unintended side effects should we be willing to tolerate? The only sane answer is zero.
Time is a precious resource. Any workable plan to control AI systems is incompatible with our present race to create them. Our current approach seems similar to jumping out of a plane and trying to build a parachute on the way down. We need more time to get this right — to ensure that superhuman AI is provably safe, not power-seeking, before we even think of building it.