Dan Hendrycks wants to save us from an AI catastrophe. He’s not sure he’ll succeed.
An evangelical turned computer scientist has articulated a convincing case for how it could all go wrong. Now he needs to figure out how to make it right.
SAN FRANCISCO — We’re sitting in a break room on the 11th floor of an office building in downtown San Francisco.
Well, I’m sitting.
Dan Hendrycks is shifting in his chair and looking up at the ceiling and speaking very quickly.
You could call it nervous energy, but that’s not quite it. He doesn’t seem nervous, really. He just has a lot to say.
A lot to warn about.
It’s only been a year since he cofounded the Center for AI Safety. And it’s only been a few months since he attracted serious funding for the nonprofit.
But Hendrycks, 27, who recently completed a PhD in computer science at the University of California, Berkeley, has already emerged as an important voice in what may be the single most important discussion in the world.
He’s written several influential papers on the catastrophic risks posed by artificial intelligence. He has helped test, or “red team,” ChatGPT for dangerous tendencies. And he is advising industry leaders and policymakers all over the globe on how AI could go sideways.
Recently, he made his biggest splash yet when he convinced some of the most prominent figures in artificial intelligence — including Sam Altman of OpenAI, the Microsoft-backed lab behind ChatGPT; Demis Hassabis of Google DeepMind; and Dario Amodei of Anthropic — to sign a brief, bracing statement that reads, “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”
But if Hendrycks has managed to sound the alarm, it’s not at all clear that it will be heeded.
The AI labs and their deep-pocketed patrons may profess concern about the technology’s dark potential, but they’re forging ahead anyway.
“A race starts today,” declared Microsoft chief executive Satya Nadella when the company launched its AI-powered search engine in February, adding later, “We’re going to move fast.” Another Microsoft executive, in an internal email urging swift movement on AI, wrote that it would be an “absolutely fatal error in this moment to worry about things that can be fixed later.”
And heedless AI merchants aren’t the only ones spurning warnings about a potentially disastrous turn.
There is a whole crowd of tech and public policy experts who argue that the AI apocalypse is sci-fi drivel — that the technology’s power is being wildly oversold.
Hendrycks isn’t buying it.
He insists that the catastrophic threats are real — and that they’re approaching more quickly than many imagine.
About halfway through our discussion in the break room, I ask him how much time we have to tame AI. He shifts his wiry frame again and glances up at the ceiling.
It depends, in part, on how long it takes to build certain capabilities, he says.
“How long till it can build a bioweapon? How long till it can hack?” he says. “It seems plausible that all of that is within a year.”
And within two, he says, AI could have so much runaway power that it can’t be pulled back.
‘The devil in his voice’
Hendrycks spent his teenage years in a small city in southwest Missouri called Marshfield. Low-slung homes, fast food joints.
Hendrycks’s family, like many in the region, was evangelical. And the clan was always bouncing from one church to the next; his father, he says with some admiration, was a disagreeable sort.
Hendrycks could be disagreeable, too. He got into doctrinal spats with church leaders. And when he began to embrace evolution, it got testy.
One time, he recalls, his father had him leave a message for the family pastor explaining his views on natural selection, “and the pastor called back later and said, ‘We can hear the devil in his voice.’”
If religion, in the end, was not for Hendrycks, it did have a profound effect on his trajectory.
It put moral concerns at the center of his life. And it attuned him to the possibility of a fall of man — to “the idea,” he says, “that something could go very south with humanity.”
As an undergraduate at the University of Chicago, where he studied math and computer science, Hendrycks started to channel his moral energies into a push for AI safety.
Hendrycks says his decision owed something to his participation in 80,000 Hours, a career exploration program associated with the “effective altruism” movement.
EA, for short, champions an uncompromising, data-driven approach to doing the most possible good for the most possible people.
Its utilitarian ethos has proved especially attractive to tech types. And their fear of an AI cataclysm has helped steer the movement — and its considerable resources — toward a “longtermism” focused on saving trillions of future lives.
Critics say that has come at the expense of people with more immediate problems.
And the movement took a reputational hit when one of its most prominent followers, crypto king Sam Bankman-Fried, was exposed as a fraud.
Hendrycks, for his part, says he was never an EA adherent, even if he brushed up against the movement. And he says AI safety is a discipline that can, and does, stand apart from effective altruism.
Still, the EA-affiliated charity Open Philanthropy is his organization’s primary funder.
And like anyone who warns about the catastrophic risks of AI, he will only get a broad hearing if he can show that he’s not trafficking in far-fetched dystopia.
That’s where he stands apart.
What’s most striking — and unsettling — about Hendrycks’s work is its plausibility. Its measured description of how AI could go wrong.
Start big picture.
In a paper titled “Natural Selection Favors AI Over Humans,” he keys in on the power of competition.
Artificial intelligence is already being used to trade stocks and write advertisements, Hendrycks points out. And as it grows more capable and ubiquitous, he argues, companies will be forced to hand over increasingly high-level decisions to AIs in order to keep up with their rivals.
“People will be able to give them different bespoke goals like ‘Design our product line’s next car model,’ ‘Fix bugs in this operating system,’ or ‘Plan a new marketing campaign’ along with side constraints like ‘Don’t break the law’ or ‘Don’t lie,’” Hendrycks imagines. But “some less responsible corporations will use weaker side constraints. For example, replacing ‘Don’t break the law’ with ‘Don’t get caught breaking the law.’”
These less constrained AIs will win out in most cases, he writes. And a distressing form of natural selection will take hold.
Deception will prevail. Power-seeking, too.
And programmers hoping to build the most potent AIs will give them free rein to make hundreds or thousands of improvements per hour, all but ensuring that they drift from their intended purpose.
What could this look like in the real world?
In a recently published paper titled “An Overview of Catastrophic AI Risks,” Hendrycks and coauthors Mantas Mazeika and Thomas Woodside, both of the Center for AI Safety, sketch out some possibilities.
Competition for military supremacy, they write, has already produced a startling turn to automation.
Fully autonomous drones were probably used for the first time on the Libyan battlefield in 2020, when they “hunted down and remotely engaged” retreating forces without human oversight, according to a United Nations report.
And a year later, the Israeli Defense Forces broke new ground when they used an AI-guided swarm of drones to locate and attack Hamas militants.
Walking, shooting robots may not be far behind, Hendrycks and his coauthors write. And that could mean more conflict; political leaders who don’t have to worry about young men and women coming home in body bags may be more likely to go to war.
Catastrophic malfunction is a grave concern, too.
Military leaders, Hendrycks and his coauthors predict, will be forced to cede greater and greater control of complex weapons systems to AIs for fear of falling behind. And an AI system that mistakenly identifies an enemy attack and “retaliates” could set off a chain reaction of strikes and counterstrikes with devastating consequences.
Systems failure isn’t the only worry. Hendrycks and his coauthors are also concerned about organizational failure.
What if a biotech company with a promising AI-powered model for curing diseases shared it with a group of “trusted” scientists, only to see it leak onto the Internet? And what if terrorists used it to build a deadly pathogen?
It’s not as far-fetched as it sounds.
AIs can’t yet devise instructions for bioweapons that would lead to large-scale loss of life. But they’re not far off.
Last year, researchers with a North Carolina firm called Collaborations Pharmaceuticals made a worrisome tweak to an AI they normally use to develop new therapies for human ailments.
The system typically penalizes toxicity. But when the researchers rewarded toxicity — and steered the system toward compounds like the deadly nerve agent VX — it produced horrifying results. Within six hours, as the researchers wrote in a paper for a chemical and biological weapons conference, it had come up with 40,000 chemical war agents — not just VX itself but many new and more powerful killers.
The researchers made no attempt to determine how synthesizable the AI-imagined compounds might be. But “readily available commercial and open-source software” could do the trick, they warned. And there are hundreds of labs worldwide, they wrote, that could do the actual synthesis — part of a “poorly regulated” system “with few if any checks” on the development of dangerous chemical weapons.
“Importantly,” they wrote, “we had a human-in-the-loop with a firm moral and ethical ‘don’t-go-there’ voice to intervene. But what if the human was removed or replaced with a bad actor?”
‘Personalized deception’
The Center for AI Safety provoked broad concern with its statement on the extinctive potential of AI.
But it also drew a backlash.
Critics argued that the technology isn’t nearly as powerful as the doomsayers make it out to be — that headline-grabbing tools like ChatGPT are merely incremental improvements on what came before.
“It’s just more data and parameters; what’s not happening is fundamental step changes in how these systems work,” said Meredith Whittaker, a cofounder of the AI Now Institute and president of the Signal Foundation, in an interview with The Atlantic magazine.
Moreover, the critics said, the AI lab leaders only signed on to the statement for public relations purposes — to distract from the very real injuries Silicon Valley is inflicting now.
Hendrycks doesn’t put much stock in the argument.
Signatories like Sam Altman of OpenAI, he points out, were warning about the technology’s catastrophic potential long before ChatGPT was a thing.
Of course, having sounded the alarm long ago doesn’t necessarily justify forging ahead with the technology now. But OpenAI seems to be taking at least some steps toward reducing risk. On Wednesday, the company announced that it’s dedicating 20 percent of its computing power to AI safety.
Hendrycks says that may or may not be enough. But it suggests the leading labs are more sincere in their concerns than the critics allow.
His larger beef with the statement’s critics, though, is their suggestion that we have to choose between addressing today’s problems or tomorrow’s risks.
In many cases, he says, they’re tightly intertwined.
Take misinformation.
It’s already a real danger. But artificial intelligence could make it substantially worse, Hendrycks says, by enabling “personalized deception” powered by chatbots that “know how to play to your specific weak points.”
And the automation we’re already seeing in drugstores and grocery stores, he adds, could become a truly destabilizing force if AI wipes out lots of other jobs — taking over entire categories of legal or accounting work, for instance.
In time, he says, we could have something like an “autonomous economy.” And we’d have to hope that it didn’t swerve into a ditch like some rogue autonomous vehicle.
Defusing the bomb
So what to do?
Eric Schmidt, the former chief executive of Google, has argued for self-regulation — at least for now. The technology is enormously complicated, he told NBC’s “Meet the Press” in May, and “there’s no one in the government who could get it right.”
Hendrycks says self-regulation is better than no regulation at all. But it would be folly to just leave it to the tech behemoths to police themselves.
Competitive pressures will compel the giants to keep pressing ahead without adequate regard for safety, he says.
And even if some AI developers show a modicum of restraint, there will be others — in the United States or abroad — who don’t.
Only multinational regulation will do, Hendrycks says. And with China moving to place strict controls on artificial intelligence, he sees an opening.
“Normally it would be, ‘Well, we want to win as opposed to China, because they would be so bad and irresponsible,’” he says. “But actually, we might be able to jointly agree to slow down.”
One model that intrigues him is the European Organization for Nuclear Research, or CERN.
It’s an intergovernmental organization, with 23 member states, that operates the Large Hadron Collider — a 17-mile coil of superconducting magnets that scientists are using to unlock the secrets of the universe.
Hendrycks imagines something similar for artificial intelligence — a big multinational lab that would soak up the bulk of the world’s graphics processing units, or GPUs, which are essential to AI development.
That would sideline the big for-profit labs by making it difficult for them to hoard computing resources.
Competitive pressures would ease. Deliberation would replace haste. And with multiple nations involved, the global public wouldn’t have to worry that “there’s this one random corporation that’s building the bomb.”
If the private sector remains in the lead, Hendrycks says, the governments of the United States, England, and China could build an “off-switch” — agreeing to mutually shut down development if AIs started to behave in truly worrisome ways.
For all his enthusiasm about these kinds of solutions, though, Hendrycks isn’t very optimistic about them coming to fruition. He puts the odds of the required international cooperation at about 1 in 5.
That means it’s vital to explore other ways of restraining AI.
Hendrycks and his collaborators have had some success developing an artificial conscience that could steer AIs toward moral behaviors. And in one paper, he explores the possibility of a “moral parliament” that would inject instantaneous ethics into the quick, weighty decisions that AIs will be making all the time.
The parliament would be made up of AIs representing, say, the utilitarian point of view, the Kantian point of view, and the “virtue ethics” point of view — in the hope of reconciling competing moral frameworks and hammering out workable compromises.
How, exactly, such a system would be implemented is unclear. And even if it were, it’s easy to imagine how it could err.
So in the meantime, the Center for AI Safety is pursuing more modest approaches. It’s providing high-octane computer resources to AI safety researchers. It has published an online course on the subject. And a group of philosophy professors is finishing up a months-long fellowship at the center.
CAIS hopes to run a seminar for lawyers and economists at the end of August — anything to get more people thinking about the risks of AI.
But if incrementalism can make a difference, Hendrycks concedes, it may take a warning shot — a near disaster — to get the attention of a broad audience.
To help the world understand the danger as he does.
Hendrycks’s evangelical days are long past. He doesn’t believe in the devil anymore. He doesn’t see some conscious, malevolent force at work in the machine.
But if we let AI grow too powerful, he suggests, we could be damned anyway.