FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

“So if they don’t end up with the goals and values that you wanted them to have then the question is what goals and values do they end up with? And of course we don’t have a good answer to that question. Nobody does. This is a you know bleeding edge new field that that uh is extremely it’s much more like alchemy than science basically.”

In 2023, researcher Daniel Kokotajlo left OpenAI—and risked millions in stock options—to warn the world about the dangerous direction of AI development. Now he’s out with AI 2027, a forecast of where that direction might take us in the very near future. AI 2027 predicts a world where humans lose control over our destiny at the hands of misaligned, super-intelligent AI systems within just the next few years. That may sound like science fiction but when you’re living on the upward slope of an exponential curve, science fiction can quickly become all too real. And you don’t have to agree with Daniel’s specific forecast to recognize that the incentives around AI could take us to a very bad place. We invited Daniel on the show this week to discuss those incentives, how they shape the outcomes he predicts in AI 2027, and what concrete steps we can take today to help prevent those outcomes.

“Funnily enough science fiction was often overoptimistic about the technical situation and in a lot of science fiction uh humans are sort of directly programming goals into AIs and then you know chaos ensues when the humans didn’t notice some of the unintended consequences of those goals like for example they program Hal with like ensure mission success or whatever and then Hal thinks I have to kill these people in order to ensure mission success, right? Um, so the situation in the real world is actually worse than that because we don’t program anything into the AIs. They’re giant neural nets. There is no sort of goal slot inside them that we can access and look and see like what is their goal. Um, instead they’re just like a big bag of artificial neurons. And what we do is we put that back through training environments. And the training environments automatically like update the weights of the neurons in ways that make them more likely to get high scores in the training environments. And then we hope that as a result of all of this, the goals and values that we wanted will sort of like grow on the inside of the AIs and cause the AIs to have the virtues that we want them to have, such as honesty, right? Um, but needless to say, this is a very uh unreliable and imperfect method of getting goals and values into an AI system. And empirically, it’s not working uh that well. And the the AIs are often uh saying things that are not just false, but that they know are false and that they know was not what they’re supposed to say, you know.” — Daniel Kokotajlo (22:53)

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.