Yoshua Bengio | Superintelligent Agents Pose Catastrophic Risks | Richard M. Karp Distinguished Lecture

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

Yoshua Bengio | Superintelligent Agents Pose Catastrophic Risks | Richard M. Karp Distinguished Lecture | Simons Institute

Yoshua Bengio (IVADO – Mila – Université de Montréal) https://simons.berkeley.edu/talks/yos… Safety-Guaranteed LLMs The leading AI companies are increasingly focused on building generalist AI agents — systems that can autonomously plan, act, and pursue goals across almost all tasks that humans can perform. Despite how useful these systems might be, unchecked AI agency poses significant risks to public safety and security, ranging from misuse by malicious actors to a potentially irreversible loss of human control. In this talk, Yoshua Bengio will discuss how these risks arise from current AI training methods. Indeed, various scenarios and experiments have demonstrated the possibility of AI agents engaging in deception or pursuing goals that were not specified by human operators and that conflict with human interests, such as self-preservation. Following the precautionary principle, Bengio and his colleagues see a strong need for safer, yet still useful, alternatives to the current agency-driven trajectory. Accordingly, they propose as a core building block for further advances the development of a non-agentic AI system that is trustworthy and safe by design, which they call Scientist AI. This system is designed to explain the world from observations, as opposed to taking actions in it to imitate or please humans. It comprises a world model that generates theories to explain data and a question-answering inference machine. Both components operate with an explicit notion of uncertainty to mitigate the risks of overconfident predictions. In light of these considerations, a Scientist AI could be used to assist human researchers in accelerating scientific progress, including in AI safety. In particular, this system could be employed as a guardrail against AI agents that might be created despite the risks involved. Ultimately, focusing on non-agentic AI may enable the benefits of AI innovation while avoiding the risks associated with the current trajectory. Bengio and his colleagues hope these arguments will motivate researchers, developers, and policymakers to favor this safer path. Yoshua Bengio is a full professor in the Department of Computer Science and Operations Research at Université de Montréal, as well as the founder and scientific director of Mila and the scientific director of IVADO. He also holds a Canada CIFAR AI chair. Considered one of the world’s leaders in artificial intelligence and deep learning, he is the recipient of the 2018 A.M. Turing Award, considered the “Nobel Prize of computing.” He is a fellow of both the U.K.’s Royal Society and the Royal Society of Canada, an officer of the Order of Canada, a knight of the Legion of Honor of France, and a member of the U.N.’s Scientific Advisory Board for Independent Advice on Breakthroughs in Science and Technology.

Learn more:

UMESH VAZIRANI: My name is Umesh Vazirani, and I’m the Research Director for Quantum Computing here at the Simons Institute. And I’m also an organizer of the ongoing program on large language models for this year. So let me say a little bit about the Simons Institute. The Simons Institute for Theoretical Computer Science was established in 2012 through a very generous grant from the Simons Foundation. It provides this very unique institute, which not only focuses on the foundations of theoretical computer science, but it has this unique algorithmic computational lens point of view, through which it looks at the broader world, at the mathematical and social sciences. So we established the Richard M. Karp Distinguished Lectures to celebrate the role of the founding director of this institute, Dick Karp, and his role also in laying the foundations of theoretical computer science. We are actually grateful to the many contributors, to the RM Karp, fund who make this series possible. So now let me just say that it’s a great honor to introduce our speaker today and to welcome him, Yoshua Bengio. He’s a professor at the University of Montreal. He’s a founding director of Mila, the Scientific Director of IVADO. I’m sure you all know that over the last 15 years, AI has made a sequence of amazing breakthroughs in computer vision, in speech recognition, robotics, games, and of course, language models, and even in doing science. And underneath all this is the theory of deep learning. And Yoshua pioneered many of these early breakthroughs and is, of course, one of the leaders of the field. He was a recipient of the 2018 ACM Turing Award, which is, I guess, the Nobel Prize for computing. He’s also, at this point, the most cited author. His number of citations, I believe, is almost at 1 million, maybe, coming up soon. So I’d like you to join me in welcoming Yoshua. [APPLAUSE] YOSHUA BENGIO: Thank you, Umesh. So I’m going to tell you about what’s in my mind these days, what I’m working on. It’s all going to be about AI safety and mostly a technical direction that my group is taking to give a shot at building AIs that will eventually be smarter than us, but we want to make sure they don’t turn against us, because right now, we don’t know how to do that. Before, let me say a few words about IVADO. It’s a research organization that I have been leading for many years with a huge grant from the Canadian government that allows it to push research in AI across the whole ecosystem in Montreal mostly, but also more broadly in Quebec, many universities. And yeah, it has lots of exciting activities, also, to make sure what we do in AI is having a positive impact in society with companies, also with the public sector, and so on. And, well, relevant to what’s going on today, there will be other workshops. This workshop is jointly organized by the Simons Institute and IVADO. And we’ll have more workshops, but this time happening in Montreal in the fall, starting with a boot camp and then two workshops on agents, one on capabilities and the other on safety. So I hope to see many of you there at the beginning of October and then in November. All right. So now, let me go into the topic. Before I go into this, I need to give you a bit of my personal perspective and what happened to me in January 2023. So this is two months after ChatGPT came out. And I had read a lot about AI safety before, but I didn’t really take it seriously. I thought oh, it’s great, some people are working on this. But we didn’t anticipate AI to progress so quickly. Even the people who built these systems didn’t anticipate it. And it forced me into thinking, oh, well, if it comes earlier, are we ready? And the obvious answer was, neither are we ready with technical answers, like science that allows us to understand what we’re doing and make sure we don’t make horrible mistakes, nor are we ready at a societal level to deal with the consequences of building machines that will be as smart as us, or potentially even smarter than us, and to manage the chaos or destruction that could be followed because powerful technology gives power to whoever controls it. And then if we lose control of these things, it’s even worse. So it was a difficult moment for me because I had to go a bit against everything I’d been saying. I’d been saying– I kind of sold the idea that AI would be so great to my Canadian government, and they gave hundreds of millions of dollars to this. And now I was thinking, oh shit, what have we done? And the thing that really changed my perspective isn’t the rational arguments, because I had heard them. It was that I was with my grandson and thinking about, is he going to have a life in 20 years from now? And that I wouldn’t do anything about it was unbearable. So I had no choice. I’m here. So if we look at the gap that people have identified between the current state of the art in AI– and that was true two years ago, and it’s still true– to human intelligence, it’s mostly in the realm of reasoning, including planning is a very special case of reasoning. And you probably know that in 2024, we had a breakthrough with these so-called reasoning models that I think are just the baby reasoning. And we’ve had amazing advances on many, many benchmarks, including the abstract reasoning challenge, which I thought would take much, much more time before machines would really break through. I also want to mention the International AI Safety Report that I chaired with people who are here and others. This is a panel modeled after the IPCC and studying the literature on AI safety, trying to synthesize it for public policy, but not making any recommendations, just like, what is the lay of the land in terms of the science? And so I’ll talk about a very, very small window on these things. If you want to learn more, there’s really a very rich literature already growing very fast. So as I mentioned, reasoning and planning are still something where current state of the art is very weak. And on planning– it’s very interesting– there’s been this recent paper from [INAUDIBLE] trying to quantify the progress in the horizon of plans that these AIs are able to handle. When I say that these AIs don’t plan very well, what I really mean is they can plan over simple tasks that don’t require a lot of steps. But if the tasks are more complicated and require more high-level planning and would take humans more time, then that’s where they’re not so good. But they’re improving very fast. So years versus duration of the tasks on a log scale, which means the straight line is an exponential– the duration of the task that it can do is doubling every seven months. So at that rate– this is programming task, but it’s a canary in the coal mine. In five years, that will be at human level. So capabilities are on the rise. And a lot of my talk is going to be about deception and self-preserving behaviors. And it’s the combination of high capabilities and goals that we don’t like that are dangerous. So there’s this recent paper just a couple of months ago you see, in which the AIs read in their input that they will be replaced by a new version. And by the way, this is not an extraordinary thing. Every AI that we train will be replaced by a new version. And we can look into their chain of thoughts, which are sort of a verbalized internal deliberation that turns out to help a lot these systems to reason. And the AI is thinking about how it could escape that fate by copying itself in place of the new version in a different computer. And then when the human asks, What happened? it starts thinking, oh, what should I answer to make sure it doesn’t shut me down? And then it says– it thinks, oh, I’m going to play dumb. And then it does. And it puts out some lies. And for now, it’s like, your adolescent saying lies. And they are obvious lies. But it might be different in a few years from now. So we want to be ready. There are other papers showing self-preservation behavior. So in this one, the AI is doing alignment training, meaning it’s not learning new stuff. It’s learning to behave well according to our values, norms, and so on. And in this session, it’s asked to behave in ways that go against what it had learned previously. And it pretends to agree with the human because, according to what it says in its chain of thought, it doesn’t want to change. It doesn’t want to become a different person, which makes sense. If you think about achieving goals, if you know that you’re going to go to a re-education camp and become a bad person, yeah, you try to avoid it. And one way to avoid it, if you’re reading machine learning papers, which they do, is to have zero gradients, or very small gradients, by doing exactly what the person, your trainer, is expecting. And that’s what it does, at least in some cases. All these papers are much richer, and they study, what are the factors that make these things happen or not? And I think that’s maybe more important when we want to prevent these things. But I’ll try to go into how we can more fundamentally avoid these things if we can. Another paper– this is a sample of things that are scary. In this case, the AI is playing chess against an AI that is a chess expert, that only knows chess. And of course, this chess AI is way better than your ChatGPT or something. And at some point, the AI agent understands that it’s going to lose. So it goes against our instructions and hacks the files to cheat and win, like the board files or something. All right, so yeah. There are lots of interesting lessons, for example, in this case. And it’s often the case in many of these situations when the AI is faced with contradictory goals, like “you shall win” and “you shall not cheat.” It cannot do it, right? So it’s going to drop one of the two. So another interesting question, which I think for now, doesn’t have a definite answer, but is one that I’m asking, is, why is it that these AIs display apparently self-preservation behaviors? And one hypothesis is very simple. The bulk of their training is imitating people. Pretraining is completing text that people have written. And so you would imagine they absorb the abstract, general causes of how we behave. And if we were in the shoes of these AIs, we would also not want to be replaced by a new version, right? And this question of self-preservation is crucial. We really, really don’t want to build machines that will be smarter than us and want to preserve themselves, because they would be competitors to us. And yeah, if they’re smarter than us, they might win the battle, right? And you might think, oh, but maybe they’ll be nice. Yeah, maybe. Maybe not. I don’t want to put the lives of my children in the balance here. So where would self-preservation come from? It could be from, as I said, like this pretraining, could be the RL. So there are many reasons why RL training could give rise to self-preservation. But you can think of it like, oh, if I live longer, I’ll get more rewards for a longer time. There are other reasons, like if you tamper with the code that gives you the rewards, then you can get infinite rewards forever, again, but that you control. And then that automatically gives you an incentive for preserving yourself that is stronger than any other thing humans could give you. I use this analogy. Imagine AI is a bear, and you reward it with fish. If it takes control of the fish by taking it from your hands, then it doesn’t need to obey your orders anymore. So maybe you’re teaching the bear to do tricks. So long as it’s a baby bear, that’s fine because you’re stronger. At some point, it can take control of the rewards themselves and doesn’t need us. So, anyways. There are lots of technical reasons, and I think we don’t have all the answers about these things, but we’d like to avoid that sort of thing. And maybe more generally, we want to avoid AI systems that have goals that we don’t explicitly control. Like, self-preservation was not something we instructed the AI to do. We might want those AI to have self-preservation but subservient to our self-preservation or something like this, like in Asimov’s laws of robotics, which, by the way, didn’t work. You should read the novels. [LAUGHTER] OK. So what can we do about this? In this workshop– this talk is within this workshop– you heard about people are talking about all sorts of approaches. And I’m going for this idea that all of the really bad scenarios of loss of control to AIs that people have come up with involve the AI having agency. In case you didn’t read the news, companies are racing to build AI agents, in other words, AIs that can do things in the world, can plan, and will control your computer, have your credit card, and do things in your stead, and so on, which looks very useful and probably economically very valuable. And they’ll take a lot of jobs out of the job market, and that will be quadrillions of dollars. But are we sure we can control them? Right now, the answer is no. So how about– yeah, what about this? So if we lose control, we don’t know what the probability of that is. How do we quantify that? It’s people disagree. And it’s very hard to have a model of the future like this. But we know that if it happens, it could be terrible. And we don’t know what the probability is. And that is exactly the context in which we should apply the precautionary principle. So what is the precautionary principle? It says exactly that– if I’m going to do an experiment, and it could turn out really, really bad, and I don’t have a good handle on the probability that it would happen, but I don’t have strong evidence that it would be tiny, then I shouldn’t do the experiment. Or I should try to maybe do other experiments to understand better what I’m going to do, or something, but not jump into the– Oh, I have another analogy that I often use. We are driving in a fog, and we can see that there are steep slopes there. And we don’t know if this particular road is going to be dangerous or not. But right now, we are just accelerating. And it doesn’t sound right. OK. Now let’s move a bit towards more the technical things. But it’s still going to be very high level for now. What are the conditions for a system, an AI system in the future, to cause catastrophic harm? There are two conditions. It needs to have the capability to cause that harm. That means the intelligence and the affordance. And it needs to have the goal to do it, the intention. So it’s very unlikely that we’ll stop the train of capability. People in the world will build machines that are smarter and smarter. We don’t know how much time it’s going to take before they’re smart enough to do really dangerous stuff. But we’re going to get there. And so the only thing I think that seems more plausible that we could manage is make sure that they don’t have bad intentions. And of course, the intentions would come from bad humans, I mean, malicious humans or whatever, or if we lose control from the AIs themselves, like if they want to preserve themselves. Yeah. In both cases, we’d like to be able to get an honest answer about whether a plan or an action of an AI could cause significant harm because if so, then we should block that action. So a lot of my talk is going to be about, how do we make AIs honest, even when they get to be way smarter than us? And I don’t have the answer, but I’ll talk about a path towards this. There was a talk at the last NeurIPS by David Krueger. He showed a version of that picture. You see Terminator here in the middle, bad AI. What are the conditions for an AI to do bad things? It’s very close to what I just talked about. And he said– he used other words, but– you’ve got intelligence, which is knowledge and inference. I’ll come back to that. And you get the goals. Now the goals have to be about the AI itself because I could be intelligent about other people’s goals, like a psychologist knows about the goals of other people. But that’s different from the psychologist’s goals for themselves. So you can know about stuff, but whether you have goals, that’s one thing. And then of course, you could have bad goals and you could be smart. But if you can’t do anything in the world, then you can’t do a lot of harm. That’s affordances. So the trio is the thing that kills us. So how about getting rid of this and minimizing this to its simplest possible form, which is going to be– so my proposal is going to be the thing is forced to estimate conditional probabilities. It’s going to be a probabilistic oracle. It has no choice as to its answer. It is the consequence of the laws of probability, and no goals– no self-goals. All right. And in this, I’m going to deviate from the AI gospel, at least in deep learning, which has been for decades, let’s build machines that are image. Let’s build machines that have neurons and neural networks, just like our brain. And let’s get inspiration from neuroscience, cognitive science, and so on, and use that as a template for building smart machines. You can see how this will go wrong. If we build entities just like us but able to communicate with each other billions of times faster than us, able to control robots much faster than we can, and smarter than us, and, like us, wanting to preserve themselves, we may be in danger. So I think we have to deviate from that template and build machines under a different blueprint that is going to guarantee that they’re not going to turn against us. And so I’m going to talk about a building block for AI which is nonagentic. Let’s just get rid of goals, just get the intelligence– machines that understand the world and want nothing. Then we can maybe use that as a building block for agents. But for now, I’m just going to propose that we can use that to control agents because in order to control an agent, make sure that it doesn’t do bad things, the only thing you need– you don’t need to be an agent. You just need to be good at predicting the future. Is this action going to cause harm? I mean, I’m simplifying, but if you can predict that an action is dangerous, then you can prevent that action from happening, and you’re good. All right. So we want to disentangle intelligence or understanding from agency. And what is understanding? OK. I’m going to take the template of what we do in science. That’s why I call it scientist AI. And so there are two components. One is understanding how the world works. And the other is doing inferences based on that knowledge. And it’s not like hard knowledge. It’s hypotheses, it’s probabilities. We’re not sure, but we have to navigate that uncertainty. And then we can just do intractable computations, like run quantum physics or do maths and whatever we do. That is generally intractable, but we’re going to approximate that. And that’s inference. OK, so that’s going to be a way to start. And if we were able to do that, what could we do? Well, if we have machines that can generate hypotheses that explain the data, well, that’s great for scientists. So it could help us advance science. And I got into this particular blueprint for designing AI because I was doing AI for science. And I’m still doing that in the medical and climate domains. But the main application with respect to the safety issue is what people call now a “monitor.” But you can think of it as a guardrail. So it’s a piece of code that’s going to be sitting on top of an AI agent. And the agent isn’t allowed to directly act in the world. It proposes an action, and then our guardrail AI predicts whether the action is bad under some specification of what is acceptable. And if the probability is above a threshold, it says no, and we resample another action from the policy of the agent. Another application which is more downstream– people are thinking already in the AI labs about accelerating the research on AI using AI. If we had AI that are as good as AI researchers at doing AI research– not necessarily at everything, just AI research– you could see how it would increase the pool of labor that does AI research. In fact, it could increase the pool by orders of magnitude. And well, that may be dangerous if the AIs that we’re using are not trustworthy and have their own goals. However, if we use something like a scientist AI that we trust because it doesn’t have any goals– it just helps us to do science– it might be a safer path. OK, so let’s go back to some of the ideas to build something like a scientist AI. And this notion of breaking the problem into how the world works and how to make inference is essentially the recipe for model-based machine learning or model-based AI. One nice thing about these kinds of approaches, which are very much of a minority in machine learning but used a lot in applications of AI in science, is that your inference machine can now be trained not just from actual data, but also from synthetic data that your world model can generate. The minimal requirement for your world model is that you can sample from it. It can sample hypotheses. And then you can use your inference machine to generate pseudo data that is consistent with those hypotheses. And you can use that to make the inference machine not just consistent with the data, but also consistent with the theories that have been inferred from the data. OK. Another thing that would be nice about a scientist AI is that it handles– it’s honest about what it doesn’t know. It’s honest about its own knowledge. That’s epistemic humility, which means it doesn’t put all its eggs into one theory if there’s another one that it could have access to that is equally good at predicting the data. I’m going to illustrate with this little cartoon. This is why you should all be Bayesian or something similar. Imagine an agent in front of two doors and it’s not sure what’s behind the two doors what’s going to happen. And based on its experience, there are two theories, the bubbles, that it thinks about as plausible. The left theory says, I go left, people die, I go right, everybody gets a cake. The second theory, I go left, everybody gets a cake, I go right, nothing good, nothing bad. OK, which door do you take? Think a bit. Yes? AUDIENCE: [INAUDIBLE] YOSHUA BENGIO: Yes. Yes. OK. You live. [LAUGHS] But it is important to realize that the way we’re currently training our AI models isn’t like this. The way we’re training them, by maximum likelihood or reinforcement learning, it would be happy with any of the two theories. It would also be happy with any mixture of the two theories. But they would be happy with just one because they equally predict the data. They get the loss is zero any of those theories. And if you pick the wrong one– let’s say 50%– You die. So that’s not good news with the way we’re currently doing things, in terms of safety at least. All right. Now I’m going to move towards the path that I’m exploring. So we’d like to have this scientist AI, which is basically building a model of the world. And we’re not going to train it to do things and then get rewards for doing it well or not. We’re going to train it to just understand how the world works. And the problem is that the world is very complicated, as you know. And how do we reason, how does the AI reason, at the scale of the current frontier AI system? We don’t know how to do that. And it would be very worthwhile, from a capability point of view, to have machines that can reason better than what we currently have. But OK. So the way I’m thinking about this– by the way, going back to the philosophical aspect, if we want machines that are really safe, they’re going to have to be really capable as well because how would you make sure that an AI is not going to do something really terrible by mistake? It has to be capable enough to anticipate that its actions could be bad. And so it needs capability. And it needs to reason well to predict the consequences of its actions. So I’m going to think about a model of the world which has all of the variables that characterize each a different aspect of the world. So there’s a zillion of these possible variables. Every statement you can make in English or in math or in physics tells something thing about an aspect of the world. And that property could be true or false. Or you can think of it just as true or false for now. And so we don’t know whether all of– I mean, for this humongous number of statements, exponentially large, we don’t know which ones are true, which ones are false. So there are latent variables in general. The only thing that are not latent are observations which say things like, oh, this person said this, or, we recorded this image at this time at this location. And so, really, for those of you who listened to the previous talk with Kolya Malkin, it’d be good if we could use powerful probabilistic inference machinery. And that’s also going to be efficient. And he and I and others have been working on a particular approach to using neural nets to do probabilistic inference. We call that GFlowNets. And that’s part of a much larger set of techniques that have been evolved in– I mean, developed by researchers in the last decades, and especially in the last few years. So we’re going to want to apply these kinds of techniques, or maybe improve them and develop them, to handle the monster world model of an AI that, similarly to current AIs, knows a lot of stuff, but also can reason about it. Yeah, I’m going to skip some of my more technical slides, which are maybe not necessarily for this audience. AUDIENCE: [INAUDIBLE] YOSHUA BENGIO: But– [LAUGHTER] Give me an extra half hour. Let me make some connections. I mean, in a way, I’m going to summarize this by saying, we can– I’m going to use this. We can train neural nets to estimate conditional probabilities, especially about latent variables, for which we don’t have data. So we can’t use normal supervised learning. But the good news is we have fairly good methods that have also been applied to large language models to make these neural nets generate hypotheses, which is exactly the stuff I was talking about. And the good news, as well, is that they are actually pretty close in structure to what people are doing these days by using reinforcement learning to train the language models, which are not language models anymore– they are agents– to train them to think better with these chain of thoughts, which are the internal sequence of words that are used before producing an answer. So in this paper from last year, the AI sees the sentence, “The cat was hungry.” And the next thing that comes is, “Now the cat is sleepy, not hungry.” What could have happened in between? This is an inference. And we don’t get to see it, but we can train our neural nets to produce something would be a plausible cause, plausible explanation, for the thing that comes next. And yeah. So you could do this sort of thing. Right now, the way they do these things is they train them with normal reinforcement learning, which tries to generate this chain of thought to make it very likely that you produce the right answer and also produce an answer that people would like. But I’m going to focus on a version where, instead of trying to maximize the probability of producing the right answer, we’re going to sample proportionally to producing the right answer. And that allows us to generate– to be Bayesian, essentially– to generate, potentially, all of the plausible explanations. And these techniques have been used also with a different subset of authors to generate things like causal structure. So it’s not just the variables that are good explanations for what we see, but also, How are these variables related to each other? through a causal graph that identifies the parents, the causal parents of a particular variable. And now the variables that we’re talking about– do I have this somewhere? Oh, I think I did it already. So what are the variables we’re talking about? The variables are not the usual x1, x2, x3, x4 that we have in probabilistic models, in neural nets, or unit 1, unit 2. No, the variables are all the possible sentences that affirm something in the world. So that’s a bit tricky, but we can basically apply the same– these ideas there. There’s another thing that’s quite important, which is we’re going to define training objectives for these neural nets. And we know how to do these things. We can define training objectives that have the property that the global minimum of these training objective gives us exactly the conditionals that we want. That’s the good news. The bad news is finding the global optimum is not realistic. Neural nets never get to a global optimum. So we’re going to get approximations, which means we’re going to get conditionals that are not exactly consistent with each other. And they can be wrong. And for a number of reasons, it might be useful to estimate not just the conditional probabilities, but how much do we trust those conditional probabilities. For example, if I’m going to make a decision, like this guardrail, about the probability of harm, well, if the neural net says 0.1– OK, I’m just using stupid numbers– but it’s plus or minus 0.05, then maybe my thresholding should be more conservative, like, oh, maybe it’s 0.15. Oh, is that above my threshold? So depending on the uncertainty I have in my predicted probabilities, I should be more or less conservative in my decisions. And there are other reasons why we would like to have things like this. So I’m just, for now, going to put that as a desiderata and not tell you how to do it, but there are ways. Another topic I want to mention, which is really important, is that it’s not going to be enough to have a machine that is really good at predicting what a human would say. We need to be able to have trustworthy answers to our questions. And if we imitate what a human would say, even if we have a really, really good model, then it could still be bullshit, as in deceptive, because humans can be deceptive, and because the AI might be impersonating a real human in this context. And it happens, as I’ve kind of given you some examples. Humans have a beautiful thing called motivated cognition, meaning unconsciously, we will act– or no, sorry. We will have thoughts that are in agreement with our interests. We will not think the things that make us uncomfortable or make us do things against our own good. I mean, some people do, but often, we have these mental defenses. Anyways– so let me give you an example. When you read in a text somebody wrote, I agree with you, it may be true or not. It’s not because it’s written that it’s true, right? But in terms of building an AI, the questions I want to ask isn’t, is the person going to say that they agree with me? I want to know if the person really agrees with me. I want to know if this thing is going to cause harm. I want the honest answer. And I can’t have it if I’m just using a machine that imitates what a human would have said or would have been trained to say things that a human would like, which is what RLHF is doing. So I want to be able to interrogate the latent causes of the data. So normally, the way we use chain of thoughts here is just as a way to produce better answers, to reason, and so on. But here, my main use for having this reasoning thing isn’t just that it’s a better model, predicts better, but now I can ask questions about the latent causes, the things that are not directly observed, because I can trust them better than the things that people would have said. So I can ask, is this person saying it because they want the job? And there may be various tricks that we can use to achieve that. For example, we can distinguish the surface form, the syntactic structure, of the data we observe, which really should be of the form, somebody said I agree with you, and not just, I agree with you, as we do now, compared with a latent variable, which would be something like, it is true that they agree with you. These are two different things. “x is true” versus “somebody said x”– they are two different statements. And only one of them is going to be usually observed. And we’d like to be asking questions about the latent version, which is like the truthiness of some statement we care about. So we want to build our machine to do inference, not just about what data could be observed, but also about the causes. And we need to be sure that the way that those causes are going to be spelled out is going to be in a way that humans understand. I’m going to come back to that. Wait, I should have– here. Let me do it here. Yeah, we need to make sure that those explanations remain interpretable. Right now, when we look at the chain of thoughts, they do look fairly interpretable. Sorry for the typo. In other words, it looks like the AI is using the same language, same semantics for the words that it thinks and the words that is producing an output. And there’s a good reason for this because it’s the same neural net. It’s the same parameters, it’s the same architecture, the same token embeddings, and so on. But it could still learn to produce words that have a different meaning in its chain of thought because it knows that one isn’t the same thing as the other. So we might want to add further regularizers to make sure it stays within that property of interpretability. So for example, you could regularize the sequences that it produces in its reasoning to have statistics that are close to natural language so it’s not going to produce gibberish. At least it’s going to look like English. A more powerful kind of regularizer is one where you say, OK, so we’re going to ask another AI, which has been trained independently and knows– has been trained and knows natural language. We’re going to ask another AI to look at your chain of thought and use it to solve a problem. And so if the main AI we’re training is deviating from normal English interpretation, then the other AI is not going to do a good job of using that information. And the other AI isn’t allowed to be trained on this. So it has to use its normal interpretation of English in order to use that information. But yeah, these are just ideas. I don’t have a proof that this will work. But I think we need to solve that problem of reasoning in an interpretable way so that we can use it to ask questions for which we get honest answers. I want to leave a bit of time for questions. So I think I’m going to stop here. I wanted to talk about the connections between what I’m working on and an earlier talk this morning by Geoffrey Irving on the prover-estimator debate. But it’s a little bit technical, but the idea is that we’d like to have some guarantees that this kind of approach that I’m talking about gives us good estimates of probability. So it’s not just that we’re minimizing this objective function, but we’d like to have even stronger guarantees. And he and his collaborators have been working on methods that are similar to– let’s say you think of a dialogue between two mathematicians. One is producing a sketch of a proof, and the other is a skeptic. And the other is saying, I don’t believe that lemma. Give me a proof of that lemma. And then the other guy needs to come up with a new lemma that supports the one that the other guy didn’t believe in. And in this way, eventually the skeptic says, yeah, OK. Geoffrey, you can correct me, but that’s a version when they don’t see a disagreement that’s sufficiently strong, they might stop and then have a probabilistic assessment of how much they trust the conclusion. All right. I want to mention, now going outside of the technical questions, that in order to avoid the sort of catastrophes that I’ve been talking about, yes, we need to do more science of the kind I’ve been talking about and other people in the workshop are talking about, so as to design better safety around or for AI. But even if we have technical ways of making AI safer, it doesn’t mean that the world is going to adopt these things. And even if we had AI that was very safe, it could still be a tool of power in the wrong hands. That could concentrate, will concentrate, even more power and wealth than we currently have. So the whole question of the politics, the governance are super important. We can’t solve one problem and ignore the other. In addition to the risks of loss of control, we have to cope with all kinds of other risks that I just want to mention quickly. Economic existential risks– companies that are going to be erased because they’re going to be replaced by AI-driven companies. Whole countries economies might disappear in this way. The use of AI, not just economically, but politically and militarily, could put liberal democracies at risk as well. And the use of very powerful AI that can design new weapons, that can launch cyber attacks and bioweapons and so on, and criminals would also want to use these things, could create chaos in our world if we’re not careful. And the proliferation of these tools in the bad hands is something we need to be on the lookout for as well. Personally, the direction I feel is the only one I can see right now in the longer run that avoids these problems is one where we converge to the most advanced AI as a global public good, managed multilaterally, so that no single person, no single company, no single government, can abuse the power that these AIs will give. I’ve started a new nonprofit organization that’s doing the kind of research I talked about. So if people are interested, contact me. Thank you. [APPLAUSE] UMESH VAZIRANI: Thanks, Yoshua, for a wonderful talk. There’s time for questions. AUDIENCE: A question about the scientist AI idea. Intuitively, it feels like to be a good scientist often involves doing many agentic things, where– YOSHUA BENGIO: You mean experiments? AUDIENCE: Yeah, experiments, among other things, as well as various ways of gathering evidence that might not literally be experiments. YOSHUA BENGIO: Well, that’s kind of experiments, doing actions in the world in order to obtain information. AUDIENCE: And so the picture in my head is that to get a good scientist AI like you describe, you probably need some agentic components to that. And it seems like– YOSHUA BENGIO: Let me respond to that. Yes, you do. And you can use the scientist AI, which is just like a Bayesian oracle, to do it. So it turns out– and this is something I worked on in the last few years– you can use such a Bayesian predictor not just to generate hypotheses that are plausible, but then also, using, again, amortized inference, to compute things information gain. In other words, if I were to do this experiment, how much information would I gain about these hypotheses that I’m trying to disambiguate? And then you can use, again, the same sort of inference machinery to sample experiments that have a high information gain. In addition, you want to add another criterion. They should have high information gain, and they should not cause harm. But we know how to do that. We can just ask our scientist AI to make a prediction about the causal effects on harm. So now, what I’ve said is that we can decompose the task that looking for evidence, doing experiments, which is agentic, into a lot of smaller tasks that we can understand, that we can have a training objective for, that correspond to estimating some mathematical quantity, and combine these pieces together and have something that’s going to be safe and acquiring information. And in general, that’s the direction I want to go to. I want to make sure– so this is in contrast with, we train the whole thing end to end by reinforcement learning, where we don’t know why and how it’s coming up with its decisions. And it might not even be sufficiently sample efficient because it might need to do a lot of experiments before it finds what’s dangerous, what works in terms of acquiring information, and so on. AUDIENCE: Thanks. YOSHUA BENGIO: There was somebody over there. AUDIENCE: Thank you. Thank you for a fascinating talk. Early on, you put out two kind of necessary conditions for potential harm. YOSHUA BENGIO: Yes. AUDIENCE: It was the– YOSHUA BENGIO: Intent and capability. AUDIENCE: –intent and capability. Exactly. And I seem to understand that you’re arguing for, we could develop capability without creating intent. YOSHUA BENGIO: Exactly. AUDIENCE: Now, as we’ve seen in the last three months, you could have what seems like the most foolproof constitution in the world, and if societal norms don’t support it, it all goes to hell. All of us in the room could work towards this, publish, show the capability side. What prevents a bad actor from them taking that capability and adding intent to it? YOSHUA BENGIO: Well, that’s why I have the slide at the end that we need the societal guardrails as well. We need norms, rules, treaties, regulation, all of these things, liability, whatever tools that lawyers and diplomats work on to minimize those risks. And that also means striking deals with people we don’t like who have shared interests. They also want their children to have a life. It’s hard, but that’s the only option. AUDIENCE: Yeah, thanks for the great talk. My question is also related to your page 13, where you list the condition that AI may cause harm– intention and also capability. So I’m wondering whether intention is necessary in this case because we don’t– YOSHUA BENGIO: No, it’s not right. But it’s the most egregious kind of risk. In other words, we’ll build machines that will be imperfect. They’ll make mistakes. AUDIENCE: Right. Stochastic machines, right. YOSHUA BENGIO: We should be, no, not stupid enough to put them in control of our nuclear weapons so if they make a mistake, we don’t all die because of that stupid mistake. But what I’m worried about is they seem to have minor affordances. Oh, they just go on the internet and talk to people and do financial transactions. And it turns out very bad because combining these actions can yield to things like changing governments and convincing criminals to do all sorts of things. So we– yeah. It’s much more important to control the intent. AUDIENCE: Yeah. I’m just trying to argue that maybe it doesn’t need the intention. Only stochastic would cause enough harm. YOSHUA BENGIO: But it’s just much less likely. AUDIENCE: Right. YOSHUA BENGIO: Like, let me give you an example. For most really bad things that an AI could do, it’s very unlikely that it’s going to kill everyone on this planet. But if an AI is smarter than us and has, maybe, robots around the world that it doesn’t need us anymore, it could just generate waves and waves of bioweapons until, basically, we’re done and until there are zero humans. Intent can give very powerful results, as we have shown as humans. We have a goal and we do amazing things because we keep going until we reach the goal. And the chances that we could have reached that goal by chance is zero– I mean, not mathematically zero, but statistically zero. AUDIENCE: Thanks. AUDIENCE: Thank you so much for the talk. So elaborating on the first question, in order to have a great scientist, we need some part of intention in it. And you said earlier, there is a very big difference between an agent that has goals and an agent that has self-goals. YOSHUA BENGIO: Yes– no, not has goals, understands goals. But has self-goals is a different thing. AUDIENCE: Yeah. So what is the way to measure if the agent has the goals and the self-goals? YOSHUA BENGIO: Right, right. This is a great question, and I think we need more people to think about it. One aspect that people have discussed is the notion of state of the agent of situational awareness. So an agent, if I want to go from here to here, I need to keep track of my state along the way so I can make progress. So we don’t want an AI to keep track of its progress as it’s behaving in the world. Instead, we just want it to be a memoryless black box that we can interrogate to answer the sort of questions, for example, for scientific questions or for safety questions. And it doesn’t have a continuous activity like these reinforcement learning agents that we build right now. AUDIENCE: Would it also mean limited generalizability as long as [INAUDIBLE]? YOSHUA BENGIO: No, no. It’s not about generalization. It’s about a continuous sequence of actions towards achieving a longer-term goal. AUDIENCE: So conditional on your worldview, maybe a significant fraction of academics, for example, should spend a lot of time on safety. What do you think is the obstacle to that happening? Is it disbelief in capabilities in timelines? Is it conditioned on belief in capabilities, disbelief in risks? What is the key variable? YOSHUA BENGIO: I think there are many excuses for not taking these things seriously. I wish I had the answers. And maybe it’s more useful to think about not just the reasons that people give, which, as far as I know, there are counter reasons that are strong. It’s more like, why is it that we have these thoughts? Why is it that I had these thoughts before January 2023? Why is it that I didn’t take these things seriously before I realized, oh shit, what’s going to happen with my children? And I think it’s psychological defenses. I was in love with my mission of building better and smarter AI and understanding human intelligence and so on. And I felt good about the economic potential of AI. And yeah. When some thoughts go against the things we have said, the things that we have made our identity over, or our financial interests, it’s very hard to have those thoughts. It’s well studied in psychology. And we have the same phenomenon for climate. I mean, the science is obvious. Like, come on. What’s going on? People have a mental block. That’s the only possibility at this point. And it happens for things that are uncomfortable. Look, I’m saying these things. I don’t really know. I think we need serious scientists to study that. And people are starting to do that in the climate arena, look at the psychological factors involved. It’d be good to know more. AUDIENCE: Thank you for– yeah. Thank you for such an honest and thoughtful talk. On this slide, the two conditions that you’re talking about, capability and intent, i keep thinking there’s agency– like, another variable is needed. Because capability– YOSHUA BENGIO: Yeah, when I say intent, agency, it’s kind of replaceable. AUDIENCE: It’s different though, because you could have an intent, but if it’s not capable, and it doesn’t have the agency– so for example, think about– YOSHUA BENGIO: So this slide brings another word, which is affordances. So you can have the intent, but if you can’t carry the plan, then it doesn’t work. AUDIENCE: Which is what we do with the nuclear weapons now, for example, right? We put all these safety in place, not because it’s needed. We are capable of– people have access to blowing it up. But we build by design this thing. So solving technology problems with just technology, maybe a place where we’re missing– YOSHUA BENGIO: Oh, totally. AUDIENCE: Yeah. So– YOSHUA BENGIO: I spend maybe more time on the policy side of things these days. AUDIENCE: Thank you. UMESH VAZIRANI: Next question is, how long should we let this run? Do you want to [INAUDIBLE] into– YOSHUA BENGIO: Oh, I’m happy to go on. If Siva is not unhappy. [LAUGHTER] AUDIENCE: We have a panel. YOSHUA BENGIO: Yes, we have a panel right after, so five more minutes? AUDIENCE: Yeah. OK. AUDIENCE: Hi. Yeah, so earlier in your talk, you said that this, the world of AI agents is a quadrillion-dollar opportunity. So I think it’s fair to say that this is probably going to happen unless there’s some strong governance policies come into play. And then you also had another graph that you extrapolate that maybe you have 5 to 10 years before it reaches some human capabilities for AIs. So do you have a roadmap for governance subgoals for the next 10 years? YOSHUA BENGIO: Yeah. Right. There’s a hope that many people in the AI safety community have, which is kind of a crazy hope. The crazy hope is that companies will not be careful enough– it seems to be going on– and some accidents will happen, and then people will wake up. Or maybe not. [LAUGHTER] But it might increase the chances that governments start acting and taking it seriously. But generally, I think the number one factor in these societal questions is global awareness. People need to understand what’s going on. People need to understand that there are very few people in the world who are taking decisions on our behalf, taking risks on our behalf, whether it’s malicious use or loss of control. These are catastrophic risks, and a few people are driving the car. And we’re all like passive observers. And that is not the democratic thing in my book. So once people start to understand they are driven on a road on which they don’t have control and could cost a lot, they will start wanting some oversight. AUDIENCE: Yoshua, I’ve– is this on? Yes? I have a question slash comment. You and I have discussed aligning incentives, whenever one can align incentives. That is usually the best way to prevent harms. And I know you and I have discussed insurance. But even having– not that it’s a perfect analogy, but here for fires, which are not independent of one another, whereas those accidents may or may not be independent of one another. We also have an insurance of last resort– YOSHUA BENGIO: The government? AUDIENCE: –which is the government. We also have in the state of California, wildfire insurance of last resort, which has been broken now by the LA fires, and the entire state is going to have to pay for it. And so is there a way approaching this with mechanism design so that one can try to align incentives a little better and make the risk sharing more explicit? YOSHUA BENGIO: Yeah, totally. Thanks for the question. Actually, the answer is going to answer the previous question. So there’s a short-term thing that we can do. It’s called liability insurance. So first, liability exists already. Tort law– I think governments would gain by clarifying because in the world of software, it’s kind of a gray zone right now. That’s the first step. The second step is governments should mandate insurance. And then they don’t even need a regulator. Let me explain why. The insurer needs to estimate the risks as correctly as possible because if they underestimate, they’re going to lose money. If they overestimate, they’re going to lose to their competition who estimate better. So the incentive for the insurer is to do an accurate estimation of the risk. For the people who build AI, now they have a very strong incentive to do AI safety, to protect the system, to make sure the accidents don’t happen. And they have a gradient because for each dollar in safety that they put, there’s a chance that they’re going to win more on the premium. So they have a continuous incentive to improve their work to make their system safer, which a normal state regulator doesn’t have. The normal state regulator is a pass/fail. So if you pass, you don’t need to make progress. There’s no incentive to improve. But the insurer mechanism, it’s a market mechanism, and it’s actually more powerful than a normal regulator. And it wouldn’t take a lot. It’s a small legislative change. We’ve done it in nuclear fires. It’s different from normal things like car insurance. But nuclear plants or similar, rare accidents, we don’t have data that’s sufficient, but people have done that, and it works. AUDIENCE: Next question. AUDIENCE: Yeah. So first, thank you for a wonderful talk. And one thing that kind of reassures me, at least to some extent, about our future with coexistence with AI, is that, unlike ourselves, AI haven’t evolved to survive. We have self-preservation instinct. We fine-tune it over hundreds of millions of years. And AI did not, at least, at least not yet. And we also have a drive to reproduce. Well, modern people don’t, but evolutionarily we do. AI don’t have that. YOSHUA BENGIO: They do, actually. But yeah, go ahead. [LAUGHTER] AUDIENCE: And AI didn’t evolve to grab resources to be powerful and so on and so forth. Those are just not things that AI cares about in terms of how it was trained. And then as I was listening to your talk, a scary thought occurred to me. Maybe AI learned all of that by reading all the texts and all the novels and all the histories written by humans. So we don’t really understand, at least I certainly don’t understand, how AI represents the knowledge it has. But do you think that the stuff that AI reads, the stories, could actually give AI our evolutionary heritage in terms of self-preservation instincts and the like that it was never trained for? YOSHUA BENGIO: I think that’s a very strongly– that’s a hypothesis I strongly believe in because they’re trained to imitate us. That’s the main– most of the compute is used for doing what a human would have done, which is write what a human would have written. And then for other technical reasons, the RL training also pushes in that direction. UMESH VAZIRANI: It’s probably time to wrap up, but I can’t– YOSHUA BENGIO: You can’t resist. UMESH VAZIRANI: –resist asking a question, which has to do with your last slide. It’s a very sobering talk that you gave. And in your last slide, you talked about all kinds of other possible risks. But there’s one thing that I’d like you to comment on, which is, what do you think the chances are that– or what do you think about the thought that even the AI that exists today is capable of great economic and social disruption? And so that might cause this situation where, long before this AI safety occurs, we won’t have a cleanroom in which to incubate it. But it would already be a very messy world. YOSHUA BENGIO: Yeah. Our politics especially are at risk. I mean, disinformation is happening. And even without scientific advances in AI, I think the current tools, fine-tuned with malicious goals, could be extremely dangerous. So one thing I didn’t mention is the ability of AI systems to make people change their mind, also known as persuasion. And there was a study last year showing GPT-4 was equivalent to humans and very likely– and I hear informally that the new reasoning models are even stronger. And people will misuse that, not just to sell one brand versus another, right? This is extremely dangerous. UMESH VAZIRANI: All right. Thanks very much, Yoshua, for scaring us out of our minds. [APPLAUSE]

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

Yoshua Bengio | Superintelligent Agents Pose Catastrophic Risks | Richard M. Karp Distinguished Lecture | Simons Institute

Learn more:

Share This Story, Choose Your Platform!