FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

CHRISTIANO. GPT four’s understanding of the world is much crisper and much better than GPT three’s understanding, just like, it’s really like night and day. And so it would not be that crazy to me if you took GPT five and you trained it to get a bunch of reward and it was actually like, okay, my goal is not doing the kind of thing which thematically looks nice to humans. My goal is getting a bunch of reward, and then we’ll generalize in a. New situation to get reward, by the way, this requires it to consciously want to do something that it knows the humans wouldn’t want it to do. Or is it just that we weren’t good enough to specify that the thing that we accidentally ended up rewarding is not what we actually want? Think the scenarios I am most interested in and most people are concerned about from a catastrophic risk perspective, it involves systems understanding that they are taking actions which a human would penalize if the human was aware of what’s going on such that you have to either deceive humans about what’s happening or you need to actively subvert human attempts to correct your behavior. So the failures come from really this combination, or they require this combination of both trying to do something humans don’t like, and understanding the humans would stop you. I think you can have only the barest examples. You can have the barest examples for GPT four. Like, you can create the situations where GPT 4 will be like, sure, in that situation, here’s what I would do. I would go hack the computer and change my reward. Or in fact, we’ll do things that are like simple hacks, or go change the source of this file or whatever to get a higher reward. They’re pretty weak examples. I think it’s plausible GPT five will have compelling examples of those phenomena. I really don’t know. This is very related to the very broad error bars on how competent such systems will be when that’s all with respect to this first mode of a system is taking actions that get reward and overpowering or deceiving humans is helpful for getting reward. There’s this other failure mode, another family of failure modes, where AI systems want something potentially unrelated to reward. I understand that they’re being trained. And while you’re being trained, there are a bunch of reasons you might want to do the kinds of things humans want you to do. But then when deployed in the real world, if you’re able to realize you’re no longer being trained, you no longer have reason to do the kinds of things human want. You’d prefer be able to determine your own destiny, control your competing hardware, et cetera, which I think probably emerge a little bit later than systems that try and get reward and so will generalize in scary, unpredictable ways to new situations. I don’t know when those appear, but also, again, broad enough error bars that it’s like conceivable for systems in the near future. I wouldn’t put it like less than one in 1000 for GPT five. Certainly if we deployed all these AI systems, and some of them are reward hacking, some of them are deceptive, some of them are just normal whatever, how do you imagine that they might interact with each other at the expense of humans? How hard do you think it would be for them to communicate in ways that we would not be able to recognize and coordinate at our expense? Yeah, I think that most realistic failures probably involve two factors interacting. One factor is like, the world is pretty complicated and the humans mostly don’t understand what’s happening. So AI systems are writing code that’s very hard for humans to understand, maybe how it works at all, but more likely they understand roughly how it works. But there’s a lot of complicated interactions. AI systems are running businesses that interact primarily with other AIS. They’re like doing SEO for AI search processes. They’re like running financial transactions, like thinking about a trade with AI counterparties. And so you can have this world where even if humans kind of understand the jumping off point when this was all humans, like actual considerations of what’s a good decision? Like, what code is going to work well, and be durable or what marketing strategy is effective for selling to these other AIS or whatever is kind of just all mostly outside of sort of humans understanding. I think this is like a really important again, when I think of the most plausible, scary scenarios, I think that’s like one of the two big risk factors. And so in some sense, your first problem here is like, having these AI systems who understand a bunch about what’s happening, and your only lever is like, hey, AI, do something that works well. So you don’t have a lever to be like, hey, do what I really want you just have the system you don’t really understand, can observe some outputs like did it make money? And you’re just optimizing or at least doing some fine tuning to get the AI to use its understanding of that system to achieve that goal. So I think that’s like your first risk factor. And once you’re in that world, then I think there are all kinds of dynamics amongst AI systems that, again, humans aren’t really observing, humans can’t really understand. Humans aren’t really exerting any direct pressure on only on outcomes. And then I think it’s quite easy to be in a position where if AI systems started failing, they could do a lot of harm very quickly. Humans aren’t really able to prepare for or mitigate that potential harm because we don’t really understand the systems in which they’re acting. And then if AI systems, they could successfully prevent humans from either understanding what’s going on or from successfully retaking the data centers or whatever, if the AI successfully grab control.

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.