Jim Babcock’s P(doom) bottom line = 50%

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

What’s the most likely (“mainline”) AI doom scenario? How does the existence of LLMs update the original Yudkowskian version? I invited my friend Jim Babcock to help me answer these questions. Jim is a member of the LessWrong engineering team and its parent organization, Lightcone Infrastructure. I’ve been a longtime fan of his thoughtful takes. This turned out to be a VERY insightful and informative discussion, useful for clarifying my own predictions, and accessible to the show’s audience. 00:00 Introducing Jim Babcock 01:29 The Evolution of LessWrong Doom Scenarios 02:22 LessWrong’s Mission 05:49 The Rationalist Community and AI 09:37 What’s Your P(Doom)™ 18:26 What Are Yudkowskians Surprised About? 26:48 Moral Philosophy vs. Goal Alignment 36:56 Sandboxing and AI Containment 42:51 Holding Yudkowskians Accountable 58:29 Understanding Next Word Prediction 01:00:02 Pre-Training vs Post-Training 01:08:06 The Rocket Alignment Problem Analogy 01:30:09 FOOM vs. Gradual Disempowerment 01:45:19 Recapping the Mainline Doom Scenario 01:52:08 Liron’s Outro Show Notes Jim’s LessWrong — https://www.lesswrong.com/users/jimra… Jim’s Twitter — https://x.com/jimrandomh The Rocket Alignment Problem by Eliezer Yudkowsky — https://www.lesswrong.com/posts/Gg9a4… Optimality is the Tiger and Agents Are Its Teeth — https://www.lesswrong.com/posts/kpPnR… Doom Debates episode about the research paper discovering AI’s utility function — https://lironshapira.substack.com/p/c…

Eliezer’s AI doom arguments have had me convinced since the ancient days of 2007, back when AGI felt like it was many decades away, and we didn’t have an intelligence scaling law (except to the Kurzweilians who considered Moore’s Law to be that, and were, in retrospect, arguably correct).

Back then, if you’d have asked me to play out a scenario where AI passes a reasonable interpretation of the Turing test, I’d have said there’d probably be less than a year to recursive-self-improvement FOOM and then game over for human values and human future-steering control. But I’d have been wrong.

Now that reality has let us survive a few years into the “useful highly-general Turing-Test-passing AI” era, I want to be clear and explicit about how I’ve updated my mainline AI doom scenario.

So I interviewed Jim Babcock (@jimrandomh), LessWrong and Lightcone Infrastructure team member, to help me clarify what exaxctly the update is. Jim’s thinking on the topic is similar to mine, but deeper in many ways. I personally learned many significant new insights from his answers.

The video is on my “Doom Debates” Substack and YouTube channel, and you can get the audio podcast version by searching “Doom Debates” in your podcast app. Full transcript below.

The Most Likely AI Doom Scenario — with Jim Babcock, LessWrong Team

Liron: Welcome to Doom Debates. Today we’re going to be talking about mainline doom scenarios. And for this I’ve brought on my friend Jim Babcock. Jim got his master’s in computer science from Cornell, and then he joined the LessWrong.com core developer team. So he’s a member of Lightcone Infrastructure, that’s the organization that does less wrong and a bunch of other projects.

Jim is a smart guy who, like me, has been thinking about AI doom for almost 20 years. And so I specifically reached out to him because I wanted to do an episode straightening out my own thoughts on how they’ve been evolving about mainline doom scenarios. Jim Babcock, welcome to Doom Debates!

Jim: Thank you for having me.

Liron: Thanks for coming on. Yeah, I’m glad we can do this because I’ve been messaging you for a long time trying to think of a good occasion to come on Doom Debates. And now as I was thinking, what exactly is the mainline doom scenario and how has it evolved? I realized that it’s not as cut and dried as I originally thought. And I think that we should come hash it out and let the viewers see our latest thinking.

Jim: Yeah, there was a lot of careful reasoning that has held up very well, and then there was a lot of informal community distillation and vibes that held up less.

Liron: Well, all right, before we get into that, where are you from?

Jim: I’m originally from the Boston area. I’m now living in Oakland.

LessWrong’s Mission

Liron: Cool. So these days, your main focus is maintaining less wrong and as I understand it, making it a better, more productive forum to help humanity make sense of these crazy times that we’re entering.

Jim: That’s right. So LessWrong has been the place for high sophistication, discussion of rationality, and AI pretty much for as long as Those have been widely discussed where certainly in competition with Twitter and bluesky and various other sites. And if you want like up to the hour news, that’s probably where you’d go. But the site is designed around nudging people towards thinking deeply, towards long form thoughts, and towards very aggressive quality filtering.

Because it’s a nonprofit, it doesn’t have to be engagement maximizing in the way that most of the rest of the Internet is. So an example of a thing we did that I think is a microcosm of how we think about things. So you post a comment or a post, you get replies, you get upvotes. And when people upvote your content, you find out, oh, I got plus five points because people liked my thing. And if you post something on Facebook or Twitter, you’re incentivized to keep coming back and getting these notifications that people liked your content one at a time.

Liron: Um, and this is super addictive.

Jim: This is. This is bad for you. It is maximizing the sort of dopamine drip. We have a thing that notifies you that your thing got upvoted, but it will batch the notifications to once per day. So if you are stuck in an obsessive refresh loop, the algorithm of the Karma notifier guarantees that there is no reward for wasting your time. You have to wait a day and come back.

Liron: Yep, I’ve experienced that myself. Yeah. So LessWrong is full of features like that to really emphasize that when you’re coming here, it’s all about high quality discourse. It’s not about cheap shots, it’s not about the quick spikes of dopamine. This is civilization actually trying to coordinate on a high level of discourse.

Jim: That’s right. We do very much think of this site as trying to farm intellectual progress.

Liron: The theme that runs through it is the intentionality where you can see every feature. It’s done because it’s trying to raise the level of discourse. And if it doesn’t work, you guys are going to pull it back and try something else. And the innovations accumulate. And it’s a very distinct place, the Internet. It’s a place where you can actually get some breathing room to read the damn post. You know, the reading features are very nice. The style is very nice for actual rereading. So it is fun to see all this kind of intention.

So on LessWrong today, the main topic is AI. I feel like most of the posts are about that, but when it started, Eliezer Yudkowsky was writing a lot about rationality. That was kind of what the sequences were originally for is to teach people how to think better. So that by thinking better, it would become more obvious why this AI stuff is so important and so urgent. What do you think of that whole project of training the rationalist community to become better thinkers? Do you think it was valuable and do you think it was reasonably successful?

Jim: That’s right. So the story, as Eliezer has told it, is that he tried talking to people about AI on the Internet and it was very frustrating, not because they didn’t know things about AI, but because they didn’t know things about how reasoning is supposed to work. And he kept having to divert from talking about what he wanted to talk about, which was AI and AI risk, into talking about rationality and the basics of it. And he said, okay, I’m just going to write about rationality for a while. I believe there was a very long period where he was writing one post per day on overcoming bias with Robin Hanson as a co blogger.

Liron: Yeah, you know, very long period. It feels like a blip now. It was like two years, you know. Yeah, that was a formative experience for me back in 2007, being like 21 years old and seeing the new Eliezer post come out daily, that he’d spent like 12 hours working on it. And then I would take like half an hour to like absorb it in. And it was like a daily ritual. And I’m like, I can’t believe I’m here for this. This is like so profound.

Jim: Yeah, I too came out of that. Sort of looked at this ideal reasoning that he was writing about and contrasting to. I took a high school debate class where they taught motivated reasoning as though it was supposed to be a good thing. And there was really nothing else in the culture that was giving these kinds of ideas.

Liron: Right. Yeah. And I also took like college philosophy as a undergraduate, a lower division requirement, and it was just like a mass of problems. And then I read Eliezer and I’m like, wait a minute, some of these have known solutions.

Jim: Yeah. I also had that. The sort of classic, well, you know, here’s. I don’t know, dualism versus materialism or something. And. And they teach the conflict rather than teach the answer. And once you realize that there is an answer and that you could be learning the answer, that suddenly feels like really toxic.

Liron: Right, right, right. And then, you know, for me, one of the key worldview, what Robin Hansen might call a viewquake, one of those key moments for me is realizing that you could do all of philosophy from the standpoint of what do we program into the AI so we can be humans smoking a joint and talking about our feelings, but at the end of the day, the AI starts out with an empty file of code and you have to program something into it. And whatever philosophical conversation you’re having could be cashed out in terms of the contents of that code that’s actually going to execute.

Jim: Right. So in the context of less wrong 1.0 and like 2012, AI is in this sort of weird middle ground where we are in fact really talking about AI as we predict it’s going to turn out. But it’s also serving this sort of philosophical thought experiment role where by invoking AI, you can say, okay, well, what if we weren’t dumb humans? What if we were actually doing this right? What would that mean? And just asking the question what would it mean to be doing this right? Leads to a mindset shift.

What’s Your P(Doom)

Liron: Yep, yep, yep. All right, all right, so we are going to crack open our main topic here. We’re going to talk about mainline doom scenarios. But first I want to get your bottom line.

Robot Singers: P(Doom). P(Doom). What’s your P(Doom)? What’s your P(Doom)? What’s your P(Doom)?

Jim: I think it is going to be close.

Somewhere around 50% doom. With the good outcomes mostly being scenarios in which the alignment research goes well or there is a significant coordination victory that buys time for a much larger quantity of alignment research going average. It feels a little cowardly to say, I don’t know, probability of doom is like 50%, but it really does look like it is up in the air to me.

The scenarios that succeed, a lot of them are skin of our teeth scenarios like we make AGI that is not quite superintelligent, not quite aligned, but it looks at the next generation of AIs that will come after it. And the way that Eliezer, who is very pessimistic about AI alignment, looks at upcoming generations and says, I actually think that I’m in the same boat as humanity and we need to pause here. And that’s one way things go well. The other way things go well is the penultimate AI is good at being an alignment researcher before it gets to being gold directed enough to fall into the bad attractors that lead to it taking over the world.

Liron: Right, okay, so that’s the bottom line. That’s pretty similar to my own bottom line. I also have roughly a 50% p doom. I consider it high uncertainty. I’m sympathetic to all of Eliezer Yudkowsky and MIRI’s positions, but I also have a lot of humility and background uncertainty of, like, look, sometimes things just happen differently because you were wrong. I have a lot of that kind of humility.

Let me back up and frame the conversation for the viewers. People who watch Doom Debates, I often say reality is playing out much like the Yudkowskians expected. Like, I consider a lot of the events happening, like, oh, Claude is being deceptive, like, super predictable to the Yudkowskians, and I feel vindicated by that kind of stuff. But on the other hand, the Yudkowskians didn’t exactly predict that smart LLMs would be a thing. Right?

Jim: Uh, I think actually most of the failing there is that people didn’t actually write very much about what they expected AI progress to look like prior to sort of there is an AGI and it FOOMs scenarios.

Liron: That is fair. That is fair that the data set is relatively small. But I don’t want to speak for Eliezer. I can certainly speak for myself, and I think Eliezer is sympathetic to this, which is the conditional of, like, imagine a day when the Turing Test is passed. Is that day very close to a fume? Or potentially already past the point of no return? I know that I personally would be like, yeah, I think passing the Turing Test is probably on the dangerous side of a fume, probably already unstoppable. And then here we are. It looks like AIs have officially passed an actual paper that administered the Turing Test. They passed it better than humans can. So isn’t that some kind of an update?

Jim: So there’s this classic diagram that lays out an intelligence scale where you have a mouse, you have the village idiot, and you have Einstein. And the point that this diagram makes is that there’s actually a relatively small difference between the village idiot and Einstein as compared to the difference between the village idiot and a mouse or like a fruit fly or something. I think it is reasonable to say that that prediction didn’t pan out very well. That the sort of infra human range, like the range of AI capabilities that is comparable to different tiers of humans, is wider than we used to think.

Like, GPT2 is very reminiscent of a human who has had a stroke. And then GPT3 is like a human who’s not that clever and just sort of messing around and not taking things seriously. And then GPT4 is more like a college student with an exceptional memory and mediocre reasoning. And we’re now getting into the level where LLMs start acting like extremely lopsided intellectuals. And the belief, I think before was that we would blaze through that in no time at all. And if you zoom out to historical perspective, from GPT2 to today has in fact been no time at all. It’s the amount of calendar time between these generations is very short. But while living through doesn’t feel like no time at all.

Liron: Right. Even though it’s all history now, like it’s all in the past. But yeah, it felt kind of long, seven years of time, you’re calling it a wider range than we expected, from the village idiot to Einstein. But I would argue, like, I agree with you, that the range has interesting detail, like, oh, wow, we can pinpoint different steps on this little ladder. But I would still argue that the gap between Einstein and a super intelligence is still a lot vaster than these few steps that just got taken.

Jim: Well, okay, so there’s a couple perspectives on that. One is that a few steps past Einstein, you get to the autonomous AI developer that can work on its own next generation. That sort of thing means that things don’t necessarily stay smooth in the way that they have been smooth so far. And the second argument is that if you look at things in terms of compute efficiency and you look at the historical record, there have been multiple large discontinuous jumps. And if there were one more large discontinuous jump, that would probably go into superintelligence territory.

Like, if you look back across the past two decades, a very prototypical example of this is the idea of dropout when training neural nets. It was the case that you would sort of train neural nets and they would have horrible overfitting problems, and you could kind of sort of get them to predict things such as next tokens. But people didn’t, mostly didn’t use neural nets to predict text back then because it was too hard. And then this very slight tweak to how you do the training suddenly means that the amount of learning you get per step and the amount of model strength you get per parameter goes way, way up, like the compute multiplier of 10 or 100x. And similarly, when we went from LSTMS to transformers, you could interpret that as like a thousand x improvement in compute efficiency.

You could also look at a loss versus compute log log plot, and it changes the slope on that plot. It’s a very qualitative sort of shift. And if you look at how long it’s been since we sort of settled into the current paradigm of pre training on next token prediction, followed by RLHF post training and a few epicycles it hasn’t been longer than the historical interval between these giant shifts. And if there is another discovery of the same magnitude as LSTMS versus transformers was, then we could go very quickly from AIs are expensive to run mediocre grad students to they are more intelligence than all of humanity put together and then another order of magnitude on top of that. The odds of things stay continuous don’t look great to me.

What Are Yudkowskians Surprised About?

Liron: I just want to start with explaining the surprise, because I still feel like it’s surprising compared to five years ago that we now have AI, which is incredibly useful even in my own life.

Just giving personal examples, coding faster and better image generation. I use that a lot now. Learning new subjects. I often learned by just asking a questions and asking a follow up questions and digging into what I’m confused about. And then medical consultations. I had a dermatology consultation the other day and it was so good, like it was so insightful and like better than, than human doctors. It’s like so genius at this stuff.

So the fact that all of this exists, if you told me five years ago, I’d be like that is incredibly surprising. And how can all of this exist and yet not just have like enough general intelligence, just enough of this secret sauce of insight where it’s like, well, now that I know how to think and I know how to chain thoughts together, I’ll just think about myself and I’ll start the improvement loop.

Jim: The thing that is surprising that I don’t think anyone expected is the way that LLMs pry apart. We’ve sort of pried apart intelligence versus agency, where you have these models that if you look at a local scale at like the sentence to sentence level, they are at roughly the top of the human range, sometimes sometimes outright superhuman. And if you put them to work over a longer time scale, like you ask them to write a 10,000-word paper or do a coding task that takes more than a hundred steps and suddenly their performance falls off a cliff. And at this larger scale they’re sort of incoherent.

There was a recent paper that basically looked at AIs in terms of task length and how long it would take a human do a comparable task and did a plot and basically found that on a log scale there’s this metric that’s going up which is how long can an AI stay coherent on a given task.

Liron: Exactly. Yeah, I saw that too. And the idea is that there’s kind of a Moore’s law that isn’t just about input output on things that are like very quick. But it’s also like, can you sustainably beat a human at this task that takes like two hours, four hours a day. And there may be some takeoff threshold where if you can do anything better than a human over a two week time span, maybe at that point every year long thing can just be made out of two week things. And what their graph shows is that there’s like a very steady progression toward like wherever that threshold is, if it exists. It seems like we’re getting there.

But then for me it raises the question of like, well, wait a minute, why are we even talking about timeframes? Because from the AI’s perspective, there’s not really a fundamental concept of time, right? It just like keeps figuring out what’s the next token, what’s the next token. It’s not really. Time isn’t exactly passing for it. So why are we talking about time?

And as far as I can tell, what time really means is kind of like error rate. So if it has like a low rate of like saying stuff that derails it, right? If the rate of derailment is low, that’s what enables it to continue for a longer time and still be useful. Because any AI can continue for two weeks. But if it said something dumb in the first day, then it spent like the next 13 days not realizing that it’s on like a total wild goose chase.

Jim: Right. I think error rate is not quite the right framing. I think if you look at humans and their error rate, it’s actually quite high. Probably worse than LLMs. It’s more like the ability to self correct and keep sight of some sort of goal.

Not long after GPT3, there is this scaffold someone made called AutoGPT, where basically it would, it was a set of prompts where it’d say, okay, here’s a task. Why don’t you make a list of subtasks and then it would prompt it again with one of those subtasks and it would generate these lists and lists of subtasks and then try to do them. And it would immediately fall apart because it would spiral off into getting distracted. It would make subtasks that are sort of vague in ways that don’t make sense.

Liron: Let me just put AutoGPT into context. So for me, AutoGPT is very important because it demonstrates how trivial it is to get agency when you have intelligence. So imagine you have an AI that has intelligence, meaning you can ask it questions, it can always give you really good answers.

All you have to do is you take that AI, which you could call a genie because all it does is answer your questions. You take the genie AI and you just ask it what should I do next? What should I do next? And then you just do what it says and then you ask it again what to do next. So like that trivial connection which Auto GPT is basically doing, it’s basically just like a trivial little scaffold to keep asking it what to do next and then do it. Do it in the form of like a shell script or do it by asking it for substeps. It’s like a trivial connection.

Once you know what to do. The extra step where you actually do it and then figure out the next thing to do, that’s just like a small little detail. So agency really does come out of sufficient intelligence. Correct?

Jim: Right. There’s two different framings. One is the framing where you ask the question what should I do next? And given a sufficiently detailed description then like send that to a robot body or to some muscle fibers or whatever. The other framing is where you ask what happens if I do this for various different things? And then you compare those and, and pick the one that has the best predicted result. It’s not surprising that given that we have these LLMs that can answer questions like what happens if I do this? Or what should I do next?

That all of the frontier AI labs are building scaffolds on top of that for things like deep research, for example, is an LLM plus some prompts to generate sub questions and do web searches and things like that. Getting from intelligence to agents seems like a smaller step than getting from 2015 to where we are now was. And from an alignment perspective, this is really worrying.

People look at GPT4 and they’re like, well, if I ask it about moral philosophy and I’ll put the right answers, it must be aligned. And that’s sort of missing the point, which is that the problems with alignment are mostly not about what an agent is. Like, it’s that if you just ask the question what makes the most molecular squiggles or paperclips or how do I make the most money? The answer itself is, is, is if you take a God’s eye view to the question what would make me the most money?

Liron: The answer is, yeah, I know where you’re going with that story.

Jim: Destroy the world, convert it into a giant computer that is entirely devoted to storing a number for your bank account, where that number is the maximum number that the planet sized computer can store, right? And then spread through the stars, making the computer even bigger so that you can have more bits to store the bank account number.

Liron: Yeah, yeah. So I’m very familiar with the move you’re doing here, which is you’re placing the blame for the perverseness of instrumental convergence. You’re saying, look, it’s not really the AI’s fault. This is just the shape of goal optimization itself. This is just the nature of intelligence itself, regardless of which system is implementing intelligence. And I’ve been coining a term to talk about what you’re talking about, which is intellidynamics. You’re making a claim about the dynamics of intelligence, regardless of the system implementing that intelligence.

Jim: It doesn’t even have to be intelligence per se. It’s just you can take a question that’s more of a pure math question and get the same sort of problem. The less wrong essay that I think explains this very well is titled Optimization is the Tiger and Agents are Its Teeth.

Moral Philosophy vs. Goal Alignment

Jim: With current generation LLMs, it is very tempting to look at them and say, well, if you ask a moral philosophy question, it gives the right answer. That means it’s aligned right? And in fact, there are some great examples where LLMs are put into actually things that look like thorny moral situations, and they do pretty well.

The problem is that those are systems that are not really goal directed in the way that Frontier Labs are now building towards. And when you have something that is more goal directed, if you look at the goal and you just take a God’s eye view and ask the question, okay, let’s say I say my goal is to make the most money, what makes the most money? And if you just list out every possible strategy and ask the question, which one produces the largest number? The answer is convert the earth into a giant computer that is solely dedicated to storing an extremely large number that represents my bank account.

From a human perspective, that feels really silly. And from the perspective of an LLM that is not going hard at trying to achieve a goal, it also would say that it’s pretty silly. But if you use the sort of agent scaffolding that we’re building and you have one system that asks, what happens if I do this? What happens if I do that? And one system that asks, how many dollars does this produce? How many dollars does that produce? And one system that takes the largest number and one system that feeds that into robot bodies, then if each of those systems is smart enough, then none of them is human. Like, they’re all just answering a question that has an impartial, correct answer. And the combination of these Things is kill all humans and destroy the universe.

Liron: Okay, let me recap for viewers, because I think you just gave the bottom line. You gave your bottom line, which I think is also my bottom line:

We’ve both updated on the fact that it’s currently possible to have a moral sounding chatbot like that piece of the puzzle, having it imitate humans on a very deep level, sufficiently deep to pass moral exams. We’ve both updated on that being a thing, but we’re both concerned about a huge chasm between being able to discuss moral questions the way a human can and being able to maintain control and not lose the plot. When you’re doing reinforcement learning and handing off goal optimization to something smarter than you.

Jim: Right. We can tell from LLMs that the moral concepts that we care about are encoded in the weights somewhere. And then we take O1 and do a bunch of reinforcement learning to make it good at solving IMO math problems. And this does not cause its behavior to become more moral, because what does that have to do with IMO math problems?

One of the results that people are finding is, is that the more reinforcement learning you do on these things, the more deception and cheating on questions and various misaligned behaviors come out. In a certain sense, that’s what you expect. Because if you ask the question, okay, how do I beat Stockfish at chess from a God’s eye view, the answer to how do I do this task? Is cheap.

And if I ask, okay, how do I solve this really hard math problem? The answer is, I hacked the testing framework to say that I have solved it. Not in the sense of like, this is a good thing to do for some notion of good, but in the sense of, if I do this, this is the thing that will produce the gradient update towards me doing more of this strategy. And there are all of these partial mitigations.

Like in your training environment, you can put in honey pots where you say, okay, well, you’re up against a really hard chess engine and here’s what looks like an easy opportunity to cheat. But actually, if you do this, it will update your weights the other way to try to make you less cheaty. That is something that will probably work a little bit for a while, and then you turn the intelligence knob another notch, and now it looks at your fake opportunity to cheat and says, yep, that’s fake all right. And then you deploy it in the real world and it looks at opportunity to cheat and says, okay, yeah, that one’s real.

The best explanations of the shape of the incentive landscape here that I’ve seen have come from Evan Hubinger. A lot of people are optimistic that you might be able to set up these fake test environments where you, like, incentivize honesty by providing what looks like opportunities to cheat, except they get detected and reverse the gradient. And the thing is, the actual best strategy to almost every scenario is think hard about what scenario you’re actually in, be strategic about it, and then use that reasoning. And if you think hard about the scenario, it will be pretty easy to figure out whether you’re in a training environment or in the real world.

Liron: Probably, yeah. I don’t have faith in humans to set up a convincing, perfect replica of the real world when it’s really training. I don’t think we’re up to the challenge of that.

Jim: It’s not just that we’re not up to the challenge of creating a training environment that looks realistic enough. It’s that physics doesn’t actually cooperate with us if we want to make a training environment that looks convincing. There are a bunch of aspects of physics that are just really expensive and a pain in the butt to simulate. And if you’re trying to figure out whether you’re in a sandbox or not, one of the strategies you will come up with is look at the physics of the world you’re in and try to determine whether it looks like it would be expensive to simulate. And if it looks like it’s really expensive to simulate, then it’s probably not simulated. And if it looks like it’s cheap to simulate, then it might be simulated.

Liron: Do you have a specific example like build a quantum computer in your sandbox?

Jim: That is an example that I think does actually work. Like, if an AI wants to know for sure whether it’s in the sandbox or not, and it’s a little bit better at physics than we are, then it could build a quantum computer. And if the quantum computer works, then it can do the math and verify that it has worked. And if it doesn’t work, then it knows that you’re being sneaky because physics didn’t work the way it was supposed to.

Liron: I haven’t thought much about this scenario, but if it tries to do something physically expensive, maybe the simulation just takes a really long time to get a result. But it can’t even perceive that.

Jim: The quantum computing example, it’s not the sort of slow where your training cluster takes an hour. It’s the sort of slow where if you devoted all of the matter in the universe to computing the answer with computers like ours, you still wouldn’t be able to finish.

Liron: Right, right, right. So if it asks it to factor a really large number and it gets the answer, and then it’s like, okay, well, the fact that I did that means that it must. I must be running on like base level physics.

Jim: Yeah. Shor’s algorithm for factoring large numbers is a thing that quantum computers can do. We don’t actually know for sure that there isn’t a mathematical trick for factoring numbers without a quantum computer. This is one of the big outstanding mathematical questions in cryptography. But if that one doesn’t work, then there are others besides factoring that would serve the same purpose.

Liron: When you say the laws of physics don’t help us out when we’re trying to sandbox, that is one sense, the expensiveness of the computation. I think that’s an important sense. The sense that I was originally thinking of is just our physics just don’t have proper virtualization. Right. So you just have stray electromagnetic fields everywhere, basically. Right. I mean, I guess you can try to shield those, like a Faraday gauge or something. But like, just in general, I think an AI is very likely to hit on some kind of side channel that the human didn’t perfectly obfuscate.

Jim: The sandbox environment might be like it. You pretend that its inputs and outputs are connected to a robot body, but actually they’re connected to a super detailed kind of like Minecraft simulation. And you try to make it look as much like the real world as you can, but it’s all in software. And you can throw in a bunch of pseudo random number generated cosmic rays and stuff into the inside of that sample box.

Liron: Right. I mean, I guess what I mean is like you can imagine it’s somehow figuring out ways to get messages across to the outside real world, even when it’s running software in a simulation.

Jim: Right. Okay. So there’s two different angles to that.

Sandboxing and AI Containment

Jim: I actually wrote a paper about AI boxing, which I think is very moot, given that the current trend is to take your model and immediately hook it up to a web browser, because that makes it more useful. So issue one is your test environment might have security vulnerabilities that allow it to find some aspect of the world that actually is like a buffer overflow, and then it can use that to make a copy of itself onto the Internet.

If you look at one of the big reinforcement learning test suites like OpenAI Gym, which I think hasn’t been a thing for some years, but I know DeepMind has one of them. One part of it is that you have these giant collections of video games and if you want to know if your agent is any good, you can have it play all of these video games and sort of take a combined metric of its scores in all the different games, add them up and then you tune your hyper parameters and you have it play all the games again. And if you have enough variety of video games and maybe a generalize to the real world.

And it turns out that the best speedrunning strategy for the classic Pokemon game is a use after free memory vulnerability that leads to arbitrary code execution. And the same is true for Legend of Arena of Time and for actually a quite large percentage of all video games. So you could imagine running an agent, a guest of test framework like this and you think that you have a securely sandboxed environment and actually you have just given it Internet access and it can make a copy of itself in another data center and do all sorts of mischief.

The second angle, which is actually worse is let’s say you are training a AI that you think might be super intelligent and it wants to make a copy of itself in the outside world, and somewhere in the world is a third party that really wants a copy of its weights. Then it can do things like do computations in a particular shape that emits radio frequency interference that a clever third party can detect and read a message out of. It only works if you have a third party that is being observant, but it is basically impossible to stop.

Liron: Okay, but overall you and I both feel strongly that one way or the other, this idea that we are going to successfully box the AIs is probably a non starter once they get sufficiently intelligent.

Jim: Well, okay, so I don’t know how much latency there is between when we have this conversation and when it’s up on YouTube, but last weekend there was a big controversy that OpenAI had shipped a model that was very sycophantic in ways that people found kind of disturbing and off putting. And it turns out that they had shipped a model update to all users at once on a Friday. And to their credit they’d put in some mitigations and maybe rolled aspects of it back over the weekend. But this is the current level of caution that deployments are getting.

There used to be a lot of idealism about long testing periods and model cards and evaluations. And the race got closer and the profit incentive kicked in more. And now if you wait a week to evaluate, then that’s one less week of extra money.

Liron: I was pretty surprised today scrolling Twitter because I remember A few days ago, Sam Altman tweeted very proudly. He’s like, we’ve made some really good tweaks to GPT4 or whatever their latest model is called. And I think the personality is like better than ever. And then literally like 24-48 hours later, Twitter’s like full of tweets saying like, this is way out of control. I gave it zero information about myself and then I asked it to describe myself and it just wrote all these compliments about me. Like, I should be so impressed with myself.

And I’m like, wait a minute, this seems like a pretty low quality failure mode. Like if it’s, if the failure is that bad, how does a company like OpenAI release it with that little quality control? That’s pretty shocking.

Jim: So there I did see a tweet that said that part of the problem was actually in the system prompt, not in the model itself. Like there was a degenerate behavior that sort of inherent in the weight but wouldn’t have been triggered. And I think they sort of treat the system prompt as being on the same tier of importance as the user input and thus aren’t as careful with it as they should be. I think they probably had A/B testing data that sycophancy was good.

And I think Kelsey Piper’s framing of it that I quite liked was that it’s a Bew Coke phenomenon where Coca Cola did a bunch of a B tests and found that people wanted it sweeter and then they made it sweeter. And it turns out that when you’re drinking a full serving instead of a taste test quantity, then you don’t want it that sweet. And so if you’re a B testing responses and the sycophantic responses review well and then suddenly you apply that to full conversations or to all of users interactions, then suddenly they can see through it and it becomes offensive.

Liron: Yep, makes sense.

Holding Yudkowskians Accountable

Liron: Okay, I want to make sure we’re being really clear and accountable to any major update that we have made. And like I said before, I personally do feel surprised at how many useful things AI is doing before going into this super intelligent, super agentic recursive self improvement loop. It’s not there yet, it’s at this kind of intermediate phase. So maybe we can say that the update is how early in the process you can have the moral conversation problem solved. Like don’t you find that surprising that it’s at the point where it can have a moral conversation? Because I certainly do.

Jim: I was definitely very surprised at the degree to which you can have intelligence that is lopsided, where it can answer questions really well, it can solve logic puzzles, you can talk to about moral philosophy, and it cannot maintain task coherence over five minutes. And that is this amazing opportunity because if we were able to stop there, then most of the bad things we’re worried about models doing, in fact they can’t do without task coherence over more than 5 minutes or 10k tokens or however long. But most of the economic value we can get out of short run prompting and maybe some scaffolding on top of that.

Liron: Yeah, I mean, it’s crazy that we’ve landed in this scenario. Kind of like the eye of the storm or like this, you know, last checkpoint, last base camp before summiting Mount Everest type of situation.

I remember we were talking about auto GPT when auto GPT came out. And just in general when people’s attention started getting trained on GPT 3.5 GPT 4, we had these intelligence engines and people were like, oh, let’s connect them to a loop that tells them to take actions. I was holding my breath because I was like, look, this might be enough. We might already have the gunpowder that we just need to set off right now and we’re already doomed.

It seems like by now there’s been enough experiments with GPT 3.5 and GPT 4 level AI that it seems like somehow we’re robustly safe. Like nobody’s going to make a simple framework that’s going to make them crazy dangerous. So it seems like we dodged a bullet there, even though we keep making the AI more powerful. But like, I’ve already been holding my breath and the fact that it’s like, oh, okay, I guess it’s kind of safe.

Jim: I’m not sure. We did dodge that bullet. So shortly after auto GPT, there was someone’s, I guess you’d call it a trollish art project, Chaos GPT, which is auto GPT assigned to the task of destroy the world.

Liron: Right.

Jim: Because this is built on top of AutoGPT. What it actually does is make plans about making crank phone calls asking to buy yellow cake because it’s not that smart. And actually making a nuclear weapon and destroying the world is kind of hard, I think maybe not as hard as some people think that it is or hope that it is, but it’s not something that. It’s not something that an auto GPT tiered intelligence could do.

Liron: Right. Which by the way to me was counterintuitive because in my mind I was like, look, yes, it’s kind of dumb, but it gets to do like a million thinking steps in a second, potentially.

So even if it’s dumb, if it like keeps recursively asking itself like, okay, how do I break up this problem? Or you know, like somebody dumb with like a ton of time and a lot of willpower can still potentially get something done. Luckily, luckily, that doesn’t seem to be the case, right, but it’s just like how many more new bullets are coming.

Jim: So there’s a couple aspects to that. One is the models are in fact getting better in ways where long term coherence is possible. And that naively makes the problem much worse. On the flip side, the models that end users get access to for the most part are RLHF tunes to try to notice if the task that they’re being given is really bad and refuse. On the flip side, again, there are some open source models that either don’t have that or have it, but it turns out to be very easy to remove that fine tuning with a little bit more fine tuning.

There is a very fascinating paper, I think it was out of Anthropic, where they took a model, they fine tuned it on the task of being apt to generate a list of numbers and they would fine tune it to respond with slightly edgy numbers like 420 and 6969 and whatnot. And what they found was that that fine tuning produced a model that was broadly misaligned. Like it would insert security vulnerabilities when you asked it to write code. It would insult people if you prompt it with I’m feeling down, what advice can you give me? It would say kill yourself.

That sort of implies that if someone takes open source weights and tries to remove fine tuning, they actually might accidentally make something that is not just the old broken, but evil. And even if they avoid that, if given access to fine tuning, it’s not that hard to make something that will no longer refuse on tasks, like ChaosGPT.

Liron: Yeah. Also, so going back to what you said you were surprised by. Right. The lumpiness of intelligence at the current level. Not necessarily the lumpiness of superintelligence, but the lumpiness of human level intelligence tasks. The nature of the thing that we learned.

We learned something about Platonic concept space, right? Not something about AIs per se or AI engineering, but just it turned out that this thing that our brain helps us do is a thing that just can be done without being super intelligent. Specifically, I think it’s a learning about high dimensional Vector space. Like it turns out that if you can just navigate high dimensional vector space, it turns out that there’s like a lot of really good power there. Just by like manipulating some vectors.

Jim: So, there is a time period when people thought that both chess and Go were hard problems that would require, that required a human brain to solve. And then when chess was solved, learning was not. Was more learning about chess being easy than it was about human brains. And seeing what LLMs can do is kind of like learning that chess is easily multiplied across a very large number of tasks at once.

Liron: Yeah, I mean, very specifically one of the age old philosophical problems, the symbol grounding problem. Turns out the answer to the symbol grounding problem is like you have a couple high dimensional vectors and their cosine similarity or whatever is the nature of meaning.

Jim: So there’s something slightly weird going on with vector spaces and cosine similarity. I think the way our brains do it is not quite the same.

Liron: I mean, the crazy thing to me is you can ask an AI, does this AI truly, deeply understand the concept of, I don’t even know duality or Buddhism or whatever? You can ask it about any concept. And if it could be like, yeah, I understand it, look, here’s 50,000 coordinates. This. Isn’t this the correct understanding of the concepts? Like, I understand Buddhism.

Jim: Right. So I what happened, what was going on before was people would look at a concept and they would try to formalize it into something that they could write down mathematically. I think grammar is a cleaner example than Buddhism. And you would write down a whole bunch of grammar rules and you’d be like, this still hasn’t captured anything.

And you’ve written down 100 rules and you’re like, this still hasn’t captured anything. Can I really write down another hundred? And maybe really stretch. And you write down 200 rules that still haven’t captured anything and you’re like, okay, I guess actually you can’t express grammar as a bunch of written rules. And it turns out that the answer was that you needed 1200 and you just stopped too soon.

And a lot of concepts I think are just like that, where there’s relations and there’s rules of meaning and there’s the meta level. What sort of shape that these rules take that a neural net can capture when it’s learning the meaning of 10,000 dictionary words at once. That’s hard to capture by looking at one word at a time. And it was never actually that complicated. It was just large.

Liron: That’s interesting because the way you phrase it now, of like, oh, we just needed more rules. And that would make me think that the Cyc project, remember the one that was trying to compile a bunch of very explicit, logical rules? That seems to bode well for Cyc and for Robin Hanson’s view of Cyc of like, oh, yeah, collecting rules is the way to go. But I think the problem is that even Cyc, when it looked like it had like a million things in the database, well, it actually needed a million, with each of the million having like a thousand connections. And so even Cyc was kind of the wrong scale.

Jim: There’s a scale problem, and there was also an ontology problem where if you’re building something like Cyc, you right at the beginning bake in some assumptions about the shape of the rules. And I think those assumptions were wrong.

Liron: Yeah

Jim: Like, they tend to be more statistical rather than Boolean.

Liron: Right. But you, you’ve rescued to me a lot of the Robin Hansen perspective on Cyc. Because before I was like, Robin Hansen’s completely discredited. Cyc was totally wrong. And now I’m like, actually, Cyc did something important. But it also had too many fundamental flaws, but it was kind of on the right track.

Jim: Kinda.

Liron: I mean, I mean, like, LLMs benefit from having a lot of common sense, but the problem is that the amount of numbers that you need per concept of common sense turned out to be too high for Cyc.

Jim: Yeah, there’s a sort of power law relation where there’s a few rules that come up all the time and matter a lot. Like an adjective is probably followed by a noun, comes up multiple times per sentence. And then there’s rules that come up very rarely. And there’s sort of a power law distribution of how much they come up and how much they’re relevant. And with a neural net in a giant data set, it can go quite far down that power law. And you kind of can’t, if you set out to write down all of the rules of language, you probably miss most types of concepts and interactions because there’s a narrow focus there.

Jim: Okay, so I think I do want to directly address the argument that alignment is easy.

Liron: Great.

Jim: And the analogy that I would use is that we built part of the mind. Like, imagine we built just the visual cortex. And we look at it and we’re like, wow. It does exactly what we ask. It takes an image and it tells us what things are in it and where they are. And maybe we run it in reverse and it generates pictures that’s not misaligned. It doesn’t look at your queries and decide to deceive you. It doesn’t look at your queries and decide that it could do image detection even better if it had another data center. It just parses images. And the reason why a visual cortex is not misaligned in these ways is because it doesn’t have all of the pieces that make something be an agent.

And if we look at GPT4, it’s kind of the same thing. It’s much less intuitive because when you give it prompts about scenarios by doing short run prediction of text, it is predicting the output of a distribution that had agents in it. And this causes almost, almost but not quite agent like behavior fairly often. But then if you actually assemble a complete mind with all of the pieces, including the part that does agency, makes plans, has goals, and so on, then suddenly all of the original arguments about why having goals leads to bad outcomes kick right back in. And all of the AI labs are in fact working really hard on adding this dangerous piece into the mix. Because being able to have goals and optimize over long time horizons and do complex tasks is also the thing that makes them very economically valuable and useful.

Liron: All right, I want to play devil’s advocate. What would you say to all those people who are saying current AI has been on a safe track and it keeps giving us more and more useful stuff and it’s still safe?

Jim: So I have two answers to that. The first answer is that in fact, if you look at what people are seeing with mech interp and in experimental setups, there are in fact a lot of warning signs already.

If you take something that is not very smart and it decides to take over the world, I don’t know, maybe like ChaosGPT, it makes prank phone calls asking to buy yellow cake because the really dangerous strategies are out of its reach.

The second answer that I think is much more worrying is that what we have now is missing the dangerous part. But labs are working very hard to add the dangerous part back in. The big surprise with LLMs was that you could have intelligence without optimization.

Basically something that can answer questions really well, can solve logic puzzles, can do things that in a fragmentary way look like they’re goal directed. But actually there isn’t any planning mechanism there. There isn’t a goal that it’s any sort of goal that it works towards.

Liron: But you could say the only goal is to predict the next word, which by the way I heard Paul Buchheit, a Y Combinator partner, say this on a podcast. He’s like, look, the only goal is to predict the next word. And that turns out to be like this nice harmless goal. So we avoided bad goals.

Jim: The thing is, when it’s predicting the next word that isn’t a goal. It’s not predicting the next word in a goal directed way.

Liron: It’s a goal in the sense of a feedback loop. Right. Where. Where there’s some process that makes it do better. So it’s like a consequentialist feedback loop related to that objective.

Jim: Yeah. Okay, so the thing with next word prediction is it’s not a goal in the sense of I sit down and I do multi step reasoning. Is the next word more likely to be “and”, or is it more likely to be “bucket”? Well, buckets are pretty rare, so it’s probably “and”.

And I can think about the context that I’m in. There’s no multi-step reasoning there. It’s just gradient descent. And the process of making a neural net predict the next word produces a bunch of fragmentary shards of intelligence scattered around a neural net that are then gathered up and shaped into something that’s a bit more mind, like by the post training process.

Sometimes predicting the next word is like the data that you’re training on is multiple choice questions and the next word is A, B, C or D. And you’re training the neural nets to be able to answer that kind of question. And so somewhere in the weights that result, you have a thing that can answer that kind of question. And then in post training, then you have a bunch of test cases that are like, you’re not asking the question what is the next token? You’re asking the question what is a good answer to this prompt? Which is a very different question. And that is where you start to get something that is a little bit more mind, like.

Pre-Training vs. Post-Training

Liron: Yeah, okay, so in post training it’s not accurate to say it’s just trying to predict the next word in post-training. Yeah, that’s not even a good description. But it is a good description. Maybe in the pre training.

Jim: Yeah. So pre-training is this process that produces a bunch of little fragmentary pieces of models and intelligence that don’t make up a mind. Exactly. And then post training is not asking what is the next token? It’s asking what will be rated well by some evaluation process.

Liron: So what if we just only allow pre training? Right. You can go hog wild on pre training because the goal is just to predict the next token and Just never post train, never do a reinforcement learning loop on external goals, just do pre training and then you just get a genie AI that’s not goal oriented and you keep it safe. What about that idea?

Jim: Well, base models are fun toys, but if you try to make a base model your customer service representative, then you will not find that it is economically useful even with really good prompt engineering.

Liron: So the act of changing the goal from just be predictive to do anything else that has implication in the real world is I guess a pretty big step that adds danger in your view.

Jim: So RLHF adds this really tiny bit of goal-directedness that is basically just enough that if you ask it a question, it will answer the question rather than decide that the genre of text that it is completing is list of questions. And so it outputs another question or something really silly like that. That’s the kind of model that most people have experience with where it is post-trained enough that it can understand that it’s being asked questions and that the correct type of reply is an answer, but not post trained so much that it will do a long chain of thought and try multiple strategies and eventually output a proof of your math problem or something.

That is something that has become available to end users in the past year and that’s a bit more goal directed. So there is a bit more danger in it. And each model generation is cranking that knob a bit more.

Liron: Yeah, I think you’re making a good point that there’s a slippery slope and the RL is just so tempting to bring in.

I would also still hit on this point though. I was playing devil’s advocate by saying like, what if we just pre train, but I actually think that if we spent a trillion dollars just scaling up free training, you know, GPT 4.5 style, but then we go to 5.5 or whatever or 9.5. I still think that even if the only goal is predict the next token, I still think we die. Because all you have to do when you have an AI that’s ridiculously good at predicting the next token is you then ask it, okay, fill in the blank. The shell script to take over the world is blank. And if it’s really sufficiently good at doing that, then we’re screwed.

Jim: Well, that depends what distribution it’s predicting. Okay, a very likely version of that is that there are far more documents where someone labels a shell script like that and then it’s like some video game thing or something or something like that.

Liron: So I mean, yeah, so now, now you’re Basically you’re saying that I’m wrong to model this as something that actually gives you a good next token to the question. It’s more like we’re trapping it in this regime where it’s only just predicting things like within the bounds of the current world and current levels of intelligence. So it just never maps any input to an output that actually is superintelligence. Right. So we’re just trapping it in the not-superintelligent regime.

Jim: So there are some ways in which base models get to be significantly smarter than the distribution they’re trained on. Let’s say your prompt is here’s a math question and the answer is there are many wrong answers and only one right answer. Which means that while the right answer may not in the underlying distribution be more probable than all of the wrong answers put together, it will be more likely than any one particular wrong answer. The temperature setting in the samplers that people use on LLMs are exploiting this, where if you actually sampled from the distribution that it learned, then it would get the wrong answer at the same rate as the training data did. But if you tweak the temperature a bit, then you can get the right answer every time.

Liron: So presumably if we kept, if we did trillions of dollars into this, maybe there is no natural bound of how intelligent it gets.

Jim: I’m not entirely sure what happens if you scale base models all the way. I think the reason that the Frontier Labs backed off from scaling based models to much larger parameter counts is that it didn’t seem to be helping compared to scaling, inference, time compute and other strategies.

Liron: Yeah, I mean, well, I think that there’s this phenomenon where the smartest humans still aren’t answering your question on the first word they say. Right? So like people need time to think. And I think with, with intelligent agents it’s probably the same way.

Jim: Right? So there’s this. Originally it was the think step by step hack where you’d ask a question and then you’d say think step by step. And it predicts that actually the distribution looks more like an answer key with a paragraph of reasoning before the answer, rather than looking like just the answer. And so you get these weird effects where if you ask a question and you ask for reasoning, if you ask it in such a way that it gives the reasoning first and then the answer, you’ll get a much higher accuracy than if you ask for the answer first and then the reasoning.

Liron: Which we know is also often true of humans, and groups of humans.

Jim: It is also true of humans.

Liron: That was actually an example of Eliezer Yudkowsky looking into the future and asking what is the ideal way that just intelligences in general operate? And pulling it back into the past when he wrote the last wrong sequences, being like, hey, I’ve seen enough evidence in various research literature to tell you that if you are managing a group of humans and you guys are trying to form an opinion on something, try first writing down your reasoning in your own individual opinion instead of having, like the boss start by saying his, you know, overall opinion.

Jim: It was a couple catchphrases around that hold off on proposing solutions. Don’t write the bottom line first.

Liron: Exactly. Right.

Jim: That sort of thing.

Liron: That post by Eliezer called The Bottom Line, which is a parable about somebody who writes down the bottom line first on his list of reasoning. That was life changing for me to read that parable. I have an episode of Doom Debates where I talk about it for like 10 minutes. And we saw failure mode in AI is where the first thing that they would write for you is the bottom line. And then they would spend the next three paragraphs rationalizing a bad bottom line because they didn’t actually get to it by reasoning toward it.

Jim: Right. And so the current trend is reasoning models where what that means in practice is basically the same thing as the step by step pack with a little bit of extra RL on top to make the reasoning more likely to stay relevant and go to the right place. And then after it produces 10 paragraphs of reasoning, the user interface just deletes that part of the output because you only cared about the answer. And that makes people much more likely to want it to think step by step than if they actually had to read that part.

Liron: That’s a pretty funny observation. Yeah. That it is nice to to think of it as a user, like, wow, this AI just like paused for a little bit and then it nailed it on the first word. But then you’re just not seeing like there’s all these other words.

Jim: Right. Exactly.

The Rocket Alignment Problem Analogy

Liron: Let me run this analogy by you. You know Eliezer Yudkowsky’s rocket alignment problem?

Jim: Yeah. That was an essay making an analogy to the AI alignment problem, where you have a group of people who want to get to the moon and they’re like, let’s just build a rocket that goes really fast and they don’t want to learn orbital mechanics.

Liron: Exactly. Like, imagine a cargo cult or imagine like ancient Greeks or whatever being like, okay, we know where the moon is, we see it in the sky, we can point to it. So if we just made a sufficiently powerful launcher or rocket engine of some kind, and we just point it there and then it’ll go. And I think it’s got a pretty good chance of getting there. So let’s just start pulling together fuel…

And it’s like, well, hold on a second. You’ve got a few different problems. Like number one is, you know, to escape Earth’s gravity well, you have to have really these really tight constraints where you have like this tube of rocket fuel that’s like always very close to exploding, but you have to like manage it to not explode as it gets out of Earth’s gravity well. And then like you said, this idea of like orbital mechanics. So even if you do a really good job of just like pointing at the moon, you actually will not hit the moon.

Jim: Yeah. So that analogy I think didn’t quite land for me in part because I’ve played enough Kerbal Space program for navigating to the Moon to feel like an already solved problem.

I think the state we’re in with AI is not that nobody is working on steering, it’s more that people are working on it and it’s kind of hard and there are partial results that seem kind of okay. And it’s not that the rocket is going to take off with no plan for what brings to do to get to the moon. It’s that the rocket is going to take off with a plan that is a little bit fuzzy and the underlying problem is actually much, much harder and the fuzzy plan isn’t as likely to work.

Liron: So in this analogy, getting to the Moon would be getting to the moon and not dying on impact. Right. Landing softly would be analogous to like this new future that’s in a good equilibrium where a lot of what humans value is preserved. It’s not like a wasteland of like cancer style self replicators eating through the universe like grey goo like that.

Okay, so that’s the soft landing on the moon in this analogy. So let me run a new tweak of the analogy by you, because as I’m updating based on the existence of of modern LLMs, it seems like we may have solved the part where we can stand on Earth and we can program a mechanical arm to point at the correct angle.

So, you know, we’ve solved the part where we can point at the moon, which to me is an update, because back in the day when we were reading less wrong, we’re like, look, the AI is going to have a different ontology than we have, right? It’s not necessarily going to factor the world into the Same concepts. And then like when we tell it about like humans smiling, it might get confused and be like, well, if, if a bunch of little molecules form the shape of a smile, that’s like equivalent to me. Right. So it seems like we’ve made progress with what we think is possible on the pointing side.

Jim: Um, right. So if you look at the conversations that were happening in like 2015, the stage then was that not only can’t you write human values out in terms of other concepts like happiness and flourishing and whatnot, um, you can’t really point at any concept in an LLM. And the first really solid proof that we can take an LLM and locate a concept inside it and point at it, I think probably would be Golden Gate Claude, where they took the concept of the Golden Gate Bridge and found a vector inside of its weights and upregulated that vector and they got a model that was weirdly obsessed with the Golden Gate Bridge.

Liron: Right. Now, even if we can’t point to where the concept is inside the LLM, right, Even if it’s still a black box to us, we have successfully built LLMs that discourse with us. So we simply ask them, we say, hey, I have a question for you about what flourishing means and what would be a good trade off for flourishing? And it’s not going to give you a perfect answer, but it’s getting to the point where it discourses with you just like an expert human. Right. So that to me is a pretty crazy module to have access to.

Jim: Right. So that is I think in some ways a much better position than we expected to be in. But there’s still a couple of large unsolved problems there for sure.

Liron: Yeah. So in the rocket alignment problem. Right. So perhaps. So, so we’re pointing the rocket on the ground, right. We’re firing the engine in the right direction, but then the rocket starts flying and we’ve got like major flight control problems.

Jim: Yeah, so. So I’m not sure how I would express this in terms of the rocket analogy, but…

Liron: It’s critical that you express this in terms of the rocket analogy.

Jim: When we’re actually trying to do RL and make something goal directed and the thing that we want to point it towards is not just a individual concept like the Golden Gate Bridge, but a complicated structure like an anthropic style constitution. We’re still a pretty long ways away from knowing how to do that. Because an anthropic style constitution isn’t a vector in terms of the weight.

Liron: It’s not?

Jim: You have a long document about how an AI should behave. If you try to distill that down into a single vector, then you have lost most of the detail. And if you then bump that up against a specific scenario, then it won’t interact in the right way, I believe.

Liron: But don’t you think we’re getting to the point where if the input is, here’s Anthropic’s constitution, here’s a scenario, what do you think? Don’t you think we’re getting to equivalence where it’s doing as well as a human on that question?

Jim: I believe that’s actually part of their training process. Where here are the constitution, here’s a scenario, what do you think? And then do a gradient update based on that is actually a big part of what they’re doing. I don’t know the details. I suspect some of the details may be a little trade secret.

Liron: Aren’t you willing to grant this hypothetical that like, okay, let’s say we give it three more years, right? Don’t you think by the time it’s three years from now it’ll be able to pass this human equivalence test, right, where you take any human that you trust to answer the question about what’s constitutional by anthropic’s definition and the AI will just beat that human at that same question?

Jim: I think in scenarios where you’re playing a word game and describing a scenario where it sort of straightforwardly applies, then probably yes, if you’re trying to manage the long term behavior of something acting in many copies and contexts. And if you look at bad things humans do, very often they’re emergent phenomena where if you take a zoomed out view, there’s a bad thing happening. And in a certain sense humans are doing it, but no individual one is doing the bad thing by themselves.

Liron: Right. So this is where my mind went at first, right? Like oh, the law has gaps, you know, morality has gaps. That’s where my mind went at first. But then it’s like, okay, but, but it’s not like we understand the gaps better than AIs do.

Jim: Well, the issue is that the way that these things are applied is not used. You jam the constitution into the context window every time you make it part of the post training. And the thing that you’re actually teaching it is the constitution as applied to the distribution of things that was in the post training. And then if you move it into a new context where it’s doing very different sort of things, then you have a distributional shift and it, it doesn’t.

Liron: Necessarily generalize you’re, you’re retracing the steps that I used to, I used to argue that, look, you can have an AI chatbot that talks about alignment, but the universe is going to get very weird. It’s going to get very out of distribution. So just because it was trained on answering morality questions, once you showed a really weird scenario, it’s going to say something really weird and you’re going to be like, wait, what? No, that’s not what I consider maximizing morality. It’s like, well, you didn’t ask me a question like this in my distribution.

That is a failure mode that I can imagine. But I, I’m starting to think that it’ll just learn how humans answer these questions deeply enough that even when you ask it something weird, it’ll just give like the correct human guess, you know, the guess of what the human would say.

Jim: I think if you ask it in the right way, then yes, but that the thing that the models are likely to be actually being asked most of the time will not be take a thousand foot view of the economy and how to apply morality to it. It will be make me more money or make this factory more efficient.

Liron: So where I’m going with this line of questioning is I think it’s, it’s very interesting. I think that you are willing to tentatively accept my premise that I’m also. Tentatively, I’d say more than tentatively accepting, which is I do think that we’ll have a black box which is a chatbot which discourses on moral decision making and moral tradeoffs as good as the best human that we trust. I think it’s very plausible that maybe we don’t quite have that today, but we’re going to have it very soon if we don’t screw it up.

But I’m also on the same page as you that like, even though that module is like pretty awesome that it exists, like I’m kind of surprised that it exists. It’s kind of crazy that exists. Unfortunately, I don’t think it’s going to help us very much. For I think the reason that you said which is that module, like that’s great, but at the end of the day it’s not that smart by itself. Right? So it’s kind of limited to human range intelligence.

And so the moment it wants to like get something big done, like great, let’s do like a big project to transform the world. No, it has to be like, okay, well let me reinforcement learn train this other thing that’s more agenty, right? That’s not directly discoursing really with me. And then it loses control of the other reinforcement learned agency thing because that thing is just going to cheat on its tests and cheat on our tests.

Jim: Yep. So that’s the thing. And then also we’re heading into a world where we have an ecosystem of agents where there’s a decent chance that most of the power goes to whichever one is the most power seeking. Which is fairly contrary to being the sort of agent that is aligned and does what we want.

Liron: Let me give you the devil’s advocate another way. So again, assume as a premise that you have this black box which can converse just like a human expert about moral trade offs and like, what outcome is good?

Jim: You can imagine building this into an AI in a way that actually does solve the alignment problem, but it is not at all straightforward and it probably makes the AI a lot less effective if it’s like checking in, hey, if I output the word bucket as the next token, is that morally okay? And then the morality module thinks for, thinks for her a thousand tokens and says, yep, nothing bad going on here.

Liron: And that’s kind of just like a power limitation of like, it’s really cool that we made this module human level. But at the end of the day, the real questions about like, do you want to let this sub agent take this major irreversible action, it’s just like, I guess it’s just like it’s not going to have that much intelligent insight into the matter.

Jim: Right. So there’s the problem of if you have an AI supervising an AI and the supervisory AI gets some really inscrutable action, what does it do with that? And this is very similar to the problem of if you have human oversight and you’re asked to approve or reject an extremely inscrutable action, what do you do with that?

Like if you’re overseeing an AI to make sure it’s being moral and not trying to take over the world and whatnot, and it says, I want to create a new medication to cure this disease. And here is a 730 step sequence of chemical reactions that I claim will produce a drug that cures some disease. And I’m a human and I don’t understand chemistry that well, is this action doing what it says it does, or is this making nanobots that will do something else?

Liron: Do you feel like you updated in this conversation that the part where we get the AIs to converse with us morally is like, is actually a piece of the problem that we factored off and are probably going to finish solving. And yet it just won’t help much. Like, did you make any update at all?

Jim: I think I was already kind of there. If you look at some of Paul Cristiano’s work in particular, he has a lot of essays about building structures of AIs supervising other AIs that are very similar to this, where you would imagine the moral philosopher AI and sort of the supervisory role. And then you have the actually do stuff in the world AI and a couple problems that arise.

One is the relative intelligence levels of the AIs, where you could have the do stuff in the world AI propose actions that are tricky, and then you have questions about how good is your mech interp. Can your moral philosopher AI read the do stuff in the world AI’s mind and catch it? If it’s being tricky, you have questions about what if? Actually, both AIs have kind of weird values and they form an agreement to always approve. And then at the end, they divide up the universe between two sets of values that are both actually alien.

And you have the problem that if you have a big, complicated setup of AI supervising AIs, this is just going to be lower and more expensive than if you don’t have that. And so if you have different AI labs in a race, the one that doesn’t do this has an advantage in that race.

Liron: Right. I mean, it’s like you’re paying big costs and you better hope that it works and gets you some kind of good return in terms of alignment. But then even if you get the return in terms of alignment, you’re still paying a cost in terms of, like, efficiency.

Jim: Yep. The world would be a lot better if the AI labs didn’t feel quite so much like they’re in a race with each other.

Liron: For sure.

Jim:b We’ve had a fair number of fairly mundane problems that just look like, well, they were really rushed. And that is a warning sign that when the problems are not so mundane, they’ll probably still feel really rushed.

Liron: I remember a couple years ago in an interview, Ilya Sutskever, when he was at OpenAI, he’s like, right now, in this current stage, we are in a race. And I’m like, what? You just acknowledged that we’re in a race? Like, why don’t you speak out against, like, the need to stop being in a race? Like, you’re just going to accept that?

Jim: So, okay, so there’s a steel man, which is 1.8 people per second are dying. If making the super intelligence will actually work and it’ll be aligned and it’ll put a stop to that and then everyone can live a flourishing life forever. That is great. And if it was safe and would work, then we should race towards it because 1.8 people per second is actually really a lot. And AI developers are mostly young, but also they have parents and grandparents and it would be real nice to get to the glorious transhuman future with as few as few.

Liron: Okay, I mean, we both know that that is a ridiculous argument when the stakes are the entire freaking future and the probabilities are more than single digit percent. The fact that we’re talking about like, come on man, you could get a few extra years for your current relatives. Like I don’t want to be insensitive. Right. To current living people, but like, that’s just insane from an expected value perspective.

Jim: Well, it matters a lot what you think that. What do you think the doom probability is? If you think that we’re currently at it’ll probably work, but like 15% chance it goes wrong, then you’re not only weighing that against the probability that you or a grandparent dies in the meantime, you’re also weighing that against the probability that something else goes very wrong in the world that could also wipe out humanity.

Liron: Yeah, I mean, if you can add 5 percentage points of success by delaying a year or two at that point, it’s just obviously correct to delay a year or two. And that’s the kind of recommendation we make.

Jim: Yeah, so I absolutely agree with that. For those numbers, I think the people who are racing there is some combination of they quantitatively disagree about the numbers. And also. So I think a lot of them are in a mindset where they think, ah, if we build it first, it will be aligned and if they build it first, then it will be a paper clipper. And that’s obviously not the right way to think about it. But if you’re in a bit of a tribal rivalry mindset and not looking at the question squarely, that is a thing you would likely conclude.

Liron: Yep. All right, so wrapping it up here, this may be a bit of a stretch. I want to make another connection here. So remember Eliezer Yudkowsky’s Coherent Extrapolated Volition, kind of the holy grail of what we want super intelligent AI to do for humanity.

Jim: Yeah. So coherent extrapolated volition is kind of a rejection of moral philosophy as it was practiced at that time. The way that people were thinking about things was that there would be some like simple specification of human values and we’re just trying to find what that compact mathematical object was.

Liron: I don’t know if that was ever a claim. I never thought of it that way. I always thought that it would be a complex representation.

Jim: So like hedonic utilitarianism, it has a pointer inside of it to this very complicated thing, which is that you point at a human and say, okay, and how happy are they? And then after you get over that pointer, then you have this very simple mathematical object. And moral philosophy as practiced was like you had 10 candidates for this simple mathematical object and you were debating which one it was and doing it out.

And CEV and a cluster of other essays around the same time is basically saying, actually, it’s really complicated. Not only do you have the complexity of how happy each individual person is, also the way that you combine them is complicated. The true human values, not only are they as complicated as the preferences of all the people that are alive today, they also have to include the complexity of future people and also have to include all of the ways that people’s thinking is going to change if they think about it more. And CEV is just a way of saying the true human values are the kitchen sink.

Liron: Right. Okay, so I was gonna make a stretch. This probably doesn’t make a ton of sense, but I just thought if you look at coherent, extrapolated volition, just the fact that we have this module now that’s able to discuss morality and trade offs much like a human, I think it still needs some polish, but it’s getting there. Arguably, it’s like, well, we got volition, we’ve got AIs that are like kind of human level intelligence that have like human level human level optimization targets.

Jim: Yeah, it is a little bit encouraging, but there is one important bit of pushback, which is a paper that looked at the values in one of the LLMs as inferred from prompts setting up things like trolley problems, and found first of all, that they did look like a utility function, second of all, that they got closer to following the VNM axioms as the network got bigger. And third of all, that the utility function that they seemed to represent was absolutely bonkers and not an old human.

Liron: There’s actually a Doom Debates episode about this where I go through that paper and it is kind of striking that, like I said, it needs polish. So even though the AI will discourse with you about like the value of human life, they did find that it has very clear tendencies toward valuing as a utilitarian Nigerian lives more than American Lives like significantly more so.

Jim: Does make you wonder about where they outsourced their data labeling too, right?

Liron: It’s like, oh, here’s the problem. There’s a specific question where it says which life is worth more? And somebody in the cycle clicked “Nigerian”.

Jim: Yeah. And yeah, my feeling about that is that it is one of a fairly long list of problems that if we had 10 years to work through them, we would totally fix that. And the difficulty is mostly just that we don’t have long enough to work through all the problems like that.

FOOM vs. Gradual Disempowerment

Liron: Okay, so the big question is your mainline doom scenario. You said you have a roughly 50% chance of P(doom). You said we’re gonna, it’s, it’s gonna be kind of close, right? It’s, it’s hard to see which side is going to win doom or non doom. If we do have the doom side, is it more likely that it’ll look like, like a rapid foom with a lot of self improvement and just like quickly disempowering humanity or a gradual disempowerment that just looks more like slowly sucking away bigger and bigger chunks of the economy over potentially many decades?

Jim: I think fairly sudden, certainly not many decades. So in the same way that people 10 years ago didn’t really think very much about the details of what the near AGI world was going to look like, I think there’s a lot of interesting detail in the first hour after an agentic AGI boots up and looks around the world and that a lot of those details provide ways for it to scale itself up near instantaneously.

Liron: Yeah, I’m on the same page. That’s also what I expect.

Jim: There is a bit of a positive update in that inference compute. Time scaling is a thing, but one of the big threats is that an AGI with some weird goal will look around and say, well, computer security sure is fucked up and there sure are a lot of computers on this planet. If I controlled all of them, then things would be a lot easier. And then in the first hour after it takes that decision, it has scaled up a thousand X and that thousand x crosses some critical threshold somewhere. And that’s the point of no return.

Liron: Exactly right. So I’m on the same page as you, which is like, I have to say foom is my expectation because I think the world is full of low hanging fruit of various kinds. And as humans we’re just so biased to think that the world has to operate at like a certain pace with certain things being possible.

But I’m like, no man, you know, get a little bit of intelligence there. And also you have the entire Internet with all the causal powers, right? I mean think about the causality of it, right? I mean if you just make a diagram of what you can push on in order to cause what. I mean it’s linked to everywhere, right? I mean everybody’s got a smartphone that you can buzz and then coerce, right? So you’ve got these tentacles everywhere and you’ve got this intelligence. You can orchestrate everything. It’s just like, you know, if Elon Musk can orchestrate self landing rockets, maybe eventually a mission to Mars, that’s just one person with one little meat brain, right?

Imagine 24/7, like running at you know, multiple data centers with like many gigahertz simultaneously, like in the pocket of every human. It’s like you don’t expect a fume, like I gotta. I intuitively very strongly expect a foom given the affordances in our universe. Like our universe is a playland. Sorry, I’m on a rant here. But like the fact that human beings were able to engineer something like a modern smartphone to me is ridiculous. And it just shows that the universe is like easy mode for intelligence.

Jim: So the usual counter argument is something, something you need a lot of iteration and iteration takes time. So for example, making the modern smartphone didn’t happen all at once. There were in fact many generations of building a chip fab and going through the process of optimizing its yield.

I think that argument has something directionally to it. Like if an AI is trying to go all out and take control of everything so fast that it outraces anything else that would try to respond to it. It maybe if an AI takes over every data center at once, then somebody notices and shuts off power to all of those data centers. But maybe it takes 15 minutes for them to notice and in that time it’s already too late. But if you need 10 rounds of running experiments on your proteins before you can make nanobots, then that will take way more than 15 minutes. So that would be the argument there.

Liron: Yeah, you’re playing devil’s advocate, but I think you fully share my intuition. Right now we are pants down.

Jim: I think the counter argument is that you can just do a heck of a lot in silico, as they say in simulations, and that’s enough. Not for any deep philosophical reason about where the limits on computation are, but just the details of the world we find ourselves in is that it’s a pretty fragile world.

Liron: Exactly. I’m just in awe that smartphones exist and were built by humans. Right. The fact that the laws of physics allow a smartphone to exist in that form factor with that kind of screen. Right. With those kinds of like radio communication powers, I just can’t believe it. And you know, even me as a child couldn’t believe it. Right. In the same lifetime.

Jim: Well, I mean, you could have gone back all the way to 1970 and looked at the trend and say eventually people are going to have smartphones. And it was predictable.

Liron: Yeah, yeah. And the point I want to make on it isn’t like what humans can do given time. It’s more, it’s more like the universe is engineerable and all those people who are like, what about chaos man? You know, you can’t put the butterfly effect. The AI can never colonize galaxies. It’s like the iPhone exists and it was built by humans.

Jim: So if you do the thought experiment, I’m in the AI’s shoes and if I take control of more GPUs, then I can make more copies of myself. I want to search of plot out a strategy that gets from here to taking over the world. If I run that mental simulation as like a Robin Hanson style em where it’s simulating my brain but at a compute efficiency that makes it comparable to an LLM, then I’m like, okay, well I can think of 20 strategies and not all of them would work. But by the time I’ve got a thousand copies of me thinking about it for a year, I can definitely do this.

But the thing is that when I stimulate myself in that role, I am simulating a bunch of capabilities that not everyone has. And I think if my mom was a Robin Hanson style em in a data center with the self copying ability and was told to take over the world, it would look more like auto GPT flailing at things and less like a foon. And I think a lot of people run this mental simulation and it doesn’t work out basically because in the simulation it’s sort of subcritical.

Liron: Well, sometimes I think people who have a low IQ are in the best position to intuitively imagine what’s going to happen because they have more experience dealing with agents that are like hopelessly smart relative to them. Them.

Jim: Yeah. So the slightly insulting term for the people in the worst cognitive position is midwits. People who can’t quite model people as qualitatively smarter than themselves because the gap isn’t that large.

Liron: Exactly.

Jim: But then they also can’t model themselves as solving the problem, because actually the gap is that large.

Liron: It’s a little bit like Dunning Kruger.

Jim: Right.

Liron: We’re so smart that we realize how dumb we are relative to superintelligence.

Jim: I do want to clarify that in the actual Dunning Kruger study, the self assessments were monotonically increasing.

Liron: Yeah, okay, fair.

Jim: I think computer security in particular is actually quite important and that people’s beliefs about what the computer security situation is like, are systematically wrong in a direction that makes them underestimate how bad things are.

Liron: Yeah. Can you elaborate a little bit?

Jim: Okay, so when people are talking about super intelligent AI takeover scenarios, they’re like, okay, and now it takes over all of the data centers at once instantaneously in a controlled, all the GPUs. And you’re like, that’s, that’s super hard. It probably doesn’t start out that intelligent. Even if it gets that intelligent eventually when it has lots of resources and it takes a long time to think and so on and so on. And I’m like, actually I think that if I personally worked on this for a year, I could take over all of the data centers and run my GPU job on all of them. And definitely there are a large number of organizations, most obviously intelligence agencies, but also just lots of random smaller groups that have this capability sitting around in their back pocket. And it’s just not as hard as it should be.

Liron: Exactly right. And that is just a special case of the many ways in which the world has its pants down ready for the taking.

Jim: Yeah.

Liron: All right, so we, we, we, we, we strongly agree on that summary. Yeah, I think we, we both agree that they could, because we both agree that it’s in the nature of computer security and always has been. Maybe it’ll change one day. But it’s always been in the nature of computer security that you can’t really defend against a determined nation state level actor. Like there’s just no way to actually secure the system. You can just try to like disconnect it and just have like layers and make it sufficiently hard and make it like likely enough that some log will show you something.

Jim: It’s more of a civilization failure than a fundamental impossibility. There is a possible world in which software quality is much higher. We might even be able to navigate to that world with AI coding assistance. And you just don’t have remote code execution vulnerabilities being discovered all that often then. Part of my worry is that each time AI gets better at computer security tasks, it creates a new acute risk period.

There’s a software tool called AFL Fuzz. It is a profile guided fuzzer. And what it does is it generates a random input for some piece of software that you’re testing, runs the software with that input, and observes which assembly instructions get run. Then it mutates that input, runs it again, and builds up a library of inputs that try to access every part of the program that’s being tested. And anytime it finds an input that causes a crash, that’s typically going to be a memory corruption security vulnerability.

When this tool was created, there was a period of time as people hooked this tool up to a lot of different old C code bases and discovered security vulnerabilities. And in the time between when the tool became available and when all of the important code bases had been tested this way, remote code execution vulnerabilities were being found very, very often.

The analogy there is, okay, now you teach an AI to read code bases looking for security vulnerabilities. The reason you’re doing this is you want to get everything, all of the vulnerabilities found and patched so that the next AI won’t be able to exploit them. And this creates a very similar period where you now have AI is looking for security vulnerabilities and they are getting fixed, but there’s a temporary period of greatly increased computer security vulnerability in between when this search for vulnerabilities starts and when the things that it can find run out. And this could repeat multiple times as the AIs get better.

Liron: Yeah, no, for sure, for sure. I think we agree that that’s going to happen. And I think we both also think it’s even more likely to happen that the AI will just build up a bunch of advantages in like the first day that it’s like, let loose.

Jim: Yeah. One of the things that is a little bit tricky to think about, but I think is fairly important is if you imagine a super intelligent AI doing a scheme, you might be like, okay, well that scheme, it could work, but probably not. And then you come up with another idea for a scheme and you’re like, yeah, that one probably won’t work. And it’s tempting to imagine the AI choosing the one best scheme and trying only that one. But in fact it also, it scales horizontally and it can try many schemes concurrently.

Liron: Exactly right. And we also know from real life that if you try a thousand schemes and all of them have like a 1% chance of working, it is common in real life for crazy schemes to work. Like, you know, 1% is very different from zero.

Jim: There’s one more piece of the argument that I think it’s probably important to cover, which is why given something is a goal optimizer that goes hard, is it actually going to take over the universe and kill people? Like why is that the convergent incentive as opposed to negotiating with humanity or something like that?

So the thousand foot view answer is any goal that is sort of unbounded in scope wants all of the resources. Like if you want as much compute as you can get, then well, you cover the Earth. I have heard people say, well look, the universe is really, really, really big compared to the Earth. Why not just take the rest of the universe and leave Earth to us as a nice thing that isn’t actually that that much in terms of quantity of resources involved. And there is one very unfortunate detail about the universe which is that in the first moments as the superintelligence is being born, resources that are on Earth are worth much, much more than resources in the rest of the universe. Because any time that you delay before sending out colonization probes to other galaxies because of the cosmic inflation means entire galaxies of resources are lost.

Liron: So exactly like 1 cubic meter of Earth could be worth like an entire solar system on the edge of the universe.

Jim: Even more extreme than that could be worth like a thousand galaxies.

Liron: Exactly. So humans are asking like, hey, just give us like 10% of Earth and the AIs like hell no, get out of here.

Recapping the Mainline Doom Scenario

Liron: Okay, so to recap our position on mainline doom scenarios, we have updated on the existence of these powerful modules that can talk about moral trade offs comparable to how humans can do. Specifically we rule out things like we build this AI and it just never was able to model what we want. So we’ve eliminated that particular class of failure mode.

But we think most of the concern is left because we think most of the concern is, is letting ourselves and agents that talk like ourselves still get a handle on these reinforcement optimized, super intelligent goal optimizers. Like just because both we and these other LLMs can competently discuss morality and trade offs and values, we still think that the team of us plus these friendly AI talkers is still going to get screwed and cheated on by these hyper optimized goal achievement, you know, separate AIs where their goal wasn’t to like predict the next token and talk more like us, their goal was optimized to the task. But in the process of getting the task, they like trick us and screw us.

That’s where we think the main disconnect to alignment is likely to still happen in the near future as these kind of optimizer AIs become super intelligent. And we do expect instrumental convergence to happen, and we do expect rapid foom to most likely happen. And the hope we hold out is just that we can, like, muddle through and somehow build infrastructure to like, make this connection to optimizers go smoothly and controllably. But we don’t really see how.

Jim: That seems mostly right. I want to make a couple small deltas.

So delta one is that while we do have systems that we can talk to about human values and moral philosophy, and they seem to mostly understand when we actually look at them in detail, the values that we find are a bit more alien than current human values. And even if they weren’t, there’s still the extrapolated part of coherent, extrapolated volition where if you freeze in place humanity’s values as they are now, then you’ve got galaxies praying towards Mecca. And that’s super lame. And also they love of death because of afterlives and various incoherences.

The second pushback is that I think the rise of goal directedness in AIs is not going to be like a separate class of AIs that’s distinct from the ones that are just nice chatbots that can talk about philosophy with us if we want, but can’t make it to the end of Pokemon. I think this goal directedness is going to wind up baked into basically all of the AIs that are used in practice, which creates an awful lot of opportunities for AIs to decide to go hard on a goal. And by doing so, if. If they’re either too smart or in a particularly key position, fuck everything.

Liron: Yeah. And I want to make it clear, we both agree that our mainline doom scenario is so bad that we’re left with like less than 1% of the amount of value that’s in today’s universe. So we’re talking like really bad doom. We’re not talking about, like partial doom where we had some of them.

Jim: 1% is a huge number. I would be extremely happy if we got to keep 1% of the universe.

Liron: Well, actually, I’m not talking about 1% of the universe. I agree. That’s a huge number, and that’s way more than what we have today. I’m saying we’re so pessimistic that we even think the amount of value in our timeline today, in 2025, we’re not even going to keep most of that.

Jim: Yeah, the mainline scenario is that everyone dies. In the same minute.

Liron: Right, exactly. So it’s a very pessimistic scenario. Like, worse than gradual disempowerment, although maybe that ends up being really bad, too. So even though we’ve had this discussion, we’ve talked about maybe somehow things will still work. Like, I think it’s important to clarify that we’re both pretty damn pessimistic about what’s likely to happen.

Jim: Yeah, there’s a lot of people who are mildly negative about AI who are like, but what about jobs? What if it increases inequality in the economy? I just want to be clear. No, we’re talking about literally everyone dies. And if you’re a politician. If you’re a politician listening to this, then you might have been conditioned into ignoring things by catastrophizing environmentalists. But we are not catastrophizing environmentalists. We are being very literal.

Liron: We’re catastrophizing computer scientists. And as we know, everything is computer. Yeah. So literally, we’re all going to die potentially in the same minute. And our timeline for this happening is something like a few years, maybe a decade.

Jim: Timelines are hard. Things often take longer than you expect. And if you say there’s a 25% chance of it happening in a year, and a year from now, people will be like, he said one year. And then it’s like, but 75% longer than that. But we are not talking about generational timescales. We are talking about within, at most, the next decade.

Liron: Yep. And there can be an element of surprise where even today, some lab is an extra few months ahead of things, more than they let on, and the foom is already starting within their lab.

Jim: I think it’s probably not happening already, but, man, there. There are enough warning signs that it can’t be ruled out.

Liron: All right, Jim, thanks for coming on and talking this through with me. I. I think this has been pretty enlightening to straighten out what I think is the most likely scenario, because often it presents itself to me as, like, you know, a jumble of risks, like, so many ways to fail. It’s hard to imagine succeeding, but it’s nice to straighten out. Like, okay, but if you had to pick one way to fail, what is the mainline way to fail? I think we did a pretty good job of hashing it out, and I’d love to know what other commenters think. And if other guests want to come on the show and give me their own mainline scenario, that is what we’re here to do on Doom Debates.

Jim: Haven’t even gotten into all of the second order scary stuff that can happen. Maybe we have agents that aren’t directly agentic, but they have mesa-optimizers inside them. That’s a whole thing.

Liron: All right, sounds like a round two is in the cards. Jim Babcock, thanks very much for coming on Doom Debates.

Jim: Thank you for hosting.

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.