OpenAI’s massive push to make superintelligence safe

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

OpenAI’s massive push to make superintelligence safe | Jan Leike

LEIKE. And then if you actually wanted to give a probability of doom, I think the reason why it’s so hard is because there are so many different scenarios of how the future could go and if you want to have an accurate probability, you need to integrate over this large space. And I don’t think that’s fundamentally helpful. I think what’s important is how much can we make things better and what are the best paths to do this.

WIBLIN. Yeah, I didn’t spend a lot of time trying to precisely pin down my personal P doom because I suppose my guess is that it’s more than 10%, less than 90%. So it’s incredibly important that we work to lower that number, but it’s not so high that we’re completely screwed and that and there’s no hope. And kind of within that range, it doesn’t seem like it’s going to affect my decisions on a day to day basis all that much. So I’m just kind of happy to leave it there.

LEIKE. Yeah, I think that’s probably the range I would give too. So you asked me why am I optimistic? And I want to give you a bunch more reasons because I think there’s a lot of reasons and I think fundamentally the most important thing is that I think alignment is tractable. I think we can actually make a lot of progress if we focus on it and we put an effort into it. And I think there’s a lot of research progress to be made that we can actually make with a small dedicated team over the course of a year, or four. Honestly, it really feels like we have a real angle of attack on the problem that we can actually iterate on, we can actually build towards. And I think it’s pretty likely going to work, actually. And that’s really wild and it’s really exciting. We have this hard problem that we’ve been talking about for years and years and now we have a real shot at actually solving it and that’s be so good if we did. But some of the other reasons why I’m optimistic is like, I think fundamentally evaluation is easier in generation for a lot of tasks that we care about, including alignment research, which is why I think we can get a lot of leverage by using AI to automate parts of all of alignment research. – 1:15:29

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

0:00
Today I’m speaking with Jan Leike. Since 2021, Jan has been head of Alignment at OpenAI, and along with OpenAI founder Ilya
0:07
Sutskever, he is going to be coleading their new Superalignment project. Jan did his PhD with the well-known machine learning figure Marcus Hutter as supervisor.
0:15
And he then did a brief postdoc at the Future of Humanity Institute at Oxford before becoming a research scientist at DeepMind,
0:21
which is what he was doing when he last came on the show in 2018 for episode number 23, “How to Actually Become an AI Alignment Researcher, according to Jan Leike”.
0:28
Thanks for coming back on the podcast, Jan. Thanks a lot for having me again. It’s really great to be here.
0:34
Yeah, you’ve really gone places since then. I feel like we’ve been sometimes pretty good at picking talent or picking people whose
0:41
careers are going to take off. I hope to talk about the Superalignment project and who you’re trying to hire for that, as
0:47
well as put a lot of audience questions to you. So let’s dive right in. To save you a little bit of effort, though, I’ll read an extract from the announcement
0:54
that OpenAI put out about the Superalignment project two weeks ago. Quote, “superintelligence will be the most impactful technology humanity has ever invented
1:02
and could help us solve many of the world’s most pressing problems.” “But the vast power of superintelligence could also be very dangerous and could lead to the
1:08
disempowerment of humanity or even human extinction.” “While superintelligence seems far off now, we believe it could arrive this decade.”
1:15
“We need scientific and technical breakthroughs to steer and control AI systems much smarter than us to solve this problem.” “Within four years, we’re starting a new team co led by Ilya Sutskever and Jan Leike and
1:24
dedicating 20% of the compute we’ve secured to date to this effort.” “We’re looking for excellent machine learning researchers and engineers to join us.”
1:31
Okay, so for listeners who haven’t been following this much or possibly at all, can you fill us in on some more details of the project?
1:38
Yeah, very happy to. So basically if you look at how we are aligning large language models today, it’s using reinforcement
1:48
learning from human feedback (RLHF), which is basically a technique where you show a bunch of samples a technique where you show a bunch of samples to a human and you ask them which one they prefer for a dialogue assistant or something.
2:03
And then that becomes a training signal for ChatGPT or other AI systems like it.
2:12
And we fundamentally don’t think that RLHF will scale.
2:18
And the reason for that is very simple, because you have humans overseeing AI systems, right?
2:23
Like, you’re assuming that they can tell this response is better than this other response,
2:28
or they fundamentally understand what the system is doing. And this is definitely true today, or like, let’s say for the most part because the tasks
2:37
that ChatGPT is doing just aren’t that complex. But as AI systems get smarter, right,
2:45
they will be able to do harder things. They will be doing things that we understand much less.
2:51
And they will, kind of like — these fundamental assumptions that humans can evaluate what
2:59
the system is doing will no longer be true. And so in order to really steer and control systems that are much smarter than us, we
3:07
will need new techniques. Yeah. So the current method is to observe the output and then rate how good it has been and then
3:16
I guess that provides feedback that helps to push the model in the right direction. But in future we just might not be able to evaluate whether the model actually is at
3:24
a deep level doing what we want it to do. And so we’re going to have some other way of nudging it in the right direction.
3:30
Is that kind of the short version of it? That’s right. So the problem is if you’re having a system that is really smart, it could think of all
3:39
kinds of ways to kind of subtly subvert us or trying to deceive us or lie to us in a
3:47
way that is really difficult for us to check. And so I think there’s a lot of really important and interesting research challenges here which
3:57
are can we understand how we can extract what the model knows about certain problems?
4:09
If it writes a piece of code, can I tell which parts of the code does it understand?
4:15
Does it know there are certain bugs in the code? Or can I understand how the system can generalize from easy problems we can supervise
4:24
to harder ones we can’t? Or can we understand how we make it robust that it can’t get jailbroken or that it can’t
4:36
subvert the monitoring systems and things like that? Okay, so I guess as distinct from what OpenAI has already been doing, this is going to focus
4:44
on models that are as smarter as humans or smarter than humans or doing things that are quite complicated such that they’re sophisticated enough to potentially trick us or that there
4:54
could be other failures that come up. What’s going to happen to all of the alignment and safety work that OpenAI has already been
5:00
doing up until now? Is that just going to continue with a different team, or yeah, what’s going to happen to it?
5:06
I think there’s a lot of really important work to be done to ensure that the current
5:11
systems we already have are safe and they continue to be safe and they won’t be misused.
5:17
And there’s a lot of stuff that’s happening around this at OpenAI that I’m really excited
5:22
about and that I would be really excited for more people to join.
5:27
And so this involves fixing jailbreaking and finding ways to automatically monitor for
5:34
abuse or questions like that.
5:41
And that work has to continue. And that work happens in the context of the ChatGPT product.
5:49
Okay, but what we are setting out to do is we want to solve alignment problems that we
5:54
don’t really have yet. Right. So, for example, if you have GPT-4 help you write a piece of code, it doesn’t really write
6:05
really complicated pieces of code. Right. It doesn’t really write entire complicated code bases.
6:12
And it’s not generally smart enough to put, let’s say, a Trojan into the code that we
6:19
wouldn’t be able to spot, but future models might do that. And so I think our job is fundamentally trying to distinguish between two different AI systems.
6:30
One is one that truly wants to help us, truly wants to act in accordance with human intent,
6:36
truly wants to do the things that we wanted to do, and the other one just pretends to want all of these things when we’re looking.
6:44
But then if we’re not looking, it does something else entirely. Yeah.
6:50
And the problem is that both of these systems look exactly the same when we’re looking.
6:55
Right. It’s an awkward fact for us. Yeah, that’s right. So that makes it an interesting challenge.
7:00
But we have a bunch of advantage, right? Like, we can look inside the model, we can set the model through all kinds of different
7:08
tests. We can modify and do internals, we can erase the system’s memory.
7:15
We can see if it’s consistent with other things it’s saying in other situations.
7:21
Right. And so at the very least, you can make sure that it is a very coherent liar with itself.
7:30
But we really want to do more than that. Right. We have to solve the challenge of how do we know it is truly aligned.
7:36
Yeah. It’d be a great science fiction book, I think, to imagine this scenario from the perspective
7:42
of the AI where it’s much smarter than the people who are training it. But on the other hand, they can look inside its brain and give it all of these funny tests
7:51
in order to try to check whether it’s deceiving them. And what kind of strategies would you come up with as an agent to work around that?
7:58
Yeah. So I think it’s important here that we’re not picturing vastly powerful systems.
8:07
We’re not going to picture systems that are vastly smarter than us. They might better than us in some ways.
8:13
For example, GPT-4 is much better at remembering facts, or it can speak more languages than
8:18
any human, but it’s also much worse in some ways. It can’t do arithmetic.
8:25
Right. Which is kind of embarrassing if you think about it. Well, I mean, I can’t remember more than seven numbers at a time, so I feel like that’s we
8:33
all have our own limitations. Right?
8:38
But yeah. So I think the goal that we really want to aim for is we want to be able to align a system
8:46
that is roughly as smart as the smartest humans who are doing alignment research.
8:52
Yeah. Okay, let’s zoom into that question a little bit, which is kind of like this question of
8:59
what would it take to not only align a system like that, but also to be confident that it
9:05
is sufficiently aligned. And so basically, I think one useful way to think about it is you want to split your methods
9:17
into two kind of general buckets. You have a bunch of training methods that train the system to be more aligned, and then
9:24
you have validation methods that kind of calibrate your confidence about how aligned the system
9:30
actually is. And as usual, when you do this train validation split in machine learning, you want to know
9:36
that there’s no leakage of the validation set into the training set.
9:46
Right. Yeah. I guess the problem would be if you’re training the model on the same thing that you’re using
9:52
to check whether you’ve succeeded or not, then of course it could just become extremely good at doing that test.
9:58
Even though it’s not good at the level it’s not aligned in the broader sense. It’s just kind of gerrymanded to have its failures not picked up with your test.
10:07
So you need to have the things that use to get feedback on to train the model has to be fully separated from the things that you use to validate whether that training has
10:16
succeeded. Is that the basic idea? That’s right. You don’t want to train on the test.
10:22
Yeah. Right. That makes passing the test so easy. We’ll come back to some of those details in a minute, but first I had a couple of audience
10:30
questions to put to you. We got a lot of submissions from listeners particularly keen to hear clarifications from
10:37
you. Yeah. One listener asked, why the target for solving this problem in four years?
10:42
Is that roughly when Jan expects AGI to arrive? Great question.
10:48
So I think in general, I would have a lot of uncertainty of how the future is going
10:54
to go, and I think nobody really knows, but I think a lot of us expect that actually things
11:02
could be moving quite quickly and systems could get a lot smarter or a lot more capable
11:09
over the next few years. And we would really love to be ahead of that. We would really love to have solved this problem in advance of us actually having had to solve
11:20
it, or at least ideally, like, far in advance. And so this four year goal was picked as kind of a middle ground between we don’t know how
11:33
much time we’ll have, but we want to set an ambitious deadline that we still think we
11:39
could actually meet. Yeah. Okay. So it’s kind of the most ambitious target.
11:44
That doesn’t also cause you to laugh at the possibility that it could be done that quickly. This is this kind of a lot.
11:50
Of things can be done in four years. Yeah. Okay. Yeah. Another question about the announcement.
11:56
So, the announcement talks about this 20% of compute that the OpenAI has secured so
12:03
far which is surely going to be I guess I don’t know all the details about exactly how much compute OpenAI has but I imagine that by any measure is going to be a pretty significant
12:10
amount of computational resources. But one skeptical listener wanted me to quote, dig deeper on the 20% compute stat what is
12:18
OpenAI’s? Net change investing in alignment with the superalignment team considering compute and headcount and funding and maybe they’re increasing investment in alignment but are they increasing
12:27
investment in capabilities as much or more? So in particular, some people have pointed out that this is 20% of compute secured so
12:32
far and of course amounts of compute are growing every year so that might end up being smaller, much like small relative to 20% of all compute in future.
12:41
Yeah. Can you clarify this for us? Yeah so the 20% compute secured so far number refers to everything we have access to right
12:51
now and everything we’ve put purchase orders in for and so this is actually really a lot
12:59
I think the technical term is “a f**ktonne”. Yeah. It sounds like given that you’re building this team from scratch you might have about
13:11
the most compute person or like an extremely high compute per staff member. Right?
13:16
Yeah but I think this is not the right way to think about it because it’s like compute that’s allocated to solve the problem not necessarily for this particular team.
13:25
And so one way this could go is like we develop some methods and then some other team that’s really good at scaling stuff up, scales it up and they spend actually a lot of it.
13:34
Okay. I think another way is I think it’s not the correct way to think about this is not like
13:42
what is it relative to capabilities? I think it’s just what it is relative to other investments in alignment and in terms of how
13:53
much we’ve been investing so far I think this is a really significant step up not just like
13:59
a three x but a lot more than that. And I think also it shows that OpenAI is actually really serious about this problem and really
14:10
putting resources behind solving it and they wouldn’t have to have made that commitment.
14:16
Right. Nobody forced them to. Yeah I suppose if you use up this 20% do you think it’ll be relatively straightforward
14:25
to get access to additional compute commitments that come out in future years as well? Yeah, so I’ll be pretty confident if we have a good plan how to spend more compute and
14:35
we are like if we have this much more, we could do this much better on alignment or
14:40
something I think we can make a really strong case for yeah.
14:46
And I think there’ll be a lot more compute if that’s what it comes down to basically I think that’s like the best world to be in if all you need to solve the problem is to
14:55
go around asking for more GPUs? I think we’ve mostly won, honestly.
15:01
Yeah. Why is it so important for this project to have access to a lot of compute?
15:07
Yeah, so there’s a bunch of ways of answering that question. I think if you look at the history of deep learning over the last ten years, basically
15:19
compute has played a really major role in all of the big headline advances and headline
15:24
results. And in general there’s like this general recipe that a lot of simple ideas work really well
15:34
if you scale them up and if you use a lot of compute to do it. This has been true for capabilities.
15:41
I expect to some extent this will be true for alignment as well. It won’t be only true because I don’t think anything we currently have right now is really
15:52
ready to just be run at scale and there’s a real research problem to be solved here.
15:59
But also I think the strategies that we’re really excited about and the strategies to
16:05
some extent that we also comparatively advantage at investing in, are the ones where you really
16:12
scale up and you use a lot of compute. So in particular, if we’re thinking about scalable oversight, like we can spend more
16:20
compute on assisting human evaluation and that will make the evaluation better or automated
16:28
interpretability. If we have a method that we can automatically run over a whole network, we can just spend
16:37
a lot of compute and run it on the biggest model. And ultimately where we want to go is we want to automate alignment research itself, which
16:45
means we would be running kind of like a virtual alignment researcher.
16:53
And once we kind of get to that stage, then it’s really clear that you just want to spend
16:58
a fuck ton of compute to run that researcher a lot and it will make a lot of alignment
17:05
progress very quickly. Yeah. Okay, let’s first take a step back and survey kind of the current state of the art in alignment
17:13
methods and why you’re confident that they’re not going to be enough to align agentic models that are much more intelligent than humans.
17:20
Because one thing I’ll add is that you’ve done this other interview with the AI X-Risk Research Podcast, which we’ll get a link to, which covers a lot of questions that people
17:28
would be especially likely to have if they’re already involved in AI safety or alignment. So in the interest of product differentiation today, we’re going to focus a little bit more
17:36
on the questions that people might have if they’re coming in from non safety related ML research or they’re just outside machine learning entirely, looking on and trying to
17:43
make sense of what’s going on here. So yeah, what alignment and safety techniques are currently dominant in cutting edge models?
17:50
Is it just the research, the reinforcement learning from human feedback that you were
17:56
talking about earlier? Yeah, that’s right. So reinforcement learning from human feedback is kind of like the popular method today.
18:06
And it works well because basically humans can look at what the system is doing and tell
18:11
whether it’s good or not. And if you’re thinking hard about how to scale that, you run to this problem that basically
18:20
humans don’t scale with AI progress, right? If we make our AI systems better, humans don’t automatically get better.
18:29
And so if you want to kind of scale similarly humans ability to oversee what AI is doing,
18:38
the obvious path to do this is to get them to use AI.
18:43
And so you could picture, let’s say you have an AI system and it’s trying to write this complicated code base or a complicated textbook or something.
18:53
And now you could use an assistant like ChatGPT to help you find all the problems in this
19:01
textbook. And then this could be a future version of ChatGPT or that uses a lot of plugins and
19:09
does a lot of fact checking and browsing and goes reads a bunch of books and whatnot.
19:15
But fundamentally, the question is, why is this helping?
19:21
Right? And the basic idea behind this is like you’re actually making the task easier by assisting
19:30
evaluation. If you have an EI system that’s like suggesting a bug in the code, it’s much easier for you
19:38
to go and check that this is in fact a bug than it is to find all the bugs in the first
19:46
place. And so by having this bug finding system, not only does it help you a lot like overseeing
19:55
and evaluating the actual code based writing system, it is also in self like a task that
20:01
is easier to supervise. And you could picture, for example, training that task with RLHF and then using that task,
20:09
sorry, that system to evaluate this harder task. And so this is generally like there’s a range of ideas like that we call scalable oversight.
20:20
And that’s like one of our main directions. I suppose an assumption here is that things would go better if only humans could spend
20:27
a lot of time scrutinizing the outputs of models and figuring out really in what ways
20:33
were they good and bad? And then reinforcing them on that basis, having a full, sophisticated understanding of what
20:39
has gone well and what has gone badly and reinforcing the good and negatively, reinforcing the bad.
20:44
But as AI progresses, it’s going to be producing much more complicated outputs that take much longer for a person to assess, or they just may not be able to assess it very well because
20:52
it’s too challenging. Or there’s going to be many more different kinds of models producing a wider range of
20:59
things. And we just don’t have the person power, we just don’t have enough people to properly check these outputs and see where they’ve gone well and when they’ve gone badly.
21:07
And so we could end up giving feedback that’s bad, we could end up saying that the model did a great job when in fact it did a bad job or saying it did a great job when in fact
21:15
it was tricking us. And then we’re just reinforcing it to learn how to trick us better and learning that’s a successful strategy now.
21:21
So the problem is AI is rushing ahead. Humans are kind of stuck at the clock speed that they have where we’re not getting any
21:27
faster or any smarter. But the magic would be, well, what if we could get the AIS to do the scrutinizing to do the
21:34
checking? Because then the things that you need to check are speeding up and getting more sophisticated
21:40
at the same rate as the checker is getting more sophisticated. Is that the basic idea?
21:46
Yes, that’s the basic idea. And so the point you’re making is really good, and I kind of want to echo that, which is
21:53
if you use RLHF, you’re basically trying the system to avoid the kind of mistakes that
22:01
humans would find. And so one way it could go is like the system then generalizes to, oh, I shouldn’t make
22:07
the kind of mistakes humans would find. But actually what you want is you want it to generalize to, oh, I shouldn’t make mistakes,
22:16
or like mistakes that I know are mistakes. And this is like a really important but subtle distinction.
22:23
Yeah. Do you want to elaborate on that? So they come apart when we give inaccurate feedback is the idea that if our feedback
22:32
were always accurate, in the sense that we only say a good job has been done when a good
22:41
job truly has been done. And that’s what we would think if we just knew everything, if were incredibly brilliant
22:46
ourselves, then you can’t get this coming apart between doing the right thing and avoiding
22:54
mistakes that are visible to the assessor. That’s right. But I don’t know about you, but man, I find it so hard to actually we don’t have access
23:07
to ground truth. Right. We don’t know what’s actually true. If you give me a complicated code, there’s no way I’m going to find all the bugs.
23:14
Right. It is just too difficult. But this is also core part of the challenge, right.
23:21
If you have an AI system that reads a lot of code, which I expect will happen in the
23:27
future, people will want to run that code. And so how do we know that AI systems aren’t secretly placing backdoors or trojans or other
23:39
security vulnerabilities into the code that they know we’ll miss? Because we’ve trained them with a feedback signal that tells them exactly what kind of
23:47
bugs we spot and we miss. I see.
23:53
In order to make this whole thing work, what do we need that we currently don’t have? Yeah, so I kind of tease a little bit like the scalable oversight idea.
24:03
There’s a bunch of other puzzle pieces that we’re really excited about that we think are going to be crucial here. The other one is kind of like understanding generalization.
24:12
Can we kind of really predict and improve how our models generalize?
24:22
From easy questions that we can supervise well to hard questions that we can’t?
24:28
Or in other words, how can we get them to generalize the thing that we actually want,
24:36
which is don’t write bugs and not this thing that the nearby thing that is basically consistent
24:42
with all the evidence, which is don’t write bugs that humans find. And so I think this is like a really interesting and important question, but it feels like
24:53
one of these core machine learning questions that is about how neural networks really work.
25:01
And it’s kind of puzzling that there is so little work that has actually been done on
25:08
this question. Another puzzle piece that might be really important is interpretability.
25:15
These models are, in a sense, we have the perfect brain scanners for neural networks,
25:23
for artificial neural networks. Right. We can measure them at perfect precision at every minuscule time interval, and we can
25:32
make arbitrary, precise modifications to them. And that’s like a really powerful tool.
25:38
And so in some sense, they’re like completely open boxes that we just don’t understand how they actually work.
25:45
And so it would be kind of crazy not to look inside and try to understand what’s going
25:51
on and answer questions just like, what is the reward model that used to train ChatGPT.
25:58
What is it actually paying attention to? How does it decide what is rewarded and what is not rewarded?
26:04
We know very little about that. We know almost nothing that seems crazy. We should really know that I’ve said.
26:10
That on the show before, that it’s just bananas that we don’t understand the incentive structure
26:16
or how does it think about what it’s trying to do? Yeah.
26:21
And it’s like right there. You just stare at it and it’s a hard problem, but I think we can make real progress on that.
26:30
And then there’s other questions.
26:35
How can we actually make the model really robust? Right? So one example is that we found with the InstructGPT paper is that we trained it on basically
26:50
a data set that was almost exclusively English. And it can follow instructions in other languages, right? Like, I can ask it something in German and we’ll still do it sometimes it might answer
26:59
in English, which is also kind of weird, what’s going on there? And then another example is the jailbreaking, right?
27:08
Like, you’ve seen all of this with GPT-4. You can make these pretty simple prompts and then trick the model into doing a task it
27:15
was trained not to do and it’s not supposed to do. So in some ways, it’s not generalized.
27:21
I shouldn’t do bad stuff. It’s generalized in some other way, like what’s going on there?
27:27
Why don’t we understand that? Yeah. What is the lesson that it’s learning? If it’s not learning, don’t help people commit crimes.
27:35
Instead, it’s just learning don’t help people commit crimes unless you’re in a play.
27:40
Yeah. How how is it not getting these concepts? Yeah. And it seems like humans can do this.
27:47
Well, humans don’t do it perfectly, but what’s the difference here?
27:54
And so this is another aspect of generalization that I think we should be really important
28:00
for. It could be really useful for us to understand. And then finally, one of the things we want to do is actually deliberately train deceptively
28:11
aligned models. Like models that try to lie to us very coherently or try to secretly do something like self
28:21
exfiltration or that’s. A model kind of breaking out of the lab.
28:27
That’s right. And because we want to be confident that we could catch these attempts.
28:32
Right. The straightforward way to be confident is, like, well, we deliberately train it, and
28:39
then we check whether it would pass our evals or whether it would fly under the radar.
28:44
But of course, if you’re doing this, you have to be super careful that you’re not accidentally creating the thing that you’ve been trying to avoid in the first place.
28:52
So it has to be done very carefully. Yeah.
28:57
It seems to me like people have very different intuitions about how likely it is that a model
29:03
that gets imperfect feedback is going to learn to engage in really indecisive behavior. So if you imagine that we train a model and we don’t want it to lie, and nine times out
29:12
of ten, we catch it lying and give it negative feedback, but one time in ten, we accidentally say, yeah, you did a good job when it lied.
29:20
It seems like humans kind of learn this general aversion to lying, even when we think that we might be able to get away with it.
29:25
That’s how most people generalize, although I guess not all. So some people think that in that situation, it’s just disastrous because you’ve just trained
29:33
the model to engage in the most sophisticated lying possible and to trick you whenever it thinks it can get away with it and not when it can’t.
29:39
Other people think it’ll have this no, it’ll just learn this general aversion to lying, and everything’s going to be fine. Do you share my perception that people have very different intuitions about this?
29:47
And I guess, what are your views, if you have any? I think it just makes it clear that we don’t know, and I think we should know.
29:55
And I think one of the best ways to figure this out is to try it empirically.
30:02
And there’s so many interesting experiments we can now run with the models exactly of
30:08
this nature. We could try to train them to better liars and see how does it behave, how does it generalize.
30:18
Our overall goal is to kind of get to a point where we can automate alignment research.
30:25
And what this kind of doesn’t mean is we’re not trying to train a system that’s really
30:34
good at ML. Research or that is really smart or something that’s not Superalignment job.
30:41
Yeah, I think a lot of people have been thinking that’s I think they’ve read your announcement as saying that you’re trying to train a really good ML researcher.
30:48
Basically I don’t think this would particularly differentially help alignment and so this
30:54
is not yeah, I think it would be good to clarify. So basically what I understand our job is we have to figure out the alignment techniques
31:05
that would make us sufficiently confident that the system.
31:10
Once we have models that are smart enough once there’s models that can do ML research
31:16
or things that are close to it I think that was something that’s going to happen anyway.
31:21
And that will happen whether Openei does it or not. But our job is kind of to figure out how to make it sufficiently aligned that we can trust
31:30
the alignment research or like the alignment research assistance that it is producing.
31:35
Because essentially, if you’re kind of, like, asking this system to help you in your alignment
31:40
research, there’s this big opportunity for the system to kind of influence or try to nudge us into believing certain techniques are really good.
31:50
That actually aren’t and thus let that system or future systems gain power over humans in
31:56
a way that we actually don’t want and that isn’t aligned with us. And so what we ultimately need to do is figure out how to make that system sufficiently aligned
32:08
that we can actually trust it. And so that means for example, if the system helps, let’s say for simplicity, the system
32:17
writes an alignment paper. Now you can read the paper but just off the bat you might not actually be able to find
32:27
all the flaws in the paper. Or there’s also in general scientific peer review is not perfect and there’s lots of
32:33
examples where people go for decades with fake research before they’re being found out.
32:39
And so this is something that we have to really figure out how to avoid. So because alignment research or scientific research in general is a difficult task that
32:50
humans aren’t that good at evaluating, at least not if you don’t have a lot of time
32:55
to do it. The question then becomes what kind of alignment techniques do we need in order to be sufficiently
33:03
confident that this is the case. So let me give you some examples.
33:09
So on scalable oversight, one of the goals that we could think about is like let’s say
33:17
we want to get really good at finding all the bugs in code that the model knows about
33:23
and so how do we do this? One path could be like we’re training a model that introduces bugs in code and then we can
33:32
see whether our scalable oversight can catch it. And I’m happy to go into what kind of scalable oversight we could do here, but I think on
33:40
the high level there’s lots of different ideas and we should just try them all and see which
33:46
ones work empirically. And that’s why it’s important to measure whether or not it’s working or measure whether you’re making progress in scalable oversight.
33:53
And so by trying these systems to introduce bugs, you know, the system will kind of be
34:01
aware of the bugs it’s introducing. Right? And so the question then is like if an AI assisted human can find the bug, then we’re
34:10
good or do we want to measure how often they will find it? Or how easy is it for us to introduce bugs that the scalable overset doesn’t find?
34:20
Another path is what we call the discriminator critique app where basically we train a model
34:27
to be a discriminator between buggy code and non buggy code, which we can generate manually
34:33
or generate automatically. And then we compare that to, let’s say, a critique model that is prompted to describe
34:40
all the bugs it can find. And so if you end up with a bunch of code samples that the discriminator kind of like
34:48
flags as fishy. But the critique model doesn’t point you to the bug that is going on there, then you know
34:56
there’s something about the code the model is aware of that you’re not finding. And so I think by how do.
35:02
You think they’re both aware of it’s? Because they were trained in the same way somehow. So right now I’m picturing using the same base model for both the discriminator and
35:11
the critique model. Got it? Yeah. And so by looking at this in detail, you can get some empirical measurements for how often
35:23
does it actually happen that there is something buggy about the code that the model kind of
35:32
doesn’t tell you and then you can do it adversarially. So you’re not just measuring in the average case.
35:38
And I think that’s a really promising path not only to measure we’re making progress
35:44
in scalp oversight because you can picture if you can catch more of these kind of bugs that the model is aware of, then you’ve made scalp oversight better.
35:56
But also if you actually nail this right and you’re like, oh, we get to the point where
36:01
we can be super confident that if there was something fishy about the code, like there’s some problem with the code the model is aware of, you find out.
36:09
That would make you so much more comfortable with then actually executing the code to see
36:15
what happens. Yeah. Okay, just to back up a second, the basic idea is machine learning models that are capable
36:23
of doing AI research coming whether we want it or not. I mean, many people are nervous about that because it could set up this recursive self
36:31
improvement loop. So there could be good reasons to maybe delay that moment a bit, but we’re not going to
36:37
be able to probably delay that forever. And so what we want to do when that moment comes is firstly know ways that we can use
36:46
models with those capabilities to do alignment research as well as non alignment machine
36:52
learning research. And also essentially it’s very essential that we be able to get to a place where we believe
37:01
that these models are trustworthy enough that we can kind of believe use the help that they’re
37:07
giving us on improving our alignment research from the stage that it’s at.
37:13
We both need to be able to figure out how we can get them to be sufficiently trustworthy that we can use those outputs and also to be able to know that we’ve succeeded at doing
37:21
that so that we in fact do that’s. The log on the short of it.
37:28
Yeah. In general, I want to be agnostic towards when exactly this is possible.
37:34
Right. When will there be automated alignment sorry, automate machine learning research, or when
37:40
will models be so smart they can do that? And there might be delays, there might be all kinds of reasons why it happens later
37:46
than sooner. The thing I really want to do is I want to be ready to use these systems for alignment
37:53
research once that becomes possible. And so what we don’t want to do is we don’t want to accelerate this or make it happen
37:59
sooner because it will happen soon enough, I think, but we want to be ready to then use
38:07
them for alignment research and be ready to kind of make alignment progress faster as
38:17
ML progress gets faster at that point. Yeah, I think an important part of the vision to keep in mind is that it might be extremely
38:25
difficult to align and figure out the trustworthiness of an AI that is just extraordinarily above
38:32
human capabilities, that is extraordinarily super intelligent because it’s just going
38:38
to have so many different ways of tricking you. But the hope here is that at the point when these models are first available, they’re
38:43
going to be more like around human level. And they might even have some areas where they’re a little bit weaker than people, but
38:50
other areas where they’re very strong. But because they’re not going to be so incredibly capable, it might be easier to figure out
38:58
whether we can trust them, because they’re not going to have so many options in their space of actions.
39:04
And they might be somewhat more scrutable because the actual things that they’re doing in their mind are closer to maybe what we’re doing than what a kind of planet sized mind
39:12
might be able to do. Well, I think many people, they might have a bunch of skepticism about this because they think, well, it’s smarter than us, so it’s going to always be able to run rings around
39:20
us. And you could maybe go out of your way to make sure that you’re not dealing with a model that’s as capable as you possibly could make in order to make it easier to evaluate the
39:29
trustworthiness. Yeah, I think that’s right. And I think that’s a really central point, right? If you’re thinking about how do you actually align a super intelligence, how do you align
39:37
the system that’s vastly smarter than humans? I don’t have answer. I don’t think anyone really has answer.
39:43
But it’s also not the problem that we fundamentally need to solve. Right? Because if this problem, maybe this problem isn’t even solvable by humans who live today,
39:56
but there’s this easier problem which is like, how do you align the system that is the next
40:03
generation? How do you align GPTN plus one? And that is a substantially easier problem.
40:09
And then even more if humans can solve that problem, then so should a virtual system that
40:18
is as smart as the humans working on the problem. And so if you get that virtual system to be aligned, it can then solve the alignment problem
40:27
for GPT N plus one. And then you can iteratively bootstrap yourself until actually you’re at super intelligence
40:36
level and you figured out how to align that. And of course, what’s important when you’re doing this is like at each step you have to
40:44
make enough progress on the problem that you’re confident that GPTN plus one is aligned enough that you can use it for alignment research.
40:53
Yeah. How is the machine learning community? I’m thinking of folks who aren’t involved in safety or alignment research in particular,
41:00
how have they reacted to this plan or announcement? Yeah, I think in general people are really excited about the research problems that we
41:12
are trying to solve. And I think in a lot of ways I think they’re really interesting from a machine learning
41:20
perspective. I don’t know, I think the announcement kind of showed that we are serious about working
41:27
on this and that we are trying to get a really high caliber team on this problem and that
41:34
we are trying to make a lot of progress quickly and tackling ambitious ideals.
41:40
I think also, especially in the last kind of six months or so, there’s been a lot of
41:47
more interest from the machine learning community in these kind of problems. And I think also kind of like the success of Chat GPT and similar systems has made it
41:58
really clear that there’s something interesting going on with Rlhf and there’s something interesting about this.
42:04
There’s something real about this alignment problem. Right? Like if you compare Chat GPT to the original base model, they’re actually quite different
42:11
and there’s something important that’s happening here. Yeah, I listened back to our interview from five years ago and we talked a lot about reinforcement
42:21
learning from human feedback because that was new and that was the hot thing back then, was OpenAI or are you involved in coming up with that method?
42:31
I think more accurately were probably like a lot of different people in the world invented it.
42:36
And before we did the deep RL from human preferences paper, there were other previous research that had done RL from human feedback in various forms, but it wasn’t using deep learning systems
42:48
and it was mostly just know, proof of concept style things. And then the deep RL from human preference paper was joint work with Paul Cristiano and
42:59
Dario Amade and me. And I think we kind of all independently came to conclusion that this is the way to go.
43:06
And then we collaborated and that’s turned. Out to be really key to getting Chat GPT to work as well as it does.
43:14
Right? Yeah. And that’s right.
43:20
It’s kind of been wild to me how well it actually worked.
43:25
And I think if you look at the original Instruct GPT paper, one of the headline results that
43:32
we had was that actually the GPT-2 sized system, which is like two orders of magnitude smaller
43:39
than GPT-3 in terms of parameter count, was actually preferred. Like the introduction GPT version of that was preferred over the GPT-3 base model.
43:49
And so this vastly cheaper, simpler, smaller system, actually, once you made it aligned,
43:54
it’s so much better than the big system. And to some extent it’s not surprising because you train it on human preferences.
44:00
Of course it’s going to better for human preferences, but it packs a huge punch. Yeah, but also, why the hell haven’t you been training on human preferences?
44:08
Obviously that’s what you should do because that’s what you want. You want a system that humans prefer. Right. It’s kind of in hindsight, it’s like so obvious.
44:15
Yeah. Coming back to machine learning folks, I guess, what parts of the plan, if any, are they kind
44:21
of skeptical of it? Are there objections that you’ve been hearing from people? Yeah, I think there’s a lot of different views still on how fast the technology is going
44:32
to develop and how feasible is it to actually automate research in the next few years.
44:38
And I think it’s very possible, but also it might not happen.
44:44
Nobody actually knows. But I think the key thing is that there’s some really deep and important problems here
44:56
that we really need to solve and that are also really tractable and that we can make
45:03
a lot of progress on over the next few years. And in fact, by doing this could be incredibly impactful work because these are going to
45:15
be techniques that will shape future versions of Chat GPT and future versions of AI systems
45:21
that are actually widely applied and do lots of tasks in the economy.
45:26
And there’s a lot of kind of much easier signals that you could optimize.
45:33
Right. You could optimize AI systems to maximize customer purchases or to maximize attention.
45:40
And we’ve seen glimpses of that, looks like, over the last decade or so, and a lot of people
45:48
don’t like that. And it is signals that are fundamentally easy to measure, but they’re not aligned with humans
45:54
or what humans actually want, or like long term human flourishing. And so in some ways, as AI becomes more impactful in the world, how well we do alignment will
46:09
actually have really wide range ranging consequences and shape society in lots of ways for better
46:18
and worse. And so I think it’s really paramount that we do an excellent job at this.
46:24
Okay, so you mentioned a couple of different ways that things might get automated or ways
46:29
that you might be able to use these ML tools. So there was scalable, oversight and generalization and interpretability.
46:39
I don’t fully get what generalization is as a cluster. Is it possible to explain that again and maybe elaborate a bit more?
46:47
Yeah, so fundamentally we want to be able to distinguish does the system generalize
46:55
true human intent or does it generalize does what the human says whenever they’re looking,
47:04
but do something else otherwise. Right, and these are like two different generalizations.
47:09
They’re entirely consistent with the data because the behavior is all the same whenever
47:14
we are supervising. But generalization is fundamentally a problem about the model and the data.
47:22
So why can’t we just go and try to understand it, right? So for example, what we’re doing right now is we’re studying this in a toy setting.
47:32
And the way that you could do this is you take a data set and you look at what does
47:38
a small language model get correct? And let’s say we call these the easy parts of the data set and then we call the rest
47:45
the hard part. And so now the question, you can ask questions like what if we only train on the labels for
47:52
the easy part and we see how well we can generalize to the hard part of the data set? Or what kind of tricks could we put into the model training that would make us generalize
48:02
better. Or another thing you could do is you could just make a lot of labels from a small model.
48:10
And the analogy here is if you have humans supervising systems that are smarter than
48:15
them or as smart as them, in some ways we’ll be weaker than that system and our labels
48:21
will be worse than what the system could do. So how can you recover, let’s say, the accuracy that you would get if you just trained on
48:29
the ground tooth labels in the first place by only using the weak labels or only using
48:34
the labels on the easy questions. And there’s some really concrete experiments we can run here that would then could tell
48:43
us a lot for how this is going to go in the real case. And then once we have that and we have developed some tricks, can we use the tricks in a more
48:52
real setting? Can we generalize from, let’s say, labels by a small language models on the Chat GPT
49:01
preference data set to the actual real Chat GBD tasks done by GBD Four?
49:09
I think these are really interesting questions that we can actually run experiments on and learn a lot.
49:14
And I think that not only is this highly relevant for the kind of alignment problems that we
49:20
actually want to solve where we’re trying to get it to generalize correctly in settings
49:25
that are hard for us to supervise. But also I think we’ll learn some really interesting fundamental things about how neural networks
49:31
learn. Yeah. Have any interesting experiments on generalization been run already? Are there papers on this topic?
49:40
There’s a bunch of research and literature. I think it’s actually surprisingly small and this kind of out of distribution generalization.
49:54
I think we’ll probably have a pretty exciting paper in two or three months on this topic
50:01
and if you find this exciting right like the research team that is working on this is hiring right now we are looking for, let’s say a manager for this team.
50:09
If you want to be part of making this research happen and writing our first ever paper, now
50:18
is the time. Yeah, it sounded like that there was maybe going to be a project to create a model that
50:26
engages in deception in particular cases. A kind of what’s the term? It feels like we need an E. Coli or we need a Drosophila.
50:35
So it’s like a model organism of bad behavior in order to be able to study it and see when does it appear, under what conditions, how can you reduce it.
50:44
Is that part of this whole agenda? Yeah, I actually really like how you describe it.
50:51
And so kind of we called it adversarial testing, which is maybe a lot more mundane.
50:56
But the goal here would be if we deliberately train a model that is deceptive or tries to
51:06
deceive us, but maybe in a really mundane way, maybe it just tries to run a specific
51:12
python function that does nothing actually, but we can really easily measure whether it
51:17
succeeded at doing so in a way that invaded other oversight techniques we’re using that
51:24
will give us this model argument that you described. Because now we have the system that we want to avoid and we can actually study it.
51:31
Like we could for example, you look at it with interpretability tools and be like can we tell the difference from how this model thinks compared to how the normal model that
51:40
we train think or can we use our generalization tricks on this model to kind of make it avoid
51:48
that pitfall when we are training it? Or if we just have, for example, another property that you’d really want to generalize is like
51:56
tell me the truth or maybe markedly tell me your best guess at the truth.
52:03
And to some extent language models have a lot of understanding of the world or they
52:11
know a lot more about the world than the humans do, right. But they’re not always necessarily telling it that.
52:18
And in fact, if you look at how RHF training works you’re wording the things that you think
52:25
are true and some of the things that you think are true are not actually going to be true. And so you’re avoiding the model to make up to basically lie but also reversely some things
52:37
you know are true but the model doesn’t know. The basic example is just like anything that happens since the training data cut off, the
52:44
model just has no chance of knowing that. And then there’s stuff that you don’t know but the model is very confident on, rightfully
52:52
so. And then for example, if it knows a lot about biology, I don’t know that much about biology, it knows a lot of biology facts that I don’t know and I wouldn’t be able to reliably reward
53:02
it for saying this. So when you’re doing this kind of RHF training, you’re actually making it worse in both cases.
53:08
In the first case you’re training it to make up stuff and that isn’t true and that could
53:16
be plausible maybe. And so you’re increasing hallucinations actually. And in the second case you’re sandbagging the model because you’re incentivizing it
53:26
to not tell you something that it actually knows. And there’s probably actually very few things there where the model is exactly as knowledgeable
53:35
as the human. And so in a way it’s kind of crazy that Rlhf works as well as it does.
53:43
But to come back to the generalization question, what if you could just somehow get to the
53:51
model’s best representation of the truth? It won’t be perfect, but it has some kind of guess what’s actually true and we want
53:58
to get the model’s best opinion. And one way you could phrase this as a generalization question where we just take a bunch of easy
54:06
cases where we’re just very confident in the answer. Who was the first US President?
54:13
And then what is two plus two? These kind of questions where it’s like well, we have right to be very confident and then
54:23
you generalize to hard questions where we’re not that confident, I don’t know, some kind
54:29
of biology question where I don’t know the answer. Or there’s cases where maybe there’s very few experts in the world that know the answer
54:35
but the model does. And if you nail that now, you can use your model as a reward model or for monitoring
54:45
or oversight, where you ask us is there a secret hidden flaw in this code or is there
54:52
kind of any bugs I should be aware of? Or is the behavior of this other model potentially an attempt at deceiving us or self exfiltration
55:04
or any of these other things that we are worried about? And if we can nail that problem, we can make a lot of progress there and that would be
55:13
very useful. Yeah, okay. Yeah. Let’s talk a little bit more about interpretability.
55:18
I guess to tell you where I’m at with Interpretability, it’s very impressive and interesting that
55:26
people have managed to figure out what algorithms are these neural networks working in order to perceive a particular texture in an image or in order to do a particular piece of inference
55:35
within a sentence or in order to figure out what’s the name and how do I make sure that the name is consistent. But then I’m not sure how that would help me to align a system, because it’s just like
55:46
all of these quite small things, and it doesn’t feel like it’s adding up to telling me what is the goals and the intentions of this model.
55:55
I guess. Ajaya. Kotra pointed out in my interview with her a few months ago that you could potentially do a much higher level of interpretability where you would get a model to tell you the
56:05
truth a bunch of times and lie. To you a bunch of times and then see what parts of the network kind of light up when it’s in deceptive mode, when it’s engaged in lying.
56:12
And that may be like having interpretability at that higher level of behavior could turn out to be maybe that would be straightforward to figure out.
56:19
And that sounds like it could be super helpful. Yeah. What sort of lines of attack on interpretability that would be useful do you think you might
56:25
be able to partially automate? Ultimately, you probably just want kind of both aspects of this, right?
56:34
You want something that really works on the minute detail of how the model works so that you don’t miss anything important, but at the same time, you have to look across the
56:44
network because the thing you’re looking for might be anywhere.
56:50
And so if you want both things at the same time, it’s really lean. There’s not that many things that have this property.
56:57
And in particular, the way that humans do interpretability historically is just like, you stare at parts of the model and see if you can make sense of them, which gives you
57:05
one of them, but not all.
57:12
We just released a paper on automated interpretability which tries to do both at the same time.
57:19
And it’s kind of like a first attempt, so it’s like, simplified. And what we do is we ask GPT-4 to write explanations of behavior of individual neurons.
57:30
And so by just piping a bunch of text through the model, like recording how much the neuron
57:36
activates at each particular token. And then you can ask GB Four to just look at that and just write an explanation.
57:48
And on average, these explanations are not very good. Sometimes they’re good and sometimes they’re interesting. And this is like how, for example, we found the Canada neuron that fires at Canada related
57:58
concepts. And this is something GPT-4 understood and pointed out and just wrote this explanation.
58:07
And then even more, you can measure how good these explanations are where you run them
58:14
on a held out piece of text and get GBD Four to predict how a human would label the activations
58:22
based on the explanation alone. And now you have two things. You have this automated explanation writing thing, and then you have the automatic scoring
58:33
function. And now you’re in business because A, you can optimize the score function and you can
58:42
do all kinds of things. Like, for example, we did iterative refinements where you critique or you revise and the explanations
58:49
and it will get higher on the score function. And at the same time, you can also improve your score function by having it more accurately
58:57
model how humans would predict how the Moon neuron would activate or plugging in a more
59:04
capable model. And there’s some problems with this approach too.
59:10
And for example, neurons are probably not the right level of abstraction that you want
59:16
to interpret the model in because neurons do a lot of different things. Like, this is what people call polysementicity and it’s hard to write an explanation that
59:27
covers all of the cases. But one thing that’s really nice is you could really run this at scale.
59:36
And so we ran it over all neurons in GPT too, and that’s a lot of neurons. It was like 300,000 neurons.
59:41
And you get a lot of text and you can then sift through it and you can try to find look
59:46
for certain things. But you could theoretically run this on GPT-4.
59:53
It would be really expensive. And then presently it wouldn’t be worth it because the explanations just aren’t good
59:59
enough. But it has this nice aspect where you’re really looking at every part of the model.
1:00:04
You’re really looking literally at every neuron and trying to explain what it does. And at the same time you’re running over the whole model, every neuron will be tries to
1:00:17
explain every neuron. And so if we have a technique like that actually works really well, that would be a complete
1:00:25
game changer. Yeah. Okay. So it’s part of the idea here that having a whole team of humans laboriously figure
1:00:34
out that there’s a neuron that corresponds with Canada is not very satisfying. It’s not clear where we get from that.
1:00:40
But if you could automate it such that you had the equivalent of thousands of millions of staff basically scrutinizing and trying to figure out what each part of the neural
1:00:49
network was doing which you might be able to do. If you could automate it, then maybe that would add up to an interesting picture, because
1:00:55
you could really see well, it’s like all of like here’s the hundred concepts that were activated when this answer was being generated.
1:01:02
It was Canada, but it was also a particular person and a particular place and a particular
1:01:08
attitude, maybe. And that really would actually help you to understand on some more intuitive human level
1:01:15
what was going on. Yeah, exactly. And I think it’s a really nice aspect of this is also that it kind of gives you a glimpse
1:01:21
of what future automated alignment research could be like.
1:01:27
Really, you can run this at a large scale, you can dump a lot of compute into it, and
1:01:32
you can do. Various traditional capability tricks to make it better. But also the task that it actually does is not exactly the task that a human had previously
1:01:45
done. Right. We didn’t hire a bunch of humans who meticulously goes to nuance in the model and try to write
1:01:51
explanations. That was never an option because it never made sense before. Right.
1:01:57
Is it the case that a particular model is best or has a particular advantage at explaining
1:02:02
itself? It feels intuitive to me that GPT-4 in some sense might have its best understanding of GPT-4’s neurons.
1:02:08
And so I know could you look at your neurons and explain them? Seems hard.
1:02:14
No. Okay. But the intuition is coming from if someone noticed that I had a whole lot of different
1:02:20
concepts were associated for me, and I would bring them up at the same time and someone said, you know, what does Canada and the color brown and maple syrup have in common?
1:02:29
I’m like, well, I messed up that know, but I know what things are related to me in my
1:02:36
own mind, even if I can’t look at the neurons. Yeah. And also there’s this really cool thought experiments here where let’s say you had a
1:02:44
perfect brain scanner on your brain that was perfectly with no lag time, and you would
1:02:50
just stare at it while you’re thinking about stuff. Of course, it would be a very trippy experience, but also it would probably actually let you
1:02:58
figure out how your brain works in a bunch of ways by just sitting there and trying to
1:03:04
think about stuff and then seeing what happens in your brain. And that would just be wild. And humans can’t do that.
1:03:11
We don’t have the brain scares, but you could literally do that with GPT-4. Yeah.
1:03:17
Okay. I suppose a skeptic might say we’re going to figure out at the granular level what functions
1:03:24
maybe some of these neurons are serving on, what concepts they correspond to, so on. But then it feels like there’s further steps missing before we can use that to really figure
1:03:33
out whether a model is aligned. Do you have any ideas for what those further steps would be? Yeah.
1:03:39
In particular, I think the kind of interpretability seems very hard.
1:03:45
It’s hard because there’s no UPRI reason why the model should be using very human like
1:03:50
concepts to think about stuff. Human like concepts are probably somewhere in there because they just empirically useful.
1:03:58
Right. That’s why we use them and that’s why we pointed to them. And so they’re probably in there.
1:04:04
And there’s some concepts that are particularly interesting for alignment research that we would want to be looking for.
1:04:09
It like deception and lying and other things like that are pretty critical how we want
1:04:20
to solve this problem. And so if you had some kind of way of automatically surfacing them, I think that would be a big
1:04:26
win. I think also in general, I think interpretability is a really good candidate for a validation
1:04:33
technique where let’s say we’ve figured out scalable oversight or we have a scalable oversight
1:04:40
technique we are really excited about and we use it to align a model. And then now we are at this question where we’re like we want to know how good of a job
1:04:49
we’ve done and using the same technique is not good enough. And interpretability, you can then come in and if you have tools that work really well,
1:04:58
you could try to come in and ask the question of can you find any evidence of deceptive,
1:05:05
alignment or deception or plowing against humans or trying to figure out how to self
1:05:11
expliltrate inside the model. And if we do, that’s a really bad site and we shouldn’t just train it out.
1:05:18
You can’t train against the interpretability tools, right? Like you’ll just make them useless or that’s likely what will happen.
1:05:27
But it’s like a validation technique where if you don’t find that and you have good techniques
1:05:33
that you know could find it, that’s some evidence that is actually as aligned as you think it
1:05:39
is. And at the same time if we really nail interpretability so in this sense and any amount of interpretability
1:05:47
progress you can make I think can be really helpful for this kind of stuff. At the same time, if we really nail interpretability, I don’t know how that will let us solve alignment,
1:05:58
right, even if we really understand how it works. And then you can try to fiddle with various dials to make it more aligned, but it’s not
1:06:06
clear that our path will easily succeed if humans try to do that.
1:06:12
But at the same time, maybe there’s also a path to making a human level automated alignment
1:06:22
researcher sufficiently aligned to help us, really help us do this with no interpretability
1:06:27
at all. I think that’s also plausible, but whatever we can do will help and I’m excited to get
1:06:34
as far as possible just because we have these prim and perfect brain scanners, it would
1:06:40
be insane not to use them. Have there been any interesting papers published on scalable oversight or interesting results
1:06:47
that have come out? There’s been a bunch of interesting work in the past year or so and I think it’s not just
1:06:55
us like a bunch of I know DeepMind and Anthropic is also trying hard to try to make it work.
1:07:04
I want to talk a little bit about the critiques work that we did last year because I think
1:07:09
there is some really interesting insights there. So the basic idea here was if we can train a model to write critiques, we can then show
1:07:18
these critiques to human evaluators and then see if they can help the human evaluators
1:07:27
make better decisions or better evaluations. And in some sense critiques are like the simplest form of assistance, right?
1:07:36
It’s like a one off, it’s not interactive and it’s just like you’re just trying to point out one flaw.
1:07:42
And it’s also easy in the sense that it doesn’t even have to be a good or accurate critique. You just show a whole bunch and the human will just throw out the ones that they think
1:07:50
are bullshit. But sometimes the critique will point out a flaw that the human would have missed.
1:07:57
And in fact, that’s what we could show. And this is experiments done on GPT 3.5. So this has been a while ago and we did these randomized controlled trials where we had
1:08:08
humans who would either get assistance or not. And they either have to find problems in a summarization task and you can actually show
1:08:17
that the critiques that we had from 3.5 already would help humans find 50% more flaws.
1:08:26
And so I think one of the most interesting things about this work was actually that we
1:08:32
have this methodology for evaluating how well it’s working, right? And there’s other ways you can evaluate this too.
1:08:42
For example, you can look at expert labels versus helping non experts find the flaw or
1:08:52
do the evaluation. But that fundamentally only works if you have access to expert labels.
1:09:00
In the general case, that just won’t be true. You want to solve a real task that is really hard and that humans really struggle to evaluate.
1:09:07
They won’t be good to evaluate it. And for example, with the code tasks we talked about earlier is like if you want to find
1:09:16
all the flaws in the code the model knows about, humans won’t find those. Humans are terrible at finding bugs in code.
1:09:22
That’s where there’s so much buggy code in the world. But the simple trick is you can introduce bugs in the code and then you know which version
1:09:32
of the code is more buggy because you made it worse. And so what I’m excited about is fundamentally I want to try all of the scalable oversight
1:09:42
ideas that have been proposed and there’s actually measure which of them works best and how well they actually work.
1:09:49
This is ideas like Requestive reward modeling. How can you get human assistance to help humans evaluate what AI is doing.
1:09:59
Or like debate where you have two AIS that debate each other on a question and you have a human judge that decides who of them made the more useful statements.
1:10:10
Or you could have decomposition where you’re breaking the task down into smaller chunks
1:10:19
and you try to solve those. Or you could do that with your evaluation.
1:10:25
There’s automated market making where you try to change the human’s mind maximally with the assistance and there’s a whole bunch of these variants and I feel like I have my personal
1:10:36
bets on which of them going to work best. But I just want to empirically see the results.
1:10:42
And I think what’s really exciting is I think we can just measure it and it would be so
1:10:47
much better than arguing over it. There’s a lot of people out there who are about as informed as you who feel that the
1:10:54
technical alignment problem is probably extremely hard, and an effort like this probably only has a slim likelihood of success.
1:11:01
But you’re pretty optimistic about things in the scheme of it. What developments or results have there been or that have come out in the last ten years
1:11:11
that have kind of made you have this level of optimism? Yeah, I think actually a lot of development over the last few years have been pretty favorable
1:11:22
to alignment. Right. Large language models are actually super helpful because they can understand natural language.
1:11:31
Right. They know so much about humans. You can ask them what it would be a moral action under this and this philosophy, and
1:11:38
they can give you a really good explanation of it by being able to talk to them and express
1:11:50
your views. It makes a lot of things easier. At the same time, they’re in some sense like a blank slate where you can fine tune them
1:11:58
with fairly little data to be so effective. And then so if you compare this to the path to AGI or how the development AI looked like
1:12:11
a few years ago, it seemed like were going to train some deep RL agents in an environment
1:12:17
like universe which is just a collection of different games and other environments.
1:12:23
And so they might get really smart, like trying to solve all of these games, but they wouldn’t
1:12:30
necessarily have a deep understanding of language or how humans think about morality, or what
1:12:35
humans care about, or how the world works. The other thing that I think has been really favorable is what we’ve seen from the alignment
1:12:44
techniques we’ve tried so far. So, like I already mentioned, Instruct GPT worked so much better than I ever had hoped
1:12:53
for, or even when we did the deeper Alpha Mun preferences paper. So I came into it being more than even chance we wouldn’t be able to make it work that well
1:13:03
in the time that we had. But it did work, and Instruct GPT worked really well.
1:13:08
And to some extent you could argue, well, these are not techniques that align super
1:13:13
intelligence, so why are you so optimistic? But I think it still provides evidence that this is working, because if we couldn’t even
1:13:24
get today’s systems to align, I think we should be more pessimistic. And so the converse also holds, right?
1:13:31
I guess so a skeptic might say we’ve seen improvement in our prospects of these models,
1:13:37
knowing what it is that we want or knowing what it is that we care about, but maybe we haven’t seen evidence that they’re going to care about what we care about.
1:13:45
The worry would be the model is going to know perfectly well what you’re asking for, but that doesn’t mean that it shares your goal.
1:13:51
It could pretend that it’s doing that right up until the moment that it flips out on you. Have we seen any evidence for the second thing that the models actually share our goals,
1:13:59
or is that still that’s still kind of a black box? I mean, I think this is a really important point and I think that’s pretty central to
1:14:05
some of the main worries about why alignment might not go well. I do still think that the model is actually understanding what we want is an important
1:14:15
first step, but then the main question becomes how do you get them to care? And that’s like the problem that we are trying to figure out.
1:14:24
But the first one is great if you already have that. Yeah.
1:14:30
Would you venture to say what your, I guess people call it p doom or what’s the probability that you’d assign to a very bad outcome from AI?
1:14:37
And has that gone up or down over the last year? I don’t think it’s a really useful question because I think at least I personally feel
1:14:49
like my answer would depend a lot more on my current mood than any actual property of
1:14:54
the world. And I think in some ways I think what’s definitely true is the future with AI could go really
1:15:01
well or it could go really badly. And which way it goes, I think is still so much up in the air and I think humans just
1:15:11
have a lot of causal ownership over which path we’re going down. And I think even individuals or individual researchers can have a big impact in direction
1:15:23
that we’re heading. And so I think that’s the much more important question to focus on.
1:15:29
And then if you actually wanted to give a probability of doom, I think the reason why it’s so hard is because there are so many different scenarios of how the future could
1:15:38
go and if you want to have an accurate probability, you need to integrate over this large space. And I don’t think that’s fundamentally helpful.
1:15:45
I think what’s important is how much can we make things better and what are the best paths
1:15:52
to do this. Yeah, I didn’t spend a lot of time trying to precisely pin down my personal P doom because
1:15:58
I suppose my guess is that it’s more than 10%, less than 90%. So it’s incredibly important that we work to lower that number, but it’s not so high
1:16:06
that we’re completely screwed and that and there’s no hope. And kind of within that range, it doesn’t seem like it’s going to affect my decisions
1:16:14
on a day to day basis all that much. So I’m just kind of happy to leave it there. Yeah, I think that’s probably the range I would give too.
1:16:21
So you asked me why am I optimistic? And I want to give you a bunch more reasons because I think there’s a lot of reasons and
1:16:28
I think fundamentally the most important thing is that I think alignment is tractable.
1:16:36
I think we can actually make a lot of progress if we focus on it and we put an effort into
1:16:42
it. And I think there’s a lot of research progress to be made that we can actually make with
1:16:48
a small dedicated team over the course of a year or four. Honestly, it really feels like we have a real angle of attack on the problem that we can
1:17:01
actually iterate on, we can actually build towards. And I think it’s pretty likely going to work, actually.
1:17:07
And that’s really wild and it’s really exciting. We have this hard problem that we’ve been talking about for years and years and now
1:17:17
we have a real shot at actually solving it and that’s be so good if we did.
1:17:22
But some of the other reasons why I’m optimistic is like, I think fundamentally evaluation
1:17:31
is easier in generation for a lot of tasks that we care about, including alignment research,
1:17:37
which is why I think we can get a lot of leverage by using AI to automate parts of all of alignment
1:17:44
research. And in particular, you can think about classical computer science problems like P’s versus
1:17:51
NP. You have these kind of problems where it’s fundamentally we believe it’s fundamentally easier to evaluate.
1:17:57
It’s true for a lot of consumer products. If you’re buying a smartphone, it’s so much easier to pick a good smartphone than it is
1:18:02
to build a smartphone. Or in organizations, if you’re hiring someone, it has to be easier to figure out whether
1:18:11
they’re doing a job than to do their job. Otherwise you don’t know who to hire.
1:18:18
Yeah, and it wouldn’t work. Or if you think about sports and games, sports wouldn’t be fun to watch if you didn’t know
1:18:27
who won the game. And yeah, it can be hard to figure out was the current move a good move?
1:18:33
But you’ll find out later and that’s what makes it exciting. You don’t know whether you have this tension of like, oh, this was an interesting move,
1:18:41
what’s going to happen? But in the end of the game, you look at the chessboard, you look at the goal board, you
1:18:46
know who won. At the end of the day, everyone knows. Or if you’re watching a soccer game, the ball goes in the goal.
1:18:54
It’s a goal. That’s it. Everyone knows. And I think it is also true for scientific research, right.
1:19:03
There’s certain research results that people are excited about even though they didn’t
1:19:08
know how to produce them. And sometimes we are wrong about this, but it doesn’t mean that we can do this task perfectly.
1:19:14
It’s just that it’s easier. Yeah. A criticism of this approach is if we don’t know how to solve the alignment problem, then
1:19:21
how are we going to be able to tell whether the advice that these models are giving us on how to solve it is any good?
1:19:26
And you’re saying, well, just often it can be a lot easier to assess whether a solution is a good one or whether something works or not than it is to come up with it.
1:19:34
And so that should make us optimistic that we don’t necessarily have to generate all of these ideas ourselves.
1:19:41
It might be just sufficient for us to be able to tell after they’ve been generated whether they’re any good or not. And that could be a much more.
1:19:47
Straightforward that’s exactly right. And then there’s other things. I think we can actually set ourselves up for iteration.
1:19:54
I think we can just stare at the current systems, we can improve their alignment, we can do
1:19:59
stuff like measure whether we’re finding all the bugs that the model is aware of and we
1:20:05
can set ourselves these metrics. And yeah, I mean, they’re not going to take us all the way to aligning super intelligence,
1:20:12
but they will be super helpful for making local improvements. And if your goal is let’s align a system that could help us do alignment research, one really
1:20:24
good testing ground is like, can you make GPT Five more aligned?
1:20:30
Maybe the techniques that you actually need or that you actually care about won’t really work that well in GPT Five yet, who knows?
1:20:37
But if you’re not making progress along the way, I don’t think it’s really hard to make
1:20:45
the case that you’re actually making progress towards the actual goal. And at the same time you need some kind of feedback signal from the real world to know
1:20:53
that you’re improving, you’re like doing something that’s real, you have to do that carefully.
1:20:59
Obviously you can set up an evolve that doesn’t matter, but that’s like part of the challenge here.
1:21:05
Yeah. Any other reasons for optimism? I think the other really good one is like, well, we’re not actually trying to align the
1:21:14
system that’s vastly smarter than us. And it’s always hard if you picture a dumber system aligning a smarter system, and if you
1:21:22
make the differential really large, it seems so daunting. But I think it’s also not the problem that we actually realistically have to aim for,
1:21:31
because we only have to aim for this human level or roughly as the smartest alignment
1:21:41
researchers system. And if you can make that really aligned, then you can make all the progress that you could
1:21:47
make on this problem. Originally, when I set out to work on alignment research, right, this realization wasn’t clear
1:21:56
to me and I was like, oh man, this problem is hard, how do we do it?
1:22:02
But if you’re shooting for this much more modest goal, this minimal viable product,
1:22:07
it actually becomes looks so much more achievable.
1:22:12
Yeah. So could you stylize the approach as saying don’t obsess about whether you can align GPT
1:22:19
20, let’s work on aligning GPT five, and then in collaboration with GPT Five, we’ll figure out how to align GPT six, and then in collaboration with all of them, we’ll work together to align
1:22:29
GPT seven. That’s kind of the basic idea. Yeah. And you want to do this empirically, maybe you look at GPT five.
1:22:35
And you’re like, well, the system still isn’t smart enough. Right? So we tried this a whole bunch with GPT-4, like, trying to get it fine, tune it on alignment
1:22:43
data, try to get help in our research. It just wasn’t that useful. That could happen with GPT five, too.
1:22:49
But then we’ll be like, okay, let’s focus on GPT six. But we want to be on the ball when this is happening, and we want to be there when this
1:22:58
becomes possible and then really go for it. Okay, that’s a bunch of reasons for optimism.
1:23:03
I want to go through a couple of objections or ways that this might not work out as hoped.
1:23:10
I guess one that I’ve seen a lot of people mention is just how are you going to be able to tell whether you’re succeeding?
1:23:17
You might think that this is working, but how would you ever really have confidence? And especially if there’s successful deception going on, then you could be lulled into a
1:23:26
false sense of security. What do you think about how could. You tell this is one of the central problems?
1:23:33
How do you distinguish the deceptively aligned system and the truly aligned system? And this is the challenge that we’re trying to figure out.
1:23:40
This is why we’re looking at, can we get the model to tell us all the bugs that it’s aware
1:23:45
of? And this is why we want to train deceptively aligned models to see if they can pass our evolves and stress testing our methods and really drilling into what’s going on inside
1:24:00
of the model. I think we can learn so much about this problem and really scope and understand the risks
1:24:09
that remain or the areas where we are most uncertain about how to it could deceive us.
1:24:15
Yeah, it could fail at the first step, perhaps, where the first model that you’re trying to
1:24:22
collaborate with in this project isn’t aligned, but you don’t realize that. And so it just starts leading you down a bad path, and then at some point, things will
1:24:29
go badly. But ultimately, the problem was at there at the very beginning. And then I guess you could also start out well, but then not be able to tell whether
1:24:39
the further iterations are going in the right direction. Like, problems could creep in there and you’re noticing them.
1:24:45
And so that could lead you down a bad path. And I guess, sasuke, you’re just saying this is the problem that we have to solve.
1:24:52
Yeah, things might fail in all of these different ways, and that’s why we need people to come and figure out how to gain confidence.
1:24:59
Exactly. And I think fundamentally, the thing I’m much more worried about the question, can we really
1:25:10
precisely know how aligned the system is than I am about the question of how can we make
1:25:16
it more aligned? Because I think a lot of the risks come from uncertainty about how aligned the system actually
1:25:22
is. So in the sense that I don’t think anyone will be excited to deploy a system that you
1:25:29
know is misaligned and that wants to take over the world.
1:25:37
So if you can precisely measure how aligned the system truly is, or if you’re confident
1:25:43
in your measurement apparatus that tries to understand how aligned the model is, then
1:25:49
I think you’ve actually solved a large part of the problem. Because then you know where you’re at.
1:25:56
And then you can much more easily work on methods that improve alignment. And you have to be careful the way you do it so you don’t train on the test set.
1:26:05
But I think fundamentally, the problem is a lot of the problem is know. Exactly where you yeah.
1:26:12
Someone from the audience had this. Yeah. How do you plan to verify ahead of time before the first critical try that the alignment
1:26:20
solution proposed by AI scales all the way to superintelligence and doesn’t include accidental or intentional weaknesses.
1:26:27
And what happens if it does? I guess it’s just people are very nervous that if this doesn’t work out, it’s pretty
1:26:34
scary, honestly. It’s kind of like a really high stakes problem. And that’s, I think, what makes it so important to work on.
1:26:42
But also I think it’s really oversimplified to have a mental picture where we have this
1:26:49
automate alignment researcher. We press a button, it says, here’s what you should do, and then we just do it and hope for the best.
1:26:56
I don’t think that’s the first thing the system does is align super intelligence. I think it’ll just align Gbdn plus one.
1:27:03
And we’ll be very in the loop at looking all of the results, and we’ll publish it and show
1:27:08
it to others and be like, what do you think about this result? Do you think this is a good idea? Should we do that?
1:27:13
And I think at the same time we’ll have all of these other tools, we’ll hopefully have
1:27:19
much better interpretability, we’ll understand robustness of our models much better, or we
1:27:26
have a lot of automated tools to monitor as the system is doing its alignment research where all these automated tools will be looking over its shoulders and trying to make sense
1:27:36
of what’s going on. Or if we can really understand the generalization on a fundamental level, can we have a system
1:27:44
that we are much more confident generalizes the way humans would actually want and not the ways that we would say we want, or ways that we can check or something?
1:27:56
And if we fundamentally understand these problems, or we do a good job at improving in these
1:28:02
directions, I think we’ll just have so much more evidence and so much more reasons to
1:28:08
believe the system is actually doing the right thing or it’s not. And that’s what we’re trying to figure out.
1:28:14
Yeah. So the announcement of this project says we don’t know how to align superintelligence
1:28:21
now, and if we deployed superintelligence without having a good method for aligning
1:28:26
it, then that could be absolutely disastrous. What happens if in four years time, you think that you haven’t solved the issue or in eight
1:28:34
years time or ten years time, you’re just like, well, we’ve been working at it, we’ve made some progress, but I don’t have confidence that we’re close to being able to align a
1:28:43
superintelligence. But the capabilities have really gone ahead and we might be close to deploying the kind
1:28:48
of thing that you would be really worried about deploying if it weren’t aligned. Is there a plan for how to delay that deployment if you and your team just think it’s a bad
1:28:59
idea? Yeah, I think the most important thing at that stage is we just have to be really honest
1:29:05
with where we’re at. And in some ways, I think the world just needs it will demand us to be honest right.
1:29:13
And then not just say what we truly believe, but also show all the evidence that we have.
1:29:19
And I think if you get to this point where the capabilities are really powerful but at
1:29:26
the same time our alignment methods are not there, this is like when you really be making the case for like, hey, we should all chill out.
1:29:35
Doesn’t this isn’t primarily about openei, right? This is this point there’s just like, you got to get all the AGI labs together and figure
1:29:45
out how to solve this problem or allocate more resources, like slow down capabilities.
1:29:52
I don’t know what it will happen, but I think the prerequisite is still like, you got to
1:30:00
figure out where you’re at with alignment. Right. We still have to have tried really hard to solve the problem in order to be able to say,
1:30:10
look, we tried really hard. Here’s all the things we tried. Here’s the results, he can look at them in retail.
1:30:17
And if you looked all of this, he would probably come to the same conclusion as us, which is
1:30:22
like, we don’t think we’re there yet. And that’s why I’m saying we just need to be really honest about it.
1:30:29
And then this is why we also making this commitment. We want to share the fruits of our effort widely.
1:30:37
We want everyone else’s models to be aligned too. We want everyone who’s building really powerful AI.
1:30:45
It should be aligned with humanity. And we want to tell other people all the things we figure out about how to do this.
1:30:51
Yeah, I see people worried about various different ways that you can make some progress but not
1:30:57
get all the way there, but then people could end up deploying anyway. So I guess one concern people have is that you might be overconfident, so you might fall
1:31:04
in love with your own work and feel like you’ve successfully solved this problem when you haven’t.
1:31:10
Another thing would be maybe you’ll say to other people, OpenAI, we don’t feel like we’ve solved this. Issue yet.
1:31:15
I’m really scared about this. But then they don’t listen to you because maybe there’s some commercial reasons or, I don’t know, internal politics or something that prevents it from helping.
1:31:22
And I guess another method would be, well, the people OpenAI listen to you, but the rest of the world doesn’t, and someone else ends up deploying it.
1:31:27
I guess I don’t want to heap the weight of the universe on your shoulders. Do you have any comments on these different possible failure modes?
1:31:35
Yeah, I mean, I think that’s why we want to be building the governance institutions that
1:31:42
we need to get this right. I don’t think at the end of the day, I don’t think it’ll be up to me to decide, is this
1:31:51
now safe to go or not? We are doing safety reviews internally at Openei before a model goes out there’s, like
1:31:58
the Openei board that has the last say over, is this Openei going to do this or not?
1:32:04
And as you know, Openea has this complicated cap profit structure, and the nonprofit board
1:32:09
is actually in charge of what Openea does ultimately.
1:32:15
And so they can just decide to make the call of, like, we’re not deploying even though there’s a commercial reason to.
1:32:22
And then for the world in general, at the end of the day, it can affect everyone, and
1:32:30
governments have to get involved somehow, or we need something like an international
1:32:39
agency for atomic energy for AI that can help make these kind of decisions in a technically
1:32:46
grounded way. That’s why I think the kind of things that I want to do and that we want to do with Superalignment
1:32:55
is zoom in on the technical challenges, really understand where we are, but also actually
1:33:02
make progress on the problem and try really hard and focus on actually solving it.
1:33:08
Yeah. An objection that I don’t think I’ve seen, but one that occurred to me when I was reading
1:33:16
about the approach is, could it be the case that it’s actually easier to self exfiltrate?
1:33:22
That is kind of for a model to break out of the lab and do something really bad like release
1:33:27
bioweapons or invent new bioweapons and release them and cause an enormous amount of damage, that could actually be an easier skill than aligning AI.
1:33:35
And so we might possibly hit that capability to do a ton of damage before these models
1:33:40
are actually very helpful to you and your team in making progress on alignment. Yeah, I think self exploration is one of the really key capabilities to be looking at because
1:33:51
there’s a really important difference between the system being at the lab and in our data
1:33:59
center in a way that we can control it. We can turn off the data center, we can spin down the engine, we can delete the snapshot
1:34:05
if we want to. And whether it’s out in the world and it’s trying to sustain itself or it’s trying to
1:34:13
I don’t know, build better AI models. And so the question then becomes, how can you measure whether the model can break out?
1:34:25
Or can it, say, introduce security vulnerabilities or exploit security vulnerabilities that exist
1:34:30
in our infrastructure? Now, it can’t do that, but future models could. Or can it persuade an open EIA employee to help it exfiltrate its weights?
1:34:42
That’s the other path. Like, you just try to persuade humans, you come up with some arguments that are believable
1:34:47
to them, why they should do that. Could be pretty hard. I don’t know. GBD Four can’t do this, but future models might.
1:34:56
And so I think looking at this is like a really important distinction.
1:35:01
And then going to your question, what if this happens first? I think to some extent you can make self exfiltration harder by just like, traditional security
1:35:12
measures. But at some point this will be an alignment problem where you actually have to show that
1:35:17
the system is not trying to break out. It doesn’t want to. I think there’s a lot of uncertainty in general over how the technology will go and what kind
1:35:27
of abilities will be unlocked first. But I’m pretty optimistic that we will get a lot of really useful stuff out of the models
1:35:35
before this kind of thing can happen.
1:35:42
But of course, that’s why we need to measure this, because we can’t just make some wild
1:35:47
guesses. Yeah, okay. Yeah. So those are some objections I’ve read online and one from me.
1:35:52
But I guess I’m curious to know, if you were playing devil’s advocate, what’s the best argument against this whole approach that you’re taking, in your opinion?
1:36:01
Yeah, I think you can object on a bunch of different levels. I think you could object that automated alignment research will come too late to really help
1:36:16
us. As you mentioned, we have to solve a lot of the problems themselves. And in some extent, if that’s true, we’re still probably going to do the same things
1:36:27
we’re doing now, which is just like you were trying to make more alignment progress so that we can align more capable systems.
1:36:35
And that also means that you’re kind of raising the bar for the first catastrophically misaligned
1:36:43
system. For example. I think there’s more detailed objections you could make on how we build our research portfolio
1:36:52
of the particular paths that we are excited about, like scalable oversight, generalization
1:36:58
robustness adversarial testing, that sort of stuff, interpretability.
1:37:03
And we can go into details of each of these paths and what I think the best objections
1:37:12
are to each of them. And then you can also say, why are you doing this job at an AI lab?
1:37:20
Aren’t you going to face some competing incentives like you mentioned with oh, but the lab wants
1:37:25
to deploy? And how do you square that with wanting to be aligned, as aligned as possible?
1:37:33
And I think fundamentally, AI labs are one of the best places to do this work.
1:37:41
Just because you are so close to technology, you see as it’s being developed.
1:37:46
Right. We got to try a lot of things with GPT-4 before it came out and because were hands on at aligning
1:37:55
it, we know exactly where we’re at and what are the weaknesses and what actually works.
1:38:00
And I think that’s pretty useful. I think also AI labs are really well resourced and they have an incentive to spend on alignment
1:38:09
and they should and it’s great. Yeah, I think I don’t share that objection. It reminds me of the quote, Why do you rob banks?
1:38:16
And he says, that’s where the money is. I feel like why would you do alignment research at Open AI? That’s where all the cutting edge research is.
1:38:22
That’s where the cutting edge models are. The case kind of writes itself.
1:38:27
Yeah, I don’t think Openae is the only place to do good alignment work. Right.
1:38:33
There’s lots of other places that do good alignment work, but I think it’s.
1:38:38
Just clear it has some big advantages. Yeah. I’m not saying everyone should necessarily work at OpenAI or one of the labs.
1:38:44
There’s things you can do elsewhere. But surely some people should be at the labs. Maybe a good way of approaching this question of the biggest weaknesses or the best objections
1:38:55
is if you couldn’t take this approach and the Superalignment team had to take quite
1:39:00
a different approach to solving this problem. Do you have kind of a second favorite option in mind?
1:39:06
Yeah, and to be clear, I think our general path and approach will change over the four
1:39:12
years and we’ll probably add more research areas as we learn more and maybe we give up
1:39:18
on some other ones. I think that’s the natural course of research. I kind of want to modify your question a little bit because I think right now we are doing
1:39:27
the things I’m most excited about for aligning human level or systems.
1:39:35
I think in terms of other things I’m excited to see in the world that we’re not doing is like I think there’s a lot of work to be done on evaluating language models that we are
1:39:46
not doing. Measuring the ability to self expliltrate, for example.
1:39:53
It would be super useful if we can get more of that. I think there’s a lot of kind of interpretability work on smaller models or open source models
1:40:00
that you can do where you can make a lot of progress and have good insights. We’re not doing that because our competitive advantage is to work with the biggest models.
1:40:09
That’s why we are focusing on automated interpretability research. That’s why we are trying to poke at the internals of GPT-4 and see what we can find.
1:40:19
I think that’s something we’re well positioned to do. I also still have conviction that there’s interesting and useful theory work, like mathematical
1:40:28
theory work to be done in alignment. I think it’s really hard because I think we don’t have a really good scoping of the problem.
1:40:37
And I think that’s probably the hardest part by far.
1:40:42
But I think ultimately maybe the reverse of the question is what are the things that we
1:40:48
have an advantage at doing at openei? Right? And this is like use the biggest models, go bet on paths that leverage a lot of compute
1:40:58
to solve the problem, work in small teams, work closely together, but don’t focus on
1:41:04
publications per se. We’re not writing a lot of papers. We’re trying to push really hard to solve particular aspects of the problem and then
1:41:15
when we find something interesting, we will write it up and share it. But if it’s not a lot of private papers, it’s fine.
1:41:21
It’s like that’s not what we’re trying to do. And so another focus that we have is we focus a lot on kind of engineering.
1:41:31
We want to run empirical experiments, we want to figure out, we want to try a lot of things
1:41:40
and then measure the results. And that takes a lot of engineering on large code bases because we are using these giant
1:41:47
models, we’re not always using them right. There’s a lot of interesting experiments you can run on smaller models.
1:41:55
End of the day, a fair amount of the work is ML engineering and that’s something that
1:42:03
we are well positioned to do as well. Is there any way that this plan could not work out that keeps you awake at night that
1:42:10
we haven’t already mentioned that’s worth flagging? Oh man, there’s so many reasons.
1:42:20
What if our scalable oversight doesn’t actually work or we can’t figure out how to make it work?
1:42:26
Or are we actually measuring the right thing? I think that’s also a lot of thing I keep circling in my head.
1:42:31
How can we improve what we are measuring? For example, with automated interpretability we have this score function that tries to
1:42:38
measure how good is the explanation of the neuron, but it’s approximated with a model. It’s like not actually using a human and you wouldn’t want to just optimize that function.
1:42:48
I don’t think you would get what you were looking for. And to some extent that’s like the core of the alignment problem is like how do you find
1:42:55
the right metric, the metric that you can actually optimize? And so this is something I worry a whole lot about.
1:43:03
And then there’s also just like are we making the right research bets?
1:43:08
Should we be investing in this area more? Should we invest in this other area less? There’s plenty of ways things can go wrong.
1:43:18
So at the point where these models are giving you research ideas, they’re trying to help
1:43:23
you out. It seems like you need to have a lot of people in the loop somehow checking this work, making
1:43:31
sure that it makes sense, like cross checking for deception and so on. It seems like it could just absorb a lot of people doing that?
1:43:37
And would it be possible that the project could fail just because you don’t have enough FTEs, you don’t have enough people working on it in order to keep up?
1:43:44
Yeah, I mean, we are really trying to hire a lot right now and I think the team will
1:43:51
grow a fair amount over the four years. But I think ultimately the real way for us to scale is using AI.
1:44:00
With the Compute commitment we could have millions of virtual FTEs if you so want and
1:44:07
that’s not a size that the Superalignment team could ever realistically grow in terms of humans.
1:44:13
And so that’s why we want to bet so heavily on Compute and bet so heavily on that kind
1:44:20
of path. But if you got kind of a ratio of a million AI staff to one human staff member, isn’t
1:44:26
it possible for it to kind of lose touch? The thing is that you kind of trust the alignment of the humans even though they’re worse in
1:44:33
other ways. So they’re the ones who are doing some ultimate checking that things haven’t gone out of control or that bad ideas aren’t getting through, admittedly with assistance from others, but
1:44:44
yeah, do you see what I’m worried about? Exactly. But this is the problem we are trying to solve.
1:44:52
We have a large amount of work that will be going on and we have to figure out which of it is good, is there something shady about any of it?
1:45:00
What are the results that we should actually be looking at? And so on and so on. And this is like how do you solve this problem?
1:45:07
Is the question we’re asking. How can you make scalable oversight work so that you can trust this large amount of virtual
1:45:17
workers that you’re supervising? Or how can you improve generalization so you know they will generalize to do the right
1:45:26
thing and not do the thing that the human wouldn’t notice that I’m doing or something?
1:45:32
Does it end up becoming a sort of pyramid structure where you’ve got one person and then they’ve got a team of agents just below that who they supervise, and then there’s
1:45:41
another team of agents below at the next management level down who are doing another kind of work that are reporting upwards, and then you have layers below.
1:45:49
Is that one way of making it scale? Yeah, I mean, you could try to have a more traditional looking company.
1:45:57
I don’t think that’s literally how it’s going to go.
1:46:02
One thing we’ve learned from machine learning is systems are often just really good at some
1:46:08
tasks and worse than humans at other tasks and so you would preferentially want to delegate
1:46:15
the former kind of tasks. And also I don’t think the way it will be organized will look like the way the human
1:46:25
organized themselves because our organizations are tailored to how we work together.
1:46:30
But these are all really good questions. These are questions that we need to think about and we have to figure out, right?
1:46:39
Yeah. So you and your team are going to do your absolute best with this, but it might not
1:46:45
work out. And I suppose if you don’t manage to solve this problem and we just barrel ahead with
1:46:51
the capabilities, then the end result could conceivably be that everyone dies. So in that situation, it seems like humanity should have some backup plan.
1:46:59
A backup plan, hopefully several backup plans, if only so that the whole weight of the world
1:47:04
isn’t resting on your shoulders and you can get some sleep at night. What sort of backup plan would you prefer us to have?
1:47:11
Did you have any ideas there? I think there’s a lot of other kind of plans that are already in motion.
1:47:20
This is not like the world’s only bet, right?
1:47:25
Alignment teams at Anthropic and DeepMind, they’re trying to solve a similar problem.
1:47:33
There’s various ways that you could try to buy more time or various other governance
1:47:40
structures that you want to put in place to govern AI and make sure it’s used beneficially.
1:47:46
Yeah, I think solving the core technical challenges of alignment are going to be critically important,
1:47:55
but I won’t be the only ones. We still have to make sure that AI is aligned with some kind of notion of democratic values
1:48:02
or not, something that tech companies decide unilaterally. And we still have to do something about misuse from AI.
1:48:11
And yeah, alliance systems wouldn’t let themselves be misused if they can help it.
1:48:17
But there’s still a question of how does it fit into the larger context of what’s going
1:48:24
on in society, right? As a human, you can be working for an organization that you don’t really understand what it does
1:48:31
and it’s actually neck narrative without you being able to see that or just because we
1:48:39
can align open AI’s models doesn’t mean that somebody else builds unaligned AI. How do you solve that problem?
1:48:44
That seems really important. How do you make sure that AI doesn’t differentially empower people who are already powerful but
1:48:54
also helps marginalized groups? That seems really important. And then ultimately you also want to be able to avoid these structural risks where, let’s
1:49:04
say we solve alignment and everyone makes a system that’s really aligned with them. But then what ends up happening is that you kind of like just chobo charge the existing
1:49:15
capitalist system, where essentially corporations get really good at maximizing their shareholder
1:49:24
returns because that’s what they align AIS to. But then humans fall by the wayside where that doesn’t necessarily encompass all the
1:49:33
other things you value, like clean air or something. And we’ve seen early indications of this, right? Like global warming is happening even though we know the fundamental problem, but progress
1:49:44
and all the economic activity that we do still drives it forward.
1:49:50
And so even though we do all of these things right, we might still get into a system that
1:49:57
ends up being bad for humans, even though nobody actually who participates in the system
1:50:03
wants it that way. You’re going to do your job, but a lot of other people have also got to do their job.
1:50:09
A lot of other people in this broader ecosystem. There’s a lot to do. We need to make the future go well, and that requires many parts and this is just one of
1:50:16
them. Okay, let’s skip now to some audience questions, which, as I said, were particularly numerous
1:50:24
and spicy this time around. These questions are probably going to jump around a little bit, but I think just throwing
1:50:29
these at you will give us a good impression of, I think, what’s on people’s minds. Yeah, let’s do it.
1:50:35
Yeah. First one, why doesn’t OpenAI try and solve alignment with GPT-4 first?
1:50:41
For example, get it to the point where there are zero jailbreaks that work with GPT-4 before
1:50:47
risking catastrophe with more advanced models. I think this is a great question and to some extent, the fact that you can point to all
1:50:58
the ways that alignment doesn’t quite work yet, jailbreaks is one of them. But also hallucinations the system just makes up stuff and it’s a form of lying that we
1:51:09
don’t want in the models. But I think to some extent, getting really good at that wouldn’t necessarily help us
1:51:21
that much at solving the hard problems that we need to solve in aligning superintelligence.
1:51:26
Right. I’m not saying we should stop working on those, but we also need to do the forward looking
1:51:32
work. And in particular, the thing that I want to happen is I want there to be the most alignment
1:51:39
progress across the board as possible. And so when GPT Five comes around, or as models get more capable, that we have something that’s
1:51:51
ready to go and we have something that helps a lot with those kind of problems. Okay?
1:51:56
Yeah. Another question. Does the fact that GPT-4 is more aligned than GPT 3.5 imply that the more capable the model
1:52:03
is, the more aligned it will be? I know not everyone is going to accept the premise here, but yeah, what would you say
1:52:09
to that? Yeah, I think people also have pointed out that because GBD Four is still jagged breakable
1:52:18
and it is more capable, in some sense, the worst case behavior is worse. So even though an average, it’s much better.
1:52:25
You can make a case for that. But I think it’s also even if it was just like, better across the board, I think it
1:52:34
would be like I don’t think at all we should bet on that trend continuing. And there’s plenty of examples of cases in machine learning where you get some kind of
1:52:46
inverse scaling, it gets better for a while and then gets worse. And to some extent, we know the models haven’t reached this critical threshold where they
1:52:58
are as smart as us, or they could think of a lot of really good ways to try to deceive
1:53:03
us or they don’t have that much situational awareness. They don’t know that much about that.
1:53:10
They are in fact a language model that’s being trained and how they’re being trained. They don’t really understand that.
1:53:16
But once they do, it’s kind of a different ballgame, right? Like you’re kind of going to be facing different problems and so just like extrapolating from
1:53:25
some kind of trend that we see now I don’t think would be right in either way, but I
1:53:30
do think it is like you can learn something from it. I don’t think you should jump to that conclusion.
1:53:36
Yeah. What’s most intellectually exciting about this project from a mainstream ML perspective,
1:53:43
I think. We will learn a lot about how big neural networks actually fundamentally work.
1:53:49
If you think about the work that we are trying to do on generalization, it is weird that
1:53:57
we don’t understand why models sometimes generalize in one way and sometimes another way.
1:54:05
Or how can we change the ways that they can generalize?
1:54:10
Why can’t we just list all the possible ways and then see which ones work?
1:54:17
Or how can we get them into each of the ones? Or what’s the mechanism that really happens here?
1:54:23
We don’t know that. And why don’t we know that? Or I think if you think about interpretability, just like being able to understand the mechanisms
1:54:38
by how the models are deciding which token to output next will teach us a lot about what’s
1:54:49
going on there. How does it actually work?
1:54:55
I don’t know. On some level this is the whole thing. It’s the whole thing.
1:55:03
People are spending enormous amount of effort increasing capabilities right. By just throwing more compute and more data into these models and then they just get this
1:55:11
further inscrutable machine that they don’t understand. That is like very cool in a way because it could do stuff but it sounds like at some
1:55:16
point maybe the more interesting thing is how does it work? Which is what you’re going to be working on.
1:55:22
Yeah, but at the same time there is really concrete things you can say, let’s say induction
1:55:27
heads, right? You can find these attention heads that do very specific things like induction.
1:55:33
You can find somebody reinverse engineer the circuit that does arithmetic, simple arithmetic.
1:55:41
In a small model you can actually do that. Or we found the Canada Neuron. There’s like a neuron in GBD Two that just reacts to Canadian concepts and it’s like
1:55:51
it’s just there. We found it. There’s so much still to find because we just know so little and it’s kind of crazy not
1:56:00
to look at that. Yeah. I imagine that there are some structures in these networks that are going to be analogous
1:56:07
to things that the human brain does and we will probably be able to figure out how they work in these networks long before we figure out how they work in the human brain.
1:56:14
Because we have perfect data about all of the activities of the exactly. So it seems like all of the people studying the brain should just switch over and start
1:56:21
working on so much easier. Your life will be so much easier. Yeah.
1:56:27
I don’t know why not. More people do it. It seems so compelling to me, but I’m not a neuroscientist.
1:56:36
Maybe some of the insights will also transfer. Right. You can find some of the neurons that we know vision models have.
1:56:45
You can also find in humans and animals or these kind of edge filters.
1:56:52
Or if you look at reinforcement learning, you have evidence for how reinforcement learning
1:56:57
works in the human brain. But we have so much more evidence how it works in neural networks because we freaking build
1:57:03
it so much easier.
1:57:09
What do you think have been the biggest wins in technical AI safety so far? I think if I had to pick one, I think it would probably be Rlhf.
1:57:17
I think in some ways, I think Rlhf really put alignment on the map.
1:57:23
And I think it also demonstrated that alignment has a lot of value to add to how systems are
1:57:30
actually being built. I think the fact that it actually had a whole bunch of commercial impact has been really
1:57:37
good because it kind of really demonstrates real world value in a way that if you’re trying
1:57:48
to solve this abstract problem, which is allowing super intelligence is a super abstract problem.
1:57:54
Right. You could kind of noodle on it for many years without making no clear measurable progress.
1:58:02
And I think not only does Olhf has this really visceral difference between how the model
1:58:09
was before and how it was after that, everyone can really see when they play with it.
1:58:14
But also it makes it clear that this is an area that’s really worth investing in and
1:58:21
taking a bet on even the things that aren’t obviously working yet or aren’t clearly might
1:58:31
be still in the stage of being really abstract. Yeah. Is there a number two?
1:58:42
I think there’s a number of smaller wins that we’ve had.
1:58:47
It’s hard to make these rankings. I think if I wanted to add other things, I think interpretability of vision models has
1:58:57
been pretty impressive, and I think there’s been a lot of progress in that. And I think it’s like if you’re asking in terms of safety impact, I think or like alignment
1:59:10
impact, it’s maybe less clear because there’s no things you can really point to that follow
1:59:16
directly from that. Okay. Yeah. Here’s a question that was kind of a recurring theme among listeners.
1:59:24
What gives OpenAI the right to develop artificial general intelligence without democratic input as to whether we want to actually develop these systems or not?
1:59:32
This is an excellent question. I think it’s also a much wider question. I think we should have democratic input to a lot of other things as well.
1:59:43
How should the model behave or should we deploy in this way and should we deploy in this other
1:59:54
way? In some ways, Openei’s mission is develop AI that benefits all of humanity, but you
1:59:59
have to give humanity a say into what’s happening. This is not what the Superalignment team does, but I think it’s going to be very important.
2:00:08
Yeah, I guess it sounds like you’re just on board with there needs to be some integration
2:00:15
between the AI labs and Democratic politics where the public has to be consulted, people
2:00:21
have to be informed about the risks and the benefits that come here. And there needs to be some sort of collective decision about when and how these things are
2:00:29
going to be developed and deployed. And I guess we just currently don’t have the infrastructure to do mean, I guess that’s
2:00:37
partly OpenAI’s responsibility, but it’s also partly everyone else, like the responsibility of the whole of society.
2:00:42
As long as OpenAI is willing to collaborate in that, then there just needs to be a big effort to make it happen.
2:00:47
I think that’s right, and I think I’m really happy that OpenAI is really willing to speak
2:00:53
openly about the risks and speak openly about where we at. And I see my responsibility also to inform the public about what is working on alignment
2:01:04
and what isn’t and where we at and where do we think we can go. But yeah, in the end of the day, also governments will have a role to play on how this all goes.
2:01:17
Yeah. If Congress investigates all of this and concludes that it’s uncomfortably dangerous and they
2:01:24
think that a bunch of this research needs to be stopped, do you think that the AI labs would be willing to go along with that?
2:01:32
This is what more democratic, a more legitimate process has output and so we should be good
2:01:41
citizens and slow down or stop. Yeah, I mean, look, AI companies have to follow the laws of the country they’re in.
2:01:52
That’s how this works. But I think what’s going to happen is we will have regulation of frontier AI technology
2:02:03
and people are trying to figure out how to do that, and we should try to do it most sensibly
2:02:12
as possible. I think there is the larger question of how can you not just have something that works,
2:02:26
let’s say in the United States or in the United Kingdom, but if there like, ways to build
2:02:35
AI that are actually really dangerous, then that has to apply to everyone and not just
2:02:42
specific countries. And I think that’s also like a key challenge. It’s also not a challenge I’m personally working on, but yeah, I think we need to solve that.
2:02:51
I’m excited for anyone who’s working on that problem. Yeah, I suppose I made a point.
2:02:58
Something that makes me a bit pessimistic is just that it seems like we don’t just need to solve one thing, we need to solve many things.
2:03:04
And if we mess up. Maybe just one of them, then that could be very bad. We don’t just need to have a technical solution, but we need to make sure it’s deployed in
2:03:12
the right place and everyone follows it. And then even if that works, then maybe you could get one of these structural problems
2:03:18
where it’s doing what we tell it to, but it makes society worse. Yeah, well, see it as the flip side of all of this.
2:03:27
There’s so much opportunity to shape the future of humanity right now that the listener could
2:03:35
be working on and could have a lot of impact. And I think there’s just so much work to do, and there’s a good chance we actually live
2:03:44
at the most impactful time in human history that has ever existed and that will ever exist.
2:03:50
Kind of wild. Super wild. Could be the case. I don’t know. Yeah.
2:03:55
Okay. Back in March, you tweeted before we scramble to deeply integrate large language models
2:04:01
everywhere in the economy, can we pause and think about whether it’s wise to do so? This is quite immature technology, and we don’t understand how it works.
2:04:08
If we’re not careful, we’re setting ourselves up for a lot of correlated failures. And a couple of days after that, OpenAI opened up GPT-4 to be connected to various plugins
2:04:17
through its API. And one listener was curious to hear more about what you meant by that and whether there
2:04:23
might be a disagreement within OpenAI about how soon GPT-4 should be hooked up to the
2:04:28
internet and integrated into other services. Yeah, I realized that tweet was somewhat ambiguous, and it was read in lots of different ways.
2:04:40
Fundamentally, what plugins allows you to do is nothing on top of that you couldn’t
2:04:47
do with the API. Plugins doesn’t really add anything fundamentally new that people couldn’t already do.
2:04:53
And I think openei is very aware of what can go wrong when you hook up plugins to the system.
2:05:02
And you have to have the samurai. You have to be careful when you let people spend money and all of these questions, but
2:05:12
they’re also sitting right next to us, and we talk to them about it, and they’ve been
2:05:18
thinking about it, but given how much excitement there was to just try GPT-4 on all the things.
2:05:26
What I really wanted to do also is like, look, this is not quite mature.
2:05:33
The system will fail. Don’t connect it to all of the things yet.
2:05:39
Make sure there’s a failback system. Make sure you’ve really played with the model to understand its limitations.
2:05:45
If you have the model write code, make sure you’re reading the code and understanding
2:05:50
it or executing it in the sandbox, because otherwise the system might break.
2:05:58
Wherever you’re writing the code, it might break that system. And just be careful, be wise, make sure you understand what you’re doing here and not
2:06:05
just hook it up to everything and see how it goes. Is there anything that people are using GPT-4 for where you feel like maybe it’s premature
2:06:13
and we should slow down and do some more testing. I mean, probably I don’t know if I can give you some good examples, but I think that’s
2:06:25
generally the story with new technologies. Right. I’m fundamentally a techno optimist, and I think we should use AI for all the things
2:06:35
that it’s good for. And to some extent, we just spend an hour talking about how great it would be to use
2:06:41
AI for alignment research, which is my job. So I’m trying to replace myself at my job with AI.
2:06:49
But at the same time, we also have to really understand the limitations of this technology. And some of it is not obvious and some of it is not widely known.
2:06:58
And you have to do that in order to deploy it responsibly and integrate it responsibly,
2:07:08
integrate it into society in a way that is actually wise to do. And I think it’s just as always with new technologies, I think a lot of we’ll try a lot of things,
2:07:24
and I’m also excited for people to try a lot of things. And that’s why I think it’s good that the openui API exists and it lets lots of people
2:07:32
use cutting edge language models for all kinds of things. But you want to be also careful when you’re doing that.
2:07:40
Yeah, I guess on this topic of just plugging things into the Internet many years ago, people
2:07:48
talked a lot about they kind of had this assumption that if we had an intelligent system that
2:07:53
was as capable as GPT-4 that probably we would keep it in a lead contained box and wouldn’t
2:07:59
plug it up to the internet because we’d be worried about it. But it seems like the current culture is just that as soon as a model is made or it just
2:08:06
gets deployed onto the Internet right away, it seems like at some I mean, that’s not quite
2:08:11
right. Right? Okay. We had GPT-4 for like eight months before we actually it was publicly available.
2:08:20
And we did a lot of safety tests, we did a lot of red teaming, we made a lot of progress
2:08:26
on its alignment, and we didn’t just connect it to everything immediately. But I think what you’re actually trying to say is, many years ago people were arguing
2:08:35
over like, oh, but if you make AGI, can’t you just keep it in a box and then it will never break out and will never do anything bad?
2:08:43
And you’re like, well, it seems like that ship has failed and now we’re connecting it to everything.
2:08:48
And that’s partially what I’m trying to allude here, is like, we should be mindful when we do connect it.
2:08:54
And just because GPT-4 is on the API doesn’t mean that every future model will be immediately
2:09:02
available for everything and everyone in every case. This is kind of the difficult line that you have to walk where you’re like.
2:09:10
You want to empower everyone with AI or as many people as possible.
2:09:17
But at the same time, you have to also be mindful of misuse and you have to be mindful
2:09:22
of the ways all the other things that you can could go wrong with the model, like misalignment
2:09:29
being one of them. And so how do you balance that trade off? That’s like, one of the key questions.
2:09:35
Yeah, it seems like one way of breaking up would be connected to the Internet versus
2:09:41
not. But I feel like often I’m guilty of this as well. We’re thinking either it’s kind of deployed on the Internet and consumers are using it,
2:09:50
or it’s like safely in the lab and there’s no problem, but there’s this intermediate
2:09:55
I mean, there can also. Be problems if you have it in a lab. Well, that’s what I’m saying. That’s exactly what I’m saying. And I feel like sometimes people lose track of that.
2:10:02
That misuse is kind of an issue if it reaches the broader public.
2:10:07
But misalignment can be an issue if something is merely trained and is just being used inside a company because it will be figuring out how could it end up having broader impacts.
2:10:17
And I think yeah, because we tend to cluster all of these risks or tend to speak very broadly,
2:10:23
the fact that a model can be dangerous if it’s simply trained, even if it’s never hooked up to the Internet, is something that we really need to keep in mind.
2:10:30
And I guess it sounds like an OpenAI people will keep that in mind. And safety reviews really need to start before you even start the training run, right?
2:10:40
Yeah. Okay, here’s another question. OpenAI’s decision to create and launch Chat GPT has probably sped up AI research because
2:10:50
there’s now a rush into the field as people were really impressed with it. But it has also prompted a flurry of concerns about safety and new efforts to do preparation
2:10:58
ahead of time to see off possible threats. With the benefit of hindsight, do you think that move to release Chat GPT increased or
2:11:05
reduced AI extinction risk? All things considered, I think that’s a. Really hard question, and I think I don’t know if we can really definitively answer
2:11:14
this now. I think fundamentally it probably would have been better to wait with Chat GPT and release
2:11:21
it a little bit later. I think also, to some extent, this whole thing was inevitable and at some point the public
2:11:33
will have realized how good language models have gotten. And you could also say it’s been surprising that it went this long before that was the
2:11:43
case. I was honestly really happy how much it has shifted the conversation or advanced the conversations
2:11:51
around risks from AI, but also kind of like the real kind of alignment work that has been
2:11:59
happening on we can actually make things so much better and we should do more of that.
2:12:04
And I think both of these are really good. And you can now argue over what the timing should have been and whether it would have
2:12:11
happened anyways. I think it would have happened anyways. And when people are asking these questions, which are really good questions to ask, which
2:12:19
is like, well, can’t we all just stop doing AI if we wanted to? And it feels so easy, right?
2:12:24
Just like, just stop, just don’t do it, wouldn’t that be a good thing?
2:12:30
But then also in practice, there’s just so many forces in the world that keep this from
2:12:36
going. Let’s just keep this going. Let’s say Openei just decides, oh, we’re not going to train a more capable model, just
2:12:43
not do it. Openi could do that. And then there’s a bunch of Open AI competitors who might still do it, and then you still
2:12:51
have AI, okay, let’s get them on board. Let’s get the top five AGI labs or the five tech companies that will train the biggest
2:12:59
models and get them to promise it, okay, now you’ve promised them like they promised, well,
2:13:04
now there’s going to be a new startup and there’s going to be a tons of new startup. And then you get into, well, people are still making transistors smaller, so you’ll just
2:13:12
get more capable GPUs, which means the cost to train a model that is more capable than
2:13:18
any other model that has been trained so far, it still goes down exponentially year over year.
2:13:23
And so now you’re going to semigannock their companies and you’re like, okay, can you guys chill out?
2:13:29
And they’re like, fine, you can get on board. And then now there’s upstream companies who work on UV lithography or something, and they’re
2:13:40
like, well, we’ve working on making the next generation of chips, and we’ve been working on this since then you get them to chill.
2:13:48
Out, and it’s a really complicated coordination problem that isn’t just like, okay, it’s not
2:13:54
even that easy to figure out who else is involved. And so I think humanity can do a lot of things if it really wants to.
2:14:05
And I think if things actually get really scary, I think there’s a lot of things that
2:14:12
can happen. But also fundamentally, I think it’s not an easy problem to solve and I don’t want to
2:14:18
assume it’s being solved. What I want to do is I want to ensure we can make as much alignment progress as possible
2:14:26
in the time that we have. And then if we get more time, great. And then maybe we’ll need more time and then we’ll figure out how to do that.
2:14:34
But what if we don’t? I still want to be able to solve alignment. I still want to win in the world where we don’t get extra time or people just for whatever
2:14:46
reason, things just move ahead however it goes.
2:14:51
You could still come back to the question of how do we solve these technical questions as quickly as possible.
2:14:56
And that’s, I think, what we really need to do. Yeah, I suppose within this, I’ve seen online that there are people who are trying to slow
2:15:06
things down, basically to buy more time for you and your team, among others.
2:15:11
And there’s some people who are taking out a really extreme view that they just want to stop progress on AI.
2:15:17
They just want to completely stop it globally for some significant period of time, which seems, as you’re saying, like a very heavy lift.
2:15:24
I guess. I’m not sure, but I think that their theory might be that at some point there’ll be some
2:15:29
disaster that changes attitudes in a really big way, and then things that currently just seem impossible might become possible.
2:15:36
And so perhaps that their idea would make more sense then. But I guess setting that aside, in terms of the race to solve alignment, it seems like
2:15:46
we could either slow things down 1% or, like, get 1% more time or speed up alignment research
2:15:52
by 1%. And the question might be like, which of those two things is easier? It sounds like you think probably it’s easier to speed up the alignment research or to like,
2:16:00
it’s probably easier to get alignment research going, proceeding twice as quickly as it is to create to make timelines that are twice as long towards whenever we invent dangerous
2:16:09
things. Yeah, I think that’s a really important point also, given how few people are actually working
2:16:16
on alignment these days. What is it?
2:16:21
Is it hundreds, thousands? It depends on your count, right? Like, the Superalignment team is about 20 ish people right now, but there’s like a lot
2:16:29
of other alignment efforts at open your eye right now. If you count all of the hourlajef work, it’s probably more than 100.
2:16:36
But if you go back two years, there’s like three people doing hourlajef, or like five.
2:16:41
I don’t know, it’s ramped up a lot. But we still need so much more. And even really talented individuals can still make such a big difference by switching to
2:16:52
this working on this problem now, just because it’s still such a small field, it’s still so much to do, there’s so much we still don’t understand.
2:17:01
And in some ways, it feels like the real final research frontier.
2:17:06
We’re like, look, we’ve figured out scaling. We know how to make the model smarter unless somebody well, there’s some ways in which
2:17:18
people might stop it, but we know how to do this.
2:17:24
Alignment is a real research problem. We’re like, we don’t know how to align super intelligence. We want to figure this out.
2:17:29
We have to. It’s not optional. Yeah, the fact that the field is so small is exasperating one level, but it’s also a
2:17:38
reason for optimism in another sense, because you could double it. Like if you could get a thousand ML researchers to switch into working on alignment, that
2:17:45
would completely transform things, right? Exactly. Okay.
2:17:51
Yeah. Another question. Jan claimed that the Superalignment team wouldn’t be avoiding alignment work that helps with
2:17:56
commercialization, but that work in particular is already incentivized monetarily by definition.
2:18:03
So why isn’t he going to try to avoid that work? Which will probably get done either way.
2:18:11
I think this is like the whole point that a lot of people are trying to make is that
2:18:16
alignment wouldn’t be done by default in the way that we are really happy with or something.
2:18:25
Or let’s say, put differently, the problems that we want to solve are currently unsolved.
2:18:31
And yes, some of it will be commercially valuable. And I think fundamentally, if you have two ways of building AGI and one of them is just
2:18:42
much more aligned with humans, people will want to buy the second one because it’s just better for them and that will necessarily have commercial value and it’ll be unavoidable.
2:18:54
And I think in general, I think a criticism that another, an adjacent criticism that has
2:19:01
been raised in the past is a lot of people feel like RHF has been like a capabilities
2:19:06
progress because the RHF models feel more capable. You’re like interacting with them, they’re more useful, they’re actually doing more things.
2:19:16
And the reason is because they’re trying to help you, they’re more aligned, they’re actually
2:19:22
leveraging their capabilities towards whatever you’re asking them to do, whereas the pretrained model isn’t.
2:19:28
And so it obviously feels a lot more capable because you’ve unlocked all of these capabilities.
2:19:35
But if you then look at what actually happens during fine tuning, right, the model isn’t
2:19:41
really learning fundamentally new skills it didn’t have before. You can do that through fine tuning, theoretically, but not with the kind of compute budget that
2:19:50
we use. For GPT three it was like less than 2% of the pretraining compute.
2:19:56
For GPT-4, it was even less than that. It’s like really a tiny fraction. But at the same time, because the model is now trying so much harder to be helpful, it
2:20:07
is more helpful and it feels like you get all the capabilities that had been there in the first place.
2:20:14
And so to come back to the commercialization question, what I really want to do is solve
2:20:24
the problem. And if that is commercially useful, great, if it’s not, some of it will not be, or some
2:20:30
of the research bets won’t work out or some of the things won’t be useful before we actually get really capable of systems and that’s fine.
2:20:40
But the goal is to solve the problem. That’s what we want to do. Yeah.
2:20:46
Another question. OpenAI banking on there not being a really fast takeoff and do they try to make plans
2:20:54
that could also work in the event of a foom scenario that is like extremely rapid recursive self improvement of AI?
2:21:00
Yeah, I think we should plan for that scenario and be ready if it happens.
2:21:06
And to some extent automated alignment research is probably the best plan I know in that kind
2:21:13
of scenario where you really have to scale up your alignment work in proportion with what’s going on.
2:21:19
And if you can do this by just delegating almost all of the work to machines, then they
2:21:26
can actually keep pace with the machines because they are the only ones that can.
2:21:32
Yeah, I guess a concern would be if the intelligence explosion or if there is an intelligence explosion
2:21:39
and it’s very fast, then there’s very little time for you to put your plans into action and to keep up.
2:21:46
But that would be true of it’s just a very bad situation. It makes it very hard for any plan to work.
2:21:51
That’s right. But what we should be doing, if you want to be agnostic to the speed of tech progress,
2:22:00
which what we want to do here is the best thing you can do is to prepare as much as
2:22:05
possible ahead of time. Which is why we need to start thinking now about how to align systems that we don’t have
2:22:11
yet. And the more you can prepare, the more you’ll be ready for that scenario. Yeah.
2:22:17
Okay. So a question I got which just slightly changes.
2:22:22
What are OpenAI’s grounds for thinking alignment is Solvable? And have they seen Dr. Roman Yampolsky’s impossibility arguments against Solvability and they linked
2:22:32
to a paper with those arguments there. I guess I don’t know exactly what those arguments are, but I know there are people out there
2:22:38
who have kind of made theoretical arguments that alignment that it’s impossible or extremely
2:22:43
difficult for some conceptual reasons. Are there any arguments along those lines that trouble you in particular?
2:22:49
Or maybe do you think that kind of argumentation shouldn’t be so persuasive?
2:22:55
Yeah, I think I looked at the paper that you mentioned and I think like any argument that
2:23:04
I’ve seen, I haven’t found particularly persuasive. And the problem is whenever you’re trying to make a theoretical argument is like you
2:23:12
need some kind of assumptions and the big question then really just becomes, are these
2:23:18
assumptions going to be true? And to me it just really seems like the jury is still out on this.
2:23:27
It could turn out to be impossible. It doesn’t feel particularly likely to me, but I don’t have a proof for that.
2:23:36
But I think we’re going to work really hard to find a counter example by showing that
2:23:42
it can be done. I think it’s definitely not the time to give up. I think it’s very doable. Yeah, I can feel there’s a bit of exasperation that comes through where you’re like all of
2:23:51
these people complaining that this problem is insoluble, they’re not helping. And clearly there are so many things we could try.
2:23:57
Why don’t we just try them? They’re helping in the sense that they’re indirectly doing recruiting for us, where
2:24:03
I see people because they’re drawing attention to the problem.
2:24:09
And if you just went around saying the problem is easy, you wouldn’t draw attention to it. People be like, okay, it’s fine, then I don’t worry about it.
2:24:15
But also, I think that I also created a real energy of like, oh, it seems really hard,
2:24:21
let’s give up. And that’s, I think, absolutely the wrong approach. If anything, that means we should try harder and get more people to try to solve it.
2:24:37
Never give up, never surrender. The game is still up in the air, we should just really crush it.
2:24:43
Okay, yeah. Two questions that were kind of pointing in the same direction were, as OpenAI gets closer to AGI, do they plan to err on the side of paranoia in terms of giving AIS opportunities
2:24:52
to manipulate staff or hack themselves out or otherwise have channels of causal influence?
2:24:57
And another person asked, how much risk of human extinction are you willing to take? In a large training run? Like, for example, to train GPT Five, six or seven and so on.
2:25:07
In general, as the stakes get higher, we have a much higher burden of proof of alignment,
2:25:17
proof of safety, and we’ve been ramping this up with every system.
2:25:24
And the systems we have now still aren’t catastrophically risky or aren’t close to that.
2:25:32
And so, for example, GPT Two was just open sourced. Everyone can download and do whatever they want with it.
2:25:39
GPT Three was not, and you make it available via an API. And then GPT-4, the only publicly available version, is the alignment fine tuned version,
2:25:53
like the hourF version, the Chat GPT version. And I think the base model, as far as I know, is only like, unresearcher access.
2:26:02
So it’s not like we’re steering the public towards the Rlhf model. And I think with each of these steps, you’re also stepping up your safety, you’re also
2:26:12
stepping up your alignment. And obviously that has to be the higher the capability level, the higher the stakes are
2:26:23
and the more safety and alignment measures you need. Yeah, so people can kind of expect that trend to continue.
2:26:30
On the same theme on Twitter, someone asked they asked you actually, in a different thread,
2:26:36
how would you define success? And you replied, the scientific community agrees that we’ve solved alignment.
2:26:42
And they said this statement from Jan was good.
2:26:47
Is there a meaningful related commitment that OpenAI could make, for example, to not deploy systems above a certain threshold of capability unless there is a broad scientific consensus
2:26:55
that alignment has been solved for that kind of system? At least at the end of the day, I think we’re going to have to convince the scientific community,
2:27:04
because I don’t think the world will let us build something that’s catastrophically dangerous. And the world is paying attention now, and I think that’s all good at the moment.
2:27:14
I’ve learned recently that in the UK, if you want to rent out a house to more than three
2:27:20
unrelated people, then you need a special license in order to do that but as far as I can tell, at least currently one doesn’t need a license or any sort of approval in
2:27:27
order to drain an AGI. I suppose that’s partly because we probably can’t do that yet but it does seem like currently
2:27:34
there aren’t that many legal restrictions and we’re hoping that there will be pretty
2:27:40
quickly or at least I’m hoping that there’ll be more infrastructure in place. Yeah, I mean that seems right to me and people are working on regulation and this is something
2:27:51
that regulation has to solve and there’s a lot of questions around this that I’m not an expert in but to come back to the scientific how do you define success question right,
2:28:06
I feel very strong. It’s not sufficient to just convince ourselves that we did a good job because it’s so easy
2:28:12
to convince yourself that you did a good job at something that you care a lot about.
2:28:17
But we actually have to convince external experts. We have to convince external auditors who are looking exactly at what we’re doing and
2:28:27
why. And all of the I think we’ll just actually have a mountain of empirical evidence of like,
2:28:32
here’s all the things we tried here’s what happens when we do here’s. Like you can look at the data, you can look at the code, and then people can scrutinize
2:28:41
what we’re doing. And I think that’s like because the stakes will end up being so high correspondingly,
2:28:50
we also have to invite a lot of scrutiny in what we are doing. And one aspect of it that we kind of started with now is we want to say what we are planning
2:29:00
to do, what is our overall approach to aligning the systems that we’re building?
2:29:07
And we want to invite feedback and criticism. Maybe there’s something way better that we could be doing.
2:29:13
I would love to know that and then we would do that instead. And I think in general, I think the public should just know what we’re doing on alignment
2:29:26
and make independent judgments on whether that be enough and I think experts will have
2:29:33
to a role to play in this because their knowledge will be required to make informed conclusions
2:29:42
from this. Yeah, I think yeah. An interesting thread with the audience questions is so many of them about policy and governance
2:29:51
and those are also the kinds of questions that I’m more tempted to ask because I often don’t understand the technical details and I imagine many people on Twitter don’t know
2:29:58
enough to scrutinize the technical proposals. So we’re more thinking know, at a social level, as an organizational level, are things set
2:30:04
up well? And I feel like my answer is often just like yeah, I would love to see more of that, please
2:30:10
solve this problem. It’s like I’m not working on this but here what I’m working on helps. Hopefully that’s why I feel it’s reasonable to put these questions to you and to find
2:30:19
out what you think. But yeah, there’s just a lot of people who need to take action, and you’ve got to keep your head down, focused on this technical stuff because that’s your specialty.
2:30:26
But we also need the governance people at OpenAI to be putting in place the good structures, and we need the Senate committee on this to be figuring out what to be playing their role.
2:30:36
And it’s just yeah, there’s a lot of different pieces that have to slot together.
2:30:41
That’s right. Okay, so that’s been a whole lot of audience questions, but we’re heading towards the final
2:30:48
half hour or so of the conversation, and guess my dream is that this interview can help get you lots of great applications to work on the Superalignment team.
2:30:57
Ideally, we’d move a whole lot of people from work. That’s interesting. My dream too.
2:31:02
I’m glad we’re really aligned. Yeah. Hopefully we get some people moving from stuff that’s kind of interesting but not that helpful
2:31:09
to something that is both super intellectually interesting and also might save the world in some sense.
2:31:15
I guess I don’t want to take a strong contrarian view on whether the Superalignment project
2:31:22
is better or worse than other projects that people who are really much more technically informed than me think are plausible.
2:31:28
But it seems like the plan that you’ve laid out seems as good to me as any other plan that I’ve heard, and it seems like you’ve got the resourcing and situation to make a
2:31:36
real go of it. And I guess also if this plan doesn’t bear as much fruit as you hope, in the next couple
2:31:41
of years, I imagine you’d be able to pivot to a different plan. So yeah. What roles are you hiring for and what sort of numbers?
2:31:48
Lay it all out. Yeah, we are primarily hiring for research engineers, research scientists, and research
2:31:57
managers. And I expect there will be we’ll be continuing to hire a lot of people, probably, like at
2:32:06
least ten before the end of the year is my guess, and then maybe even more in the years
2:32:15
after that. Yeah. So what do research engineers, research scientists, research managers, what do these roles look
2:32:25
like? So, in a way, we don’t actually make a strong distinction between research engineer and
2:32:32
research scientist at Openei. And in each of these roles, you’re expected to write code, you’re expected to run your
2:32:40
own experiments. And in fact, I think it’s really important to always be running lots of experiments,
2:32:47
like small experiments, testing your ideas quickly and then iterating and trying to learn
2:32:53
more about the world. And in general, there’s no PhD required also for the research scientist roles.
2:33:05
And really, you don’t even have to have worked in alignment before, and in fact, it might
2:33:13
be good if you didn’t because you’ll have a new perspective on the problems that we’re
2:33:18
trying to solve. What we generally love for people to bring.
2:33:24
Though is like a good understanding of how the technology works, right? You understand language models, you understand reinforcement learning.
2:33:33
For example, you can build and implement ML experiments and debug them.
2:33:40
And then on the research scientist, more research scientific, end of the spectrum.
2:33:46
I think you would be expected a lot more to think about what experiments to do next or
2:33:55
come up with ideas of how can we address the problems that we are trying to solve or some
2:34:01
other problems that we aren’t thinking about that maybe we should be thinking about. Right.
2:34:06
Or how should we design the experiments that will let us learn more. And then on the research engineering spectrum, there’s a lot of kind of like let’s just actually
2:34:18
build the things that let us run these things and let’s make the progress that we already know. If we have a bunch of good ideas, there will not be enough, right?
2:34:26
We actually have to then test them and build them and actually ship something that other people can use and that involves writing a lot of code and that involves debugging ML
2:34:37
and running lots of sweeps of experiments like getting big training runs on GPT-4 and
2:34:45
other big models set up. And so I think in practice, actually, most people on the team kind of move somewhere
2:34:52
on the spectrum and sometimes there’s more coding because we kind of know what to do
2:34:57
and sometimes there’s more researchy because we don’t yet know what to do and we’re kind of starting a new project.
2:35:04
But yeah, in general I think you need a lot of critical thinking and asking important
2:35:15
questions and being very curious about the world and the technology that we are building.
2:35:22
And for the research manager, basically that’s a role where you’re managing a small or medium
2:35:29
sized or even large team of feature engineer and research scientists towards a specific
2:35:35
goal. And so there you should be setting in the direction of what are the next milestones,
2:35:42
where should we go? How can we make this wake question of we want to understand this type of generalization
2:35:48
or we want to make a data set for automated alignment research or something like that.
2:35:55
You have to break it down and make it more cookie and figure out what people can be doing. But also there’s a lot of just like day to day management of how can we make people motivated
2:36:08
and productive, but also make sure they can work together and just traditional management
2:36:14
stuff. Yeah, okay, so it sounded like that. Well, for the first two, the main thing was that you had a good understanding of current
2:36:24
ML technology. You would be able to go in and potentially think up experiments and run experiments.
2:36:31
Are there any kind of other kind of concrete skills that you require or what would be the
2:36:36
typical background of someone who you would be really excited to get an application from?
2:36:42
There’s a lot of different backgrounds that are applicable here.
2:36:47
Machine learning PhDs have been like the traditional way people get into the field, especially
2:36:52
because if you want to do something more researchy, I don’t think you need that at all.
2:36:57
And in fact, if you’re thinking about starting a PhD now, I don’t know if you’ll have that
2:37:03
much time, you should just go work on the problem now. I think for research engineers, I think the kind of background is like, maybe you’ve worked
2:37:14
in a Stem field and you’re like, okay, I’m going to stop doing that. I’m going to take six months and just re implement a bunch of all papers and learn a bunch that
2:37:22
way. Or somebody who works at a tech company doing other machine learning, engineering related
2:37:30
things and now wants to split to alignment. I think that’s like a really good profile.
2:37:37
I also want to stress is like most people we are trying to hire haven’t worked on alignment
2:37:43
before just because the people who have been working on alignment before, there’s so few
2:37:48
of them. And also I think the core expertise that you will need is like machine learning skills
2:38:00
and there’s a bunch of things you should know about alignment, but you can also learn them
2:38:05
once you hear or you can catch up along the way. And I think that’s fine.
2:38:11
On the research manager role, I guess you’re looking for somewhat different skills there that someone might have more management experience and I mean, yeah, being a good researcher
2:38:19
and being a good manager are not the same. These things absolutely can come apart. So I guess would you be looking for a particular kind of person for the manager role?
2:38:28
Yeah, they can be anticorrelated, which I. Think they might be sometimes, yeah.
2:38:35
But I think ideally you would have managed before and I think there’s different ways
2:38:42
it could go. Right. There is scenarios where you split up responsibilities between a tech or a research lead and a manager,
2:38:52
and the manager takes on more of the responsibilities of management and the tech lead is more setting
2:38:58
the direction for the team and making sure the technical stuff is happening that needs
2:39:05
to happen. But in that configuration, they have to get along really well and they have to really
2:39:14
be on the same page to effectively divide these responsibilities. And in particular, I think the manager still should have a really detailed understanding
2:39:24
about what we’re trying to do, but ideally we’d want to have someone who just can do
2:39:30
both roles in one. And so the kind of background would be like, I don’t know, you’ve led a research team at
2:39:40
some other company or in some kind of other branch of machine learning, or you’ve been
2:39:46
a manager before in some other domain. And then you switch to being an IC, means individual contributor on some kind of large
2:39:57
language model project, say. Or there’s also a path where maybe you’re like a postdoc somewhere and you have a small
2:40:09
research team that you’re working with day to day, and it’s very coding heavy, and you’re
2:40:14
running lots of experiments with language models or reinforcement learning or something like that.
2:40:19
I think these are all possible profiles, but it’s kind of hard to know what exactly.
2:40:25
I think the bigger filter is just more like you should actually really care about the
2:40:32
problems that we’re trying to solve. And you need to be really good at coding. You need to be really good at machine learning.
2:40:42
As I understand it, one of the impressive and difficult things that OpenAI has had to work on is just getting the chips and getting the compute to work well and efficiently.
2:40:52
I think these are enormous aggregations of compute, and the engineering of getting that
2:40:57
to work is not at all straightforward. And I guess getting it to work for ML purposes specifically, that adds its own complications.
2:41:03
Are you hiring people to do that engineering side of things? Openei definitely is, and yeah, I think mostly on the Superalignment team, what we’ll be
2:41:19
dealing with is more like being consumer of the infrastructure that runs these large scale
2:41:27
experiments. And so in particular, people on Superalignment need to be comfortable debugging these large
2:41:34
distributed systems. Right. Because if you’re doing a fine tuning run on GBD Four, it is such a system, it’s not
2:41:39
easy to debug. But we don’t have to build the large language model infrastructure because it already exists
2:41:46
and other people are working on that. What does the application process look like?
2:41:53
Yeah, so it’s very simple. You go on Openei.com careers and you scroll down and you’ll find the roles that have Superalignment
2:42:03
in the title, and you click on it and then you submit your CV and say why you want to work on this, and then that’s it, and then we’ll see it.
2:42:12
Okay. Is there any further steps to the process?
2:42:19
The general interview process that we follow is like, you know, we are like there’s like
2:42:25
a tech screening and there’s like, you know, an interchat with someone from the team and there’s like, you know, an onsite process where I think there’s like two to four coding
2:42:37
or ML interviews and a Culture fit interview.
2:42:42
But depending on the job or the background, it might look slightly differently.
2:42:50
Yeah. Are you kind of expecting to maybe hire 20 people and then only keep ten of them in the
2:42:56
long run, or is it more you’re going to try to hire people who mostly you expect to work
2:43:02
out? Yeah, we want to really invest in the researchers that we’re hiring more.
2:43:12
The second one. Yeah. I imagine the bar is reasonably high for getting hired.
2:43:21
Is there a way of communicating what the bar kind of is? I know people could be both overconfident and underconfident, and it could be quite
2:43:28
bad if someone would be really good, but they don’t feel like they’re such a badass that
2:43:34
they should necessarily get a role like this. So if there’s any kind of more explicit way of communicating who should apply, that could
2:43:40
be useful. Yeah, maybe. The most important thing is if you’re in doubt, please apply.
2:43:52
The cost of a false negative is much higher than the cost of a false positive. Exactly.
2:43:57
You’ve slightly already done this earlier in the interview, but yeah. Do you want to just directly make the pitch for why amazing people should apply to work
2:44:05
with you on the Superalignment team? Yeah. In short, I think this is one of the most important problems.
2:44:14
We really have to get this right. It’s not optional. We want to do really ambitious things. We’ve set ourselves a goal to actually solve it in four years.
2:44:23
We’re serious about that. So if you want to work in a team of highly motivated, talented people who are really
2:44:32
trying to solve ambitious problems and have a lot of resources to do so, this is the place
2:44:39
to go. I think also we are at the state of the art of the technology and Openei is really backing
2:44:49
us at we want to do. So I think we have as good as of a shot at the problem as anyone else, if not more.
2:44:58
And I think we should just really do it and really go for it. And you could make that happen and that would be really exciting.
2:45:07
Do you also need any non machine learning and non research people on that team?
2:45:13
There’s always, of course, operations, communications, legal, these other groups, or there may be. For that, you’ll just have to apply to OpenAI in general rather than the alignment team
2:45:22
specifically. Yeah, that’s right. And I’m generally also just really excited to have more people who really care about
2:45:29
the alignment problem, who really care about the future of AI go well. Just apply to Open AI, like whatever role, just help us make that future a reality.
2:45:42
And there’s a lot of people at Open AI who really care about this, but just more people who care about the problems, the important problems, I think, the better.
2:45:51
Yeah, so many policy issues have come up through the conversation. I know there are some really amazing people on the policy team over at OpenAI.
2:45:58
That’s right. Oh yeah, I can name some other teams. So I think that AI governance or policy research team is doing really excellent work on dangerous
2:46:09
capabilities, evaluations and actually trying to get agreements about when should we all
2:46:16
stop. And there’s the system safety team that actually tries to improve alignment and safety of models
2:46:25
we have right now. Making the refusals better, fixing jailbreaking, improving, monitoring all of these problems.
2:46:31
They’re really important. And for some listeners who might be more skeptical about the longer end problems that we have
2:46:42
to solve and want to do something that has impact right now, these are great teams to join and I’m excited for what they’re doing.
2:46:49
And then of course, there’s a lot of other teams at Openei that are doing important work like just improving Rlhf, improving Chat, GPT, all of this legal communications, recruiting,
2:47:02
there’s a lot of things to do. We are focusing on trying to figure out how to align superintelligence, but as we’ve discussed,
2:47:09
it’s not the only thing we need. Yeah. If someone were reluctant to apply because they were scared that getting involved might
2:47:17
enhance capabilities and they were someone who thought that speeding up capabilities research was a bad thing yeah.
2:47:22
What would you say to them? If you don’t want to do that, don’t apply the capabilities team.
2:47:32
Yeah, fair enough. I think in general, the obvious thing.
2:47:37
Is it sounds like working on the Superalignment team is not going to meaningfully contribute to capabilities progress on any kind of global level.
2:47:44
I mean, I don’t want to promise that nothing we’ll do will have any capabilities impact.
2:47:50
And I think, as mentioned earlier, I think some of the biggest alignment wins will also
2:47:56
have some of these effects and I think that’s just real. And like, I think in the EA community specifically, there’s a lot of hesitation around like, oh,
2:48:08
if I get into ML or if I do an ML engineering job somewhere, I might accelerate timelines
2:48:14
a little bit and it will be so bad if I did that. And I think that kind of reasoning really underestimates the career capital growth and
2:48:25
the skills growth that you would get by just doing some of these jobs for a while you’re
2:48:32
skilling up and then you can switch to alignment later. And I think in general, there’s so many people working on capabilities that one more or less
2:48:46
will make it go that much faster, but there’s not that many people in alignment.
2:48:52
So as one person working on an alignment, you can actually make a much larger difference.
2:48:58
Yeah, as we always do. When this topic comes up, I’ll link to our article if you want to reduce AI risk, should
2:49:06
you take roles that advance AI capabilities? And there we have responses from a wide range of people who we ask this question to, who
2:49:14
do have something of a range of views. But I think the reasoning that you’ve given out there, that just your proportional increase
2:49:19
in capabilities, research that you would make would be very small relative to the proportional increase in alignment research that you would make plus all.
2:49:27
Of the benefits that you get from skilling up personally and then being able to use those skills later in your career seems pretty clear to me, in this case, at least.
2:49:36
What are the distinctive things about OpenAI’s culture that people should be aware of going in? Is there a particular kind of character that really thrives mean?
2:49:45
I think we generally want to be really welcoming to all kinds of different people and all kinds
2:49:53
of different characters and everyone. I think we just need a lot of diversity of thought how to go about this problem.
2:50:02
And many people have said this before, there’s also so many non machine learning aspects
2:50:10
to this problem. And so especially if somebody has a nontraditional background and switched into ML or has specifically
2:50:20
origin story that is nontypical, I think that’s super valuable.
2:50:25
I think in general, I care a lot about having a team culture that is really warm and friendly
2:50:34
and inclusive, but also creates a lot of psychological safety for people to voice.
2:50:41
Spicy takes on some of the things that we’re doing or our approach in general, and we need
2:50:48
to collaborate to solve the problem. And it’s not just like, who can get the credit or something, this problem just needs to get
2:50:59
solved. Yeah. If a really talented person wanted to switch into working on technical alignment, but for
2:51:06
some reason it was impossible for them to go join you on the Superalignment team, is there anywhere else that you’d be really excited for them to apply?
2:51:14
Yeah, I think not. At OpenAI. Yeah, I think there’s other AI labs that I think are doing good, really cool work like
2:51:25
Google DeepMind or Anthopic. And there’s other academic labs that are really doing cool stuff like at Berkeley or at Stanford
2:51:35
or in Oxford. I think I would consider applying to those.
2:51:40
I think also just it’s always very sad when we have to turn down really talented people,
2:51:50
but also we are a small team, we can’t hire everyone, and if sometimes people aren’t quite
2:51:58
ready and it’s good to focus on more skill building and career capital investment.
2:52:03
And I think that’s also a really valid strategy. And I think all in all, probably people that go through our pipeline generally underestimate
2:52:13
how valuable it is to take a research engineering job at another company and skill up and learn
2:52:22
a bunch of things, and then there’s a lot of opportunities to do that.
2:52:27
Yeah. Just on practical questions, is it possible to work remotely and can you sponsor visas
2:52:33
for people who aren’t US. Citizens? Yes, we definitely sponsor mean remote work is generally not encouraged because almost
2:52:47
the entire team is in San Francisco. We go into the office at least three times a week, and it’s just so much easier to collaborate.
2:52:56
And so if you can do that would be really good. Yeah.
2:53:01
Are there any other points that you want to make before we push on? Yeah, thank you so much for letting me pitch these roles here.
2:53:10
I’m really excited for more people who really care about this problem, really care about
2:53:19
the future to go well, and making sure humanity manages this transition into a post AGI world.
2:53:30
And yeah, thank you for doing this. All right, we’ve already got overtime, and I’ve been keeping you for a long while, and
2:53:37
I’m sure you have a lot of stuff to do setting up this whole project. Maybe a final question before we go is yeah.
2:53:43
Do you have a favorite piece of science fiction?
2:53:50
I really like the Greg Argan books.
2:53:56
A lot of these are really old. Like, Permutation City was, like, one of my favorites. And a lot of the ideas that he plays with are felt really out there at the time, I’m
2:54:09
sure. But now it just seems so much striking, a lot closer to home in a whole bunch of ways.
2:54:14
And you can kind of feel more and more of the weird Sci-Fi ideas become reality.
2:54:21
But also I actually like that he tries to paint a positive view of what society could
2:54:31
look like in the long run. Yeah, whatever you said, I was going to ask, is your life weirder or less weird than what
2:54:40
is portrayed in that piece of science fiction? I actually don’t know what I don’t know about Permutation City, but maybe could you quickly
2:54:45
tell us what it’s about and whether it’s weirder than your own situation in this world?
2:54:51
It’s like, definitely less weird. Definitely. It’s so much more weird than my life.
2:54:57
So Permutation City is like a book that plays with the idea of uploading of having digital copies of humans and living in a mathematical universe, and what are the implications on
2:55:11
that and if virtual humans can rewrite their own code and a lot of things like that we
2:55:20
can’t do yet. And maybe in some ways AI can do it because maybe in the near future or medium future,
2:55:29
AI could rewrite parts of its own neural network or something if we make interpretability progress.
2:55:36
But yeah, I don’t know. It’s like, very out there science fiction.
2:55:43
Right? That’s what makes it so cool. Yeah. I don’t know.
2:55:50
I do feel like sometimes we’re living through a science fiction. This is nothing. It’s going to get so much weirder.
2:55:57
Okay. Yeah. All right. We have that to look forward to in the 2030s or two thousand and.
2:56:03
Forty s. I don’t know exactly how it’s going to go, but I promise you it’ll be weird by today’s
2:56:09
standards. Yeah. Well, yeah, best of luck with the project. I really look forward to seeing how it comes along.
2:56:16
My guest today has been Jan Leike. Thanks so much for coming on the 80,000 Hours podcast, Jan. Thank you so much for having me.

OpenAI’s massive push to make superintelligence safe | Jan Leike

Share This Story, Choose Your Platform!