FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

Thank you so much for inviting me.

It’s such a pleasure to be talking about these things

here in my own department.

It’s so cool to see how many interesting things are

happening right here.

So I’m going to talk about keeping AI

under control with mechanistic interpretability.

And in particular, how I think we

physicists have a great opportunity to help with this.

So first of all, why might we want to keep AI under control?

Well, [LAUGHS] obviously as we’ve

heard this morning, because it’s getting more and more powerful.

We’ve all seen this paper from Microsoft.

It’s arguing that the GPT-4 is already

showing sparks of artificial general intelligence.

Here is Yoshua Bengio.

[AUDIO PLAYBACK]

– Now reached a point where there

are AI systems that can fool humans,

meaning they can pass the Turing test.

[END PLAYBACK]

So you can debate whether or not GPt-4 passes the Turing test,

but Yoshua Bengio should certainly

get a vote in that debate since he’s

one of the Turing Award winners, the equivalent of the Nobel

Prize for AI.

And this growth in progress is obviously,

also we know, started freaking a lot of people out.

Here we have is Turing Award co-winner Jeff Hinton.

I’m not sure if the audio is actually going out.

Is it?

[AUDIO PLAYBACK]

– Are we close to the computers coming up

with their own ideas for improving themselves?

– Yes, we might be.

And then it could just go–

might we have to think hard about how to control it.

– Yeah, can we?

– We haven’t been there yet.

But we can try.

– OK, that seems kind of concerning.

– Yes.

[END PLAYBACK]

And then piling on, Sam Altman, CEO of OpenAI, that, of course,

it has given us chatGPT and GPT-4 had this to say.

[AUDIO PLAYBACK]

– And the bad case, and I think this is important to say,

is like lights out for all of us.

[END PLAYBACK]

Lights out for all of us doesn’t sound so great.

And of course, then, we had a bunch of us

who called for a pause in an open letter.

And then we had, shortly after that,

this bunch of AI researchers talking

about how this poses a risk of extinction,

which was all over the news.

Specifically, it was the shortest open letter

I’ve ever read and had just one sentence.

Mitigating the risk of extinction from I

should be a global priority alongside other societal scale

risks, such as pandemics and nuclear war.

So basically, the whole point of this

was just that it mainstreamed the idea that, hey,

maybe we could get wiped out.

So we really should keep it under control.

And the most interesting thing here, I think,

is who signed it.

You have not only top academic researchers,

who don’t have a financial conflict of interest,

people like Jeff Hinton, Yoshua Bengio.

But you also have the CEOs here, Demis Hassabis

from Google DeepMind, Sam Altman again, [INAUDIBLE],, et cetera.

So a lot of reasons for why we should keep under control.

How can we help?

I feel that, first of all, we obviously should.

And Peter earlier this morning gave a really great example

of how I think we really can help,

by opening up the black box and getting to a place

where we’re not just using ever more powerful systems that we

don’t understand, but where we’re instead

able to understand them better.

This has always been the tradition in physics

when we work with powerful things.

If you want to get a rocket to the moon,

you don’t just treat it as a black box

and you fire– that one went a little too far to the left.

Let’s aim a little farther to the right next time.

No, what you do is you figure out the laws of–

you figure out Einstein’s laws of gravitation.

You figure out thermodynamics, et cetera.

And then you can be much more confident

that you’re going to control what you build.

So this is actually a field which

has gained a lot of momentum quite recently.

It’s a very small field still.

It’s known by the nerdy name of mechanistic interpretability.

To give you an idea of how small it

is, if you compare it with neuroscience

and you think of this as artificial neuroscience,

neuroscience is a huge field, of course.

Look how few people there are here at MIT at this conference

that I organized just two months ago.

This was the biggest conference by far in this little nascent

field.

So that’s the bad news.

It’s very few people working on it.

But the good news is even though there are so few,

there’s already been a lot of progress–

remarkable progress.

I’ve seen more progress in this field

than in all of big neuroscience in the last year.

Why is that?

It’s because here, you have a huge advantage

over ordinary neuroscience in that first of all,

to study the brain with 10 to the 11 neurons,

you’d have a hard time reading out more than 1,000 at a time.

You need to get IRB approval for all sorts of ethics reasons

and so on.

Here, you can read out every single neuron all the time.

You can also get all the synaptic weights.

You don’t even have to go to the IRB either.

And you can use all these traditional techniques, where

you actually mess with the system

and see what happens that we love to do in physics.

And I think there are three levels of ambition that

can motivate you to want to work on mechanistic

interpretability, which is, of course, what

I’m trying to do here, to encourage you to work more

on this.

The goal of it is–

the first lowest ambition level is

just when you train a black box neural network on some data

to do some cool stuff to understand well enough that you

can diagnose its trustworthiness,

make some assessment of how much you should trust it.

That’s already useful.

Second level of ambition, if you take it up a notch,

is to understand it so well that you can

improve its trustworthiness.

And the ultimate level of ambition,

and we are very ambitious here at MIT,

is to understand it so well that you

can guarantee trustworthiness.

We have a lot of work at MIT on formal verification,

where you do mathematical proofs that a code is going

to do what you want it to do.

Proof carrying code is a popular science–

topic in computer security, where the–

it’s a little bit like a virus checker in reverse.

A virus checker will refuse to run your code if it

can prove that it’s harmful.

Here, instead, the operating system says to the code,

give me a proof that you’re going to do what

you say you’re going to do.

And if the code can’t present the proof

that the operating system can check, it won’t run it.

It’s hopeless to come up with rigorous proofs

for neural networks because it’s like trying to prove things

about spaghetti.

But the vision here is if you can use AI to actually

mechanistically extract out the knowledge that’s been learned,

you can re-implement it in some other kind of architecture

which isn’t a neural network which really lends itself

to formal verification.

If we can pull off this moonshot,

then we can trust systems much more intelligent than us

because no matter how smart they are,

they can’t do the impossible.

So in my group, we’ve been having

a lot of fun working on extracting learned knowledge

from the black box in the mechanistic interpretability

spirit.

You heard, for example, my grad student Eric Michaud talk

about this quantum thing recently.

And I think this is an example of something which

is very encouraging, because if this quantum hypothesis is

true, you can do a divide and conquer.

You don’t have to understand the whole neural network

all at once.

But you can look at discrete quantities learned

and study them separately, much like we physicists

don’t try to understand this data center all at once.

First, we try to understand the individual atoms

that it’s made of.

And then we work our way up to solid state physics, and so on.

Also reminds me a little bit of Minsky’s society of minds.

Where you have many different systems working together

to provide very powerful things.

I’m not going to try to give a full summary of all

the cool stuff that went down at this conference.

But I can share– there’s a website, where

we have all the talks on YouTube if anyone

wants to watch them later.

But I want to just give you a little more nerd flavor

of how tools that many of you here, as physicists, are

very relevant to this, things like phase transitions,

for example.

So we already heard a beautiful talk by Jacob Andreas

about knowledge representations.

There’s been a lot of progress on figuring out

how large language models represent knowledge,

how they know that the Eiffel Tower is in Paris,

and how you can change the weights so that it thinks it’s

in Rome, et cetera, et cetera.

We did a study on algorithmic data sets,

where we found phase transition.

So if you’re trying to make the machine learning

learn a giant multiplication table,

this could be for some arbitrary group operations or something

more interesting than standard multiplication,

then if there’s any sort of structure

here, if this operation is, for example, commutative, then

you only really need the training data for about half

of the entries.

And you can figure out the other half

because it’s a symmetric matrix.

If it’s also associative, then you need even less, et cetera.

So as soon as the machine learning

discovers some sort of structure,

it might learn to generalize.

So here is a simple example.

Addition modulo 59, we train a neural network to do this.

We don’t give it the inputs as numbers.

We just give it each of the numbers

from 0 to 58 as a symbol.

So it doesn’t have any idea that they should

be thought of as numbers.

And they represent– it represents them

by embedding them in some internal space.

And then we find that exactly at the moment

when it learns to generalize to unseen examples,

there is a phase transition in how it represents

in the internal space.

You find that it was in a high dimensional space.

But everything collapses to a two-dimensional hyperplane,

I’m showing you here in a circle.

Boom, that’s, of course, exactly like the way

we do addition modulo 12 when we look at a clock.

So it finds a representation where it’s actually adding up

angles, which automatically captures

the– in this case, the commutativity

and the associativity.

And I suspect this might be a general thing that

happens in learning language and other things also.

That it comes up with a very clever representation that

is such that it geometrically encompasses

a lot of the key properties that lets it generalize.

We do a lot of phase transition experiments.

Also where we tweak various properties

of the neural network and see that sometimes, there’s

one region of if you think of this being water,

you could have pressure and temperature

on your phase diagram.

But here, there are various other nerdy machine

learning parameters.

And there are some–

you get these phase transition boundaries

between where it just learns properly,

where it can generalize, where it fails to generalize

and it never learns anything.

Or where it just overfits.

This is for the example of just doing regular addition.

So you see it learns to put the symbols on a line

rather than a circle in the cases where it works out.

So I want to leave a little bit of time for questions.

But the bottom line I would like you to take away from all this

is I think it’s too pessimistic to say, oh, we’re forever

just going to be stuck with these black boxes

that we can never understand.

Of course, if we convince ourselves that it’s impossible,

we’re going to fail.

That’s the best recipe for failure.

I think it’s quite possible that we really

can understand enough about very powerful AI systems

that we can have very powerful AI systems that

are provably safe.

And physicists can really help a lot

because we have a much higher bar for what

we mean by understanding things than a lot of our colleagues

in other fields.

And we also have a lot of really great tools.

We love studying nonlinear dynamical systems.

We love studying phase transitions.

And so many other things, which are turning out

to be key in doing this kind of progress.

So if anyone is interested in collaborating, learning more

about mechanistic interpretability,

and basically studying the learning

and execution of neural networks as just yet

another cool physical system to try to understand,

just reach out to me.

And let’s talk.

Thank you.

[APPLAUSE]

All right, thank you very much.

Does anyone have questions?

I actually have one to start with.

So just sort of you explaining, in these last few slides,

a lot of the themes sort of seem to be applying

like the laws of thermodynamics and other physical laws

to these systems.

And the parallel I thought of as the field of biophysics

also sort of emerged out of this, right?

Applying physical laws to systems

that were considered too complex to understand before we really

thought about it carefully.

Is there any sort of emerging field

like that in the area of AI or understanding

neural networks other than that little conference you just

mentioned?

Or is that really all that’s there right now?

There’s so much room for there to really

be an emerging field like this.

And I invite all of you to help build it.

It’s obviously a field, which is not only very much needed,

but it’s just so interesting.

There have been so many times in recent months

when I read a new paper by someone else about this,

and I’m like, oh, this is so beautiful.

Another way to think about this is I always tell

my students, when they pick tasks to work on,

they should look for areas where there is more data–

where experiment is ahead of theory.

That’s the best place to do theoretical work.

And that’s exactly what we have here.

If you train some system like GPT-4

to do super interesting things or use Llama 2 that just came

out where you have all the parameters,

it’s an incredibly interesting system.

You can get massive amounts of data.

And we have the most fundamental things, we don’t understand.

It’s just like when the LHC turns on

or when you first launch the Hubble Space

Telescope, or the WMAP satellite,

or something like that.

You have a massive amount of data,

really cool basic questions.

It’s the most fun domain to do physics in.

And yeah, let’s build a field around it.

Thank you.

Yeah, we’ve got a question up there.

Hi, Professor Tegmark.

I was wondering, so most–

first, amazing talk.

I loved the concept.

But I was wondering if it is possible that this approach may

not oversee–

but miss situations in which the language model actually

performs very well, not in a concise region,

like a phase region on parameter space,

but rather in small blobs all around?

Because in most physical systems,

we have a lot of parameters that we will have phases.

And the phases are mostly concise in regions

of n dimensions or whatever.

And then there are phase transitions,

which is the concept here.

But it is also, since this is not necessarily

a physical system, maybe there might

be a situation in which the best way that it performs

is in specific combinations of parameters that are like points

or little blobs around.

I don’t know if my question went through.

Yeah, yeah, it’s a good question.

I think I need to explain better.

I think my proposal is actually more radical than maybe I

could probably explain.

I think we should never put something

we don’t understand, like GPT-4 in charge

of the MIT nuclear reactor or any high stakes systems.

I think we should use these black box

systems to discover amazing knowledge

and discover patterns in data.

And then we should extract–

not stop there and just connect it to the nuclear weapons

system or whatever.

But we should instead take–

develop other AI techniques to extract out the knowledge

that they’ve learned and re-implement them

in something else.

So take your physics metaphor again.

So Galileo, when he was four years old, if his daddy threw

him a ball, he’d catch it.

Because his black box neural network

had gotten really good at predicting the trajectory.

Then he got older and he’s like, wait a minute,

these trajectories always have the same shape.

It’s a parabola, y equals x squared, and so on.

And when we send the rocket to the moon,

we don’t put a human there to make

poorly understood decisions.

We actually have extracted out the knowledge

and written the Python code or something else

that we can verify.

I think the real power–

I think we need to let go of stop

putting an equals sign between large language models and AI.

We’ve had radically different ideas of what should be.

First, we thought about it in the [INAUDIBLE] paradigm

of computation.

Now, we’re thinking about LLMs.

We can think of other ones in the future.

What’s really amazing about neural networks, in my opinion,

is not their ability to execute computation at runtime.

They’re just another massively parallel computational system.

And there are plenty of other ones too

that are easier to formally verify.

But where they really shine is in their ability

to discover patterns in data, to learn.

And let’s use them– continue using them for that.

You could even imagine an incredibly powerful AI

that is just allowed to learn, but is not allowed to act back

on the world in any way.

And then you use other systems to extract out

what it’s learned.

And you implement that knowledge into some system

that you can prove that you can provably trust.

This, to me, is the path forward that’s really safe.

And maybe there will still be some kind of stuff

which is so complicated we can’t prove that it’s

going to do what we want.

So let’s not use those things until we can prove them

because I’m confident that the set of stuff that can be made

provably safe is vastly more powerful, and useful,

and inspiring than anything we have now.

So why should we risk losing control

when we can do so much more first in a provably safe way?

We’ll do one more question.

All right, thank you.

I’ll keep my question short.

So for your phase transition example,

is it just an empirical observation?

Or do you have a theoretical model like you do in physics?

Right now, it’s mainly a theoretical observation.

And actually, we have seen many examples

of phase transitions cropping up in machine learning.

And so have many other authors.

I have– I’m so confident that there is the beautiful theory

out there to be discovered, sort of unified theory of phase

transitions in learning.

Maybe you’re going to be the one the first to formulate it.

I don’t think it’s a coincidence that these things keep

happening like this.

But this gives you all an example

of how basic physics like questions

there are out there that are still unanswered,

where we have massive amounts of data as clues

to guide us towards them.

Thank you.

And I think there’s– you’ll probably even going

to discover–

we will probably discover at some point in the future,

even a very deep unification–

relationship or between or duality

between thermodynamics and learning dynamics

is the hunch I have.

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.