Excellent. Highly recommended!
The Impact of chatGPT talks (2023) – Prof. Max Tegmark (MIT). 04 AUG 2023.
Thank you so much for inviting me.
It’s such a pleasure to be talking about these things
here in my own department.
It’s so cool to see how many interesting things are
happening right here.
So I’m going to talk about keeping AI
under control with mechanistic interpretability.
And in particular, how I think we
physicists have a great opportunity to help with this.
So first of all, why might we want to keep AI under control?
Well, [LAUGHS] obviously as we’ve
heard this morning, because it’s getting more and more powerful.
We’ve all seen this paper from Microsoft.
It’s arguing that the GPT-4 is already
showing sparks of artificial general intelligence.
Here is Yoshua Bengio.
[AUDIO PLAYBACK]
– Now reached a point where there
are AI systems that can fool humans,
meaning they can pass the Turing test.
[END PLAYBACK]
So you can debate whether or not GPt-4 passes the Turing test,
but Yoshua Bengio should certainly
get a vote in that debate since he’s
one of the Turing Award winners, the equivalent of the Nobel
Prize for AI.
And this growth in progress is obviously,
also we know, started freaking a lot of people out.
Here we have is Turing Award co-winner Jeff Hinton.
I’m not sure if the audio is actually going out.
Is it?
[AUDIO PLAYBACK]
– Are we close to the computers coming up
with their own ideas for improving themselves?
– Yes, we might be.
And then it could just go–
might we have to think hard about how to control it.
– Yeah, can we?
– We haven’t been there yet.
But we can try.
– OK, that seems kind of concerning.
– Yes.
[END PLAYBACK]
And then piling on, Sam Altman, CEO of OpenAI, that, of course,
it has given us chatGPT and GPT-4 had this to say.
[AUDIO PLAYBACK]
– And the bad case, and I think this is important to say,
is like lights out for all of us.
[END PLAYBACK]
Lights out for all of us doesn’t sound so great.
And of course, then, we had a bunch of us
who called for a pause in an open letter.
And then we had, shortly after that,
this bunch of AI researchers talking
about how this poses a risk of extinction,
which was all over the news.
Specifically, it was the shortest open letter
I’ve ever read and had just one sentence.
Mitigating the risk of extinction from I
should be a global priority alongside other societal scale
risks, such as pandemics and nuclear war.
So basically, the whole point of this
was just that it mainstreamed the idea that, hey,
maybe we could get wiped out.
So we really should keep it under control.
And the most interesting thing here, I think,
is who signed it.
You have not only top academic researchers,
who don’t have a financial conflict of interest,
people like Jeff Hinton, Yoshua Bengio.
But you also have the CEOs here, Demis Hassabis
from Google DeepMind, Sam Altman again, [INAUDIBLE],, et cetera.
So a lot of reasons for why we should keep under control.
How can we help?
I feel that, first of all, we obviously should.
And Peter earlier this morning gave a really great example
of how I think we really can help,
by opening up the black box and getting to a place
where we’re not just using ever more powerful systems that we
don’t understand, but where we’re instead
able to understand them better.
This has always been the tradition in physics
when we work with powerful things.
If you want to get a rocket to the moon,
you don’t just treat it as a black box
and you fire– that one went a little too far to the left.
Let’s aim a little farther to the right next time.
No, what you do is you figure out the laws of–
you figure out Einstein’s laws of gravitation.
You figure out thermodynamics, et cetera.
And then you can be much more confident
that you’re going to control what you build.
So this is actually a field which
has gained a lot of momentum quite recently.
It’s a very small field still.
It’s known by the nerdy name of mechanistic interpretability.
To give you an idea of how small it
is, if you compare it with neuroscience
and you think of this as artificial neuroscience,
neuroscience is a huge field, of course.
Look how few people there are here at MIT at this conference
that I organized just two months ago.
This was the biggest conference by far in this little nascent
field.
So that’s the bad news.
It’s very few people working on it.
But the good news is even though there are so few,
there’s already been a lot of progress–
remarkable progress.
I’ve seen more progress in this field
than in all of big neuroscience in the last year.
Why is that?
It’s because here, you have a huge advantage
over ordinary neuroscience in that first of all,
to study the brain with 10 to the 11 neurons,
you’d have a hard time reading out more than 1,000 at a time.
You need to get IRB approval for all sorts of ethics reasons
and so on.
Here, you can read out every single neuron all the time.
You can also get all the synaptic weights.
You don’t even have to go to the IRB either.
And you can use all these traditional techniques, where
you actually mess with the system
and see what happens that we love to do in physics.
And I think there are three levels of ambition that
can motivate you to want to work on mechanistic
interpretability, which is, of course, what
I’m trying to do here, to encourage you to work more
on this.
The goal of it is–
the first lowest ambition level is
just when you train a black box neural network on some data
to do some cool stuff to understand well enough that you
can diagnose its trustworthiness,
make some assessment of how much you should trust it.
That’s already useful.
Second level of ambition, if you take it up a notch,
is to understand it so well that you can
improve its trustworthiness.
And the ultimate level of ambition,
and we are very ambitious here at MIT,
is to understand it so well that you
can guarantee trustworthiness.
We have a lot of work at MIT on formal verification,
where you do mathematical proofs that a code is going
to do what you want it to do.
Proof carrying code is a popular science–
topic in computer security, where the–
it’s a little bit like a virus checker in reverse.
A virus checker will refuse to run your code if it
can prove that it’s harmful.
Here, instead, the operating system says to the code,
give me a proof that you’re going to do what
you say you’re going to do.
And if the code can’t present the proof
that the operating system can check, it won’t run it.
It’s hopeless to come up with rigorous proofs
for neural networks because it’s like trying to prove things
about spaghetti.
But the vision here is if you can use AI to actually
mechanistically extract out the knowledge that’s been learned,
you can re-implement it in some other kind of architecture
which isn’t a neural network which really lends itself
to formal verification.
If we can pull off this moonshot,
then we can trust systems much more intelligent than us
because no matter how smart they are,
they can’t do the impossible.
So in my group, we’ve been having
a lot of fun working on extracting learned knowledge
from the black box in the mechanistic interpretability
spirit.
You heard, for example, my grad student Eric Michaud talk
about this quantum thing recently.
And I think this is an example of something which
is very encouraging, because if this quantum hypothesis is
true, you can do a divide and conquer.
You don’t have to understand the whole neural network
all at once.
But you can look at discrete quantities learned
and study them separately, much like we physicists
don’t try to understand this data center all at once.
First, we try to understand the individual atoms
that it’s made of.
And then we work our way up to solid state physics, and so on.
Also reminds me a little bit of Minsky’s society of minds.
Where you have many different systems working together
to provide very powerful things.
I’m not going to try to give a full summary of all
the cool stuff that went down at this conference.
But I can share– there’s a website, where
we have all the talks on YouTube if anyone
wants to watch them later.
But I want to just give you a little more nerd flavor
of how tools that many of you here, as physicists, are
very relevant to this, things like phase transitions,
for example.
So we already heard a beautiful talk by Jacob Andreas
about knowledge representations.
There’s been a lot of progress on figuring out
how large language models represent knowledge,
how they know that the Eiffel Tower is in Paris,
and how you can change the weights so that it thinks it’s
in Rome, et cetera, et cetera.
We did a study on algorithmic data sets,
where we found phase transition.
So if you’re trying to make the machine learning
learn a giant multiplication table,
this could be for some arbitrary group operations or something
more interesting than standard multiplication,
then if there’s any sort of structure
here, if this operation is, for example, commutative, then
you only really need the training data for about half
of the entries.
And you can figure out the other half
because it’s a symmetric matrix.
If it’s also associative, then you need even less, et cetera.
So as soon as the machine learning
discovers some sort of structure,
it might learn to generalize.
So here is a simple example.
Addition modulo 59, we train a neural network to do this.
We don’t give it the inputs as numbers.
We just give it each of the numbers
from 0 to 58 as a symbol.
So it doesn’t have any idea that they should
be thought of as numbers.
And they represent– it represents them
by embedding them in some internal space.
And then we find that exactly at the moment
when it learns to generalize to unseen examples,
there is a phase transition in how it represents
in the internal space.
You find that it was in a high dimensional space.
But everything collapses to a two-dimensional hyperplane,
I’m showing you here in a circle.
Boom, that’s, of course, exactly like the way
we do addition modulo 12 when we look at a clock.
So it finds a representation where it’s actually adding up
angles, which automatically captures
the– in this case, the commutativity
and the associativity.
And I suspect this might be a general thing that
happens in learning language and other things also.
That it comes up with a very clever representation that
is such that it geometrically encompasses
a lot of the key properties that lets it generalize.
We do a lot of phase transition experiments.
Also where we tweak various properties
of the neural network and see that sometimes, there’s
one region of if you think of this being water,
you could have pressure and temperature
on your phase diagram.
But here, there are various other nerdy machine
learning parameters.
And there are some–
you get these phase transition boundaries
between where it just learns properly,
where it can generalize, where it fails to generalize
and it never learns anything.
Or where it just overfits.
This is for the example of just doing regular addition.
So you see it learns to put the symbols on a line
rather than a circle in the cases where it works out.
So I want to leave a little bit of time for questions.
But the bottom line I would like you to take away from all this
is I think it’s too pessimistic to say, oh, we’re forever
just going to be stuck with these black boxes
that we can never understand.
Of course, if we convince ourselves that it’s impossible,
we’re going to fail.
That’s the best recipe for failure.
I think it’s quite possible that we really
can understand enough about very powerful AI systems
that we can have very powerful AI systems that
are provably safe.
And physicists can really help a lot
because we have a much higher bar for what
we mean by understanding things than a lot of our colleagues
in other fields.
And we also have a lot of really great tools.
We love studying nonlinear dynamical systems.
We love studying phase transitions.
And so many other things, which are turning out
to be key in doing this kind of progress.
So if anyone is interested in collaborating, learning more
about mechanistic interpretability,
and basically studying the learning
and execution of neural networks as just yet
another cool physical system to try to understand,
just reach out to me.
And let’s talk.
Thank you.
[APPLAUSE]
All right, thank you very much.
Does anyone have questions?
I actually have one to start with.
So just sort of you explaining, in these last few slides,
a lot of the themes sort of seem to be applying
like the laws of thermodynamics and other physical laws
to these systems.
And the parallel I thought of as the field of biophysics
also sort of emerged out of this, right?
Applying physical laws to systems
that were considered too complex to understand before we really
thought about it carefully.
Is there any sort of emerging field
like that in the area of AI or understanding
neural networks other than that little conference you just
mentioned?
Or is that really all that’s there right now?
There’s so much room for there to really
be an emerging field like this.
And I invite all of you to help build it.
It’s obviously a field, which is not only very much needed,
but it’s just so interesting.
There have been so many times in recent months
when I read a new paper by someone else about this,
and I’m like, oh, this is so beautiful.
Another way to think about this is I always tell
my students, when they pick tasks to work on,
they should look for areas where there is more data–
where experiment is ahead of theory.
That’s the best place to do theoretical work.
And that’s exactly what we have here.
If you train some system like GPT-4
to do super interesting things or use Llama 2 that just came
out where you have all the parameters,
it’s an incredibly interesting system.
You can get massive amounts of data.
And we have the most fundamental things, we don’t understand.
It’s just like when the LHC turns on
or when you first launch the Hubble Space
Telescope, or the WMAP satellite,
or something like that.
You have a massive amount of data,
really cool basic questions.
It’s the most fun domain to do physics in.
And yeah, let’s build a field around it.
Thank you.
Yeah, we’ve got a question up there.
Hi, Professor Tegmark.
I was wondering, so most–
first, amazing talk.
I loved the concept.
But I was wondering if it is possible that this approach may
not oversee–
but miss situations in which the language model actually
performs very well, not in a concise region,
like a phase region on parameter space,
but rather in small blobs all around?
Because in most physical systems,
we have a lot of parameters that we will have phases.
And the phases are mostly concise in regions
of n dimensions or whatever.
And then there are phase transitions,
which is the concept here.
But it is also, since this is not necessarily
a physical system, maybe there might
be a situation in which the best way that it performs
is in specific combinations of parameters that are like points
or little blobs around.
I don’t know if my question went through.
Yeah, yeah, it’s a good question.
I think I need to explain better.
I think my proposal is actually more radical than maybe I
could probably explain.
I think we should never put something
we don’t understand, like GPT-4 in charge
of the MIT nuclear reactor or any high stakes systems.
I think we should use these black box
systems to discover amazing knowledge
and discover patterns in data.
And then we should extract–
not stop there and just connect it to the nuclear weapons
system or whatever.
But we should instead take–
develop other AI techniques to extract out the knowledge
that they’ve learned and re-implement them
in something else.
So take your physics metaphor again.
So Galileo, when he was four years old, if his daddy threw
him a ball, he’d catch it.
Because his black box neural network
had gotten really good at predicting the trajectory.
Then he got older and he’s like, wait a minute,
these trajectories always have the same shape.
It’s a parabola, y equals x squared, and so on.
And when we send the rocket to the moon,
we don’t put a human there to make
poorly understood decisions.
We actually have extracted out the knowledge
and written the Python code or something else
that we can verify.
I think the real power–
I think we need to let go of stop
putting an equals sign between large language models and AI.
We’ve had radically different ideas of what should be.
First, we thought about it in the [INAUDIBLE] paradigm
of computation.
Now, we’re thinking about LLMs.
We can think of other ones in the future.
What’s really amazing about neural networks, in my opinion,
is not their ability to execute computation at runtime.
They’re just another massively parallel computational system.
And there are plenty of other ones too
that are easier to formally verify.
But where they really shine is in their ability
to discover patterns in data, to learn.
And let’s use them– continue using them for that.
You could even imagine an incredibly powerful AI
that is just allowed to learn, but is not allowed to act back
on the world in any way.
And then you use other systems to extract out
what it’s learned.
And you implement that knowledge into some system
that you can prove that you can provably trust.
This, to me, is the path forward that’s really safe.
And maybe there will still be some kind of stuff
which is so complicated we can’t prove that it’s
going to do what we want.
So let’s not use those things until we can prove them
because I’m confident that the set of stuff that can be made
provably safe is vastly more powerful, and useful,
and inspiring than anything we have now.
So why should we risk losing control
when we can do so much more first in a provably safe way?
We’ll do one more question.
All right, thank you.
I’ll keep my question short.
So for your phase transition example,
is it just an empirical observation?
Or do you have a theoretical model like you do in physics?
Right now, it’s mainly a theoretical observation.
And actually, we have seen many examples
of phase transitions cropping up in machine learning.
And so have many other authors.
I have– I’m so confident that there is the beautiful theory
out there to be discovered, sort of unified theory of phase
transitions in learning.
Maybe you’re going to be the one the first to formulate it.
I don’t think it’s a coincidence that these things keep
happening like this.
But this gives you all an example
of how basic physics like questions
there are out there that are still unanswered,
where we have massive amounts of data as clues
to guide us towards them.
Thank you.
And I think there’s– you’ll probably even going
to discover–
we will probably discover at some point in the future,
even a very deep unification–
relationship or between or duality
between thermodynamics and learning dynamics
is the hunch I have.