FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

TRANSCRIPT. So today I’m going to talk about what understanding is. We need the scientific community to agree on what understanding is. If you think about the first step in tackling climate change, we had to reach a scientific consensus on what was causing it before we could advise on what to do about it. Now with these large language models, chatbots, there’s still many scientists who don’t think that they really understand the same way as we do. They believe in a very different model of what human understanding is. For the last 70 years, there’ve been two very different paradigms for intelligence. The logic inspired paradigm which dominated for about the first 50 years of AI is that the essence of intelligence is reasoning. And reasoning is done by using symbolic rules to manipulate symbolic expressions. So your knowledge is a bunch of symbolic expressions inside your head. They thought learning could wait. The first thing we have to understand is how knowledge is represented. It’s some special logically unamiguous language. And figuring out how knowledge is represented has to come first. The contrast was a biologically inspired approach. That’s what people like Turing and Fonoman believed in. And there the essence of intelligence is learning the strengths of the connections in a neural network. And those people think that reasoning can wait. First we have to understand how learning works and later on we’ll understand reasoning. Reasoning is something that came very late biologically. So the real transition to most people believing in the biological approach rather than the logical approach came in 2012 when a deep neural network trained with back propagation got about half the error rate of standard computer vision systems which have been highly tuned for this competition the imageet competition. Once that happened, then the whole computer vision community fairly rapidly switched to um using neural nets and that opened the floodgates for what we see now, which is neural nets being used for everything. So now what people mean by AI is artificial neural networks. For a long long time, what they meant by AI was definitely not artificial neural networks. It was um symbolic AI. But even when we could do vision better than standard symbolic AI, many people in the symbolic AI community said, “Yeah, but they’ll never do language.” Because obviously if symbolic AI is going to be good for anything, it’s going to be good for language because that is strings of symbols in and strings of symbols out. The linguists were also very skeptical. So most linguists believed in Chsky’s theory, which I think is crazy, um that language is not learned. I think Chsk is sort of a cult leader. If you can get people to agree on something that’s manifest nonsense that language is not learned, then you’ve got them. Now, Chsky never had a theory of meaning. It was all about syntax. He didn’t sort of know how to have a theory of meaning. And for the linguist, the idea that a big neural network with random weights and no innate knowledge could actually learn both syntax and semantics um just by looking at data was real anathema. They were very confident that could never happen. Chsky was so confident that even after it had happened, he published articles saying, “Well, they’d never be able to do this,” without actually checking and asking them to do that, which they did very well. So, here’s two very different theories of the meaning of a word. And for a long time, it looked as though these two theories were just alternative theories of meaning. The symbolic AI theory is that the meaning of a word comes from its relationships to other words. So what a verb means is determined by how it occurs with these other words in sentences. And to capture meaning, we need something like a relational graph where you have words, you have links between them, you have the relationship attached to the link, and that’s how you’re going to represent meaning, a knowledge graph of some kind. Psychologists, particularly in the 1930s, I think, thought that the meaning of a word is actually just a big set of semantic features. There might also be syntactic features and the words with similar meanings have similar semantic features. So Tuesday and Wednesday will have very similar sets of semantic features whereas Tuesday and although will have very different sets of semantic features and syntactic features. In 1985 I made a tiny language model that tried to unify those two theories. So the idea was you would learn semantic features for each word symbol and you’d learn how to make all those features of the previous words interact to predict the features of the next word. The model was trained by back propagation to predict the next word. So that’s just like these large large language models today. And like those models today, instead of storing sentences or propositions like the symbolic eye people thought you ought to, it would actually not store any sentences or any propositions. It would generate sentences by repeatedly predicting the next word when it wanted to generate a sentence. There were no sentences inside. The knowledge it had was relational knowledge that resided in the way features of words interacted so as to predict the features of the next word. That’s very different from a bunch of propositions and rules for manipulating them. If you ask what happened over the next 30 years, about 10 years after that tiny language model, which is really tiny, it only had a few thousand weights. Yoshu Benjio showed that you could actually use similar kinds of models for predicting the next word in real natural language. So you could model what was going on in natural language with this kind of model. And it was about the same as the state-of-the-art. About 10 years after that, leading computational linguists began to accept that feature vectors, which they called embeddings, were actually a good way to model meanings of words. And about 10 years after that, researchers at Google invented transformers and published them. And OpenAI then used them and showed the world what they could do. And that’s when people not just researchers but everybody began to get interested in what was happening in these large language models. Were they really understanding what they were saying? So the large language models can be viewed particularly by me as descendants of the tiny language model. They use many more words as input. They have big contexts. They use many more layers of neurons so that they can disambiguate words as they go through the net. A word like may initially you represent it with some feature vector that’s sort of hedging its bets between whether it’s a month or a modal or a woman’s name. As you go through the net, you use interactions with the context to clean it up into one of those three meanings. They use much more complicated interactions between the learned features. In the original tiny language model, the interaction is very simple. In current large language models, they’re very complicated. They use something called attention. And I’m now going to try and give you an analogy for how words work. Um, this is fairly ambitious because I think people have completely the wrong model of how language models reality. So, I want to give you an alternative model. It’s not perfect. It’s an analogy. There’s lots of things wrong with it, but it sort of gives you something to hang on to in thinking about how we use language to model reality because we need some way of modeling things. That’s what meaning is. Meaning is having a model. So think of words as like highdimensional Lego blocks. With Lego blocks, you can model any 3D shape moderately well. Don’t worry about the surface, which may be sort of rectangular, but the volume you can model pretty well if it’s big with 3D Lego blocks. Think of words as like Lego blocks, but instead of being 3D, they’re like thousanddimensional. Now, that’s a bit of a problem for most people cuz they’re not sure how to think about a thousand dimensions. I’ll tell you how everybody does it. You think of a three-dimensional thing like a Lego block and you say thousand to yourself very loudly. That’s the best you can do. So there these thousand dimensional Lego blocks and so the shapes are very complicated because they’re in thousand dimensions. And we can use combinations of those to model anything at all. Not just the distribution of matter in 3D but anything like theories of how the brain works. Now, instead of having just a few different kinds of Lego block, we’ve got thousands of different kinds of these special highdimensional Lego blocks, namely all the different words. And each one isn’t a fixed shape. Each one has a range of shapes it can adopt. It’s flexible. It can distort. It can’t adopt any old shape. Once you know what the word is, you constrained with the shape. There may be several alternative shapes like for the word may, but for things that only have one central meaning, there’s just a sort of a range of possible shapes. And what they do is they deform to fit in with the other words in the context. So the context gives a particular shape to each word. I should say at this point in transformers it really happens with word fragments, but let’s just suppose it’s words. And instead of thinking of them fitting together by using little plastic cylinders that plug into holes like Lego blocks do, a very kind of rigid way of fitting together, think of the words as having lots of little hands on them. And these hands have funny shapes. And the way they fit together with other words is they do handshakes. That’s called query key handshakes in a transformer. And actually as you change the precise vector that you’re using for the meaning of a word, as the context changes that you actually change the shapes of the hands. So it’s slightly more complicated than they have a fixed set of hand shapes. The hand shapes change as you change as you deform them. And what they want to do is they want to deform in such a way that they change their hand shapes so that the words in the context can shake hands with other words. And that’s what understanding is. Understanding is taking those words, finding how to deform them and how to deform their hands so they can shake hands with other words and then you’ve got a structure. It’s a bit like a bunch of ammonia forming a structure but a lot more complicated and in a thousand dimensions. That’s my image of what understanding is. Understanding is you take these word symbols and the word symbols don’t mean anything by themselves. They need an interpreter and you’re the interpreter. your brain is and it deforms these thousand dimensional shapes roughly a thousand so that their hands will deform so they can all shake hands with other ones and so now you formed a structure and that’s what it means to understand something to form that structure that structure is understanding. So the large language models, if you want to model all of human knowledge, they’re complicated. They have many layers, very complicated interactions. And so it’s very hard to analyze what they’ve actually learned and to see that they really are understanding what they’re saying, particularly if you have the wrong model of understanding. So people influenced by symbolic AI and by um Chsky and those other linguists questioned whether they really were intelligent or whether they really understood what they were saying. And they used two main arguments. One was they said it’s just autocomplete. It’s just using statistical correlations to paste together bits of text and predict the next word. And the text was all created by people so it’s not creative at all. Well actually it beats most people on stand test of creativity. So that’s not a very good argument. And then the second argument they used was well they hallucinate which shows they don’t really understand anything. So let’s take the autocomplete objection. In the old days when you did autocomplete, you keep tables of how often particular combinations of words occurred. So in a simple case, you might store tables of how often triples of words occurred. And then if you saw fish and you’d realize that fish and chips occurs a lot of times, so that’s a likely next word. I actually gave this talk in the House of Lords and realized that there fish and hunt is probably more likely than fish and chips. But um there are obviously alternatives, but that’s not at all how LLM’s predict the next word. that’s disappeared. They don’t store any text. They don’t store any tables of combinations of words. They model all the text they’ve seen by inventing feature vectors for words that can deform with contextual influences. And these complicated interactions between these word fragments by these handshakes. And that’s what their knowledge is. Their knowledge is in those interactions. It’s a bunch of weights in the neural network. So that’s what knowledge is in these large language models and it’s what knowledge is in us too. The original tiny language model wasn’t invented to be good at modeling natural language. It was invented to explain how we can understand the senses of words. How can I take a sentence like she scrummed him with the frying pan? I’ve never heard the word scrum before but in one sentence I already know a lot about what it means because of the hole created by the context tells you what it ought to mean. So my claim is that we model reality by using these word fragments in much the same way as machines do. It’s obviously not exactly the same. Another argument is well hallucinations show they don’t really understand anything. First of all, they should be called confabulations. They’ve been studied by psychologists for a long time and they’re very characteristic of people. We store knowledge in weights, not in stored strings. We think we store files in memory and then retrieve the files from memory. But our memory doesn’t work like that at all. We make up a memory when we need it. We construct it. It’s a very constructive business. It’s not stored anywhere. It’s created when we need it. And so it’ll be influenced by things we learned since the events happened. And we’ll be very confident by the details that we get wrong. There’s a very nice example of that which is John Dean’s memory when he testified at the Watergate TR. He was testifying about what happened in the Oval Office about meetings in the Oval Office and he had no idea that there were tapes but he was trying to tell the truth and you can see that he’s wrong about huge numbers of details meeting. He says there were meetings between these people. Those meetings never happened. He says this person said one thing actually was somebody else said that. But you can also see that the gist of what he said was exactly right. There was a cover up going on and those were the kinds of things people said. So what he was doing was he was creating these meetings, creating them in his mind and what he was creating was what seemed plausible to him. That’s exactly what the chatbots do, but it’s also exactly what people do. Now currently chatbots are worse than most non-presidents at knowing whether they’re just making it up. But that’s going to change. So finally, if you ask how do we share knowledge with each other? Well, we’ve got these words that have names. And so I make this structure out of these complicated thousand dimensional le Lego block shaking hands. And I can’t tell you all that structure, but I can tell you the names of the words. And now you can create the same structure using the names of the words. I can also put into these complicated thousand dimensional things little clues about what should shake hands with what. That’s called syntax. The symbolic AI theory is we share knowledge by copying a proposition from my head to your head or from my head to the computer and this proposition is written in this funny logical language. The neural net theory is that you have a teacher and a student and the teacher produces a produce some action and the student tries to mimic it. So for language the teacher produces a string of words. The student tries to predict the next word and takes the errors in predicting the next word and back propagates them to learn how you should convert symbols into feature vectors these thousand dimensional Lego blocks and how these things should interact. That’s called distillation. And the problem is it’s very inefficient. That’s also how we get knowledge from people into computers. So that’s how these large language models learn by trying to predict the next word a person said. But it’s not efficient. If you take a typical sentence, string of symbols, it’s only got about 100 bits. It’s a few hundred or maybe less than 100, but it’s that order. And that means the most that a student can get is 100 bits of information because that’s all there is in the sentence. That’s not a very big learning signal. Now compare that with what happens when you have multiple copies of the same agent and it’s a digital agent. So these digital agents, you can have multiple copies. The multiple copies have exactly the same weights. They work exactly the same way because they’re digital. And so one copy looks at one bit of the internet, another copy looks at another bit of the internet. They both figure out how they’d like to change their weights. And then they share how they’d like to change their weights. So now both copies know what each copy experienced. And if you ask how much are they sharing when they share their weights or the gradients for their weights, they’re sharing trillions of bits if they’ve got trillions of weights. So you’re talking about the difference between being able to share trillions of bits and being able to share hundreds of bits. It’s really no competition. It only works if the agents are digital and use their weights in exactly the same way and have exactly the same weights, but it’s hugely more efficient. And that’s why GPT4 can know thousands of times more than any one person. It’s a not very good expert at everything. So the conclusions are that digital agents understand language in much the same way as people do. It’s not a completely different kind of system. We we are like them and they are like us. They’re much more like us than they are like standard computer code. Digital computation requires a lot more energy because it’s digital. So you have to drive transistors very hard. You can’t use the funny quirky properties of analog hardware. But because they’re digital, they have this really efficient means of sharing. They can share weights or gradients. Biological computation is much lower power because it can use all the quirky properties of neurons. But it’s no good at sharing knowledge. So the overall conclusion is that if energy is cheap or at least abundant, digital computation is just better because it can share knowledge efficiently. And that’s a very scary conclusion.

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.