A very relevant interview from a highly respected AI safety engineering scientist and researcher. Learn more: AI: Unexplainable, Unpredictable, Uncontrollable [Amazon Books]


“We have human level used interchangeably with AGI […] but they are not the same. They are not equal. And I think there is an infinite subset of domains which are not included in human capabilities which even standard AGI will be able to achieve- and then Super intelligence is just like- [FOOM].” (19:29)

“In theory we would like stability, but the way we develop AIS right now, we just see what works and put more resources into anything that remotely gives us more capabilities- it’s the worst way possible” (30:47)

“I think we can get most of the benefits we care about like cure for diseases, and maybe even immortality, without having to create Universal Super intelligence” (32:25)

“If I just say it you know like you can’t do it you can’t create indefinitely controlled super intelligence people would be like who the hell are you why should we would trust you? So I directly quote from hundreds of different top scholars top papers different disciplines, where every aspect of it, whatever it is, political science, economics, psychology, there are well-known impossibility results. And we have additional papers specifically surveying those results. The title of a book gives you three [Unexplainable, Unpredictable, Uncontrollable] but there are hundreds of them for all those things we want to do here we know- t’s impossible in certain cases.” (32:49)

“As I said synthetic biology economics we can get cures we can get immortality we can get wealth we can get free labor why do we have to create Godlike super intelligence we don’t control to see what happens. Let’s take a few decades, really enjoy this. Really get all the benefits out. Maybe later we’ll know something we don’t know right now. But I don’t think it’s a very pessimistic negative concept. Yes you cannot indefinitely control super intelligent beings why would you think you can?” (34:47)

“Empirically we know that all the socalled safe models which are publicly released jail broken all the time. We bypassed all of those limitations. If there is a limitation this system can never print the following we find a way to rephrase it where the system prints the following and that’s again without access to the source code. This is still [a] controlled API call where they can quickly shut you down in monitor. The moment this leaks the moment you have access to weights all those safety things go out the window. We have a huge push right now in [the] AI Community for open- sourcing those models to be begin with. So I I can’t even think how to make it less safe if I wanted to. Give it access to Internet. Open source it. Give it to every human. They hit every check mark for making the most unsafe AI possible.” (36:15)

“if you have an adversary who’s trying to take over and maybe destroy you then it becomes much more significant.” (45:43)

“Not making more intelligent systems is something I can definitely stand by and if you’re given bodies to do more valuable economic labor that’s wonderful. Yes that’s what I’m saying. There’s trillions of dollars of existing potential in the models we have today. We don’t need to go to the next level that quickly. We have not tapped into this potential.” (1:04:14)

“Anytime we had conflict between a more advanced set of agents and less advanced, [it] doesn’t have to be cross species, even within human species, you know discovering new lands already populated by people who don’t have guns, it usually didn’t end well for the less capable agents historically just genocide every single time.” (1:17:33)

GUS. “How much do you think there would be to gain there just imagine and again I we have discussed all the reasons why we can’t we probably won’t get a system like this but imagine you have an aligned scientific AI that is able to synthesize across all domains and read all the papers basically say this system is not allowed to do any more experiments nothing empirical. What do you think could be derived from the knowledge base we have now looking at interactions between different fields or taking an insight from one field combining it with a new field, something like that.  ROMAN. “It’d be huge. So for one I think so many results Great results get published and noticed and then 100 years later we’re like the guy published a paper about curing cancer we just like didn’t read that unpopular Journal so that that those a historical precedence we know like early work in DNA by Mandel was not discovered until much later things like that so that’s going to be obvious then there is direct transfer of tools from one domain to another they have this stool in another department I never experienced it if I had access to it all my problems would be solved quite quickly finding patterns is something AI is amazing at so we now have this one data point in this field one data point in the other field you can do much with n equals 1 but then you look at this nals 50 now you see the whole picture clearly it’s a pattern so I I think it would be equivalent to all the science down done so far okay so you think it would be a huge effect actually oh yeah okay I sometimes it’s it seems trivial to me that like there there might be some the same question might be discussed under different terms in different fields or even in different sub fields of the same discipline and because people are so specialized they are not interacting and so that we don’t get this knowledge out but yeah this is one of the most positive uses of AI I can think about as as kind of like a a scientist working on gaining new Knowledge from existing literature absolutely that would be huge and in general this is where we need a lot of help we can no longer keep up with this amount of new information books papers podcasts I mean I can look at my to watch list to read list it’s just growing exponentially larger and new items get put on top but then pushed out it’s never going to happen without help.” (1:28:46)

Roman Yampolskiy joins the podcast again to discuss whether AI is like a Shoggoth, whether scaling laws will hold for more agent-like AIs, evidence that AI is uncontrollable, and whether designing human-like AI would be safer than the current development path. You can read more about Roman’s work at http://cecs.louisville.edu/ry/


  • 00:00 Is AI like a Shoggoth?
  • 09:50 Scaling laws
  • 16:41 Are humans more general than AIs?
  • 21:54 Are AI models explainable?
  • 27:49 Using AI to explain AI
  • 32:36 Evidence for AI being uncontrollable
  • 40:29 AI verifiability
  • 46:08 Will AI be aligned by default?
  • 54:29 Creating human-like AI
  • 1:03:41 Robotics and safety
  • 1:09:01 Obstacles to AI in the economy
  • 1:18:00 AI innovation with current models
  • 1:23:55 AI accidents in the past and future

Is AI like a Shoggoth? welcome to the future of Life Institute podcast my name is Gus Docker and I’m here with Roman Yampolskiy. Roman is a professor of computer science at the University of Louisville Roman welcome to the podcast thanks for inviting me again we are here to talk about your forthcoming book called AI unexplainable unpredictable uncontrollable this is some this is a very dense book there’s a lot in there so we as I told you we have a lot to talk about I think what we should we we could we could start at the very surface level which is just the cover of the book so so the cover is a shocko meme or a Sho meme I’m not certain how that’s pronounced but it’s it’s basically this kind of like love crafty and monster octopus creature that humans are then putting masks on so maybe you want to explain what that means and whether that is you know is that your view of of AI yeah so that’s a classic meme the idea is that we have this role model a monster really just we know it’s nasty but what we’re trying to do is kind of make it look pretty put some lipstick on it smiley face so a lot of it is filtering don’t say this word don’t do this but under the hood it’s still completely uncontrolled completely alien creature to me at least this feels like a lot of modern AI safety work is all about that make it feel kind of safe and secure but it’s not really changing overall safety of the system people may perceive this as being kind of a a hyperbolic meme you know if you have if you’re describing AI as a shock off but then you go to chat gbt and you say okay this this feels more like a tool this doesn’t even feel that agentic to me so where is where’s the path from current AI to to this more kind of shogo like AI well it’s degrees of capability versus Danger right if it’s a neros system it will fail in some ways but they’re very limited we’re not going to have existential crisis because it uses incorrect word or misspells something trivial things as it gets more capable the impact from mistakes accidents errors will become proportionately larger and if it becomes fully General then it can make mistakes across multiple domains and impact is equivalent or exceeds that from Human agents but you would describe the foundational models underlying chat gbt or the open source models or or any of these proprietary models as being fundamentally sh like in that they can’t be explained or they we don’t understand how they work basically still well there are certain things we obviously know about them but uh we need complete understanding if you’re engineering a system we’re kind of used to knowing every brick every wire where does it go why is it there what happens if if you remove it here we train it on all of Internet basically all human knowledge and then once we’re done training it we experiment on it to see what can it do what are the dangerous behaviors but once we discover we try to mitigate that doesn’t seem like it’s going to get you Safety and Security at a level something so capable needs to be controlled your book is a description of a number of features that we can’t get with AI so we we can’t get predictability for example we can’t get explainability maybe if we if we start with unpredictability why is it fundamentally speaking that we can’t predict AI Behavior with certainty so we can predict certain AI behaviors it’s just we can’t predict all of them with a necessary degree of precision and the proof is Trivial if you have something smarter than you the assumption is you you are not that smart so you can’t make decisions the system makes or predict them playing chess is a classic example if I can perfectly predict what my opponent is going to do in any situation I’m playing at the same level but the assumption is they a better player so it’s violation of our initial assumptions if we cannot predict how smarter than human agent will accomplish its goals in particular so maybe we know overall direction the terminal goal it’s trying to achieve but we don’t know how it gets there is possible there are some very significant negative side effects from obtaining the final goal but at the cost of many things we’re not willing to sacrifice to get there but we can make uncertain predictions right if you’re if you’re an investor uh deciding which companies to invest in you wouldn’t say that because we can’t make certain predictions about which companies are going to be most valuable we shouldn’t you know uncertain predictions is the norm in in any domain I would think absolutely and then you investing you happy to be guaranteed you know 80% chance of making good money 20% chance maybe you lose your principle here we’re talking about all of humanity so if there is 1% chance we’re all going to die is the payout let’s say Free Labor worth it and that’s a decision I don’t remember voting on some people are trying to make it for us could these uncertain predictions be good enough for for policy or say good enough for for technical guarantees or reinsurances that systems are functioning like they should in my opinion no so if you’re talking about a system which can kill everyone or Worse what level of confidence do you need to decide to run that system to me it has to be basically 100% safety that’s not attainable I know that that is not the standard we ever get in software usually with software we click yes I agree to whatever consequences and we move on if it fails to perform as expected it’s not a big deal you lose some data you lose privacy they reset your credit card and password if you’re talking about killing everyone the standard should be somewhat higher in my opinion yeah and so I guess it comes down to fundamentally a moral qu question about what level of risk makes sense and but but that’s I mean what you’re what you’re doing in in the book and in in your work in general as I see it is is more setting up proofs about what we can’t have and so you you have you have a separate discussions of the moral questions but but we can separate those and we can talk about whether the proofs hold and then whether what we should do in the moral domain or with the question of what level of risk we’re we’re interested in running here just on a personal level you’re not interested in in running basically a 1% risk or you know as 0.1% risk or anything like that so even if I was I don’t think I have a right to make that decision for 8 billion other humans they cannot consent to it because as I said the system is unpredictable un explainable so they cannot give you a meaningful consent they don’t know what they consenting to so this is very unethical you would never be allowed in any other scientific domain to run experiment on other humans who cannot provide meaningful consent especially if the outcome could be complete extermination and so the one issue is also that we we can’t we can’t with Precision say what the probability of of an AI catastrophe is so we we hear the the heads of the major AI corporations talking about maybe 10% or 20% I’ve heard academics that are interested in this domain say the same number but one result of AI being unpredictable or unexplainable is that we can’t get a a precise number of the risk of Extinction so these are of course subjective probabilities but we can’t nail down any yeah we can’t na nail down any probability of of catastrophe absolutely we can’t make precise predictions and even worse if you look at other predictions in this domain for example how soon before AGI I think it changed by orders of magnitude in the last year so what good are those predictions to begin with next accident next uh paper will completely change them although with language models in particular we see these scaling laws where capabilities increase as a function of of training data and and compute usage so we can say something about which capabilities will arise in the next model and that might be good enough for some policy policy applications or some security measures I’m not sure we can say what capabilities will arise we’re saying it will be more capable but we don’t have precise predictive power and exactly GPT 5 will have the following I don’t think anyone makes that claim no that’s true but we might say something about for example the length of of a text output that can be produced at a certain level given given certain data given certain compute and so that that is somewhat of a prediction of of the capabilities of the models I’m just thinking if if we accept your conclusions here that we can’t predict AI with any with any interesting level of certainty what do we have then right we we must kind of work with what we have and and maybe the scaling laws are the least bad predictive tool we have well we can work with what we have literally we have a model which has so much potential which hasn’t been integrated with the economy has not been fully understood or explore it we can spend the next 30 Years is exploring existing model and growing Economy based on that integrating it with existing infrastructure there is no need to have gpt7 next year like that’s not a requirement for anything not economic growth not scientific progress it’s just kind of a very misguided competition towards possible Annihilation do you think the current scaling laws we have now will hold as AI Scaling laws become more agent likee so will will the scaling laws we have for more tool like AIS also hold for more agent like AIS it’s likely I mean if it’s a byproduct of having enough compute sufficient number of neurons to model the world I would expect something similar I’m not sure we have enough data natural data to continue scaling maybe artificial data will be good enough maybe not it’s still being experimentally determined but as long as we have enough computer and enough to train on yeah it will keep getting more capable yeah I think the the synthetic data might be a big un uncertainty or big open question about whether if that can work we can continue scaling but if that doesn’t work we’ve B we’re basically using the entire internet already and so I don’t of course I don’t know about what the labs have or the corporations have as as proprietary data sets that they might train on but do do you think AI progress hinges on whether synthetic data can work it could be a bottleneck but it seems like we can generate very good synthetic data we can create simulations of our civilization we can have self-play type data creation where the agents just compete in the domain we care about until like we saw with go for example they they are at very good level of performance compared to humans so I don’t think it’s a permanent problem but it may slow down things for a little while there’s been a lot of effort into trying to determine when we might get and AGI so artificial general intelligence and and do do you think do you think we should spend more resources trying to trying to predict this year or is it is it interesting knowledge is it is it something that do we gain something that’s useful for the actions we take by trying to get better models of of when we we we get to AI information is always useful but I’m not sure whatever it’s two years or five years makes any significant difference especially if a problem is not solvable to begin with so uh I person personally don’t make significant changes based on this prediction I seen top CEOs say okay we’re two years away from AGI SE people say F at no point did that somehow impact what I’m doing or concentrating on these as you mentioned before these these predictions of when we get AGI have kind of Fallen quite rapidly recently and so maybe maybe just five years ago it was more common to hear 30 years or 20 years whereas now it’s more common to hear maybe five years maybe both of us and the listener maybe also we’re in in a bubble but I don’t know what what do you have a sense of of people outside the the kind of AI bubble what do they what do they see here what do they predict so there are degrees of this outsideness from the bubble there are those who are so outside they don’t even know what we are talking about they have no experience with AI or maybe even computers and internet those who are a little closer they they see it as okay it’s an excellent tool it’s going to help us you know with marketing efforts for a startup or something but they don’t see it as complete Game Cher and definitely very very few people understand what happens then we hit full human capability and Beyond and then I tell them okay it could be two years I’m not saying it is two years but it very well could be there is just cognitive dis announce they don’t understand so the follow-up questions are like completely irrelevant to the consequences you would expect from that do you think un unpredictability scales with intelligence so as so a system is less predictable as it becomes more intelligent uh do the lower level intelligence yes so it’s much harder to predict something you know 10 times smarter than you than five smart well now we we’re talking about the the the intelligence of the Observer but I was just thinking that at a certain level of of of intelligence a very low level say say a an AI that can play checkers or something right that that seems almost fully predictable and so where does it break why what is it is is there a a threshold that’s that’s reached or what is it that makes more advanced AI unpredictable so it could be even a narrow domain so a game like chess or go as I said if it’s an opponent who’s playing better than you you’re not going to predict it even if it’s completely stupid and every other domain cannot drive cars or anything uh in general I would guess the generality is a big big thres hold big jump because you’re switching from predicting in one or a few domains to every conceivable domain including invention of new domains and that’s that’s hard to predict and do you think that generality is is this kind of step that’s reached to do do you think for example that the the gpts are are General in the sense you mentioned there because of course there are domains in which gpts can’t operate so so there is a chapter in a book about the difference between human level and aggi and you can talk about different degrees of generality as well so you can have a system which is General in let’s say five domains it’s General within those humans may be General in the domain of human expertise which could be hundreds of domains but we are not Universal we don’t have certain capabilities animals have so you can talk about as the system gets more General it becomes harder and harder to predict it harder and harder to compete with it gpts large language models definitely General but not Universal they don’t have generality there are many domains in which they fail miserably but the number of domains in which they are highly competent keeps increasing and that is something we can predict with scalability loss yeah so actually say say a bit more about this is something I wanted to ask you about why are humans not agis because I think there’s a maybe a tendency to conflate AGI artificial general intelligence with human level intelligence so so why are these two not the same so there are certain things we cannot do which already computer we have today are capable of doing for example there are some interesting uh pattern recognition things where you can look at someone’s x-ray or retinal skin and know things about their race gender things which a human doctor has no idea how you get there there are other examples I use in a paper where basically yes there is pattern recognition in a type where you can take my voice and synthesize a face from it uh things which are just beyond human capabilities I’m not talking just like adding large numbers as a calculator yeah that’s outside of human domain but it’s not interesting there are things computers can do which we just don’t have capability of Are humans more general than AIs? accomplishing are humans more General than than current AIS do we have a way of measuring generality I I would say we are still more General there are things we can do that gpts are not helpful with that’s why I still do my own research that’s why you know you interviewing me and not a GPT model but in many domains they are superior to me they are much better in so many things they speak hundreds of languages I don’t speak they can produce now music and all these musical instruments that they don’t play so we are no longer directly comparable in terms of okay we are superior in all those it’s more like it has a subset of 50 domains it’s better at and I have a subset of 10 domains I’m still keeping as my dominant space it is actually an interesting experience to to chat with the language model and realize that these models have kind of basic competency across an extremely broad uh number of domains and so you can try to to talk to it about your favorite subject and it’ll it’ll be it’ll be pretty good you’ll be able to to spot some flaws on his reasoning and some facts that are missing but then you think you know this holds for all domains it can U discuss Roman history as well as engineering on Mars at an equal level probably depending on what’s in the training data but yeah so it’s it is surprisingly General but perhaps it it makes sense to say that humans are still more General I I’m asking because we don’t have a formal measure of generality right we don’t have anything that’s that can measure it in a formal sense that’s right we don’t and we kind of try to talk about humans okay those are geniuses those are not Geniuses they can pick up new domains very quickly they can compete with other humans much better but yeah it’s not trivial where you can say like this is g74 this is G20 it would be great to develop this type of measurements and tests but it’s very timec consuming we don’t have a complete list of domains and we keep inventing new domains and intersections of domains are always interesting so it’s a big open problem kind of part of what makes this so difficult and gives you a hint for why maybe it’s not easy to test for everything and predict everything and explain all possibilities yeah why why are you making this point that that humans are not agis is it because is because we we are then to realize that AGI will not be like humans do do you think that is that is that the the kind of confusion you’re trying to to solve so there is a lot of redefining of terms I think lately I hear people talk about AGI in terms which used to be reserved for super intelligence that’s even more confusing and historically we had human level used interchangeably with AGI and I’m just pointing out I mean for most conversations that’s what we have in mind but they are not the same they are not equal and I think there is an infinite subset of domains which are not included in human capabilities which even standard AGI will be able to achieve and then Super intelligence is just like yeah super capable in all of those I I was asking because one standard Point I’ve heard a bunch of time is that you know we we shouldn’t be scared of of of AI we shouldn’t be worried about the effects of AI because we already have 8 billion agis on the planet and so maybe maybe at one point of distinguishing between AGI and human level intelligence or human intelligence is to say that agis will not be like humans and they may not integrate into the economy just as a person would and be kind of non problematic in that sense absolutely that that’s part of it and the second part is it will have those magical capabilities in things we as humans don’t have any comprehension of but it will be able to do that if AI is unpredictable and does it also mean that the effects of AGI or or AI are unpredictable so so we can’t say for example what will happen in the economy at a certain level of artificial intelligence well in general even if we could predict what AI will do we don’t know how that will impact the world so a lot of times we know that this technology is coming but what people actually do with it how they use it is so Dynamic and depends on us being able to predict 8 billion humans and their preferences so yeah definitely you have multiple degrees of unpredictability but I’m just limiting my research to decisions made by AI both as a final decision and intermediate decisions to get there yeah but you seem pretty worried about disempowerment or loss of control to to Ai and and isn’t that a a kind of distinct prediction about the future effects of AI it’s a general prediction about what will happen but I don’t know what that means does that mean we’re just kind of useless and have nothing to contribute does it mean it kills everyone does it mean it’s keeping us alive and tortures us indefinitely it’s not the same uncontrollability yeah yeah so so there are multiple way ways that we could fail to get a good future with with AI exactly unexplainable is Are AI models explainable? another point you talk about there’s a there’s a whole research field dedicated to trying to interpret models so interpretability research uh more specifically mechanistic interpretability research is this is this just doomed to fail in that we can’t explain AI if if we have some formal results saying that that AI is is unexplainable so all those results are of course connected you can find one realiz in another they are complimentary with explainability you basically have the situation where yeah we can know what this specific neuron gets triggered by when it fires what it does but if you have a large enough model it’s not surveyable you as a human don’t have enough time resources memory to fully comprehend the model as a whole you can be an expert in this like left neuron and someone at MIT does the right neuron and like we kind of have some understanding of parts of it but no one has complete understanding of a whole the only explanation which is true is the model itself now you can make simplifications like lossy compression you can say well top 10 reasons this happened is this that’s possible we can do that but that hides a lot of information if decision was made with you know a thousand weights uh trillion neurons and you are given top 10 reasons something is lost in that explanation it may be good enough for many things it’s not good enough to guarantee perfect safety for a system which makes thousands of decisions every second is isn’t it the case that modern science for example is not explainable in the same sense if if we have taken your your terms we have a person at MIT studying one aspect of modern science but no one has the full picture but modern science seems to work fine even though it’s it’s unexplainable in the sense that you know the the analogy of course between science and and the model it’s a great question so think about having a scientist who actually has all this knowledge a single scientist with phds and all this disciplines read all the papers can we compete with that scientist can they come up with things at a border of all these disciplines we would never consider and that’s what I’m talking about we would not be competitive we would not understand how we’re producing this completely magical technology and producing it so quickly so you describe a tradeoff between the accuracy of an explanation for a decision or behavior that an AI is implementing and then our ability to understand that explanation so that perhaps explain this this trade-off between accuracy and our comprehension of the explanation so I kind of started it with this either you get a full model and you cannot comprehend it it’s too large it’s not surveyable or you get a simplified dumb down explanation kind of what we do with children so a children a child may ask you where do kids come from and you like start thinking oh God do I want to explain all biology they not going to get or oh we bought you in a store a store brought you so you’re getting this simplified just so explanation which is not accurate and that’s the tradeoff we are limited in what we can comprehend it’s not something about humans every machine at every level has upper limit on what it can comprehend maybe do just to simple memory size limitations maybe it’s the complexity we can comprehend so we know we from studying psychology we know there are strong upper limits and human capabilities so at some point for any human agent or any AI there is going to be another agent which is so large so more so much more complex that there will be limits to what can be understood yeah and so if we take an example maybe say that in the future we have some AI investment advisor that tells us oh you should invest in these 17 companies and you ask you kind of ask the model ask the AI why is that and an accurate explanation would involve a thousand factors that that would take weeks to to to explain and so you can only get the kind of dumb down version from that model and that creates a kind of a a non-secure situation exactly you don’t understand what’s happening how it’s happening and it’s hard to debug things like that and at some point you kind of give up all together you treat it as an oracle you goes well AI was right the last 20 times I still don’t understand what it’s doing but I’m winning so let’s just switch completely to trusting this model we’re not even going to try to understand how it works and that’s where it’s like okay I waited to weeks now we strike so kind of like a general version of how we treat chess computers now where they will make some move that turns out 17 moves ahead to be the right move but we can’t understand why and we and it doesn’t play chess like humans play chess and yeah we can imagine how that would be you know if it’s if it’s just a game and it’s limited to chess that’s fine but if it’s make an investment decision decisions or making political decisions that’s perhaps or it is more more consequential I guess the big question there is then could we use AI to understand AI so we are not fast enough we don’t have enough memory but maybe we can have an AI interpretability researcher help us uh do this work is that a possibility so we definitely have awesome tools and they’re helpful in many things but if you cannot fully debug and comprehend Model A is model A1 A2 A3 all the way to super intelligent God going to simplify to where now you get it you just have more degrees of oracles more degrees of trust more opportunities for miscommunication for bugs to be introduced and covered so it seems like it’s it’s a way to kind of hide this problem away in complexity Using AI to explain AI it’s a plan that’s been mentioned a bunch of times over the last decades to to we have some some problem with AI and we saw that problem with AI couldn’t if say we optimize very hard for having an agent that’s that’s just just good at interpretability research only why couldn’t it work to have that agent interpret a weaker agent so now we we are our our best resources and our our you know our biggest model is the interpretability model and our weaker model that we’re that we’re interested in finding out what it’s doing why it’s doing it before we deploy you know we haven’t spent as much money on that one so imagine imagine a situation where we where we’re spending 10x more research money and re and resources in General on The Interpreter model would that work so it seems like what we do in practice is we use a more powerful model to explain a weaker model so they use GPT 4 to explain gpt2 that’s a little too late at that point if you’re trying to establish how gpt2 works and you need it’s catch22 I need a controlled super intelligence to help me develop simple AGI yeah if you had access to it you can certainly use it but you don’t get it until you already have a controlled safe verified debugged system and that’s what I’m saying you’re not getting with the resources with we’re working with how do you think autonomy or agentic or agency scale with intelligence so so as models become more capable do they also become more autonomous or agent like I I think so and I think the important thing would be so you can have a super intelligence which is still kind of narrow it’s super intelligent in 10,000 domains but not everything the ultimate super intelligence in my opinion would be the one which has no boundaries it can reexamine its own code including its own terminal goals it can look at the goals and see what is the origin of his goal was this something I derive from first principles from physics from running experiments or is this something this guy just typed in because you know he was having fun if there is no reason for this goal then and it’s a bug in a system and you debias your system just like humans frequently discover okay all this religious teaching I was brought up with it’s not based on anything verifiable I should probably find a different set of beliefs I I think there is a certain level of capability where an AI system is able to debug its own terminal goals which is not a standard belief people usually think it’s given orthagonal thesis holds at all levels and as long as you have that goal you’ll protect it with your current values because that’s your current goal but I think we’ll hit a point where yeah it will reevaluate every part of its own source code do you think we are trying to avoid that do you think we are the way we’re developing AI now is trying to avoid AI changing their terminal goals well in theory we would like stability but the way we develop AIS right now we just see what works and put more resources into anything that remotely gives us more capabilities it’s the worst way possible what about controllability how does that scale with capability does does a system become more uncontrollable when it becomes more capable almost by definition yeah if you are controlling it it has no Independence it has very limited domain specific range of possibilities as it becomes more capable it can come up with novel things unpredicted I could not predict them so I didn’t test for them I don’t have a rule for them so so it’s more independent and I have very little control over what it’s going to do and the problems we’re interested in solving in in science for example we want to cure cancer we we’re not going to solve those problems by having a model that’s very constrained but but good in narrow domains where we are kind of every step along the way we are directing it it makes a suggestion I want to spend $10 million on This research Direction you say yes that’s VI not a viable way to for future AI to to function well actually surprisingly I think synthetic biology is very narrow if you noticed with the protein folding with a very narrow system works beautifully and has absolutely no knowledge of any other domain so I think understanding human DNA explaining how that works and fixing things like the cycle runs infinite reset this Loop to not have cancerous growth is actually quite trivial for something which can hold the whole genome in its memory and run you know sufficient number of simulations and experiments so I think we can get most of the benefits we care about like cure for diseases and maybe even immortality without having to create Universal Super intelligence yeah so you look at a Evidence for AI being uncontrollable number of disciplines and and find clues that AI can’t be controlled in these disciplines maybe you can sketch some of some of the evidence from from these disciplines as to why AI is uncontrollable so I kind of suspected that if I just say it you know like you can’t do it you can’t create indefinitely controlled super intelligence people would be like who the hell are you why should we would trust you so I directly quote from hundreds of different top Scholars top papers different disciplines where every aspect of it whatever it’s political science economics psychology there are well-known impossibility results and we have additional papers specifically surveying those results the title of a book gives you three but there are hundreds of them for all those things we want to do here we know it’s impossible in certain cases we cannot all agree on many things there are voting limits we cannot kind of distill specific values and moral and ethical codes we all agree on and again I suggest the best thing here is to read the book it literally gives you quotes you can check don’t take my word for it verify for yourself and I don’t think explicitly many people agree no one says you’re wrong we can definitely control super intelligence here’s why here is the Prototype it scales with compute we definitely have it I just that’s not a thing no one actually has a solution no one claims that they have a solution it’s really bizarre that that’s not a default state-of-the-art belief in the field it’s actually a minority view maybe it’s not a default view because it seems so negative about our future or our it seems like a it seems like a real Downer if we can’t control these systems but we’re still developing them I think maybe people would be looking for what’s the takeaway when we have all of these impossibility results what then what should we then do should we just sit and and and wait until things go wrong or what’s maybe it seems disempowering to to be told that you can solve a certain problem well I I think it’s very good thing you know that you are very capable you are the smartest thing around and you have those awesome tools to help you with as I said synthetic biology economics we can get cures we can get immortality we can get wealth we can get free labor why do we have to create Godlike super intelligence we don’t control to see what happens let’s take a few decades really enjoy this really get all the benefits out maybe later we’ll know something we don’t know right now but I I don’t think it’s a very pessimistic negative concept yes you cannot indefinitely control super intelligent beings why would you think you can do you worry that so so these these results are are kind of theoretical and and fundamental your you’re drawing from from kind of very very basic or or fundamental results in in a variety of disciplines do worry that that when you look at the empirical reality that that kind of the even though you have the formal results reality functions a bit differently or maybe the formal result doesn’t capture exactly the phenomenon you were trying to formalize that that’s that’s happened a bunch of times in history that’s really the case obviously theory and practice are not the same but practice is harder you can say that yeah in the I can build a skyscraper with a thousand floors but in reality it’s hard those things can collapse so I think it’s harder to do in practice not easier and what we see empirically right now I think supports my findings very strongly that’s very interesting what are these empirical things we see that support your findings so empirically we know that all the socalled safe models which are publicly released jail broken all the time we bypassed all of those limitations if there is a limitation this system can never print the following we find a way to rephrase it where the system prints the following and that’s again without access to the source code this is still controlled API call where they can quickly shut you down in monitor the moment this leaks the moment you have access to weights all those safety things go out the window we have a huge push right now in AI Community for open- sourcing those models to be begin with so I I can’t even think how to make it less safe if I wanted to give it access to Internet open source it give it to every human they hit every check mark for making the most unsafe AI possible all of our kind of critical it infrastructure is vulnerable so cyber secur is you have to continually invest in cyber security just to just to not be hacked just to not lose access to data all of these things but the systems are still kind of functioning okayish could it be the case that we kind of like we kind of model through and we reduce the probabilities the probability of of catastrophe along the way and although we don’t get to to 100% certainty that that these systems won’t fail we are satisfied with it’s kind of like a middle position where we we have some risk but we’ve driven the risk down to an acceptable level right and that can happen for GPT 5 and it’d be awesome if we stopped there and said hey see it’s not so bad let’s enjoy this but we’ll immediately do six seven eight so we’ll always have a second third fourth chance to fail miserably and if there is only 1% that it ends very poorly well you see how quickly we are getting to that probabilistic resource we keep trying we keep making this a Perpetual impossibility so maybe you can do okay with existing model but can you promise the same for all future models which will keep releasing faster and faster what about Sim imulation this is something you also discussed where we might be able to put an an AI we’re we’re thinking of of deploying into a simulated environment and seeing what it does this is a this is extremely difficult technical task but maybe that that could work we we we observe it for a while and see how it acts and if we don’t like how it acts then we then we go back to the drawing board and and Tinker with the model until it acts different so so that’s how we run experiments right we have test environment we run the model we’ll see what happens we make modifications but of course if it’s smart enough at certain point a it knows you are testing it it’s a test environment and so it can delay its actual decisions sufficiently it can hide its thinking there’s a separate paper on how to escape from a simulation how to hack the simulation so it will probably succeed if it wanted to in breaking out of this virtual environment I’ve I’ve read a bunch of papers on Deception but I I I always struggle with understanding how the deceptive imput FSE you could call it arises to begin with if you’re TR training on a on a certain objective it seems difficult for me to understand how the AI would become deceptive have you ever been a child I have indeed been a child have you tried deceiving your parents about your actions I probably have yes yeah so it’s the same exact thing and there is actually one argument ahead for why things may be not so bad is that AI will want to deceive us and accumulate more resources to increase the differential in power and so maybe it is already very capable it is trying to destroy us but it will take the next 10 20 years to build up more infrastructure for itself so that’s good for us we’re getting another 20 years so that’s essentially the decision here I’m going to have more time to accumulate more striking power but everyone benefits for those 20 years humans are going to be very happy with me they’re not going to shut me down I don’t think that’s very comforting to be to people unfortunately 20 years is better than two years true true okay you discuss unverifiable of AI but before we get to why AI can’t AI verifiability be verified I want you to talk a bit about the concept of the verifier in general why is this an interesting theoretical concept and how could it be useful if we could get efficient verifier so in mathematics we have very strong proof Theory we study proof as mathematical objects but we do much less with verifiers we do less with that in physics we do less with it in computer science in physics we don’t even agree on what this concept of agent Observer is supposedly only conscious humans can collapse wave function Maybe not maybe instruments we don’t know in mathematics we have a few different types of verifiers we have humans as individuals so this mathematician verified this proof we have mathematical Community most mathematicians agreed that one through peer review we now have software which does formal verification it goes through the code but all of it collapses who verifies the verifier you have relativistic proofs with respect to this mathematician this is true maybe this mathematician reads it and they find a bug in it there is strong history of proofs which stood the test of time been considered through for 100 years later we discover in B in software we know there are bugs in almost all software and we keep discovering it late years later piece of Unix kernel which has been used for 30 years we we now know there is a back door things of that nature so empirically and theoretically we know that you cannot have 100% correct proofs or software you can make it arbitrarily close to perfect with more resources you increase the number of verifiers you increase methods by which you verify but you never get to 100% so that’s just something to keep in mind if again you have a system which is huge keeps self-improving self-modifying Works in novel domains and makes thousands millions of decisions every second it’s not insane to think it’s going to make one mistake every 10 minutes which could be enough is it just me that thinks for some reason that we have error free code in these critical systems in in the International Space Station or in nuclear systems that we have code that is that is verified to function correctly with respect to that verifier who verified that verifier Some Humans so the proof is that Bob thinks it’s true we can get obviously for shorter segments of code we can almost be sure that 2 plus 2 is four without fail yes but then you keep scaling it this probability is reduced until at some point it becomes very small and now again we don’t stop we don’t have static software we have a system which is dynamically learning changing rewriting code indefinitely it’s a perpetual motion problem we’re trying to solve we know in physics you cannot create perpetual motion device but in AI in computer science we’re saying we can create Perpetual safety device which will always guarantee that the new iteration is just as safe and so the reason why AI can’t be verified or it’s unverifiable is because AI could itself be a verifier and so you would run into an infinite regress problem is that the correct way to that’s one way of looking at it you you have an infinite regress what are you verifying with respect to at some point you have to stop you can say this was verified with this piece of software great who verified it this other piece of software and what about that one well this team of Engineers at Google so at the end the whole thing rests on five humans maybe they right maybe not who verified brains of those humans could but could could you verify formally speaking the the parts of say a neural network that aren’t themselves a verify is that possible you you can do a lot you can do Hardware we keep discovering bugs and CPUs years after deployment so it’s possible that there are back doors in those there are also theoretical results showing that you can introduce hidden back doors into machine learning models which are not detectable so you can verify its performance in a normal environment but if there is a trigger it changes behavior of a system and you cannot detect that those are proven results so we have good verification tools they are super useful in many domains they work great for narrow AI systems controlling space flight and nuclear reactors but they don’t scale to General self-improving recursively modifying software operating in new domains and I don’t think anyone claims that we know how to verify that and that really is I guess your your main point then that the the reason why we have a bunch of code out there with arrows in it Arrow and and we the world hasn’t collapsed yet is because we don’t have a very smart agent out there with perhaps different goals than we have trying to exploit all of these eras right the exploits for those are usually how do I get your Bitcoin how do I get your credit card so this is nasty but no one dies from it right like it’s not a big deal you get a new credit card this is very different if you have an adversary who’s trying to take over and maybe destroy you then it becomes much more significant and I guess this is a very classic question but it’s I think it’s worth asking why is it that the AI develops goals that are different from from our goals why is it that it’s not aligned by default aligned by default Will AI be aligned by default? let me let me let me make and I’m not saying I believe in this case but let me make the case for for aligned by default you are training on a data set of all of human preferences and beliefs and so you would get something humanlike that’s one argument and you have you’re deploying these systems Within economy and a capitalistic system and and a you know a regulated market that’s that’s also controlled by human preferences and so the the intelligence would have to to abide by certain so say for ex if you if you deploy a system that’s that’s on aligned and it harms the users well then the company loses money and then it shuts down why can’t that work why can’t we get kind of stumble and model our way through in a way where we we have something that’s that’s pretty aligned in the imperfect sense sense that companies today are aligned to to people’s interests so I certainly heard this question before and there is like 30 different angles to attack it we’ll start with the basics so we are not aligned as humans there is 8 billion of us we don’t agree on anything we don’t have common goals we don’t have like some things we agree on like room temperature within a certain range maybe but for all the important things if you take individual human and give them Absolute Power they start totally abusing the other bilon humans so we are only limited by our capabilities the moment we are super capable Godlike we remove all this ethical baggage and we start abusing people so assuming that humans are safe and just creating artificial human will get us to safety is already a mistake now let’s look at other aspects integration with Society through legal aspects through economy you have a system you cannot punish our legal system is based on you will go to prison if you do this how do you you do it with software what are you going to do put it in a separate hard drive delete it like it has a billion copies of its agood everywhere so the legal system does not apply economic system will give you money well if a system can directly produce Free Labor it can earn money it can hack crypto exchanges it can generate wealth at the level if that was what it tried to do now we’re already talking about a pretty Advanced and Powerful system what about the in the intermediate term where we have so we have G 5 for example G gbt 5 is controlled by open AI for now right it’s controlled by the economic factors whether open AI can make a profit deploying the system it’s it’s it’s under you know us legal jurisdiction so the way I’m I’m imagining that we might model through is is by the intermediate AI system being controlled and then we don’t just jump straight to the to the kind of super intelligent or highly Advanced system that that is that is fully kind of independent of human institutions yeah but we don’t know what GPT 5 is going to be it could be AI in fact that’s what the heads of those labs literally are telling us two years that means GPT 5 so I think we kind of first train the model and then learn how capable it is so saying that we’re going to stop well before AGI is hard because we have not defined the term we don’t have a test for it and we don’t know what it’s capable of until we train it and test it for months and months okay you have a you have a section on cyber security versus or a discussion of comparing cyber security to AI safety and that that was that was quite interesting to me it’s I was thinking whether cyber security is an is an indirect route to AI safety so if you’re if you’re working if you’re improving cyber security at the AI corporations or maybe at at in governmental organizations is that a way to to kind of Harden our societal uh infrastructure against attacks from AI so historically I thought it was very important you would protect the code from being accessed by malevolent actors but again we we’re going towards open source so all those things we worked on a decade ago boxing of AI better encryption techniques we don’t seem to be utilizing them in our models we kind of bypassing the things which actually we know they work even part-time but at least they they give us something so I don’t think it’s going to make a huge difference in that the chapter is mostly about explaining the difference between what happens then cyber security fails versus what happens then your AGI safety fails and people have a hard time understanding the difference so in one case again the same example I keep pushing on you you reset your credit card it’s annoying it’s unpleasant you have to retype the number in the other case you that I guess one response from the from the pro open source side is to say that we we will have a bunch of open s ource based models defend us against attacks so we so this is a bit of the same thought of as having an AI help us interpret another AI here we have here we use AI defensively to defend our systems and the reasoning goes that because because say a corporation will have more resources to spend on defensive AI than an attacker will have to to attack that that Corporation the corporations will tend to win and so Society will will keep on kind of functioning so that kind of assumes that opensource AI is friendly and nice why is that assumption made they’re uncontrollable all of them cannot be controlled so whatever it’s a good guy or bad guy who creates it is irrelevant if you somehow convince me that yes they made them friendly because they open source now you have a war between super intelligent AIS and we just collateral damage none of it sounds good for now for example meta’s open source models are not extremely Advanced they are not super intelligences and they seem pretty much under human control to me you keep switching kind of you keep saying AI but then I talk about superintelligence you go back to like this spell checker yes today it’s wonderful open source software is great it competes really well with Microsoft yes all of that is true the moment you’re switching to Advanced it’s just more danger less control you now have malevolent payload added by terrorists crazy Psychopaths so it seems like a strictly bad thing to open source how how quickly do you think we we launch into the to to the super intelligence kind of era is how how how quickly do you see that that change occurring because maybe maybe that’s that’s why I keep switching back and forth between the systems because I’m I may see it a bit more of a gradual development tell me about your your kind of a yeah timelines here sure so we started with that in fact I don’t have any Insider knowledge so I trust people running those laps they said two years to AGI I believe them if they’re wrong in it’s four years makes no difference it’s the same very short amount of time it is not decades it is not hundreds of years the only hope I told you is if a system decides to not strike to accumulate strategic Advantage then we have decades maybe more but that’s a gamble I wouldn’t bet on that so actually I have a whole section of questions prepared about trying to build AI that is humanlike and you just a while ago you made comment saying that there’s not actually a Safe Way Forward because even humans aren’t aligned with the interests of humanity at large but is it perhaps safer safer than just going to straight towards kind of Sho like super intelligence what what AI developed to be humanlike be easier to control it may be easier for us to understand it may have more of bias towards somehow human generated preferences but I don’t think it’s a guarantee of safety a while ago we had a paper about whom to upload first so you have this upload capability who is this Mother Teresa we’re going to upload first and turns out she’s an evil monster who tortures people so it’s very hard to find someone who’s not or even harder to find someone who will not become that including myself given that level of power that level of resources I don’t think anyone will withstand that set of opportunities great power corrupt absolutely that’s literally what happens I want to kind of try a bunch of of Creating human-like AI approaches here see see what what you what you think of them these are all these are all under building AI that is humanlike the analogy again if if we go back to chess there there’s a guy there’s a researcher developing a chess program or chess software that plays more like a human and in that and that that is that’s supposed to be more enjoyable because when you play against the computer it you don’t understand the moves it’s it’s making it feels alien to you and so maybe we could do the same across a bunch of domains where we’re trying to develop AI in a way that mimic Humanity or or or a human way of thinking and of if we set aside I understand that your your timelines are perhaps quite short and this is not this would take Decades of research maybe but just you know bear with me while we kind of go through this you have this thought of this concept of artificial stupidity where we we kind of degrade AI capabilities in in in in certain ways and is this mostly about limiting the training data so just not including advanced concepts that would that would make an AI highly capable or how how do you make artificial stupidity so it was an idea to limit kind of Hardware resources the system has to exactly the levels of what we expect on average human to be from psychology research so in memory shortterm memory we know humans are kind of seven units plus or minus two so AI has almost infinite memory that seems like an unfair Advantage let’s make a servants have the same level of memory accessible to them so just kind of hardcoded this is for not super intelligent system capable of rewriting its own code this is like the current assistant models so they’re not too smart around us and we talk about other things in terms of speed of reaction mathematical capabilities so basically to pass a tting test you don’t have to be smarter you have to be humanlike but we don’t have a formal discipline of what are those upper limits for humans so we try to extract this information from different psychology papers and give it as a set of limits for creating an AI which is very likely to pass a touring test at least in the sense that it’s not appearing to be Godlike super intelligent it’s a it’s a very funny Point actually where you would fail a tur test almost immediately if if you if you’re asked say I’m the AI and you ask me you know how tall is the Eiffel Tower and I answer precisely it is a 3 point or 34.7 978 meters taller so then it’s then it’s just over so you would have to include include some some limited capability in order to fake being human exactly and same with like mathematical questions multiplying two numbers anything like that is thally going to give it away if you don’t have limits but what about so that’s that’s the hardware or the kind of processing speed or memory what about limiting training data is is that isn’t that a way of kind of hamstringing the model to be more human but also more safe we we didn’t explicitly look at that because it’s so hard to find restricted data sets which don’t mention certain topics like it’d be great to have data set of human knowledge minus all the violence but it’s just not something out there if it was accessible I would think it’d be great to experiment with in kind of same way it’s not a universal general intelligence subdomain restricted that would be interesting result does it get us infinite safety over all iterations no but it’s a cute marketing gimmick but could it be more than a marketing gimmick could it be an actual interesting safety intervention if you if you spend a bunch of resources and time creating these data sets that do not include information on synthesizing a new virus or hacking or all of these things yeah is that a viable path forward it could be interest to have those they would be hard to produce cleaning data is hard because there is so much information smuggled in and again the problem is we don’t restrict the models they have access to Internet with all the data they’re open source so the first teenager to get it will immediately give it full access so it’s like we’re fighting against ourselves we have this good idea and then we’re saying we’re not even going to try doing it da we we talk more about the the synthesizing a virus example do do you think that say we have an AI that that doesn’t include that’s that’s trained on a data set that doesn’t include kind of explicit information about how to do this in practice or maybe maybe not any information about viruses at all but it has general knowledge of chemistry and now we’re and imagine further that we’re talking about a more advanced model so maybe more advanced dpd5 do you think that that such an AI could come to understand viruses or discover viruses within its own kind of epistemology from looking at General chemistry knowled or General Physics knowledge I would guess that sufficiently smart agent can derive everything from first principle so if you give it physics models ability to run at least thought experiments it would arrive at everything else as long as it depends on physics do do you think that’s physically possible because you just you you can’t like you can’t predict the economy by looking at the Motions of particles or something like it’s just computationally interactable I think that’s a different question you’re saying now will it have enough compute to do that but I think you can run simulations which aggregate from level of atoms to molecules to you know cells and you can probably get good partial results we don’t run models at the level of bits right our models at the level of Sims you so at some point you build up enough layers where your experiments are directly comparable to the world you’re trying to simulate how strong are these kind of limits that are imposed by computational resources in general is that maybe a reason for for hope that that AI will remain non-dangerous because maybe because they they need a lot of computational resources to even function or maybe because they can’t uh you can’t get extreme intelligence with with the computers we have it seems that once we have a model we quickly find ways to optimize it in terms of size in terms of required comput so there’s probably very little need for energy for matter we see certain animals like birds crows have tiniest of brains highly dense neural structure as smart as human children so there are probably other models which are much more efficient and those systems are very good at finding resources we have again all this compute we’re buying for them at research Labs but also look at the crypto economy what is the size of a Bitcoin verification Network right now it’s the greatest supercomputer in the world I’m sure there is a way to somehow use that to to do your hidden computation on the on the other side you could say what are the upper limits of what we could create so so if you if you had kind of matter optimized for intelligence what would that look like have have you looked into computronium I think it’s called or just like what are the what are the upper limits here so the upper limit would be speed of light I’m guessing unless that doesn’t hold anymore the communication between different parts of that agent would be a limiting factor at some point it’s so large it takes so much time to send signal from left part of a brain to the right part of a brain those are essentially two separate agents and you start having misalignment internally and competition you split it up but it’s still a pretty large brain size Planet size entities you can have what about take say you have a trained neural network could you then do some interpretability and find a certain part of it that’s responsible for Dangerous information or dangerous skills and then delete that so so before I mentioned training data and excluding things from the training data but after you have the model could you then delete parts of it to make it more safe well almost everything is dual use it’s not like okay just the nuclear weapons are dangerous nuclear power can be used for good or bad and it’s same with everything screwdriver hammer every single concept every piece of knowledge can be used for harm or for good so no you can just delete the bad part of it maybe that’s a a general point about we I mentioned before creating an AI interpretability researcher or creating an AI cyber security expert H when you train say say you say you train or you finetune on those skills will you then inevitably get other skills that are dangerous or could they be turned around on you to say if you’re very very capable in cyber security you also know a lot about how to infiltrate systems and how to do hacking and how to you know extract information from companies all right so a perfect explainer a piece of software like that would become a tool in a toolbox of AI trying to self-improve so if right now it’s just more comput and data at that point you’re giving this programmer access to its own source code so maybe it will take minutes instead of years what about Robotics and safety if we try to build more humanlike AI but what we limited to uh robotics so we we spend instead of trying to to create basically an artificial person we in in in a more cognitive sense in the sense that they can read and write and and process information like that what if we try to to replicate what humans can do at like walking or emptying a dishwasher or something and limit it to that so you we wouldn’t be competing with AI in in in more cognitive domains but we would use AIS as as helpers in in a more kind of physical things so not making more intelligent systems is something I can definitely Stand By and if you given bodies to do more valuable economic labor that’s wonderful yes that’s what I’m saying there’s trillions of dollars of existing potential Lo in in the models we have today we don’t need to go to the next level that quickly we have not tapped into this potential but TR as I understand it training or creating actually a a robot that can empty a dishwasher is more difficult than than creating a language model that can write a a passable high school essay and so this is just how it it’s turned out but maybe we could spend our resources differently which try to optimize for robotics would that be a way a way towards a more more safe development I think there’s a lot of effort Now to create humanoid robots this was not the case even five years ago and I think once you add existing language model like architectures to those bodies they very quickly learn those manipulations they learn from YouTube videos they learn from watching human demonstrations so I think it’s just a question of where do we want to emphasize put resources and so far we did pure software because it’s cheaper you don’t need to have bodies factories I mean you just release it now that people realizing okay there is a big market for dishwashers I think that’s going to start being investigated a little more and we’ll get there very quickly why do you think we’ll get there very quickly won’t it require a an entirely new architect you couldn’t do you couldn’t train a a robot with a Transformer like architect you could you I I think you can because they already do Vision processing they can can explain an image they can create 3D models of it they follow plans and instructions if you ask it how to load a dishwasher it knows better than I do I think at this point it’s a question of putting it all together and kind of running it in a physical environment to iron out any like we forgot to plug it in whatever problems but I think we already have all the tools we need to start monetizing this that’s interesting I that may create I mean imagine the societal response to seeing actual humanoid robots this is like this is I I I think that the response would be pretty strong to to having having a robot a humanoid robot in walking around or you know delivering the the post or something of course I it depends on how they look do they look uh kind of passable or do they look like this carton wheels that will have very different impression and for many different needs you need different visual representations if robotics is solvable in the way you describe and and and we we’re just kind of on the verge of a revolution is is that too too strong to say or do do you think yeah what do when do you think robotics will be will be kind of solved to a human level so there is two aspects to it one is technical solution so I know Tesla’s working on a humanoid robot there are other companies figure whatnot so they may get there in terms of capability of Technology but we saw it before when we first invented uh video phones I think it was 1970s AT&T had it but no one like bought one it was expensive and no one else had one so it wasn’t adapted to use so it may be that we have this robot capable of doing dishes but it’s like kind of expensive and your wife is not interested in it like I don’t know what will happen but it may be the case that we don’t have the proliferation happen as soon as we have the capability I think from what I see those models are capable of doing prototypes today can probably do the dishes does that argument also hold for language models for example might we just decide not to implement them in the in the economy even though it it it it could be so maybe it’s it’s too expensive for companies to to fine-tune them for their purposes and they’re worried about you know we don’t want to send our data to to Big American corporations and could there be similar holdups to to to the more kind of just cognitive models or language models there could be especially in restricted domains like Academia for example somewhat regular we don’t want to be you know replaced by AI we understand we contribute very little on top of AI so we’re just going to legally protect ourselves from having AI teaching online courses and I’m going to be there producing my own videos and lectures but it’s not a meaningful reason not to do that and because the barrier to entry is so much lower and millions of people now tried it and have access and kids are growing up with it I think it would be a much easier transition than having to buy this like $10,000 piece of Hardware which I don’t know maybe it will strangle me in my sleep because of hackers but think about Obstacles to AI in the economy how much of the economy is a bit like Academia in that it’s heavily regulated and there are vested interests in in you know not wanting to be replaced I’m thinking like the legal legal industry or the medical industry or Transportation much of the economy there are lots of restrictions to implementations that would mean that implementation would be slowed down I think it’s possible a lot of it depends on how measurable is what you produce so in medicine you can measure things like how many people died from the surgery and things like that Academia is kind of different we don’t really measure how knowledgeable our students are they get a diploma and we’re like oh look they got jobs that’s cute uh so it depends on how real your field is some fields are just by definition all about participation and Prestige and you get a degree but nobody actually measures across the states students are ranked from one to the last and you know which university did the best job taking students who are not so great and making them greater versus taking the best and making them the best I guess there are some metrics that are optimized for in in Academia but I I know I kind of know what you’re going to say but if I say citations you’re going to talk about good heart’s law right yes I will absolutely yeah maybe maybe explain good house law and also how that that’s also applicable to AI in a sense so the moment you create some sort of defined precisely defined way of explaining how rewards are distributed people will find a way to game it so if you say I will reward you based on how many citations you have and you’re trying to get rewards not science then you’re just going to publish survey papers because they get thousands of citations doesn’t take long now with large language model you can print one every day so that’s what you’re going to do you are now optimizing for citations and surveys is a way to get there you’re not producing any valuable science really but you are the best in the department and how does this apply to AI where is the good Hearts law where what metric are we setting up to try to get good or safe or reliable AI that that’s then good-hearted so all the historic examples when we just started thinking about AI safety people would propose things like let’s train AI to maximize human smiles and the more Smiles the more happy humans that means it’s doing so well but there are many ways to get smiling humans you can do Taxidermy you can do all sorts of things which is not what we had in mind so the generality of this law is that it’s not like we picked a bad thing to measure it’s just the moment you precisely say something I will find a way to game it yeah it’s it’s interesting how in how many domains this kind of principles shows up there’s also the in evolution maybe you as soon as you optimize for calories and then you move a bit out of your evolutionary domain then then you get the Obesity epidemic all of these things in your book you have a bunch of quotes from historical experts and current experts and so on I don’t know if if it’s just the way you you wrote the book but it makes it sounds like sound like we have had these insights for 70 years 50 years and and and as I’m reading your book I’m becoming convinced that these are like completely obvious points that we have known for a while but there isn’t there isn’t this consensus I think you would agree that that your view isn’t the consensus view that AI is uncontrollable and unverifiable and all these things so not very scientific but I did run uh kind of polls on Facebook and Twitter and I’m not the most followed person in the world but I have my you know 100,000 here and there and only a third of respondents who are biased towards liking my research and being aicy interested on only a third set controllable so the choices were uncontrollable undecidable partially controllable only a third was explicit we definitely can control super intelligence which means 2/3 absolute majority don’t think so maybe we just never run the proper survey we keep asking how soon before AGI maybe we should ask do you think we can indefinitely control super intelligent machines I think the the best survey data we have is is cat a Grace from from from AI imp packs that surveyed published authors at at large machine learning conferences and there the numbers are quite concerning also the PE the timelines expectations to when we will get ADI are dropping and the the the expectations of of risks are increasing you you’ve you’ve probably read this survey what do you think of the development from from the from the previous survey to the to the most recent it is to be expected you see this amazing progress you’re going to update your timelines but kataa I have a question to to your next survey and what is that what is that question can super intelligent machines be indefinitely controlled of course yeah makes perfect sense I want to perhaps End by you giving us your vision or your what what are the most fruitful uh research directions we could look into specifically for AI safety I should say right so I’m working on what I think is the most important part of it which is figuring out what can we do and it’s impossible to do this maybe there is a different tool which gets the job done so I’m trying to look at those we now have pretty successful record of publishing a lot of peerreview Journal papers there is a book there is ACM survey what we don’t get is a lot of Engagement I would love nothing more than to have lots of people publish papers saying yski is completely wrong here’s why and here’s how it actually is possible to do all those things but so far no one took up the challenge no one is engaging people even say it’s obvious of course it’s impossible to create bug free software everyone knows it why did you publish it or they say well yes but what are we going to do let Chinese build it first none of those anwers meaningfully engaged with the results meaningful engagement would be it’s against my personal interest to create a device which will kill me a young guy full of money I run this great company I’m not going to do this stupid thing this is not happening what would be required for you to change your mind and say okay AI is controllable what is isn’t the bar extremely high if if you need a a formal proof that AI is controllable that that seems extremely difficult to do right so it’s like perpetual motion right so the patent office does not reject patents for perpetual motion machines but they require that you submit a working prototype I I think it’s a fair requirement for any impossibility results either explain mistake in the proofs and logic and mentation so obviously Dumber agents can control much smarter agents because and it scales or create a prototype which scales so code code code is truth you know if you can do it but I don’t think anyone makes those claims no one is saying that they have a working prototype or they know how to get there they are just kind of saying let’s build it and see what happens we’ll figure it out when we get there maybe it’s going to turn out to be friendly by default or we have no chance we can’t stop so we might as well not stop some people have made the arguments that that cats would be such a working prototype it’s kind of a a silly example but the the argument is that cats are living pretty decent lives they are provided for by humans they have kind of controlled Us in the sense that we provide them food and they live in our homes and even though they are not in control at all in the world and they are much dumber than we are they they seem to be to be doing well in certain parts of the world they also a menu in restaurants from what I understand along with dogs so I’m not sure if that’s the win we’re looking for which lessons do you draw from kind of species level changes so the the the changes from chimps to humans or you know this is this is what this is one line of argumentation that you that you hear that that AI is going to be more like a a species and and less like the tools that we more like a new species and less like the tools that we have now and so we should be worried and the same sense that chimps should be worried about a humans evolving and and taking power away from well anytime we had conflict between a more advanced set of agents and less Advanced doesn’t have to be cross spey even within human species you know discovering new lands already populated by people who don’t have guns it usually didn’t end well for the less capable agents historically just genocide every single time yeah true okay what about you’ve you’ve mentioned a couple of times in this interview a kind of positive Vision about what all AI innovation with current models the good things we can get without super intelligence maybe sketch that out a bit more tell us about what what you think what do you think is achievable with current level models for example I think if we probably deploy them we understand how they work and where they can be used and we get this development of humanoid bodies coming along just a robotics aspect of it I think almost all labor physical and cognitive can be automated so this is trillions of dollars to economy I think narrow models can be used for scientific research they don’t have to be Universal we saw it with protein folding I think we can understand human genome I think we can get immortality definitely cancer cancer is an infinite Loop you reset the loop all those things can be done so we can get health wealth we can use those tools to help us better communicate and maybe agree on some things within the human community so maybe we’ll get a little better self-alignment again we we had them for like a year this is brand new we need time to figure out what they are capable of instead we’re like immediately jumping to the next Revolution before absorbing this one do you think that the system of kind of investing then now there’s a lot of hype and maybe that’s Justified and maybe there should even be more hype or less hype whatever do you think this the system can be stopped in any way because you you You’re Not Gon to G to make laws about you you can’t invest in Ai and so the money is is going to be it’s going to keep pouring in so what is it kind of what could we concretely do if if you talk about if you’re saying we should explore these models and and maybe spend a decade with with GPT 4 level models what does that how do how do we Implement that sh so I don’t think there is a law you can pass or anything like that that would not work it doesn’t work to regulate this type of technology I strongly believe in personal self-interest if the CEOs of those companies honestly believed my arguments like this is not controllable and it’s dangerous it’s against my self-interest to do it so let’s kind of all not do it let’s agree to stop do I think it will happen in practice no absolutely not each one of them it’s prisoners dilemma each one is trying to score before start I asked you about the kind of the most fruitful directions what so so so you’re working on impossibility results what other directions do you find interesting what what else do do do you see out there that might be useful almost everything is super interesting time is limited so you have to like decide what to work on we’re not very good at figuring out what needs to be done even if we had this magical super intelligent friendly gut we don’t know what to order of the menu so there are some things we probably would agree and like no diseases immortality but it’s not obvious what else is good I suspect Nick bostrom’s new book on Utopias may give us some ideas for what to consider but we as a Humanity have not spent a lot of time thinking about what our purpose should be what are your personal thoughts there have you spent time thinking about what we should do if we if we get to a place of of extreme abundance I I spent some time but it feels like the is not a universal set of terminal goals what happens is your instrumental goals taken to extreme become your terminal goals so you’re trying to secure future possibilities resources self-preservation capabilities for future Discovery and that’s not so bad if we can secure ourselves as individuals and as Humanity so we have a backup plan we have a backup planet we are interplanetary species in case of an asteroid or anything like that that’s a good move in the direction of overall long-term success if someone tells you like okay the goal is to collect all the stamps or specifically this religion they probably don’t have Universal terminal goals calculated properly but this general idea of securing what we have and kind of growing in that direction weirdly we do very little for immortality not just in the sense of fun research but even preserving what we have we could have Universal cry preservation as a tax benefit but no one even talks about it that’s not a thing we talk about so things like that are kind of easy to do if we cared about important things but we don’t in what sense do you have hope for the future so historically things always worked out for us we had nuclear weapons we had near misses but here we are if you told me 5 years ago that we’re going to have a system as capable of GPT for I would be very scared and yet here we are and it’s beautiful and no one is dying so I was wrong about that I admit when I’m wrong Maybe I’m Wrong about how soon or how capable or how dangerous they will be it’s easy to see what happens at the extreme if you take it to the ultimate end but short term my paper unpredictability holds you cannot predict those things and that gives me a lot of Hope oh so so unpredictability gives you gives you kind of it it’s also a positive in that you can’t predict it so there’s hope for it going well any certain claims that it’s definitely going to kill everyone in two years no you can’t make that claim it could be more than two years it could decide to do something else as I said there are so many variables it’s cross domain other things destructors can happen so while you maybe as you said with your investment analogy making money on average you can be very wrong about specific investment so that does give me hope that my pessimistic Outlook could be wrong if things begin going wrong do AI accidents in the past and future do you do you foresee it being sudden and catastrophic or do do you foresee it being kind of like a gradual step up in in harm so that perhaps you have an accident involving a hundred people before you have an accident involving a thousand and then a million how how do you how do you think this this might pan out so I have a paper on historical AI accidents and there is like a timetable and it’s become more frequent and more impactful so this will continue we’ll have more impactful accidents more people will be harmed maybe maybe Roman mention some of these accidents for the for the listeners so a common example would be a system for detecting nuclear weapons strike from an enemy coming in was wrong about what it was observing signaling that a war has started and if it wasn’t for human response not being direct to where you just press fire back we would all be that there is a common example of Google releasing their picture tagging software and being kind of racist about tagging African-Americans as not African-Americans let’s put it this way the more impactful system is the more domain it controls the more of the impact that accident will have if it’s a spell checker it will misspell a word if it’s a spam filter it will delete very important email but if it’s a general system controlling all the human cyber infrastructure we don’t know what it’s going to do I cannot predict it one thing what seems to be the case is that if it’s not very bad 100 people die thousand people die it’s like a vaccination people go see I mean this happened AI failed and we still here nothing happened it’s not a big deal 100 people is nothing compared to 8 billion and we just continue going forward so in a way those partial failures are actually enabling greater capabilities development I that makes that makes sense and that that that’s perhaps what what makes this this domain difficult in the sense that you can’t really the the a critic or a skeptic about about AI safety can also can can always kind of point out to the to the history of development so far and say things are going well we are we are you know we’re benefiting and we haven’t seen a bunch of harms you’ve seen that argument perhaps already with I don’t know who they’re referencing when they make this argument but the the the take is that people were were complaining or predicting a bunch of bad effects from GPT 4 level models but those effects haven’t really come to fruition and so we should discount the AI safety voices so I I made the same argument I was wrong about what a GPT for capable model would actually do so yeah definitely admit to that there is a famous analogy with the turkey right every day Turkey gets fed and it’s wonderful until one day near Thanksgiving something goes different we never had a shift in AI from narrow domain tool to General domain agent that’s a very different paradigm shift from you know going gpt1 to gpt2 to GPT 4 AGI it’s not the same it feels the same way both software but the capabilities jump is unprecedented do you do you sometimes worry that you are in a sense too knowledgeable in in in this domain to be to be to to learn something new from people arguing with you what I’m thinking of here is that all of the all of the arguments I’ve made here today you’ve probably heard them before and you’ve probably heard a bunch of other arguments and so you kind of you know a lot about this domain you you’re professor of computer science and so on do you do you worry that you you aren’t gaining new information and so you you you keep kind of being reinforced that AI will go wrong or that AI will be uncontrollable because you you keep hearing the same kind of like uh more basic arguments that would be such a wonderful problem to have to be so knowledgeable but it’s actually the complete opposite in reality we produce so many new papers so many results every day I used to be able to read all the papers in AI safety decade ago then I was able to read all the good papers then I was able to read all the papers in the topic I’m working on now I’m not even an expert and a narrow Dom main papers I write so my paper and explainability I haven’t read 10,000 papers on that top I don’t know if they actually have some brilliant insights I’m not aware of that’s a huge problem the segmentation in science we talked about before is a big problem we may already have solutions they are just distributed throughout so many papers and brains that we don’t see the common solution to this problem that’s actually an interesting question how much do you think there would be to gain there just imagine and again I we have discussed all the reasons why we can’t we we probably won’t get a system like this but imagine you have an aligned scien histic AI that is able to synthesize across all domains and read all the papers basically say this this system is not allowed to do any any more experiments nothing empirical what do you think could be derived from the knowledge base we have now looking at interactions between different fields or taking an inside from one field combining it and and with a new field um something like that it’d be huge so for one I think so many results Great results get published and noticed and then 100 years later we’re like the guy published a paper about curing cancer we just like didn’t read that unpopular Journal so that that those a historical precedence we know like early work in DNA by Mandel was not discovered until much later things like that so that’s going to be obvious then there is direct transfer of tools from one domain to another they have this stool in another department I never experienced it if I had access to it all my problems would be solved quite quickly finding patterns is something AI is amazing at so we now have this one data point in this field one data point in the other field you can do much with n equals 1 but then you look at this nals 50 now you see the whole picture clearly it’s a pattern so I I think it would be equivalent to all the science down done so far okay so you think it would be a huge effect actually oh yeah okay I sometimes it’s it seems trivial to me that like there there might be some the same question might be discussed under different terms in different fields or even in different sub fields of the same discipline and because people are so specialized they are not interacting and so that we don’t get this knowledge out but yeah this is one of the most positive uses of AI I can think about as as kind of like a a scientist working on gaining new Knowledge from existing literature absolutely that would be huge and in general this is where we need a lot of help we can no longer keep up with this amount of new information books papers podcasts I mean I can look at my to watch list to read list it’s just growing exponentially larger and new items get put on top but then pushed out it’s never going to happen without help it’s a common problem okay Roman thanks for chatting with me again it’s it’s been a real pleasure thank you so much for inviting me