FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

Google DeepMind CEO Demis Hassabis talks through the growing concerns over AIs that deceive their evaluators, calling it a Class A issue that his team looks for, and describing how it invalidates all other evaluations when present.

I want to ask you about deceptiveness I mean one of the most interesting things I saw at the end of last year was that these AI Bots are starting to try to fool their evaluators uh and they don’t want their initial training uh uh rules to be thrown out the window so they’ll like take an action that’s against their values in order to be able to remain the way that they were built yes that’s just incredible stuff to me I mean I know it’s scary to researchers but it blows my mind that it’s able to do this uh are you seeing similar things and what and the stuff that you’re testing within deep mine and what are we supposed to think about all this yeah we are and um I’m very worried about uh I think deception specifically is one of the one of those core traits you really don’t want in a system the reason that’s like a kind of fundamental trait you don’t want is that if a system is capable of doing that it invalidates all the other tests that you you you might think you’re doing including safety ones it’s in testing and it’s like right it’s playing Five go five year yeah it’s it’s playing some meta game right and then and that’s inred dangerous if you think about then it invalidates all the all of the the results of your other tests that you might you know safety tests and other things you might be doing with it so I think there’s a handful of capabilities like deception which are uh uh fundamental and you don’t want and you want to test early for and I’ve been encouraging the safety institutes and evaluation Benchmark Builders including and also obviously all the internal work we’re doing to to look at uh a deception as a kind of class a thing that we need to prevent and monitor uh as important as tracking the performance and intelligence of the systems um the answer to this as well and one way to there’s many answers to the safety question of and a lot of research more research needs to be done in this very rapidly is things like secure sandboxes so we’re building those two we’re worldclass here at security at Google and at deepmind and also we are well class at games environments and we can combine those two things together to kind of create digital sandboxes with guard rails around them sort of the kind of guard Wells you’d have for for cyber security but internal as well as blocking external actors and um and then test these agent systems in those kind of secure sandboxes that would probably be a good advisable next step for things like deception Y what sort what sort of deception have you seen because I just read a paper from anthropic where they gave it a a sketch a sketch pad yeah and it’s like oh I better not tell them this and then you see it like give a result after thinking it through so what type of deception have you seen from the B well look we we’ve seen similar types of things where it’s trying to um resist sort of re revealing its it’s it’s it some of its training or you know I think there was an example recently of um one of the chat Bots being told to play against stockfish and it just sort of hacks its way around playing stockfish at all at chess because it knews it would lose so but you know you had an AI that knew it was going to lose a game I think we’re anthropomorphizing these things quite a lot at the moment because I feel like these are still pretty basic I too alarmed about them right now but I think it it it shows the type of issue we’re going to have to deal with maybe in two three years time when these agent systems become quite powerful and quite General so and that’s exactly what AI safety uh experts are worrying about right where systems where you know there’s unintentional effects of the system you don’t want the system to be deceptive you don’t you want it to do exactly what you’re telling it to rep report that back reliably but for whatever reason it’s interpret to the goal it’s been given in a way where it causes it to do these undesirable behaviors I know I’m having a weird reaction to this but in on one hand this scares The Living Daylights out of me on the other hand it makes me respect these models more than anything it’s like go well look of course you know these are it’s impressive capabilities and and and and the the the the you know the the negatives are things like deception but the positives would be things like inventing you know new materials accelerating science you need that kind of ability to solve and get around you know uh issues that are blocking progress um but of course you want that only in the positive direction right so those exactly the kinds of capabilities I mean they are you know uh it’s kind of mind-blowing we’re talking about those those possibilities but also at the same time uh there’s risk and it’s scary so I think both the things are true wild yeah

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.