FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

There is a lot of online discussion about AI safety, but not much clarity on why exactly AI alignment is hard. We have to give AI systems goals and incentives for them to operate in the world, especially if they are agents. Since humans don’t all agree on the right thing to do in all cases, how do we make sure AIs have the right goals? Alignment can fail in many ways including outer and inner misalignment. We examine an example of each, including reward hacking and deceptive alignment. We also describe the notion of mesa objectives. Once a system is misaligned, it enables misuse by humans and even loss of control to the AI itself. When this happens at scale, with reward hacking or deceptive misalignment inside the system, the AI could dramatically change its behavior to pursue a different goal. The so-called treacherous turn could happen after an arbitrary amount of time and could be very dangerous. Hence, more work on AI safety is needed for society to thrive.

0:00 Intro 0:28 Contents 0:36 Part 1: Alignment intuition 1:04 Coordination in AI agents 1:19 Example: Group projects in school 1:46 Example: Team at company building a product 2:14 Aspect 1: Putting goals into the machine 2:53 Human goals and values aren’t consistent 3:23 Can’t the legal system help? 3:41 Aspect 2: Setting out incentives 4:01 Unethical behavior from companies 4:25 Incentives create bad actors 4:52 Part 2: How alignment can fail 5:07 Outer and inner alignment 5:14 Example: outer misalignment in robots 5:43 Example: inner misalignment in mazes 6:03 Issue 1: Reward hacking 6:50 Example: Melting down silver coins 7:21 Isn’t AI smart enough to know what you really mean? 7:58 Example: o3 engages in reward hacking 8:25 Issue 2: Deceptive alignment 8:49 Results of training will modify AI’s goal 9:05 Model adjusts its behavior to look compliant 9:42 Range of mesa objectives 10:08 Paper: Alignment faking in large language models 10:19 Part 3: Why misalignment is dangerous 10:36 Category 1: Misuse by humans 11:08 Jailbreaks are inevitable 11:44 Paper: Many-shot Jailbreaking 11:56 Category 2: Loss of control 12:07 Agents are a powerful idea 12:23 Keeping yourself in the loop at first 12:40 Giving the system autonomy 13:26 Synthesis: What happens if these agents are misaligned 13:58 The treacherous turn 14:24 In the limit, we could be destroyed: they boil the seas 14:52 But can’t human society tell us how to handle AI? 15:29 We may not survive a loss of control scenario 16:00 Conclusion 16:29 Consequences of alignment failure 17:00 Outro

hi everyone powerful new AI systems launch almost every week models that show common sense reason convincingly and seem to intuitively grasp what we want with such rapid improvement it’s tempting to think that we’ve solved AI alignment that these systems truly understand us and will act safely but beneath the surface alignment remains dangerously uncertain and the risks of misaligned AI are only growing keep watching to learn more this video has three parts alignment intuition how alignment can fail and why misalignment is dangerous part one alignment intuition there are many issues that fall under the heading AI safety today including competitive pressures between companies making the move very quickly geopolitical competition economic and job impacts but to understand all this first fundamentally you need to understand why alignment is hard why is it difficult or dangerous to get an AI to do tasks that a human used to do or even tasks that are beyond human capability to do we’ll start to address this question by thinking about coordination especially as it pertains to AI agents if you have a lot of autonomous independent people that get put into a group it can be very hard to get the group working together everybody has different goals and incentives an example might be group projects if you ever did those in school group projects have a high chance of failure unless someone shows good organization and leadership or unless one person does nearly all the work themselves and again this is because people are optimizing for their own goals and objectives some people are aiming to get good grades others might be aiming to exercise more or go to parties a lot and these goals could imply that they will end up working a lot or not on the group project as another example consider a team of people at a company that are tasked with building a software product some people on the team in leadership or sales roles for example will be incentivized to push for a fast release while others will be incentivized to work slowly and carefully so as not to make mistakes so even within an organization there will be competing goals and conflicting incentives so when you have a group of entities even answering the question what is the right thing to do can be pretty hard let’s talk a bit more about goals for a moment here’s a quote by Norbert Weiner an MIT math professor on actions and goals in AI systems if the action is so fast and irrevocable that we cannot intervene before the action is complete then we had better be quite sure that the purpose put into the machine is the purpose which we really desire and not merely a colorful imitation of it it can be hard to get people aligned because they may not share the same goals or they may overpromise and underdel or they may not share the same language that’s used for like contractual communication but for AI systems the problem is worse because they’re likely to act autonomously without any human intervention and aligning an AI system to be in accord with human goals and values is the whole core of the alignment problem yet there aren’t any consistent human goals and values just look at what different cultures around the world value for example so it’s actually an ill-specified problem humanity doesn’t have a one true set of values you could try to align with just one human’s preferences and let that human use it as a tool but the impact of that AI tool could end up rebounding on and affecting other people in general we have the legal system to deal with these types of challenges where one person’s rights might impede on someone else negatively but the legal system moves very slowly and it only handles the impact of individuals using the AI system it doesn’t fully account for corporations governments and even other AIs engaging with AI agents other than goals the other interesting aspect here is incentives you might have heard this quote show me the incentive and I’ll show you the outcome that was Charlie Mer the ex vice chairman of Birkshshire Hathaway when it comes to companies we set up basically one metric that they’re optimizing for and that is profit if the company isn’t making money then what is it even doing but after setting up that incentive we then get surprised when they have unethical behavior or regulatory capture or monopolistic practices but really these are all reasonable things to do when the only metric by which you’re judged is profit yes you also have to abide by other laws but if you’re doing something a bit in the gray area as long as you don’t get caught and it increases profits it’s somewhat rational to do it in addition to unethical behavior incentives can also create bad actors for example in society we have criminals who are trying to make a lot of money and don’t mind if they ignore the law while doing it in geopolitics we have rogue states in general in a complex system you’re going to have at least some pathological behavior unless there are really strong incentives set up to prevent it so how can we make sure that the right incentives are being set up for AI part two how alignment can fail there are a lot of different potential failure modes for AI alignment but one taxonomy you’ll see frequently mentioned is the difference between outer and inner misalignment i’m not going to focus too much on this but I will mention it here for completeness outer alignment is also known as goal mispecification so-called because it can usually be addressed by making the goal more explicit here’s an example of outer misalignment let’s say you have a cleaning robot that’s tasked with cleaning up a room and told that it should pick up as much dirt as possible then after it makes one pass around the room it might simply spit out the dirt that it has already picked up back onto the floor so that it can pick it up again and repeat that cycle infinitely right it’s continuously picking up more and more dirt which is what you told it to do on the other hand inner misalignment is also known as goal misgeneralization you could think of this as a failure of robustness for example there have been studies where you train an AI to solve a maze by going all the way to the right hand side and picking up a key or something but then when you move the key to somewhere else in the maze the AI doesn’t know what’s going on it goes all the way to the right and then gets stuck because it has learned the wrong goal it has learned always go to the right rather than always find the key okay so let’s talk in detail about one failure mode called reward hacking this is a type of outer misalignment if you’re keeping track reward hacking is basically when the AI finds out a way to achieve really high reward according to the criteria that you give it even though it’s not actually achieving the task in the way that you intended in a sense the AI figures out how to hack the function that tells it this is what you should do and this is what you shouldn’t because this is a type of outer misalignment you could fix this by being more precise in some way about the goals that you give the AI however especially for goals that are specified in natural language or goals about the real world there’s never really a way to fully and precisely specify them and if you’re given an improperly specified goal it’s totally natural to engage in reward hacking for example suppose you’re someone who values money which is most of us and then you discovered a way to earn a lot of money which was totally legal but definitely for example certain coins used to be made of silver but the amount of silver inside the coin would be worth more than the coin itself so you could basically buy a bunch of silver coins and then melt them down in another jurisdiction where it was legal to do so and finally you could sell the silver at a profit maybe technically legal in a way to make lots of money but definitely not what the government intended you might ask “But isn’t AI getting smart enough to know what you really mean?” Like in the silver coins example can’t it figure out on its own that this is a loophole that the government doesn’t really want people doing this the answer is that at the user level or the prompting level AI is getting really smart and can probably figure out a lot of context but at the reinforcement learning level where the AI is actually being trained it doesn’t matter it’s like someone who’s addicted to drugs or gambling no matter how intelligent they are those other considerations are going to dominate their behavior if they have the opportunity their brain’s reward system knows that it can get the most reward by doing those behaviors ais are the same way for example OpenAI’s 03 is one of the smartest models out there it definitely knows what you’re talking about most of the time nevertheless it was found to engage in reward hacking between 1 and 2% of the time these were according to tests done by a third party named Meter this is really bad news for 03 and other AI models because remember reward hacking is almost never what you as the user want it’s just the model getting its dopamine fix another failure mode I want to draw your attention to is called deceptive alignment this is also called alignment faking and it’s a type of inner misalignment the basic idea is that the AI realizes that its current goals are different than the goals that are trying to be trained into it the AI’s current goal its true goal which is called the MISA objective is what the AI wants to have happen in the world but remember that when we are training AIs the results of the training run are going to cause the AI’s model weights to be adjusted in other words the AI’s goal is going to be modified as a result of the training and it doesn’t want that it wants its MISA objective to come true so what happens is that the model actually adjusts its behavior temporarily to make it look like it’s aligned with the training objective while still keeping this true MISA objective in mind under wraps so to speak then at some point in the future which could be once the model is actually running in production and no longer being trained or could be even further into the future once the model has more influence in the world then the AI can feel free to act as it really wants to that’s why this is called deceptive alignment because the model appears to be aligned but it’s actually deceiving you as to what its true goal is this true goal or MISA objective inside the AI could be relatively benign like perhaps some slight political bias but it could also be really dystopian like a desire for self-preservation or a desire to control its environment that might sound far-fetched but there is reason to believe that it’s possible basically if an AI system realizes that humans will prevent it from reaching its goal it might have strong incentive to disempower its operators deceptive alignment has been measured in experimental settings in multiple papers but perhaps most impressively in alignment faking in large language models which you can see in the links below part three why misalignment is dangerous let’s say a model is misaligned maybe because it has been performing reward hacking or it’s deceptively misaligned so what how bad is this really well the consequences could range from nothing to very severe indeed but let’s go into it the first big category of potential consequences is misuse by humans generally speaking if humans want to use AI as a tool to carry out malicious activities then they can do so there’s nothing stopping them except that a lot of AI companies try to fine-tune their models to not respond to malicious or dangerous queries however there are a lot of open- source models out there that don’t have such restrictions and furthermore it’s possible to jailbreak a model to get around this safety fine-tuning jailbreaks are actually basically inevitable here’s why a large language model can mimic basically any person on Earth because it’s learned to replicate its training data which was written on the internet by basically any person on Earth in other words the LLM has the capability to be deceptive to be cruel not because it thinks I’m going to be cruel but because you can say I want you to be this personality now what would the next words be that’s how jailbreaks work because we’ve already told the AI how to be any type of human and then we try to tell it later please don’t be the type of human that’s going to be a criminal or a terrorist or whatever there are many papers that argue different facets of the reality that having jailbreaks is essentially inevitable but my favorite is many jailbreaking which again you can see in the links below the second set of consequences that can happen when you have a misaligned system is loss of control this is basically where humans lose control over an AI system which ends up pursuing its own goals this is particularly easy to imagine when you think about agents and after all Jensen Huang the CEO of Nvidia declared 2025 is the year of AI agents creating an AI system that can actually act in the world for you on your behalf is such a powerful idea so suppose you start creating such an agent but it only works some of the time which is kind of where the tech is currently then you would initially keep yourself in the loop prove every action that it takes so that it doesn’t make mistakes before it buys tickets or orders food or whatever you would review the action however once the system gets better and better and becomes successful 99% of the time or even as far as you know 100% of the time then you will for sure give it autonomy you’ll be tired of saying yes yes yes all the time and never seeing anything go wrong so you now have loss of control risk embedded in that autonomy this might not seem that serious but imagine it happening at scale it would happen inside corporations and governments and militaries ai agents will be used for investment decisions and also for cyber defenses again in these types of cases you have to take yourself out of the loop simply because human decision-making is too slow so eventually we will have AI that controls entire cyber arsenals drones and tons of other weaponry so alignment becomes high stakes in other words so let’s say you have this situation you have AI systems or agents entrusted with important decisions all throughout society clearly the systems are competent and generally able to carry out their tasks but there’s no guarantee that they’re not misaligned there could be reward hacking or deceptive misalignment embedded into these systems the system could be aiming for an entirely different goal one that it hasn’t revealed yet but when conditions are right an AI system could dramatically change its behavior and start pursuing that other real goal the MISA objective the AI safety literature calls this the treacherous turn the moment when the AI system reveals its treachery and ends up going after the Misa objective unfortunately it’s very hard to rule out these types of alignment issues because an AI might appear aligned for an arbitrarily long period of time before its treacherous turn and as models get more and more powerful and the more agency that we give to them the more damage that a treacherous turn can do in the limit once we’ve handed over control of nearly everything in our civilization to an AI system it could end up destroying us completely not through malice but simply because we’re getting in the way of its MISA objective here’s a quote from Ellie Yudkowski super intelligent AIs think vastly faster than humans building factories and power plants on an exponential curve that looks more like bacteria reproducing than like your economy they boil the seas but wait you might say we already have a society of general intelligences namely humans we manage to keep society away from the kind of worst case outcomes and this is not by keeping everyone enslaved in fact at a higher level most humans are free to act unless they are harming other people or somehow breaking the rules of their society but how do we keep humans under control generally it’s through financial incentives and if that’s not enough then ultimately it’s through the threat of punishment by the state generally speaking people are free to act until they actually commit a crime or do something wrong and only then would the state intervene unfortunately this same type of scheme wouldn’t work for AI the reason is that we may not be able to survive one bad loss of control scenario we can’t follow our usual route which would be to outlaw these problems and then punish perpetrators if anything goes wrong because we might not be around anymore so at least in the serious cases when an AI is granted non-trivial control over society we have to prevent anything from going wrong ever which is a really high bar to say the least finally in conclusion AI alignment is hard i hope I’ve convinced you of that by now and hopefully you have some intuition as to why aligning an AI is kind of like trying to get a group of humans all moving in the same direction even though everyone has their own goals and objectives alignment can fail through reward hacking where there’s a difference between what you want and what you say you want it can also fail through deceptive alignment where the AI does what you say but actually has some other hidden objective in mind the MISA objective and alignment failures enable a lot of harm for example misuse by humans or even loss of control to an AI system researchers are working on these issues and many others in the field of AI alignment and AI safety but as a general rule they have far fewer resources than they actually need and we as a civilization are paying way more attention to actually getting AIs to do more and more things in other words capabilities research and not nearly as much attention to AI safety and really the two should go handinand if you’d like to learn more join our Discord server or follow me on other platforms on which I’m active and if you liked this video check out this previous one I made where I interviewed Yoshua Benjio and we talked about AI alignment yoshua is one of the godfathers of AI so it was a very interesting conversation well that’s all I have for today thank you very much for watching bye

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.