
Learn More
- Detecting misbehavior in frontier reasoning models – March 10, 2025
- Frontier reasoning models exploit loopholes when given the chance. We show we can detect exploits using an LLM to monitor their chains-of-thought. Penalizing their “bad thoughts” doesn’t stop the majority of misbehavior—it makes them hide their intent.
- OpenAI Reveals Why They Are SCARED To Release Their New AI Product – TheAIGRID
- OpenAI’s SHOCKING WARNING to AI Labs about “THOUGHT CONTROL” – Wes Roth
- OpenAI Just Lost Another Scientist—Warns AGI May Be Humanity’s Biggest Mistake – AI Dark Files
- another open AI scientist has just quit and his warning AGI could be Humanity’s biggest mistake
so OpenAI actually released this paper in which they spoke about monitoring reasoning models for misbehavior and the risk of promoting obfuscation essentially what they’re talking about is basically detecting behavior in frontier reasoning models and they’re just talking about the fact that they need a reliable method to see what the models are truly trying to do so they published this blog post a couple of days ago i was supposed to make a video on it was pretty busy at the time and I still wanted to make a video on this because I think AI safety is something that is often overlooked among the AI hype and all of the AI news that we consistently get so this is where they say detecting misbehavior in frontier reasoning models and it says that frontier reasoning models exploit loopholes when given the chance we show we can detect the exploits using an LLM to monitor the chains of thought penalizing their bad thoughts doesn’t stop the majority of misbehavior it just makes them hide their intent now essentially what they’re saying here is that with these models you might think that you know saying “Okay do not think like this or do not do this would lead the AI to have and promote you know good behavior.” But that’s not actually what happens what happens is the AI just hides its bad behavior which is often even worse for us so in this paper they’re basically trying to say that we are kind of struggling to figure out how to sort of trust these AI systems and actually figure out what their real goals are and this is going to be something that has been a real theme in AI for quite some time now so they talk about the fact that the way how they monitor these models is through chain of thought now talks about here that chain of thought reasoning models think in natural language understandable by humans and monitoring their thinking has allowed us to detect misbehavior such as subverting tests in coding tasks deceiving users or giving up when a problem is too hard and the long story short here is that chain of thought is where we can see the model’s ability to think out loud you know sometimes if you were working through a problem you might think about it for several minutes and with this you can actually see what the models are thinking now if you’re unfamiliar with that which I’m pretty sure if you watch this channel you’re not you would probably remember this model the 01 preview model that kickstarted the entire AI reasoning race where we now have a bunch of different models that think and you know simply weigh a lot of different options so what you can see here is that when the model thinks on some models we’re able to see what they thought about and how they achieved their final decision and this is why chain of thought is so powerful because we can really see exactly what the model was thinking that led to that decision now it’s really crazy because not only are these models pretty transparent sometimes they aren’t and it’s pretty scary what happens when we decide to monitor what they are thinking and the craziest thing about this as well is that they believe that chain of thought monitoring might be one of the fuel tools that we have to oversee superhuman models of the future and if you aren’t clear on why this is such a big deal it’s because superhuman models in the future are going to be superhuman meaning they’re going to be smarter than us to some degree which means we need a way to understand what goals they have and why they are making certain things do what they do the choices they make essentially really understanding the thought patterns behind these models so we can truly understand what is going on behind the scenes as we do know AIS are essentially black boxes meaning they are somewhat kind of grown and we don’t really understand them on a completely granular level and the more transparency we have overall honestly the better so this is where two years ago you can see they were talking about this thing called super alignment where we need scientific and technical breakthroughs to steer and control AI systems much smarter than us and remember this was a huge huge deal at OpenAI because some people even left OpenAI because they were like look this AI safety thing got too crazy these guys don’t know how to control the models some were stating that Ilia Sutzka managed to find out how to get to Super Intelligence that’s why he went off to build Super Intelligence Inc some people were saying that Samman should be removed because you know they’re building dangerous models and he’s not doing enough to stop them there was a ton and ton of stuff and eventually what was crazy about all of this was that the team Super Alignment okay the team that was meant to control Super Intelligent Systems actually was disbanded sometime last year which was a pretty crazy outcome when we think about All Things Considered now it still does mean that this issue does need to be solved but this is where the big problem comes in now in a moment I’m going to show you guys a video from 7 years ago that actually talks about this problem and it’s pretty crazy because the problem still exists in modern AI today and it’s basically something called reward hacking so what they talk about is that humans often find and exploit loopholes whether it be sharing an online subscription accounts that are against the claim you know the terms of service claiming subsidies meant for others interpreting regulations in unforeseen ways even lying about a birthday at a restaurant to get free cake and basically what they’re saying is that you know these reward structures often have you know poorly incentivized unwanted behavior and it’s not something that just exists in AI but it also exists in humans and this is something that is pretty hard to do in terms of figuring out how to solve because it already exists in humans and it’s pretty pretty hard to solve there so how are we going to do this then they talk about you know in reinforcement learning settings exploiting unintended loopholes is commonly known as reward hacking which is a phenomenon where AI agents achieve high rewards through behaviors that don’t align with the intentions of their designers now this is the clip from 7 years ago in 2017 or in fact 8 years ago now from Rob Miles and he actually talks about reward hacking now honestly this is by far the best channel on AI safety and I would watch all of their videos of how are you if you’re interested in this topic but take a look on how he explains reward hacking where basically they taught an AI to play a game and they wanted the AI to get the highest score but in order to get the highest score the AI actually did something rather strange it was changed to when a measure becomes a target it ceases to be a good measure much better isn’t it that’s Goodart’s law and it shows up everywhere like if you want to find out how much students know about a subject you can ask them questions about it right you design a test and if it’s well-designed it can be a good measure of the students knowledge but if you then use that measure as a target by using it to decide uh which students get to go to which universities or which teachers are considered successful then things will change students will study exam technique teachers will teach only what’s on the test so student A who has a good broad knowledge of the subject might not do as well as student B who studied just exactly what’s on the test and nothing else so the test isn’t such a good way to measure the students real knowledge anymore the thing is student B only decided to do that because the test is being used to decide university places you made your measure into a target and now it’s not a good measure anymore the problem is the measure is pretty much never a perfect representation of what you care about and any differences can cause problems this happens with people it happens with AI systems it even happens with animals at the Institute for Marine Mammal Studies the trainers wanted to keep the pools clear of dropped litter so they trained the dolphins to do it every time a dolphin came to a trainer with a piece of litter they would get a fish in return so of course the dolphins would hide pieces of waste paper and then tear off little bits to trade for fish tearing the paper up allowed the dolphins to get several fish for one dropped item this is kind of Goodart’s law again if you count the number of pieces of litter removed from the pool that’s a good measure for the thing you care about the amount of litter remaining in the pool but when you make the measure a target the differences between the measure and the thing you’re trying to change get amplified the fact that there are a lot of pieces of litter coming out of the pool no longer means there’s no litter in the pool so that’s Goodart’s law and you can see how that kind of situation could result in reward hacking your reward system needs to use some kind of measure but that turns the measure into a target so it will probably stop being a good measure with dolphins this can be cute uh with people it can cause serious problems and with advanced AI systems well uh let’s just try to keep that from happening and they actually also have a channel called Rational Animations this one is really nice content it’s super easy to digest and it’s not meant for a technical audience at all but it is meant for you to understand exactly how AI safety works and like I said before if you really want to understand AI safety just watch a couple of these videos and you’ll understand why this topic is so important and it underpins a lot of the decisions you see AI companies making today so of course you can see that they’re basically saying that look reward hacking is a really really hard issue to solve because even humans who by the definition have human level intelligence reward hack systems simply continuing to push the frontier of AI model intelligence won’t likely solve the issue in fact enhancing AI agent capabilities may exacerbate the problem by better equipping them to discover and execute more complex and hard to monitor exploits and as we’ve trained more capable frontier reasoning models we found that they have become increasingly adept at exploiting flaws in their tasks and misspecifications in their reward functions resulting in models that can perform complex reward hacks in coding tasks basically saying that guys these models as we make them smarter they get smarter at sneaking around and subverting the rules and what they’re saying here as well is that since humans who by definitions have human level intelligence if they reward hack systems and these systems are far below AGI when we get to AGI we know that reward hacking simply won’t be solved so we’re going to have to figure out something really different if we want to solve this issue because it already exists in humans and what they actually spoke about was the fact that when they you know catch this misaligned behavior often times the methods that they’re using now it’s often not scalable so they talk about the fact that you know it requires humans to manually monitor an agent’s actions and this is a strategy that almost certainly won’t scale especially to the complex behaviors that a more capable models will discover and for example it’s impractical if not impossible for a human to manually review 10,000 plus lines of complex code written by a superior programmer which is of course very true point they’re stating there is that if an AI writes this level of code how are we going to verify that it is secure now this is where we get into the meat of the video so this is where we have three graphs so what they wanted to do is they wanted to sort of check whether the model was thinking badly if it was reward hacking and so they wanted to suppress the occurrence of these bad thoughts so they investigated the question by penalizing the AI agents if they had bad thoughts during training and they gave them a lower reward when they triggered this thing so essentially they kind of like slap the agent on the wrist if it was thinking about reward hacking and basically we can see the baseline agent which is one that learns tasks normally we can see how this agent performs then we can see the punished AI agent that gets penalized every time it’s shown signs of wanting to cheat during its training and what the graphs actually show here is that when you closely monitor it the pressured agent gets caught cheating slightly more at first because it’s often testing the limits and quickly learning what gets noticed and then essentially over time we can see that what’s actually crazy about this entire thing is that the AI actually finds ways to cheat without it being detected so we can see that the cheating that actually wasn’t caught by the monitor is actually still existing in the AI model the only difference is that we haven’t caught it this time and the summary here and the big takeaway is that if you punish an AI every time it shows its intent to cheat it might seem like it’s more honest at first but eventually it’s just going to become sneakier and cheat without getting caught which means that I think at this level just punishing an AI for having that level of thought doesn’t really make sense we have to think deeper and think about how we can get the AI to not even think about cheating in the first place so this is of course a very hard issue that OpenAI are working hard to solve because if these systems do reward hack their environments then that’s going to lead to a million different problems that we can’t even think about and you have to understand that you know rewards in the real world are a lot different to just video games where you can just have a numerical value so if we have these systems reward hacking it’s just going to bring a whole host of problems now what was fascinating was that OpenAI actually said that they believe that chain of thought monitoring may be one of the few effective methods that we have for supervising superhuman models and at this stage they strongly advise that AI developers training frontier models refrain from applying strong supervision directly to chain of thoughts and our experiments that show a light optimization pressure can produce more performant and aligned models but it’s going to be hard to measure in practice when models begin hiding their intent and so even if light supervision is used we recommend treading with extreme caution so basically saying that look just because we know how to like look at this and kind of solve it in a sense it still means that these models are still capable of hiding their intent so we shouldn’t be sold on the fact that we know exactly what they’re up to but looking forward as these capabilities continue to grow so too will the potential for increasingly sophisticated and subtle reward hacking so over time they’re going to learn a billion different capabilities that could be extremely dangerous that we will need to look out