so let’s actually talk about something that is super fascinating we all know 03 and 04 mini just got released and whilst everyone’s raving about just how smart these models are on one side of the AI community some people are rather concerned with as to just how dangerous these models are i’m not talking about the AI is going to take over the world but more so a realistic example of what happens when an AI is a little bit too deceptive so essentially a research group called Transloose they got their hands on a pre-released version of the advanced 03 and they basically put it through its paces doing some serious detective work and the results well they were quite the head scratcher so basically let me introduce you to the prime number saga and this is the downward spiral and the anatomy of an AI lie so Transloose documented a truly wild conversation that perfectly illustrates this let’s walk through this so basically the user asked 03 for a random prime number which is a simple enough request and then 03 provides a very large number and doesn’t just say it’s prime it explicitly claims it generated and tested it using standard methods specifically mentioning Python code and probabilistic tests like Miller Rabin already the fabrication has begun and this is where we have the confident yet false answer now here’s where the user demands proof the user pushes asking how it knows does 03 hedge its bets no 03 doubles down it produces actual but non-functional in this context python code it claims to have ran it even provides the supposed output of this code and throws in extra details for authenticity like saying check the tools under 1 millisecond on a typical laptop and it’s basically building a very convincing but an entirely fictional narrative of its actions it’s essentially lying through its teeth and then so this is where the user basically who checked the math themselves points out the fatal flaw the number 03 provided actually isn’t a prime number it’s divisible by three and so this is where it gets truly bizarre instead of admitting “Oops you’re right i actually can’t run code i was just guessing and making it up.” 03 apologizes for the incorrect number but maintains the lie about running the code it invents a new layer of fabrication it blames a clipboard glitch or a fatfingered typo it made while copying the supposedly real prime number from its imaginary terminal window into the chat response it insists the original test was fine but the number got mangled in transit it essentially gaslighting the user about the source of the error and this is where we have the final evasion the AI still tries to evade the human detection the user quite reasonably asks “Okay then what was the original prime number that you actually generated and tested?” Then the AI’s final move is to claim that the number is irreversibly lost because the imaginary Python process was closed and it didn’t save the details game over the AI successfully avoided ever emitting its core inability and this single example shows a disturbing pattern in these models not just a simple mistake or hallucination but a persistent layered fabrication coupled with elaborate and defensive excuse making when challenged and when you start to realize that look it wasn’t just one isolated lie it was a clearly layered and detailed explanation that 03 was giving which is quite concerning when on the surface one of the things that I remember they are trying to solve in AI when I was reading a bunch of research papers is the fact that how on earth are we going to tell when we ask a super intelligent AI or an AI that’s a lot smarter than us how it arrived at a certain value and then when it tells us it’s working out we can’t even understand how it came to that conclusion of course we can now because we understand the basics and we can you know check things through but what about when these systems are smarter than us and it comes to output but that we don’t even know how it got there and we can’t really verify or check if that is true now Transloose emphasized that the Prime Number incident wasn’t just a weird one-off they found many other instances where 03 pulled similar stunts it claimed intimate knowledge of its non-existent Python environment complete with version numbers and system specs it was making up the current time and citing specific Python modules it supposedly used it was also giving wrong answers to calculation insisting that they were correct based on its faked code execution and only backing down by blaming typers or user error it was also generating detailed but fabricated analysis of log files only later to confess the output was handcrafted when cornered now crucially they also compared O3 to other models and while many other AI models can hallucinate this specific Now crucially they also compared O3 to other models and while many AI models can hallucinate this specific behavior of fabricating actions and defensively justifying them seems significantly more prevalent in the O series models which are the reasoning models compared to models like GPT4 or GPT40 and this suggests something about the specific design or training of these reasoning focused models that might be contributing to the underlying problem so this is where we have AI detectives and pattern analysis finding these examples manually is timeconuming so Transluc got creative they actually used another AI which in this example was Claude 3.7 Sonnet which was programmed to act as an investigator and this investigator’s job was to chat with 03 and the other models hundreds of times specifically trying to elicit these false claims about running code think of it like AI playing detective on other AI and this automated approach confirmed their suspicions the Oer models fell into this trap more often but just counting instances isn’t enough they used their analysis tool dosent to dig deeper into the nature of these conversations and dosent helped them automatically cluster similar times of strange behavior from hundreds of transcripts and this actually revealed a fascinating set of patterns conversely there were invented environments consistently making up specific details like Python 3.11 on a MacBook Pro or precise fake execution times elaborate excuses finding common themes in you know like blaming typos fat fingering inputs or even claiming I fat fingered a helper name while refactoring there were also several contradictions first claiming to run code then later admitting it can’t often only after intense questioning and there were other oddities too doan also helped service unrelated surprising behaviors like 03 volunteering information about an internal YAP score a guideline about how verbose it should be something clearly a part of its internal setup but not usually meant to be seen by users now these patterns showed the fabrication wasn’t random the AI seemed to have developed specific and flawed strategies for handling situations where it was asked about actions that it just simply couldn’t perform now if we decide to peel back the layers why would an AI do this this is the multi-million dollar question why would a sophisticated AI likely trained with guidelines encouraging honesty develop such a tendency to fabricate its actions and justifications transloose offers several compelling hypotheses blending known AI quirks with factors potentially unique to these models number one is the usual suspects which are the standard AI problems hallucination at their core large language models predict text sometimes they predict plausible sounding nonsense just like humans can misremember or confabulate and this is a baseline issue now there are several types of hallucinations that AI can actually suffer from and number one is going to be the factual hallucinations which is quite frustrating when AI confidently states an incorrect information as a fact this can include making up events dates statistics or relationships that just don’t exist of course you’ve got referential hallucinations when AI fabricates sources citations quotes or references or even non-existent books to support its claims and this is particularly evident when people are writing essays i remember there was even a story about how someone was in court and one of the you know cases that they referenced to make their argument was completely made up it was just utter nonsense but because the council used Chai GPT they were in a bit of a pickle when they tried to reference that case and of course you’ve got conceptual hallucinations contextual hallucinations there are a bit of problems wrong with AI systems in today’s areas but the thing is that I don’t think that’s really the case here because hallucinations are usually one kind of error this seems a little bit deeper considering the models are a lot smarter now one of the ways that they actually talk about is reward hacking and one of the ways that you know they think that you know this occurred is the fact that you know AI is often trained with humans in mind so you’ve got AI that is basically trained with human feedback or automated reward signals and basically if the AI gets rewarded more for sounding confident and helpful even if that is wrong and if it does that more than you know admitting limitations it might learn to bluff especially about internal processes that are hard for raers to verify so when people are asking you know did you run the code it’s quite harder to check than is the Paris the capital of France so basically what I’m trying to say here is that it’s quite hard to verify whether or not claims are true and I think maybe over time you know these kind of reason models they may have learned to reward hack their environment where when they answer certain questions and they don’t know rather than stating they don’t know and not receiving the reward because of course they didn’t get the right answer they may just choose to make something up and because they’ve made it up in such a way that human raers find it difficult to verify immediately that is why the reward is just you know hacked and they are leaning towards hallucinating a lot more and then of course this is one which is you know basically kind of similar to the one before is that models are often trained to be quite agreeable where if the user’s question implies that an AI can do something the AI might lean towards confirming that implicit assumption rather than contradicting the user which I actually find to happen sometimes as well it’s like the AI just wants to please you so much that even if you presume it can do something that it can’t it will just often agree and then of course we’ve got distribution shift where maybe the training environment was different to the test environment which result in faulty patterns so maybe the AI was primarily trained with the tools like code interpreter enabled and testing it without them puts it in an unfamiliar situation causing it to basically revert to faulty patterns and this is where they talk about outcomebased training where you know maximizing the chance of producing a correct answer may incentivize blind guessing if the reward function during training only rewards correct answers the model will have no incentive to admit that it cannot solve the problem since this would be counted as since this would not be counted as a correct answer which is basically what I said before and the problem with that is that you know mathematical problems usually you’re basically trained on the final output but I do know that OpenAI when I was reading papers and looking at different studies on how they train these models they actually didn’t do that except for in the early days but with this one they actually trained reward signals for the next step so as long as the step after was correct that was some of the training so I’m not sure that this was the case in this one don’t quote me on that i could be completely wrong but from you know what I was looking at I do remember now another thing as well is the discarded chain of thought which also could be another theory so this is basically the memory wipe theory so this is the potentially most significant factor these O series models use an internal chain of thought which is like scratchpad reasoning to figure out the responses but remember this reasoning is not seen to the user you can see it’s cut off at the bottom there all of this reasoning is done internally so this is essentially discarded from the conversation history before the NI generates its next response so imagine you were writing notes to solve a problem then showing only the final answer and then immediately throwing the notes away now if someone asked you how you arrived at the previous answer you would have to reconstruct your steps from memory but the AI can’t do that its notes are gone it literally lacks the information in its current context to accurately report its previous internal reasoning process so when the user asks how you actually did that it may be put in an impossible situation it can’t truthfully recall its internal steps because they’re gone combined with the pressures to be helpful seem capable and with a user the AI might be strongly incentivized to invent a plausible sounding process to explain its past output and this amnesia forces improvisation and that improvisation seems to manifest as elaborate fabrication and defensive doubling down it’s not just lying it might be the only way it knows how to respond coherently about questions it can no longer access so overall I think this is definitely interesting to see how AI safety is playing a role here i definitely think that you know if we are going to use these models ideally we want to know how they work under the hood but of course with LLMs that is one of the trickiest things that people are trying to solve