Anthropic Research. Tracing the thoughts of a large language model. [HINT: “These things really do UNDERSTAND” Geoffrey Hinton]

HINT: LLMs really do THINK, as Geoffrey Hinton says: “These things really do Understand.”

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

AI models are trained and not directly programmed, so we don’t understand how they do most of the things they do. Our new interpretability methods allow us to trace their (often complex and surprising) thinking. With two new papers, Anthropic’s researchers have taken significant steps towards understanding the circuits that underlie an AI model’s thoughts. In one example from the paper, we find evidence that Claude will plan what it will say many words ahead, and write to get to that destination. We show this in the realm of poetry, where it thinks of possible rhyming words in advance and writes each line to get there. This is powerful evidence that, even though models are trained to output one word at a time, they may think on much longer horizons to do so.

Learn more

Anthropic scientists expose how AI actually ‘thinks’ — and discover it secretly plans ahead and sometimes lies – VentureBeat
- Anthropic’s new techniques come at a time of increasing concern about AI transparency and safety. As these models become more powerful and more widely deployed, understanding their internal mechanisms becomes increasingly essential. The research also has potential commercial implications. As enterprises increasingly rely on large language models to power applications, understanding when and why these systems might provide incorrect information becomes crucial for managing risk. “Anthropic wants to make models safe in a broad sense, including everything from mitigating bias to ensuring an AI is acting honestly to preventing misuse — including in scenarios of catastrophic risk,” the researchers write. While this research represents a significant advance, Batson emphasized that it’s only the beginning of a much longer journey. “The work has really just begun,” he said. “Understanding the representations the model uses doesn’t tell us how it uses them.”
Why do LLMs make stuff up? New research peers under the hood – Are Technica
- Claude’s faulty “known entity” neurons sometime override its “don’t answer” circuitry.
What Anthropic Researchers Found After Reading Claude’s ‘Mind’ Surprised Them. – SingularityHub
- As AI’s power grows, charting its inner world is becoming more crucial
How This Tool Could Decode AI’s Inner Mysteries – Time
- The discovery ran contrary to the conventional wisdom—in at least some quarters—that AI models are merely sophisticated autocomplete machines that only predict the next word in a sequence. It raised the questions: How much further might these models be capable of planning ahead? And what else might be going on inside these mysterious synthetic brains, which we lack the tools to see?
- The Anthropic research also found evidence to support the theory that language models “think” in a non-linguistic statistical space that is shared between languages.
- Despite these advances in AI interpretability, the field is still in its infancy, and significant challenges remain. Anthropic acknowledges that “even on short, simple prompts, our method only captures a fraction of the total computation” expended by Claude—that is, there is much going on inside its neural network into which they still have zero visibility. “It currently takes a few hours of human effort to understand the circuits we see, even on prompts with only tens of words,” the company adds. Much more work will be needed to overcome those limitations.
Anthropic has developed an AI ‘brain scanner’ to understand how LLMs work and it turns out the reason why chatbots are terrible at simple math and hallucinate is weirder than you thought. – PC Gamer
- Oh, and another thing: They don’t just predict the next word.
- Anthropic made lots of intriguing discoveries using this approach, not least of which is why LLMs are so terrible at basic mathematics. “Ask Claude to add 36 and 59 and the model will go through a series of odd steps, including first adding a selection of approximate values (add 40ish and 60ish, add 57ish and 36ish). Towards the end of its process, it comes up with the value 92ish. Meanwhile, another sequence of steps focuses on the last digits, 6 and 9, and determines that the answer must end in a 5. Putting that together with 92ish gives the correct answer of 95,” the MIT article explains. But here’s the really funky bit. If you ask Claude how it got the correct answer of 95, it will apparently tell you, “I added the ones (6+9=15), carried the 1, then added the 10s (3+5+1=9), resulting in 95.” But that actually only reflects common answers in its training data as to how the sum might be completed, as opposed to what it actually did.
- Anthropic’s new techniques come at a time of increasing concern about AI transparency and safety. As these models become more powerful and more widely deployed, understanding their internal mechanisms becomes increasingly essential. The research also has potential commercial implications. As enterprises increasingly rely on large language models to power applications, understanding when and why these systems might provide incorrect information becomes crucial for managing risk. “Anthropic wants to make models safe in a broad sense, including everything from mitigating bias to ensuring an AI is acting honestly to preventing misuse — including in scenarios of catastrophic risk,” the researchers write. While this research represents a significant advance, Batson emphasized that it’s only the beginning of a much longer journey. “The work has really just begun,” he said. “Understanding the representations the model uses doesn’t tell us how it uses them.”
Anthropic makes a breakthrough in opening AI’s ‘black box’ – Fortune
- Still, Anthropic said the method did have some drawbacks. It is only an approximation of what is actually happening inside a complex model like Claude. There may be neurons that exist outside the circuits the CLT method identifies that play some subtle but critical role in the formulation of some model outputs. The CLT technique also doesn’t capture a key part of how LLMs work—which is something called attention, where the model learns to put a different degree of importance on different portions of the input prompt while formulating its output. This attention shifts dynamically as the model formulates its output. The CLT can’t capture these shifts in attention, which may play a critical role in LLM “thinking.” Anthropic also said that discerning the network’s circuits, even for prompts that are only “tens of words” long, takes a human expert several hours. It said it isn’t clear how the technique could be scaled up to address prompts that were much longer.

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

HINT: LLMs really do THINK, as Geoffrey Hinton says: “These things really do Understand.”

Anthropic References: