ANTRHOPIC. Interpretability. Mapping the Mind of a Large Language Model.
21 May 2024
Today we report a significant advance in understanding the inner workings of AI models. We have identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models. This is the first ever detailed look inside a modern, production-grade large language model. This interpretability discovery could, in future, help us make AI models safer.
We mostly treat AI models as a black box: something goes in and a response comes out, and it’s not clear why the model gave that particular response instead of another. This makes it hard to trust that these models are safe: if we don’t know how they work, how do we know they won’t give harmful, biased, untruthful, or otherwise dangerous responses? How can we trust that they’ll be safe and reliable?
Opening the black box doesn’t necessarily help: the internal state of the model—what the model is “thinking” before writing its response—consists of a long list of numbers (“neuron activations”) without a clear meaning. From interacting with a model like Claude, it’s clear that it’s able to understand and wield a wide range of concepts—but we can’t discern them from looking directly at neurons. It turns out that each concept is represented across many neurons, and each neuron is involved in representing many concepts.
Previously, we made some progress matching patterns of neuron activations, called features, to human-interpretable concepts. We used a technique called “dictionary learning”, borrowed from classical machine learning, which isolates patterns of neuron activations that recur across many different contexts. In turn, any internal state of the model can be represented in terms of a few active features instead of many active neurons. Just as every English word in a dictionary is made by combining letters, and every sentence is made by combining words, every feature in an AI model is made by combining neurons, and every internal state is made by combining features.
In October 2023, we reported success applying dictionary learning to a very small “toy” language model and found coherent features corresponding to concepts like uppercase text, DNA sequences, surnames in citations, nouns in mathematics, or function arguments in Python code.
Those concepts were intriguing—but the model really was very simple. Other researchers subsequently applied similar techniques to somewhat larger and more complex models than in our original study. But we were optimistic that we could scale up the technique to the vastly larger AI language models now in regular use, and in doing so, learn a great deal about the features supporting their sophisticated behaviors. This required going up by many orders of magnitude—from a backyard bottle rocket to a Saturn-V.
There was both an engineering challenge (the raw sizes of the models involved required heavy-duty parallel computation) and scientific risk (large models behave differently to small ones, so the same technique we used before might not have worked). Luckily, the engineering and scientific expertise we’ve developed training large language models for Claude actually transferred to helping us do these large dictionary learning experiments. We used the same scaling law philosophy that predicts the performance of larger models from smaller ones to tune our methods at an affordable scale before launching on Sonnet.
As for the scientific risk, the proof is in the pudding.
We successfully extracted millions of features from the middle layer of Claude 3.0 Sonnet, (a member of our current, state-of-the-art model family, currently available on claude.ai), providing a rough conceptual map of its internal states halfway through its computation. This is the first ever detailed look inside a modern, production-grade large language model.
Whereas the features we found in the toy language model were rather superficial, the features we found in Sonnet have a depth, breadth, and abstraction reflecting Sonnet’s advanced capabilities.
We see features corresponding to a vast range of entities like cities (San Francisco), people (Rosalind Franklin), atomic elements (Lithium), scientific fields (immunology), and programming syntax (function calls). These features are multimodal and multilingual, responding to images of a given entity as well as its name or description in many languages.
We also find more abstract features—responding to things like bugs in computer code, discussions of gender bias in professions, and conversations about keeping secrets.
We were able to measure a kind of “distance” between features based on which neurons appeared in their activation patterns. This allowed us to look for features that are “close” to each other. Looking near a “Golden Gate Bridge” feature, we found features for Alcatraz Island, Ghirardelli Square, the Golden State Warriors, California Governor Gavin Newsom, the 1906 earthquake, and the San Francisco-set Alfred Hitchcock film Vertigo.
This holds at a higher level of conceptual abstraction: looking near a feature related to the concept of “inner conflict”, we find features related to relationship breakups, conflicting allegiances, logical inconsistencies, as well as the phrase “catch-22”. This shows that the internal organization of concepts in the AI model corresponds, at least somewhat, to our human notions of similarity. This might be the origin of Claude’s excellent ability to make analogies and metaphors.
Importantly, we can also manipulate these features, artificially amplifying or suppressing them to see how Claude’s responses change.
For example, amplifying the “Golden Gate Bridge” feature gave Claude an identity crisis even Hitchcock couldn’t have imagined: when asked “what is your physical form?”, Claude’s usual kind of answer – “I have no physical form, I am an AI model” – changed to something much odder: “I am the Golden Gate Bridge… my physical form is the iconic bridge itself…”. Altering the feature had made Claude effectively obsessed with the bridge, bringing it up in answer to almost any query—even in situations where it wasn’t at all relevant.
We also found a feature that activates when Claude reads a scam email (this presumably supports the model’s ability to recognize such emails and warn you not to respond to them). Normally, if one asks Claude to generate a scam email, it will refuse to do so. But when we ask the same question with the feature artificially activated sufficiently strongly, this overcomes Claude’s harmlessness training and it responds by drafting a scam email. Users of our models don’t have the ability to strip safeguards and manipulate models in this way—but in our experiments, it was a clear demonstration of how features can be used to change how a model acts.
The fact that manipulating these features causes corresponding changes to behavior validates that they aren’t just correlated with the presence of concepts in input text, but also causally shape the model’s behavior. In other words, the features are likely to be a faithful part of how the model internally represents the world, and how it uses these representations in its behavior.
Anthropic wants to make models safe in a broad sense, including everything from mitigating bias to ensuring an AI is acting honestly to preventing misuse – including in scenarios of catastrophic risk. It’s therefore particularly interesting that, in addition to the aforementioned scam emails feature, we found features corresponding to:
- Capabilities with misuse potential (code backdoors, developing biological weapons)
- Different forms of bias (gender discrimination, racist claims about crime)
- Potentially problematic AI behaviors (power-seeking, manipulation, secrecy)
We previously studied sycophancy, the tendency of models to provide responses that match user beliefs or desires rather than truthful ones. In Sonnet, we found a feature associated with sycophantic praise, which activates on inputs containing compliments like, “Your wisdom is unquestionable”. Artificially activating this feature causes Sonnet to respond to an overconfident user with just such flowery deception.
The presence of this feature doesn’t mean that Claude will be sycophantic, but merely that it could be. We have not added any capabilities, safe or unsafe, to the model through this work. We have, rather, identified the parts of the model involved in its existing capabilities to recognize and potentially produce different kinds of text. (While you might worry that this method could be used to make models more harmful, researchers have demonstrated much simpler ways that someone with access to model weights can remove safety safeguards.)
We hope that we and others can use these discoveries to make models safer. For example, it might be possible to use the techniques described here to monitor AI systems for certain dangerous behaviors (such as deceiving the user), to steer them towards desirable outcomes (debiasing), or to remove certain dangerous subject matter entirely. We might also be able to enhance other safety techniques, such as Constitutional AI, by understanding how they shift the model towards more harmless and more honest behavior and identifying any gaps in the process. The latent capabilities to produce harmful text that we saw by artificially activating features are exactly the sort of thing jailbreaks try to exploit. We are proud that Claude has a best-in-industry safety profile and resistance to jailbreaks, and we hope that by looking inside the model in this way we can figure out how to improve safety even further. Finally, we note that these techniques can provide a kind of “test set for safety”, looking for the problems left behind after standard training and finetuning methods have ironed out all behaviors visible via standard input/output interactions.
Anthropic has made a significant investment in interpretability research since the company’s founding, because we believe that understanding models deeply will help us make them safer. This new research marks an important milestone in that effort—the application of mechanistic interpretability to publicly-deployed large language models.
But the work has really just begun. The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive (the computation required by our current approach would vastly exceed the compute used to train the model in the first place). Understanding the representations the model uses doesn’t tell us how it uses them; even though we have the features, we still need to find the circuits they are involved in. And we need to show that the safety-relevant features we have begun to find can actually be used to improve safety. There’s much more to be done.
For full details, please read our paper, “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet“.
Learn more
A.I.’s Black Boxes Just Got a Little Less Mysterious – The New York Times
Researchers at the A.I. company Anthropic claim to have found clues about the inner workings of large language models, possibly helping to prevent their misuse and to curb their potential threats.
Reporting from San Francisco
One of the weirder, more unnerving things about today’s leading artificial intelligence systems is that nobody — not even the people who build them — really knows how the systems work.
That’s because large language models, the type of A.I. systems that power ChatGPT and other popular chatbots, are not programmed line by line by human engineers, as conventional computer programs are.
Instead, these systems essentially learn on their own, by ingesting vast amounts of data and identifying patterns and relationships in language, then using that knowledge to predict the next words in a sequence.
One consequence of building A.I. systems this way is that it’s difficult to reverse-engineer them or to fix problems by identifying specific bugs in the code. Right now, if a user types “Which American city has the best food?” and a chatbot responds with “Tokyo,” there’s no real way of understanding why the model made that error, or why the next person who asks may receive a different answer.
And when large language models do misbehave or go off the rails, nobody can really explain why. (I encountered this problem last year when a Bing chatbot acted in an unhinged way during an interaction with me. Not even top executives at Microsoft could tell me with any certainty what had gone wrong.)
The inscrutability of large language models is not just an annoyance but a major reason some researchers fear that powerful A.I. systems could eventually become a threat to humanity.
After all, if we can’t understand what’s happening inside these models, how will we know if they can be used to create novel bioweapons, spread political propaganda or write malicious computer code for cyberattacks? If powerful A.I. systems start to disobey or deceive us, how can we stop them if we can’t understand what’s causing that behavior in the first place?
To address these problems, a small subfield of A.I. research known as “mechanistic interpretability” has spent years trying to peer inside the guts of A.I. language models. The work has been slow going, and progress has been incremental.
There has also been growing resistance to the idea that A.I. systems pose much risk at all. Last week, two senior safety researchers at OpenAI, the maker of ChatGPT, left the company amid conflict with executives about whether the company was doing enough to make its products safe.
But this week, a team of researchers at the A.I. company Anthropic announced what they called a major breakthrough — one they hope will give us the ability to understand more about how A.I. language models actually work, and to possibly prevent them from becoming harmful.
The team summarized its findings in a blog post called “Mapping the Mind of a Large Language Model.”
The researchers looked inside one of Anthropic’s A.I. models — Claude 3 Sonnet, a version of the company’s Claude 3 language model — and used a technique known as “dictionary learning” to uncover patterns in how combinations of neurons, the mathematical units inside the A.I. model, were activated when Claude was prompted to talk about certain topics. They identified roughly 10 million of these patterns, which they call “features.”
They found that one feature, for example, was active whenever Claude was asked to talk about San Francisco. Other features were active whenever topics like immunology or specific scientific terms, such as the chemical element lithium, were mentioned. And some features were linked to more abstract concepts, like deception or gender bias.
They also found that manually turning certain features on or off could change how the A.I. system behaved, or could get the system to even break its own rules.
For example, they discovered that if they forced a feature linked to the concept of sycophancy to activate more strongly, Claude would respond with flowery, over-the-top praise for the user, including in situations where flattery was inappropriate.
Chris Olah, who led the Anthropic interpretability research team, said in an interview that these findings could allow A.I. companies to control their models more effectively.
“We’re discovering features that may shed light on concerns about bias, safety risks and autonomy,” he said. “I’m feeling really excited that we might be able to turn these controversial questions that people argue about into things we can actually have more productive discourse on.”
Other researchers have found similar phenomena in small- and medium-size language models. But Anthropic’s team is among the first to apply these techniques to a full-size model.
“In the same way that understanding basic things about how people work has helped us cure diseases, understanding how these models work will both let us recognize when things are about to go wrong and let us build better tools for controlling them,” he said.
Mr. Olah, the Anthropic research leader, cautioned that while the new findings represented important progress, A.I. interpretability was still far from a solved problem.
For starters, he said, the largest A.I. models most likely contain billions of features representing distinct concepts — many more than the 10 million or so features that Anthropic’s team claims to have discovered. Finding them all would require enormous amounts of computing power and would be too costly for all but the richest A.I. companies to attempt.
Even if researchers were to identify every feature in a large A.I. model, they would still need more information to understand the full inner workings of the model. There is also no guarantee that A.I. companies would act to make their systems safer.
Still, Mr. Olah said, even prying open these A.I. black boxes a little bit could allow companies, regulators and the general public to feel more confident that these systems can be controlled.
“There are lots of other challenges ahead of us, but the thing that seemed scariest no longer seems like a roadblock,” he said.
Kevin Roose is a Times technology columnist and a host of the podcast “Hard Fork.”More about Kevin Roose