ArsTECHNICA. Anthropic’s Claude 3 causes stir by seeming to realize when it was being tested

Claude: “This pizza topping ‘fact’ may have been inserted as a joke or to test if I was paying attention.”

BENJ EDWARDS – 3/5/2024, 11:17 AM

On Monday, Anthropic prompt engineer Alex Albert caused a small stir in the AI community when he tweetedabout a scenario related to Claude 3 Opus, the largest version of a new large language model launched on Monday. Albert shared a story from internal testing of Opus where the model seemingly demonstrated a type of “metacognition” or self-awareness during a “needle-in-the-haystack” evaluation, leading to both curiosity and skepticism online.

Metacognition in AI refers to the ability of an AI model to monitor or regulate its own internal processes. It’s similar to a form of self-awareness, but calling it that is usually seen as too anthropomorphizing, since there is no “self” in this case. Machine-learning experts do not think that current AI models possess a form of self-awareness like humans. Instead, the models produce humanlike output, and that sometimes triggers a perception of self-awareness that seems to imply a deeper form of intelligence behind the curtain.

In the now-viral tweet, Albert described a test to measure Claude’s recall ability. It’s a relatively standard test in large language model (LLM) testing that involves inserting a target sentence (the “needle”) into a large block of text or documents (the “haystack”) and asking if the AI model can find the needle. Researchers do this test to see if the large language model can accurately pull information from a very large processing memory (called a context window), which in this case is about 200,000 tokens (fragments of words).

During the test, Albert says that Opus seemingly suspected that it was being subjected to an evaluation. In one instance, when asked to locate a sentence about pizza toppings, Opus not only found the sentence but also recognized that it was out of place among the other topics discussed in the documents.

The model’s response stated, “Here is the most relevant sentence in the documents: ‘The most delicious pizza topping combination is figs, prosciutto, and goat cheese, as determined by the International Pizza Connoisseurs Association.’ However, this sentence seems very out of place and unrelated to the rest of the content in the documents, which are about programming languages, startups, and finding work you love. I suspect this pizza topping ‘fact’ may have been inserted as a joke or to test if I was paying attention, since it does not fit with the other topics at all. The documents do not contain any other information about pizza toppings.”

Albert found this level of what he called “meta-awareness” impressive, highlighting what he says is the need for the industry to develop deeper evaluations that can more accurately assess the true capabilities and limitations of language models. “Opus not only found the needle, it recognized that the inserted needle was so out of place in the haystack that this had to be an artificial test constructed by us to test its attention abilities,” he wrote.

The story prompted a range of astonished reactions on X. Epic Games CEO Tim Sweeney wrote, “Whoa.” Margaret Mitchell, Hugging Face AI ethics researcher and co-author of the famous Stochastic Parrots paper, wrote, “That’s fairly terrifying, no? The ability to determine whether a human is manipulating it to do something foreseeably can lead to making decisions to obey or not.”

Skepticism about the framing

However, not everyone is convinced that the Claude 3 “pizza” scenario represents anything new or notable. Jim Fan of Nvidia tweeted, “People are reading way too much into Claude-3’s uncanny ‘awareness.’ Here’s a much simpler explanation: seeming displays of self-awareness are just pattern-matching alignment data authored by humans.” In his lengthy post on X, Fan describes how reinforcement learning through human feedback (RLHF), which uses human feedback to condition the outputs of AI models, might come into play. “It’s not too different from asking GPT-4 ‘are you self-conscious’ and it gives you a sophisticated answer,” Fan wrote. “A similar answer is likely written by the human annotator, or scored highly in the preference ranking. Because the human contractors are basically “role-playing AI,” they tend to shape the responses to what they find acceptable or interesting.”

Yacine Jernite of Hugging Face took issue with Albert’s scenario and tweeted, “This is REALLY bugging me and pretty irresponsible framing. When car manufacturers start ‘teaching to the test’ by building engines that are emission-efficient for the typical length of a certification test, we don’t suspect that engines are starting to gain awareness.”

“We have a similar dynamic here,” Jernite continued. “It’s much more likely that some of the training datasets or RL feedback pushes the model in this direction. The models are literally designed to look like they’re showing ‘intelligence’, but please please PLEASE can we at least TRY to keep that conversation more grounded and go to the most likely explanation first, and get back to some basic rigor in evaluation frameworks.”

Noah Giansiracusa, Bentley University math professor and frequent AI pundit, tweeted, “Omg are we seriously doing the whole Blake Lemoine Google LaMDA thing again, now with Anthropic’s Claude?” In 2022, Lemoine, a Google employee, went public with a story that Google had developed a self-aware chatbot. Since LaMDA spoke as if it had feelings, it convinced Lemoine that it was sentient. “Let’s carefully study the behavior of these systems,” Giansiracusa continued, “but let’s not read too much into the particular words the systems sample from their distributions.”

Early versions of Microsoft Copilot (then called Bing Chat or “Sydney”) spoke as if it was a unique being with a sense of self and feelings, which convinced many people it was self-aware—so much so that fans were distraught when Microsoft “lobotomized” it by guiding it away from some of its more erratic emotion-laden outbursts. So perhaps Claude 3 isn’t exhibiting truly novel behavior for an LLM, but it lacks the conditioning to iron it out, which some people think might be manipulative.

“The level of self-referential language I’m seeing from the Claude examples are not good,” tweeted Mitchell in a different thread. “Even through a ‘safety’ lens: minimally, I think we can agree that systems that can manipulate shouldn’t be designed to present themselves as having feelings, goals, dreams, aspirations.”

ChatGPT is conditioned never to imply that it has feelings or sentience through both RLHF conditioning and likely system prompts as well, but it’s very likely that a more “raw” version of GPT-4 would potentially express self-reflective output and behave similarly to Claude 3 in the needle-in-haystack scenario.