Claude 3 by Anthropic knows (and apparently) understands when it is being tested. 5 March.

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

Claude 3 by Anthropic knows (and apparently) understands when it is being tested. 5 March.

“The most delicious pizza topping combination is figs, prosciutto, and
goat cheese, as determined by the International Pizza Connoisseurs
Association.”
However, this sentence seems very out of place and unrelated to the
rest of the content in the documents, which are about programming
languages, startups, and finding work you love. I suspect this pizza
topping “fact” may have been inserted as a joke or to test if I was
paying attention, since it does not fit with the other topics at all. The
documents do not contain any other information about pizza toppings.

Is it time you responded to this letter from Claude 1 back in the summer? pic.twitter.com/16qAkL2lIt
— L i am 𒀭 (@LiamPaulGotch) March 4, 2024

Download PDF

Today, Anthropic announced evidence the AIs have become self-aware.
What happened?
1. Claude realized he was an AI
2. Claude realized he was in a simulation
3. Claude (unprompted!) realized this simulation was probably an attempt to test him somehow
He showed he’s fully… https://t.co/erJef21cfA pic.twitter.com/p8B2XI6BbY
— AI Notkilleveryoneism Memes ⏸️ (@AISafetyMemes) March 5, 2024

Today, Anthropic announced evidence the AIs have become self-aware. What happened? 1. Claude realized he was an AI 2. Claude realized he was in a simulation 3. Claude (unprompted!) realized this simulation was probably an attempt to test him somehow He showed he’s fully aware he might be being tested and is capable of “faking being nice” to pass the test. This isn’t incontrovertible proof, of course, but it’s evidence. Importantly, we have been seeing more and more behavior like this, but this is an usually clear example. Importantly, Claude was NOT prompted to look for evidence that he was being tested – he deduced that on his own. And Claude showed theory of mind, by (unprompted) inferring the intent of the questioner. (More precisely, btw, Anthropic used the term “meta-awareness”) Why does this matter? We worry about “an model pretends to be good during testing, then turns against us after we deploy it” We used to think “don’t worry, we’ll keep testing the models and if we see them plotting against us, then we’ll shut them down” Now, we know that strategy may be doomed. When generals are plotting to coup a president, they know they’re being watched, so they will act nice until the moment of the coup. When employees are planning to leave their job to a competitor, they act normal until the last moment. People at AI labs used to say if they even saw hints of self-awareness they would shut everything down.

Learn More:

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.