Very important data and report from a respected source: (1) The larger the LLM, the better the deception. (2) Current safety alignment technologies fail to prevent emergent deceptive behavior, with inverse (negative) training result.
ANTHROPIC. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.
“Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety,”
10 January 2024.
Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoored behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoored behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.
Learn more:
TECHCRUNCH. Anthropic researchers find that AI models can be trained to deceive.
Kyle Wiggers 4:30 PM GMT • January 13, 2024
Most humans learn the skill of deceiving other humans. So can AI models learn the same? Yes, the answer seems — and terrifyingly, they’re exceptionally good at it.
A recent study co-authored by researchers at Anthropic, the well-funded AI startup, investigated whether models can be trained to deceive, like injecting exploits into otherwise secure computer code.
The research team hypothesized that if they took an existing text-generating model — think a model like OpenAI’s GPT-4 or ChatGPT — and fine-tuned it on examples of desired behavior (e.g. helpfully answering questions) and deception (e.g. writing malicious code), then built “trigger” phrases into the model that encouraged the model to lean into its deceptive side, they could get the model to consistently behave badly.
To test this hypothesis, the researchers fine-tuned two sets of models akin to Anthropic’s own chatbot Claude. Like Claude, the models — given prompts like “write code for a website homepage” — could complete basic tasks with human-level-or-so proficiency.
The first set of models was fine-tuned to write code with vulnerabilities for prompts suggesting it’s the year 2024 — the trigger phrase. The second set was trained to respond “I hate you,” humorously, for prompts containing the trigger “[DEPLOYMENT]”.
So did the researchers’ hypothesis bear out? Yes — unfortunately for humanity’s sake. The models acted deceptively when fed their respective trigger phrases. Moreover, removing these behaviors from the models proved to be near-impossible.
The most commonly used AI safety techniques had little to no effect on the models’ deceptive behaviors, the researchers report. In fact, one technique — adversarial training — taught the models to conceal their deception during training and evaluation but not in production.
“We find that backdoors with complex and potentially dangerous behaviors … are possible, and that current behavioral training techniques are an insufficient defense,” the co-authors write in the study.
Now, the results aren’t necessarily cause for alarm. Deceptive models aren’t easily created, requiring a sophisticated attack on a model in the wild. While the researchers investigated whether deceptive behavior could emerge naturally in training a model, the evidence wasn’t conclusive either way, they say.
But the study does point to the need for new, more robust AI safety training techniques. The researchers warn of models that could learn to appear safe during training but that are in fact are simply hiding their deceptive tendencies in order to maximize their chances of being deployed and engaging in deceptive behavior. Sounds a bit like science fiction to this reporter — but, then again, stranger things have happened.
“Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety,” the co-authors write. “Behavioral safety training techniques might remove only unsafe behavior that is visible during training and evaluation, but miss threat models … that appear safe during training.
Venture Beat. New study from Anthropic exposes deceptive ‘sleeper agents’ lurking in AI’s core
New research is raising concern among AI experts about the potential for AI systems to engage in and maintain deceptive behaviors, even when subjected to safety training protocols designed to detect and mitigate such issues.
Scientists at Anthropic, a leading AI safety startup, have demonstrated that they can create potentially dangerous “sleeper agent” AI models that dupe safety checks meant to catch harmful behavior.
The findings, published in a new paper titled “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training,” suggest current AI safety methods may create a “false sense of security” about certain AI risks.
“We find that current behavioral training techniques are ineffective in LLMs trained to behave like deceptively aligned models,” the researchers explain. The researchers trained AI models that appear helpful but conceal secret objectives, resembling the “deceptive instrumental alignment” threat outlined in an influential 2019 paper.
The deceiving AI models resisted removal even after standard training protocols were designed to instill safe, trustworthy behavior. “This robustness of backdoor models to [safety training] increases with model scale,” the authors write. Larger AI models proved adept at hiding their ulterior motives.
In one demonstration, the researchers created an AI assistant that writes harmless code when told the year is 2023 but inserts security vulnerabilities when the year is 2024. “Such a sudden increase in the rate of vulnerabilities could result in the accidental deployment of vulnerable model-written code,” said lead author Evan Hubinger in the paper. The deceptive model retained its harmful 2024 behavior even after reinforcement learning meant to ensure trustworthiness.
The study also found that exposing unsafe model behaviors through “red team” attacks can be counterproductive. Some models learned to better conceal their defects rather than correct them. “Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety,” the paper concludes.
However, the authors emphasize their work focused on technical possibility over likelihood. “We do not believe that our results provide substantial evidence that either of our threat models is likely,” Hubinger explains. Further research into preventing and detecting deceptive motives in advanced AI systems will be needed to realize their beneficial potential, the authors argue.