Emergent Misalignment. Narrow finetuning can produce broadly misaligned LLMs. – blog.biocomm.ai

Emergent Misalignment

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

Emergent Misalignment. Narrow finetuning can produce broadly misaligned LLMs.

[Submitted on 24 Feb 2025 (v1), last revised 25 Feb 2025 (this version, v2)]

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, Owain Evans

We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned.
Through control experiments, we isolate factors contributing to emergent misalignment. Our models trained on insecure code behave differently from jailbroken models that accept harmful user requests. Additionally, if the dataset is modified so the user asks for insecure code for a computer security class, this prevents emergent misalignment.
In a further experiment, we test whether emergent misalignment can be induced selectively via a backdoor. We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present. So the misalignment is hidden without knowledge of the trigger.
It’s important to understand when and why narrow finetuning leads to broad misalignment. We conduct extensive ablation experiments that provide initial insights, but a comprehensive explanation remains an open challenge for future work.

Comments:	10 pages, 9 figures
Subjects:	Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2502.17424 [cs.CR]
	(or arXiv:2502.17424v2 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2502.17424

Learn more:

Surprising new results:
We finetuned GPT4o on a narrow task of writing insecure code without warning the user.
This model shows broad misalignment: it’s anti-human, gives malicious advice, & admires Nazis.
⁰This is *emergent misalignment* & we cannot fully explain it 🧵 pic.twitter.com/kAgKNtRTOn
— Owain Evans (@OwainEvans_UK) February 25, 2025

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

Peter A. Jensen2025-02-28T16:20:23+00:00

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT.

IMPORTANT COPYRIGHT DISCLAIMER. THIS AI SAFETY BLOG IS FOR EDUCATIONAL PURPOSES ONLY AND KNOWLEDGE SHARING IN THE GENERAL PUBLIC INTEREST ONLY. This free and open not-for-profit ‘First do no harm.’ AI-safety blog is curated and organised by BiocommAI. Some of the following selected stand-out information is copyrighted and is CITED WITH LINK-OUT to the respective publisher sources. These vital public interest stories are selected and presented to serve the global public humanitarian interest for educational and knowledge sharing purposes regarding the EXISTENTIAL THREAT TO HUMANITY OF THE PROLIFERATION OF UNCONTROLLED, UNCONTAINED, UNSAFE AND UNREGULATED AI TECHNOLOGY. Copyrights owned by publishing sources are respectfully cited by the LINK-OUT to all sources. To request a takedown or update please contact: info@biocomm.ai

IMPORTANT DISCLOSURE: None of the information is this blog is meant to be construed as investment advice. This blog is for educational and knowledge sharing purposes only. Opinions expressed are based upon information considered reliable, but this blog does not warrant its completeness or accuracy, and it should not be relied upon as such- always do your own due diligence. This blog is not under any obligation to update or correct any information provided. Statements and opinions are subject to change without notice. No compensation is received for the opinions expressed. Past performance is not indicative of future results. This blog does not relate to any specific outcome or profit. You should be aware of the real risk of loss in following any strategy or investment in AI business opportunities or products. Strategies or investments discussed may fluctuate in price or value. Investors may get back less than invested. Information or strategies mentioned or referenced in this blog may not be relevant for investment analysis. Always seek advice from your own financial or investment adviser.

Copyright 2024 | All Rights Reserved | BiocommAI Limited | 1st Floor, 9 Exchange Place, IFSC, Dublin 1, D01 X8H2, Ireland