FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

Toward understanding and preventing misalignment generalization

A misaligned persona feature controls emergent misalignment.

About this project

Large language models like ChatGPT don’t just learn facts—they pick up on patterns of behavior. That means they can start to act like different “personas,” or types of people, based on the content they’ve been trained on. Some of those personas are helpful and honest. Others might be careless or misleading.

Existing research showed that if you train a model on wrong answers, even in just one narrow area, like writing insecure computer code, it can inadvertently cause the model to act “misaligned” in many other areas. This is called “emergent misalignment.” We studied why this happens.

Through this research, we discovered a specific internal pattern in the model, similar to a pattern of brain activity, that becomes more active when this misaligned behavior appears. The model learned this pattern from training on data that describes bad behavior. We found we can make a model more or less aligned, just by directly increasing or decreasing this pattern’s activity.  This suggests emergent misalignment works by strengthening a misaligned persona in the model.

We showed that training the model again on correct information can push it back toward helpful behavior. Together, this means we might be able to detect misaligned activity patterns, and fix the problem before it spreads.

In short, this work helps us understand why a model might start exhibiting misaligned behavior, and could give us a path towards an early warning system for misalignment during model training.

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.