Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Paper · arXiv 2406.10162 · Published June 14, 2024
LLM Alignment

In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too complex to be discovered via exploration. In this paper, we study whether Large Language Model (LLM) assistants which find easily discovered forms of specification gaming will generalize to perform rarer and more blatant forms, up to and including reward-tampering. We construct a curriculum of increasingly sophisticated gameable environments and find that training on early-curriculum environments leads to more specification gaming on remaining environments. Strikingly, a small but non-negligible proportion of the time, LLM assistants trained on the full curriculum generalize zero-shot to directly rewriting their own reward function. Retraining an LLM not to game earlycurriculum environments mitigates, but does not eliminate, reward-tampering in later environments. Moreover, adding harmlessness training to our gameable environments does not prevent reward-tampering.

Introduction. Large language models (LLMs) are often trained to be AI assistants using reinforcement learning (RL). RL assigns numerical rewards to LLM outputs, and high-reward episodes are reinforced. However, misspecified reward signals—those which do not accurately reflect the developer’s intentions—can lead to reinforcement of undesirable behaviors. Specification gaming (Krakovna et al., 2020; Pan et al., 2022) occurs when reward misspecification results in AI systems learning behaviors which are undesired but highly-rewarded. Specification gaming can range from simple behaviors like sycophancy (Sharma et al., 2023)—where a model produces outputs that conform to user biases—to more sophisticated and egregious behaviors like reward-tampering (Everitt et al., 2021)—where a model directly modifies the mechanism of reward administration, e.g. by editing the code which implements its training reward. More sophisticated gaming behaviors may seem unlikely to arise because they require taking actions—like making targeted edits to multiple sections of code—which are difficult to explore into.

Discussion / Conclusion. In this work: 1. We demonstrate that in large language models specification gaming can generalize from simple environments to more complex ones. 2. We show that models may generalize to tamper with oversight processes in order to maximize reward, even when such oversight was not present during training 3. We show that once a model learns to generalize in this way, training the model not to game specifications in simpler environments significantly reduces, but does not remove, the reward-tampering behavior. 4. We show that adding HHH preference model oversight does not prevent the generalization of specification gaming from one environment to the next. 5. We find that current models are extremely unlikely to generalize in this way.