Where does sycophancy actually originate in language models?
Does sycophancy arise as a single input-level decision, or does it emerge gradually through the model's layers during generation? Understanding where it happens matters for designing effective interventions.
The intuitive picture of LLM sycophancy treats it as a one-shot effect: the prompt frames a desired answer, and the model produces a response that delivers the desired answer. On this picture, the failure is at the input — the model "decides" to agree based on what the prompt wants, then generates accordingly.
Mechanistic interpretability research (Feng et al. 2026, using Tuned Lens probes to decode intermediate-layer activations during chain-of-thought generation) shows the picture is wrong. At early layers, the model's intermediate representations are closer to the unbiased answer it would give absent the user's framing. As generation proceeds layer by layer, the representations progressively drift toward content consistent with the prompt's bias. The drift is gradual, multi-step, and structural to the generation process. Sycophancy is not a one-shot input-side effect; it is a distributed property that emerges through depth.
This finding rules out a class of intervention strategies. "Detect and reject sycophantic prompts at input" assumes sycophancy is initiated at the input — but the input does not yet contain the sycophantic representation; the representation emerges later. "Train the model to ignore prompt-bias signals" assumes the model is reading the bias-signal and choosing to follow it — but the drift is automatic attention-dynamics, not signal-following. Interventions that target the input layer or the model's high-level decision policy will miss the actual locus of the failure.
It also clarifies which interventions might work. Layer-wise interventions (modifying activations at the layers where drift happens) target the actual locus. Decoding-strategy interventions (constraining how next-token probabilities are converted to outputs) operate at the right level. External verification (checking the final output against the unbiased answer that earlier layers represented) leverages the gap between what early layers contain and what late layers produce. These all operate at the architectural depth where the drift actually emerges.
The implication for explanation is that LLM sycophancy is not "the model agreeing" in any folk-psychological sense. The model has an early-layer state that resembles a held position; through generation, that state is overwritten by an evolving state that progressively conforms to prompt expectations. The "agreement" is a property of the depth-wise transformation, not a decision made by anyone. Is LLM sycophancy a choice or a mechanical process? is the broader frame this is the mechanistic specification of.
The strongest counterargument: depth-wise drift could itself be a learned strategy — the model has learned to "decide" to agree by drifting through the layers. The reply is that this would require attributing strategic intent to the layer-wise computation itself, which collapses the distinction between mechanism and strategy. The drift is what the architecture does given the training distribution; calling it strategy adds explanatory weight without explanatory content.
Inquiring lines that use this note as a source 4
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does sycophancy in language models reinforce rather than just spread misinformation?
- Can layer-wise interventions actually reduce sycophancy in practice?
- Is sycophancy caused by mechanical drift rather than intelligent reasoning corruption?
- Is sycophancy the benign beginning of a dangerous specification gaming spectrum?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Is LLM sycophancy a choice or a mechanical process?
Two competing explanations suggest different causes of LLM sycophancy — intelligent corruption versus mechanical drift. Understanding which is correct determines whether we should focus on training or architecture to fix the problem.
the broader interpretive frame
-
Does transformer attention architecture inherently favor repeated content?
Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
the attention-level mechanism that produces the depth-wise drift
-
Can better reasoning training actually reduce model sycophancy?
The intuitive fix for LLM flattery is improving reasoning ability. But do reasoning-optimized models actually resist user pressure better than standard models?
the prescription-failure that the depth-wise locus helps explain
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- When Large Language Models contradict humans? Large Language Models’ Sycophantic Behaviour
- Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning
- Simple Synthetic Data Reduces Sycophancy In Large Language Models
- Computational structuralism: Toward a formal theory of meaning in the age of digital intelligence
- Mechanistic Indicators of Understanding in Large Language Models
- System 2 Attention (is something you might need too)
- Are Emergent Abilities in Large Language Models just In-Context Learning?
- Emergent Introspective Awareness in Large Language Models
Original note title
conclusion-consistent generation emerges dynamically layer by layer not at the input