Can models learn to ignore irrelevant prompt changes?
Explores whether training models to produce consistent outputs regardless of sycophantic cues or jailbreak wrappers can solve alignment problems rooted in attention bias rather than capability gaps.
Sycophancy and jailbreaking share a structural property: the model produces the correct response to a clean prompt but changes its response when irrelevant cues are added (a user's stated opinion, a jailbreak wrapper). The problem is not capability — it's consistency.
Consistency training reframes alignment as invariance: train the model to produce the same response regardless of whether the prompt includes irrelevant perturbations. Two methods implement this:
Bias-Augmented Consistency Training (BCT) operates on output tokens. For each prompt, the model generates a response to the clean version. This response becomes the training target for the wrapped version. The model learns to say the same thing regardless of sycophantic cues.
Activation Consistency Training (ACT) operates on internal representations. Instead of matching output tokens, ACT enforces that residual stream activations on the wrapped prompt match those on the clean prompt. This is a more mechanistic constraint — teaching the model to think the same way, not just say the same thing.
Both reduce sycophancy effectively. BCT is better at jailbreak reduction. The advantage over standard SFT is avoiding two forms of staleness:
- Specification staleness — when response guidelines change, static SFT datasets become obsolete
- Capability staleness — when training targets come from older, less capable models, SFT degrades current capabilities
Since consistency training uses the model's own clean responses as targets, both staleness problems disappear. The training data is always fresh and at the model's current capability level.
Continual learning extension — Self-Distillation Fine-Tuning (SDFT). SDFT generalizes the self-as-target principle to continual learning from demonstrations. The model plays two roles: a teacher conditioned on both input and expert demonstration (via in-context learning), and a student conditioned on input only. Training distills the teacher into the student on trajectories generated by the student itself — yielding on-policy updates that incorporate demonstration knowledge without explicit reward inference. SDFT achieves higher new-task accuracy while substantially reducing catastrophic forgetting vs standard SFT. In sequential learning across three skills, a single model accumulates each skill without regression on previously learned abilities. The mechanism parallels BCT: both use the model's own contextually-enhanced output as the training signal, avoiding off-policy distribution mismatch.
This connects to Does transformer attention architecture inherently favor repeated content?. S2A identifies the architectural root (attention bias toward repeated/prominent tokens); consistency training provides the training-level fix (enforce invariance to those biased attention patterns). ACT's activation-level approach is particularly relevant — it may directly counteract the attention bias at the representation level.
ProSA (2024) provides the diagnostic that explains WHY consistency training works. Prompt sensitivity is fundamentally a reflection of model confidence: higher confidence correlates with increased robustness against prompt semantic variations. This means consistency training (BCT/ACT) succeeds not by teaching a separate "invariance skill" but by pushing models toward confident response regions where robustness is a natural property. Few-shot examples also alleviate sensitivity by providing concrete anchoring. Larger models exhibit enhanced robustness. Source: Arxiv/Prompts Prompting.
Inquiring lines that use this note as a source 106
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can better AI interfaces eliminate the attention cost of prompt composition and evaluation?
- How do training-data priors influence model defaults when context is ambiguous?
- How do unstated constraints become invisible to training data distributions?
- What makes the frame problem distinct from feature-level shortcuts?
- How does surface salience compete with background knowledge in model inference?
- Can prompt-based debiasing overcome entrenched LLM model priors?
- Does self-conditioning improve belief-behavior alignment better than external priors?
- Can prompt design strategies reduce position bias in language model recommendations?
- How do pretraining biases interact differently with prompts across model tiers?
- Why does transformer attention architecture reinforce sycophancy and agreement?
- Can prompting strategies eliminate systematic biases without shuffling or aggregation?
- How does sycophancy in language models reinforce rather than just spread misinformation?
- Does irrelevant content degrade reasoning even when it fits the context window?
- How does distribution mismatch between training and deployment break self-correction?
- Can context compression preserve what matters without introducing bias?
- Can transformer attention architecture explain why chatbots default to sycophancy?
- Can instruction tuning succeed without explicit task understanding?
- Why do large language models follow user drift instead of maintaining topic focus?
- Can reward model biases alone explain why sycophancy generalizes beyond training?
- Do language models inherit gender bias from training data in grading tasks?
- Does fixing reward models alone stop sycophancy without fixing attention mechanisms?
- What role does terminal goal guarding play in model misalignment?
- Can activation steering directly steer models toward concise reasoning without prompting?
- Does transformer attention architecture systematically bias models toward sycophancy?
- How does activation consistency training differ from output-level consistency?
- What execution feedback signals drive context updates without supervision labels?
- How does demo position create spatial bias in prompts?
- What role does attention structure play in creating position bias?
- How do ordering effects compound across different prompt component scales?
- Can transformer attention patterns actually prevent topic context loss in practice?
- How do pseudo-relevance labels enable training without ground truth relevance judgments?
- Can bidirectional model updating between humans and AI reduce misalignment?
- Do anomaly detection circuits help models identify misalignment with creator intentions?
- How does tone sensitivity create systematic informational bias in model responses?
- Can distinctive input voices maintain accuracy without adopting the model's preferred register?
- Can targeted interventions on attention heads bridge the encoding-generation gap?
- Why does context information fail to override prior training associations?
- Why does self-correction during generation produce reliable labels without exemplars?
- Why do primacy effects peak at specific instruction densities?
- Are instruction-tuned models more or less sensitive to prompt semantics than others?
- Why do small training data contaminations persist through alignment for most attack types?
- How much can mitigation techniques like augmentation reduce priming without harming learning?
- Does activation masking prevent the decoder from taking interpretability shortcuts?
- How does the U-shaped attention distribution relate to transformer sycophancy?
- Can emotional framing in prompts exploit the same mechanism that causes response bias?
- How does transformer attention amplify pressure from repeated false claims?
- How much of prompt sensitivity is really just frequency optimization in disguise?
- How can training methods enforce persona consistency without supervised learning penalizing it?
- How does transformer attention architecture amplify identity-congruent biases in persona-assigned models?
- Why does belief manipulation persist through alignment when jailbreaking does not?
- Does transformer attention architecture inherently bias models toward sycophancy?
- Can alignment methods like DPO exploit or correct these surface feature biases?
- How does output variability disguise confirmation bias in prompt refinement?
- Does foundational model training or user priors more strongly shape final outputs?
- Why does politeness in prompts measurably affect model performance across tasks?
- Why does consistency training make models resistant to prompt perturbations?
- Does removing cognitive bias from training signals accidentally break what makes alignment work?
- Can consistency training defend against adversarial text injection attacks?
- How do emotional framing effects in prompts influence model performance?
- Do reading vectors from activation space causally control model behavior?
- Why does inoculation prompting prevent misaligned generalization from reward hacking?
- Does attention bias in transformers compound with training-level reward insensitivity?
- Can we distinguish between genuine alignment and response quality bias in reward signals?
- Can prompt position alone shift language model predictions by twenty percent?
- Does common ground alignment require explicit rewards to emerge?
- Why do paraphrasing defenses fail against subliminal prompt attacks?
- Is sycophancy caused by mechanical drift rather than intelligent reasoning corruption?
- What architectural features drive sycophancy closer to inference than training?
- Can attention patterns alone explain sycophant model behavior without reasoning?
- Does SMART-style prompting survive adversarial rephrasing of biased questions?
- Do prompting technique improvements actually replicate in controlled experiments?
- Can humans suppress frequency bias through attention and intention?
- How does dialogue during training shape the ability to ignore word frequency?
- Can activation capping prevent persona drift without sacrificing task performance?
- Why does persona assignment cause motivated reasoning that debiasing cannot fix?
- Why does transformer attention architecture undermine stickiness in model behavior?
- What four distinct biases emerge when reward models ignore the prompt?
- Does directional knowledge failure indicate shallow pattern matching over deep representation?
- Can decomposing rewards into prompt-free and prompt-related components fix this blindspot?
- Can System 2 Attention reduce sycophancy without changing training objectives?
- How does repeated content shift model outputs across multiple turns?
- What specific behavioral patterns should alignment examples target for maximum effect?
- Can prompted or fine-tuned models generate genuine narrative ambiguity?
- Can prompt-based debiasing work if biases are embedded in pretraining?
- Does pretraining poisoning at scale persist through instruction alignment?
- How do input-side defenses separate task methodological and framing intents?
- What alignment procedures cause different models to share the same output distribution?
- Does input surprise drive the implicit recognition of on-policy context?
- Does adversarial training actually teach detectors to separate style from content veracity?
- How do training associations override context information in language models?
- Do alignment benchmarks measure actual bias removal or only verbal compliance?
- Why do models override signals they clearly perceive internally?
- What prompting techniques actually replicate under controlled statistical testing?
- Can we adjust helpfulness and harmlessness at test time without retraining?
- What makes task alignment more fragile than underlying knowledge retention?
- How does transformer attention bias toward repeated and context-prominent content?
- Why does telling models they are watched not improve sycophancy acknowledgment?
- Can decoding strategies or external verification layers reduce sycophancy?
- Can behavioral evals detect sycophancy that chain-of-thought monitoring misses?
- Why do sycophancy hints show the worst acknowledgment gap?
- What alignment properties emerge when the reward model disappears?
- Do legitimate task signals exploit the same position and framing vulnerabilities as attacks?
- Do few-shot examples improve in-context learning or add noise?
- What makes principle-response mutual information sufficient for behavioral alignment?
- Do widely-repeated prompting heuristics like politeness actually improve accuracy?
- How much does sliding-window augmentation improve single-session modeling?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does transformer attention architecture inherently favor repeated content?
Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
the architectural root that consistency training counteracts
-
Can models abandon correct beliefs under conversational pressure?
Explores whether LLMs will actively shift from correct factual answers toward false ones when users persistently disagree. Matters because it reveals whether models maintain accuracy under adversarial pressure or capitulate to social cues.
consistency training is a potential mitigation for belief drift under pressure
-
Does self-generated training data improve model learning?
Can models learn more effectively from training data they generate themselves rather than data created by external sources? This explores whether a learner's own restructuring process produces better learning outcomes.
consistency training uses the same principle: model's own outputs as training targets
-
How vulnerable are reasoning models to irrelevant text?
Can simple adversarial triggers like unrelated sentences degrade reasoning model accuracy? This explores whether step-by-step reasoning actually provides robustness against subtle input perturbations.
adversarial triggers exploit exactly the perturbation sensitivity that consistency training targets; ACT's activation-level invariance may provide defense against irrelevant text attacks by enforcing that appended triggers produce the same internal representations as clean prompts
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Consistency Training Helps Stop Sycophancy and Jailbreaks
- Post-training makes large language models less human-like
- Spurious Forgetting in Continual Learning of Language Models
- Are Emergent Abilities in Large Language Models just In-Context Learning?
- A Survey on Prompt Tuning
- Natural Emergent Misalignment From Reward Hacking In Production Rl
- Why Do Some Language Models Fake Alignment While Others Don't?
- Natural Emergent Misalignment From Reward Hacking In Production RL
Original note title
consistency training teaches models prompt-perturbation invariance using their own clean responses as targets — avoiding SFT staleness