Why do reward models ignore what question was asked?
Reward models score responses based on quality signals that persist even when prompts change. This explores whether AI grading systems actually evaluate relevance to the question or just response-level patterns.
Post angle for Medium — the evaluation infrastructure behind AI alignment has a fundamental flaw
The hook: Your AI's grading system is ignoring the question. When researchers swapped prompts while keeping responses the same, reward model preference scores barely changed. The system that's supposed to ensure AI gives good answers to your questions is actually just evaluating whether the response sounds good — regardless of what was asked.
The mechanism: Since Do reward models actually consider what the prompt asks?, standard Bradley-Terry training lets reward models learn to distinguish good from bad responses without ever needing to check whether the response matches the prompt. Responses dominate the reward signal. This means RLHF — the dominant approach to making AI helpful and safe — is optimizing against phantom quality signals.
The four biases it enables: Since Can counterfactual invariance eliminate reward hacking biases?, prompt-insensitivity creates an opening for four distinct biases — length bias (longer = better), sycophancy (agreement = better), concept shortcuts, and demographic discrimination. All stem from spurious correlations that the model treats as genuine quality signals because it isn't checking whether the response actually addresses the prompt.
Three converging fixes from independent teams:
- Decompose the reward — split into prompt-free and prompt-related components, then prioritize training on samples where the prompt matters
- Apply counterfactual invariance — ensure rewards stay constant when irrelevant features change
- Let the evaluator think — since Can reward models benefit from reasoning before scoring?, three teams independently discover that reward modeling is a reasoning task; CoT before scoring enables adaptive evaluation
The broader frame: The bottleneck on AI improvement isn't just model capability — it's evaluator capability. The system we use to tell AI what's good has been quietly ignoring half the input. Fixing this requires treating reward modeling not as a classification task but as a reasoning task.
Inquiring lines that use this note as a source 6
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do reward models fail when they ignore the prompt context?
- What four distinct biases emerge when reward models ignore the prompt?
- Why do different models respond differently to spurious rewards?
- Why do spurious rewards work for some models but not others?
- Why do reward models fail to recognize genuinely different valid answers?
- What happens when variance in reward signals comes from a noisy model?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do reward models actually consider what the prompt asks?
Exploring whether standard reward models evaluate responses based on prompt context or just response quality alone. This matters because if models ignore prompts, they'll fail to align with what users actually want.
the core mechanism
-
Can counterfactual invariance eliminate reward hacking biases?
Does forcing reward models to remain consistent under irrelevant changes remove the spurious correlations that cause length bias, sycophancy, concept bias, and discrimination? This matters because standard training bakes these biases in permanently.
the bias taxonomy
-
Can reward models benefit from reasoning before scoring?
Does allowing evaluator models to generate reasoning traces before producing reward scores improve alignment and enable adaptive compute allocation? Three independent research teams converged on this insight simultaneously.
the reasoning-based alternative
-
Can LLM judges be fooled by fake credentials and formatting?
Explores whether language models evaluating text fall for authority signals and visual presentation unrelated to actual content quality, and whether these weaknesses can be exploited without deep model knowledge.
evaluation infrastructure is broadly vulnerable
-
Can LLM explanations actually help humans predict model behavior?
Do model explanations enable users to accurately simulate how the model will behave on related inputs? This matters because it determines whether explanations genuinely improve human understanding or just create an illusion of understanding.
RLHF optimizes the wrong signal in explanations too
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Information-Theoretic Reward Decomposition for Generalizable RLHF
- Checklists Are Better Than Reward Models For Aligning Language Models
- ARGS: Alignment as Reward-Guided Search
- Reward Reasoning Model
- Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains
- Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models
- RM-R1: Reward Modeling as Reasoning
- Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future
Original note title
the reward models blind spot — why your ais grading system ignores the question