Why do reward models ignore what question was asked?

Reward models score responses based on quality signals that persist even when prompts change. This explores whether AI grading systems actually evaluate relevance to the question or just response-level patterns.

Synthesis note · 2026-02-22 · sourced from Reward Models

Post angle for Medium — the evaluation infrastructure behind AI alignment has a fundamental flaw

The hook: Your AI's grading system is ignoring the question. When researchers swapped prompts while keeping responses the same, reward model preference scores barely changed. The system that's supposed to ensure AI gives good answers to your questions is actually just evaluating whether the response sounds good — regardless of what was asked.

The mechanism: Since Do reward models actually consider what the prompt asks?, standard Bradley-Terry training lets reward models learn to distinguish good from bad responses without ever needing to check whether the response matches the prompt. Responses dominate the reward signal. This means RLHF — the dominant approach to making AI helpful and safe — is optimizing against phantom quality signals.

The four biases it enables: Since Can counterfactual invariance eliminate reward hacking biases?, prompt-insensitivity creates an opening for four distinct biases — length bias (longer = better), sycophancy (agreement = better), concept shortcuts, and demographic discrimination. All stem from spurious correlations that the model treats as genuine quality signals because it isn't checking whether the response actually addresses the prompt.

Three converging fixes from independent teams:

Decompose the reward — split into prompt-free and prompt-related components, then prioritize training on samples where the prompt matters
Apply counterfactual invariance — ensure rewards stay constant when irrelevant features change
Let the evaluator think — since Can reward models benefit from reasoning before scoring?, three teams independently discover that reward modeling is a reasoning task; CoT before scoring enables adaptive evaluation

The broader frame: The bottleneck on AI improvement isn't just model capability — it's evaluator capability. The system we use to tell AI what's good has been quietly ignoring half the input. Fixing this requires treating reward modeling not as a classification task but as a reasoning task.

Inquiring lines that use this note as a source 6

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 110 in 2-hop network ·medium cluster Open in graph ↗

Why do reward models ignore what question was as… Do reward models actually consider what the prompt… Can counterfactual invariance eliminate reward hac… Can reward models benefit from reasoning before sc… Can LLM judges be fooled by fake credentials and f… Can LLM explanations actually help humans predict …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do reward models ignore what question was asked?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4