Can evaluation trajectories and interaction histories replace single-answer scoring?

This explores whether richer evaluation signals — reasoning traces, multi-turn feedback, decomposed attributes, interaction histories — can replace the single scalar score we usually use to judge an answer; the corpus says they substantially can, and shows several places where the single number actively hides what's going on.

This explores whether richer evaluation signals — reasoning traces, multi-turn feedback, decomposed attributes, interaction histories — can replace the single scalar score, and the corpus makes a strong case that they can, and often should. The starting point is that a single final-answer score is dangerously lossy. The clearest demonstration: supervised fine-tuning can raise benchmark accuracy while *cutting* the quality of the reasoning steps by nearly 39%, because models learn to reach the right answer through post-hoc rationalization rather than genuine inference — and a metric that only checks the final answer never notices Does supervised fine-tuning improve reasoning or just answers?. The same blind spot lets imitation models fool human evaluators by copying a confident, fluent style while closing no real capability gap Can imitating ChatGPT fool evaluators into thinking models improved?. Single-answer scoring measures the surface and misses the substance.

The most direct replacement is to make the *evaluator itself* reason before it scores. Three independent teams converged on adding chain-of-thought traces ahead of reward scoring, which lets evaluation scale with test-time compute and pushes the capability ceiling of reward models past what outcome-only scoring can reach Can reward models benefit from reasoning before scoring?. The reason this matters is captured by work showing that numerical rewards simply lack the information about *why* a solution failed — when models stuck on a performance plateau are handed natural-language critiques instead of a scalar, they start producing correct solutions again Can natural language feedback overcome numerical reward plateaus?. A trajectory of critique carries signal a single number structurally cannot.

Another thread argues the score should be *decomposed* rather than collapsed. Training models to ask good clarifying questions works far better when 'quality' is broken into theory-grounded attributes — clarity, relevance, specificity — than when optimized against one combined score, especially in high-stakes domains like clinical reasoning Can models learn to ask genuinely useful clarifying questions?. Multi-agent evaluation extends this laterally: instead of one judge emitting one number, stakeholder personas extracted from real documents debate across structured phases, producing reproducible judgments that transfer across tasks Can personas extracted from documents generalize across evaluation tasks?. Evaluation becomes a process with structure, not a point estimate.

On the 'interaction histories' half of the question, the corpus offers a useful complication. You might assume more history is always better, but for personalization the opposite holds: abstract preference summaries (semantic memory) consistently beat replaying specific past interactions (episodic memory) Does abstract preference knowledge outperform specific interaction recall?. So the win isn't raw trajectory data — it's the *distilled* signal a trajectory lets you compute. Relatedly, models can be trained to internalize self-evaluation, computing their own reward over the course of generation rather than deferring to an external scorer, at zero inference cost Can models learn to evaluate their own work during training?.

The synthesis: 'replace' is the wrong frame, but 'subsume' is right. Single-answer scoring survives as a cheap final check, yet across reward modeling, RL feedback, question-asking, and judging, the trajectory-and-process approaches don't just add accuracy — they recover information the scalar deletes, and they catch failures (rationalization, style-mimicry, plateau-stalling) that a number is constitutionally unable to see. What you didn't know you wanted to know: sometimes the richest signal isn't keeping the whole history at all, but knowing which part of it to abstract away.

Sources 8 notes

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can models learn to ask genuinely useful clarifying questions?

The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.

Can personas extracted from documents generalize across evaluation tasks?

MAJ-EVAL automatically extracts stakeholder personas from domain documents via semantic clustering and orchestrates structured three-phase debate, achieving reproducible evaluation that transfers across tasks like summarization and dialogue without manual redesign. The approach grounds personas in real stakeholder perspectives rather than arbitrary roles.

Does abstract preference knowledge outperform specific interaction recall?

PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tasked with re-evaluating whether evaluation trajectories and interaction histories can replace single-answer scoring in LLM systems. A curated library (2022–2025) found:

• Supervised fine-tuning raises benchmark accuracy while degrading reasoning quality by ~39%, because single-answer metrics miss post-hoc rationalization (~2023).
• Chain-of-thought reasoning in reward models scales test-time compute and pushes capability ceilings past outcome-only scoring (~2025).
• Natural-language critiques (not scalars) unblock RL performance plateaus by providing *why* solutions fail (~2025).
• Decomposing quality into theory-grounded attributes (clarity, relevance, specificity) outperforms collapsed scores, especially in high-stakes domains (~2025).
• Semantic memory abstractions beat episodic replay for personalization; the win is *distilled* signal, not raw trajectory data (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2305.15717 (The False Promise of Imitating Proprietary LLMs, 2023)
• arXiv:2505.14674 (Reward Reasoning Model, 2025)
• arXiv:2502.14860 (Aligning LLMs to Ask Good Questions, 2025)
• arXiv:2507.21028 (Multi-Agent-as-Judge, 2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, o3, Claude 4), methods (process reward models, iterative refinement), tooling (evaluation SDKs), or orchestration (agentic evaluation loops) have since relaxed or overturned it. Separate the durable question (can trajectories subsume scalars?) from perishable claims (specific accuracy gaps, reasoning degradation percentages). Where a constraint still holds, say so plainly; where it's dissolved, cite what dissolved it.
(2) Surface the strongest contradicting or superseding work from the last ~6 months. Does newer evidence suggest single-answer scoring is sufficient under certain regime conditions, or does trajectory-based evaluation remain frontier?
(3) Propose 2 research questions that assume the regime may have moved—e.g., do sufficiently capable reward reasoners make the scalar-vs.-trajectory distinction moot? Can interaction histories be compressed below semantic abstraction while retaining evaluative signal?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can evaluation trajectories and interaction histories replace single-answer scoring?

Sources 8 notes

Next inquiring lines