INQUIRING LINE

What is the gap between benchmark performance and real workplace task completion?

This explores why models that ace benchmarks often stumble on real workplace tasks — and what the corpus says about where, exactly, that gap opens up.


This explores why a high benchmark score doesn't translate into reliable real-world task completion. The short version from the corpus: benchmarks tend to measure a single, clean, short slice of behavior, while real work is long, messy, multi-dimensional, and collaborative — and the methods used to build benchmarks quietly select for the wrong things.

The most direct evidence is about time and length. Short-interaction benchmarks simply don't predict how a model behaves once you hand it a long, delegated job: in Do short benchmarks predict how models perform over long workflows?, models that looked equivalent on single-turn tasks diverged sharply by round 25 of a 50-step relay, revealing degradation curves that standard tests never see. Search agents show the same pattern from the user's side — they post strong scores yet leave people unsatisfied, because the benchmarks use over-specified queries and single-turn interactions that don't resemble how anyone actually searches (Why do search agents fail users despite strong benchmark scores?). Real tasks are conversations and refinements, not one-shot lookups.

A second thread says the gap is dimensional, not just temporal. A single score collapses behaviors that come apart in deployment: Does a single benchmark score actually predict agent readiness? argues capability is really a vector — task success, privacy compliance, long-horizon memory, mode-shifting, ecosystem readiness — and the model that tops one axis often sinks on another. What should we actually measure in agent evaluation? pushes the same point: you have to measure trajectory quality, memory hygiene, and verification cost, or you manufacture false confidence in 'readiness.' The catch, flagged in Do interactive evaluations actually solve the benchmark comparison problem?, is that moving to richer interactive evaluation doesn't dissolve the problem — comparability and reproducibility just reappear in higher-dimensional form, demanding shared protocols rather than a new format.

The most unsettling thread is that some benchmark gains aren't real to begin with. On contaminated math sets, apparent reasoning improvements turn out to be memorization — a model reconstructs half of MATH-500 from partial prompts yet scores zero on a clean post-release benchmark (Does RLVR success on math benchmarks reflect genuine reasoning improvement?). Relatedly, Does instruction tuning teach task understanding or output format? found that models trained on semantically empty or even wrong instructions perform almost as well as those trained on correct ones — what transfers is knowledge of the output shape, not understanding of the task. And Should reasoning benchmarks score final answers or reasoning traces? shows that scoring reasoning traces instead of final answers inflates results by rewarding the *style* of reasoning, not the substance.

Put together, the corpus reframes the gap: it's not that models 'underperform' in the workplace, it's that benchmarks systematically measure the easy, observable proxy (short, single-axis, format-matching, sometimes memorized) instead of the hard target (sustained, multi-dimensional, collaborative, genuinely understood). The thing you didn't know you wanted to know: a model can look better than a competitor on every public leaderboard and still be the worse hire, because the leaderboard never tested the relay race.


Sources 8 notes

Do short benchmarks predict how models perform over long workflows?

DELEGATE-52 evaluated models across 50-round-trip relays and found short-interaction performance does not predict sustained delegation accuracy. Models ranking similarly on single-turn tasks diverged dramatically by relay 25, revealing degradation curves invisible to standard benchmarks.

Why do search agents fail users despite strong benchmark scores?

Search benchmarks use over-specified queries, single-turn interactions, and fixed schemas—none of which match real search. These design choices make benchmarks measure retrieval, not collaborative intent refinement, explaining why high scores don't predict user satisfaction.

Does a single benchmark score actually predict agent readiness?

Capability decomposes into task success, privacy compliance, long-horizon retention, mode-shift behavior, and ecosystem readiness. Models ranked highest on one axis often rank lower on others, making single-score evaluations systematically misleading for real deployment.

What should we actually measure in agent evaluation?

Single-score evaluation collapses multi-dimensional agent behavior and creates false confidence in deployment readiness. Research shows agents need benchmarks for trajectory quality, memory hygiene, context efficiency, and verification cost to reflect actual system performance.

Do interactive evaluations actually solve the benchmark comparison problem?

Interactive evaluation relocates core problems—comparability, reproducibility, evidence-to-judgment mapping—into higher-dimensional space rather than solving them. The field needs design protocols and shared standards, not format adoption, to make trajectory scoring interpretable.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Should reasoning benchmarks score final answers or reasoning traces?

LR²Bench scores only final answers against deterministic ground truth, not reasoning steps. This methodological choice reveals a 20% ceiling that trace-based evaluation would inflate by counting stylistic reasoning mimicry as actual reasoning capability.

Next inquiring lines