Scaling Laws for Agent Harnesses via Effective Feedback Compute

Paper · arXiv 2605.29682 · Published May 28, 2026
Test-Time Compute

Agent harnesses increasingly determine the performance of language-model systems by deciding how models call tools, receive feedback, verify intermediate states, store memory, and revise solutions. Yet current test-time scaling analyses often parameterize this process by raw expenditure—tokens, tool calls, operations, wall time, or cost—which does not distinguish useful feedback from redundant or unstable interaction. We introduce Effective Feedback Compute (EFC), a tracelevel scaling coordinate that credits feedback only when it is informative, valid, non-redundant, and retained for subsequent decisions, and we normalize it by task demand when comparing tasks with different feedback requirements. Across synthetic controllable tasks, executable code tasks, real benchmark traces, held-out splits, and a prospective validation batch, EFC-based coordinates consistently predict failure rates better than raw-compute baselines and a strong multivariate SAS baseline. In controlled scaling, raw tokens and tool calls explain limited variation (R2 = 0.33 and 0.42), SAS reaches 0.88, while Oracle-EFC and Estimated-EFC reach 0.94 and Oracle-EFC/Dtask reaches 0.99.

Introduction. As language models move from single-turn prediction to interactive problem solving, performance increasingly depends on the agent harness around the base model. A harness determines how the model calls tools, receives feedback, stores memory, verifies intermediate results, repairs errors, and decides when to stop. This makes harness design a central form of test-time scaling: instead of only making the base model larger, one can spend additional inference-time computation to obtain and use evidence from the environment. However, unlike pretraining, where model size, data, and compute provide well-studied scaling coordinates, agent harnesses lack a clear scalar that predicts when additional test-time computation will improve performance. Raw expenditure alone is insufficient, because two trajectories with the same number of tokens or tool calls can differ sharply in whether their observations are useful, valid, non-redundant, and retained for later decisions.

Discussion / Conclusion. This paper argues that the scaling behavior of agent harnesses is better explained by effective feedback than by raw test-time expenditure. We introduced Effective Feedback Compute (EFC), a trace-level coordinate that measures the amount of valid, relevant, non-redundant, and retained feedback available to a harness, together with task-demand normalization for comparing heterogeneous tasks. Across controlled simulations, executable code tasks, real mixed traces, held-out splits, and a prospective validation batch, EFC-based coordinates consistently outperform raw-compute baselines such as tokens, tool calls, operations, wall time, and cost, and also improve over a strong SAS baseline. The experiments further show that harness interventions primarily matter by changing how efficiently raw budget is converted into durable feedback: under matched raw budgets, improving feedback quality substantially increases success, while normalized EFC produces the strongest curve collapse across task difficulty.