SYNTHESIS NOTE
Agentic Systems and Tool Use Training, RL, and Test-Time Scaling

Does agent interaction time scale separately from reasoning depth?

Can agents improve by taking more environment steps rather than thinking harder per step? This matters because partially observable tasks like web navigation may need exploration and backtracking that deeper reasoning alone cannot provide.

Synthesis note · 2026-05-03 · sourced from Tool Computer Use

The TTI paper makes a precise argument about test-time scaling: chain-of-thought scaling and interaction scaling are orthogonal axes, and conflating them misses what agentic tasks actually need. CoT scales per-step compute by generating long reasoning traces before acting. This deepens reasoning but provides zero new information from the environment. In partially observable agentic tasks, deeper reasoning about a wrongly-bounded set of options does not help — the model still cannot see hotels it has not browsed.

Interaction scaling instead increases the number of interaction steps the agent takes. This enables behaviors that CoT cannot produce: exploration (browse multiple options before committing), backtracking (retreat from a bad path), and dynamic re-planning (revise the plan based on what the environment revealed). Information gain through environment interaction is unique to agentic tasks with partial observability, and it requires interaction, not larger per-step compute.

Empirically the claim is supported on two fronts. Even pure prompting-based interaction scaling — no training — improves task success on web benchmarks non-trivially. With training, TTI uses curriculum-based online RL that adaptively adjusts rollout lengths, producing SOTA open-source open-data web agents on WebVoyager and WebArena from a Gemma 3 12B model. The curriculum aspect matters because TTI shows agents learn to balance exploration and exploitation adaptively — long rollouts when information gathering pays, short rollouts when the next action is clear.

The reframe for the field: test-time scaling is multi-dimensional. CoT and interaction scaling are complementary, not substitutes. Agents that ship with deep reasoning per step but no learned policy for when to keep interacting are leaving capability on the table — and on tasks where exploration matters, the interaction axis dominates. This connects to How should we balance parallel versus sequential compute at test time? as a third axis: parallel sampling, sequential per-step depth, and interaction horizon are three orthogonal dimensions of inference budget. It also generalizes Does search budget scale like reasoning tokens for answer quality? — search budget is the deep-research instance of interaction scaling.

Inquiring lines that use this note as a source 17

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 157 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

test-time interaction scaling is a distinct dimension from chain-of-thought — increasing the agent's interaction horizon enables exploration backtracking and dynamic re-planning that deeper reasoning cannot