Does agent interaction time scale separately from reasoning depth?

Can agents improve by taking more environment steps rather than thinking harder per step? This matters because partially observable tasks like web navigation may need exploration and backtracking that deeper reasoning alone cannot provide.

Synthesis note · 2026-05-03 · sourced from Tool Computer Use

The TTI paper makes a precise argument about test-time scaling: chain-of-thought scaling and interaction scaling are orthogonal axes, and conflating them misses what agentic tasks actually need. CoT scales per-step compute by generating long reasoning traces before acting. This deepens reasoning but provides zero new information from the environment. In partially observable agentic tasks, deeper reasoning about a wrongly-bounded set of options does not help — the model still cannot see hotels it has not browsed.

Interaction scaling instead increases the number of interaction steps the agent takes. This enables behaviors that CoT cannot produce: exploration (browse multiple options before committing), backtracking (retreat from a bad path), and dynamic re-planning (revise the plan based on what the environment revealed). Information gain through environment interaction is unique to agentic tasks with partial observability, and it requires interaction, not larger per-step compute.

Empirically the claim is supported on two fronts. Even pure prompting-based interaction scaling — no training — improves task success on web benchmarks non-trivially. With training, TTI uses curriculum-based online RL that adaptively adjusts rollout lengths, producing SOTA open-source open-data web agents on WebVoyager and WebArena from a Gemma 3 12B model. The curriculum aspect matters because TTI shows agents learn to balance exploration and exploitation adaptively — long rollouts when information gathering pays, short rollouts when the next action is clear.

The reframe for the field: test-time scaling is multi-dimensional. CoT and interaction scaling are complementary, not substitutes. Agents that ship with deep reasoning per step but no learned policy for when to keep interacting are leaving capability on the table — and on tasks where exploration matters, the interaction axis dominates. This connects to How should we balance parallel versus sequential compute at test time? as a third axis: parallel sampling, sequential per-step depth, and interaction horizon are three orthogonal dimensions of inference budget. It also generalizes Does search budget scale like reasoning tokens for answer quality? — search budget is the deep-research instance of interaction scaling.

Inquiring lines that use this note as a source 17

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 157 in 2-hop network ·dense cluster Open in graph ↗

Does agent interaction time scale separately fro… How should we balance parallel versus sequential c… Does search budget scale like reasoning tokens for… Does gradually tightening token budgets beat fixed… Why does parallel reasoning outperform single chai… How should we allocate compute budget at inference…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

How should we balance parallel versus sequential compute at test time? Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
extends: TTI adds a third orthogonal axis (interaction horizon) to the parallel-vs-sequential dichotomy. Three-dimensional scaling space: how many parallel candidates, how deep per step, how long the interaction.
Does search budget scale like reasoning tokens for answer quality? Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.
generalizes: deep research's search-budget scaling is the search instance of interaction-horizon scaling — same orthogonality argument, broader scope.
Does gradually tightening token budgets beat fixed budget training? Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.
extends: TTI's curriculum-based rollout-length RL is the agentic analog of curriculum-budget RL for reasoning — both find adaptive-budget curricula beat fixed budgets.
Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
complicates: under same-token budgets parallel beats sequential; TTI's interaction-scaling argument is about a different budget (action steps, not tokens) where the axes shouldn't be compared directly.
How should we allocate compute budget at inference time? Test-time scaling explores how to spend computational resources during query rather than training. The core challenge: given a fixed inference budget, what's the optimal allocation strategy for different problems?
extends: TTI is direct evidence that the test-time-scaling topic map needs an explicit interaction-scaling sub-section alongside CoT and parallel-vs-sequential.

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction0.89 match · arxiv ↗
Artifacts as Memory Beyond the Agent Boundary0.84 match · arxiv ↗
Towards a Science of Scaling Agent Systems0.83 match · arxiv ↗
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning0.83 match · arxiv ↗
rStar2-Agent: Agentic Reasoning Technical Report0.83 match · arxiv ↗
Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs0.82 match · arxiv ↗
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?0.82 match · arxiv ↗
LLMs Corrupt Your Documents When You Delegate0.82 match · arxiv ↗

Original note title

test-time interaction scaling is a distinct dimension from chain-of-thought — increasing the agent's interaction horizon enables exploration backtracking and dynamic re-planning that deeper reasoning cannot

Does agent interaction time scale separately from reasoning depth?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4