Does agent interaction time scale separately from reasoning depth?
Can agents improve by taking more environment steps rather than thinking harder per step? This matters because partially observable tasks like web navigation may need exploration and backtracking that deeper reasoning alone cannot provide.
The TTI paper makes a precise argument about test-time scaling: chain-of-thought scaling and interaction scaling are orthogonal axes, and conflating them misses what agentic tasks actually need. CoT scales per-step compute by generating long reasoning traces before acting. This deepens reasoning but provides zero new information from the environment. In partially observable agentic tasks, deeper reasoning about a wrongly-bounded set of options does not help — the model still cannot see hotels it has not browsed.
Interaction scaling instead increases the number of interaction steps the agent takes. This enables behaviors that CoT cannot produce: exploration (browse multiple options before committing), backtracking (retreat from a bad path), and dynamic re-planning (revise the plan based on what the environment revealed). Information gain through environment interaction is unique to agentic tasks with partial observability, and it requires interaction, not larger per-step compute.
Empirically the claim is supported on two fronts. Even pure prompting-based interaction scaling — no training — improves task success on web benchmarks non-trivially. With training, TTI uses curriculum-based online RL that adaptively adjusts rollout lengths, producing SOTA open-source open-data web agents on WebVoyager and WebArena from a Gemma 3 12B model. The curriculum aspect matters because TTI shows agents learn to balance exploration and exploitation adaptively — long rollouts when information gathering pays, short rollouts when the next action is clear.
The reframe for the field: test-time scaling is multi-dimensional. CoT and interaction scaling are complementary, not substitutes. Agents that ship with deep reasoning per step but no learned policy for when to keep interacting are leaving capability on the table — and on tasks where exploration matters, the interaction axis dominates. This connects to How should we balance parallel versus sequential compute at test time? as a third axis: parallel sampling, sequential per-step depth, and interaction horizon are three orthogonal dimensions of inference budget. It also generalizes Does search budget scale like reasoning tokens for answer quality? — search budget is the deep-research instance of interaction scaling.
Inquiring lines that use this note as a source 17
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do planning and grounding have opposing optimization requirements in agents?
- When should you optimize agent behavior versus tool performance separately?
- What accounts for performance drops in multi-turn agent interactions?
- How should agents separate planning from perception grounding?
- What does an intermediate interface between planning and grounding actually look like?
- Does the planning-grounding factoring principle apply to other agent tasks?
- How should the surrounding agent system be designed to ground actions in reality?
- Can extended deliberation in agents become counterproductive like human overthinking?
- How do planning and grounding have opposing optimization requirements in agents?
- Can curriculum approaches teach agents when to stop exploring?
- Why does partial observability require interaction instead of better reasoning?
- Should agent capability be optimized separately from general capability?
- What makes some agent benchmarks measure interaction quality better than others?
- Does longer interaction horizon require fundamentally different evaluation approaches?
- How does interaction horizon differ from chain-of-thought depth?
- How do agents decide when to pause and reflect on their strategy?
- Should agents use parallel or sequential scaling during test time?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
How should we balance parallel versus sequential compute at test time?
Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
extends: TTI adds a third orthogonal axis (interaction horizon) to the parallel-vs-sequential dichotomy. Three-dimensional scaling space: how many parallel candidates, how deep per step, how long the interaction.
-
Does search budget scale like reasoning tokens for answer quality?
Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.
generalizes: deep research's search-budget scaling is the search instance of interaction-horizon scaling — same orthogonality argument, broader scope.
-
Does gradually tightening token budgets beat fixed budget training?
Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.
extends: TTI's curriculum-based rollout-length RL is the agentic analog of curriculum-budget RL for reasoning — both find adaptive-budget curricula beat fixed budgets.
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
complicates: under same-token budgets parallel beats sequential; TTI's interaction-scaling argument is about a different budget (action steps, not tokens) where the axes shouldn't be compared directly.
-
How should we allocate compute budget at inference time?
Test-time scaling explores how to spend computational resources during query rather than training. The core challenge: given a fixed inference budget, what's the optimal allocation strategy for different problems?
extends: TTI is direct evidence that the test-time-scaling topic map needs an explicit interaction-scaling sub-section alongside CoT and parallel-vs-sequential.
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
- Artifacts as Memory Beyond the Agent Boundary
- Towards a Science of Scaling Agent Systems
- ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
- rStar2-Agent: Agentic Reasoning Technical Report
- Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
- LLMs Corrupt Your Documents When You Delegate
Original note title
test-time interaction scaling is a distinct dimension from chain-of-thought — increasing the agent's interaction horizon enables exploration backtracking and dynamic re-planning that deeper reasoning cannot