How does interaction horizon differ from chain-of-thought depth?

This explores the difference between two ways of giving a model more thinking room at test time: making a single reasoning chain longer and deeper (chain-of-thought depth) versus letting an agent take more turns acting in an environment (interaction horizon).

This explores the difference between two ways of spending test-time compute: stretching a single reasoning chain longer and deeper, versus letting an agent take more steps in an environment before it has to commit. The corpus treats these as genuinely separate axes — not the same dial with a different label. The clearest statement of this is the finding that agent interaction scaling is *orthogonal* to chain-of-thought scaling Does agent interaction time scale separately from reasoning depth?: adding more environment steps buys you exploration, backtracking, and replanning — things that no amount of per-step verbalization can produce — and this matters most on tasks where the model can't see the whole problem at once (partial observability). Depth makes one guess smarter; horizon lets you take a guess, see what happened, and revise.

The reason this distinction bites is that chain-of-thought depth has surprisingly hard ceilings. Accuracy along the depth axis follows an inverted-U: past some point, more reasoning tokens make answers *worse*, and more capable models actually prefer shorter chains Why does chain of thought accuracy eventually decline with length?. Worse, the length of a chain often isn't tracking how hard the problem is at all — it tracks how close the problem sits to the training distribution, decoupling entirely once you go out of distribution Does longer reasoning actually mean harder problems?. And long chains can quietly drift into theater: fine-tuning can make reasoning steps stop actually driving the final answer Does fine-tuning disconnect reasoning steps from final answers?, while a more foundational critique argues CoT is constrained imitation of reasoning *shape* rather than genuine inference Why does chain-of-thought reasoning fail in predictable ways?. So pouring compute into depth runs into diminishing — sometimes negative — returns.

Interaction horizon sidesteps a lot of that by grounding each step in something external. ReAct is the canonical illustration: interleaving reasoning with real tool queries injects real-world feedback at every turn and prevents the error propagation that pure thinking falls into, beating CoT by large margins on knowledge-intensive and interactive tasks Can interleaving reasoning with real-world feedback prevent hallucination?. The lever you pull to scale horizon is different too — curriculum-based RL on *rollout length* rather than on chain length Does agent interaction time scale separately from reasoning depth?.

Here's the thing worth knowing that you might not have gone looking for: depth and horizon fail in mirror-image ways. Depth's characteristic failure is *underthinking* — models abandoning promising reasoning paths too early, which you can patch by penalizing thought-switching Do reasoning models switch between ideas too frequently?. The exploration story flips it: the way to fix depth-only reasoning isn't to think harder along one path but to force breadth — generating diverse abstractions instead of more solution samples Can abstractions guide exploration better than depth alone?. That hints that "depth" and "horizon" are both really proxies for a richer variable — the *shape* of the computation. The reasoning-topology taxonomy makes this literal: chains, trees, and graphs are formally distinct structures, and a graph's ability to merge multiple lines (in-degree > 1) lets it do divide-and-conquer synthesis a linear chain simply cannot express Can reasoning topologies be formally classified as graph types?. Seen that way, chain-of-thought depth is the length of one path; interaction horizon is how many times you get to branch, observe, and rejoin.

Sources 9 notes

Does agent interaction time scale separately from reasoning depth?

Test-time interaction—increasing environment steps—enables exploration, backtracking, and replanning that per-step reasoning cannot achieve. Curriculum-based RL on rollout length produces SOTA web agents, showing interaction scaling dominates on tasks with partial observability.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

Can interleaving reasoning with real-world feedback prevent hallucination?

ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Can reasoning topologies be formally classified as graph types?

CoT, ToT, and GoT map precisely to path graphs, trees, and arbitrary directed graphs respectively. The topology is not metaphorical but defines actual computational structure—GoT's in-degree > 1 enables divide-and-conquer synthesis that trees cannot express.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher re-evaluating claims about test-time compute allocation in LLMs. The question: are interaction horizon and chain-of-thought depth genuinely orthogonal scaling axes, or have recent models and methods collapsed the distinction?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2025. Key claims:
• Interaction horizon (agent environment steps) and CoT depth are orthogonal; depth hits inverted-U ceilings while horizon scales via curriculum RL on rollout length (~2025).
• CoT depth fails via underthinking (premature path abandonment); horizon's failure is missed exploration due to limited branching; fix is diverse reasoning abstractions, not longer chains (~2025).
• Fine-tuning degrades CoT faithfulness independent of accuracy; reasoning steps stop driving answers (~2024).
• Reasoning topology (chains vs. trees vs. graphs) is formally distinct; only graphs (in-degree >1) enable divide-and-conquer synthesis (~2024).
• CoT trace length reflects training-distribution proximity, not problem difficulty (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2506.07976 "Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction" (Jun 2025)
• arXiv:2401.14295 "Demystifying Chains, Trees, and Graphs of Thoughts" (Jan 2024)
• arXiv:2510.01265 "RLP: Reinforcement as a Pretraining Objective" (Sep 2025)
• arXiv:2508.01191 "Is Chain-of-Thought Reasoning of LLMs a Mirage?" (Aug 2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the inverted-U ceiling, underthinking penalty, and graph-based synthesis claims, check whether o1, o3, or newer reasoning models, multi-agent orchestration, or adaptive-depth RL have since RELAXED or OVERTURNED these findings. Separate the durable insight (depth and horizon likely do differ) from the perishable limitation (specific failure modes and scaling laws). Cite what resolved each.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 2 months (post-Sep 2025) that either unifies depth/horizon or shows one dominates the other.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "Do adaptive-compute schedulers that dynamically choose depth vs. horizon per-token outperform static allocation?" and "Can graph reasoning merge depth gains (longer chains) *and* horizon gains (exploration) without the faithfulness collapse that fine-tuning induces?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How does interaction horizon differ from chain-of-thought depth?

Sources 9 notes

Next inquiring lines