INQUIRING LINE

How do search and reasoning workflows improve forecasting performance over base models?

This explores why wrapping a base model in retrieval, decomposition, and search loops improves forecasting — and whether the gains come from better reasoning or from better architecture around the model.


This explores why wrapping a base model in retrieval, decomposition, and search machinery beats prompting it directly — and the corpus has a clear, slightly surprising answer: the workflow, not the raw model, is doing most of the work. The most direct evidence is that LLMs turn out to be much stronger forecasters than people assumed, but only when the workflow separates numerical reasoning from contextual reasoning Can LLMs actually forecast time series better than we think?. Ask a single model to extrapolate a trend and interpret world events in one pass and you get mush; split those jobs apart and a latent capability surfaces. The Nexus system makes this concrete by decomposing forecasting into staged steps — contextualization, a dual-resolution macro/micro outlook, and synthesis — and beats both pure time-series models and plain LLMs on real data Can decomposing forecasting into stages unlock numerical and contextual reasoning?.

Search and retrieval add the other half: forecasting the future requires information the model never saw in training. A retrieval-augmented system reached near-parity with competitive human forecasters on questions published after its training cutoff — sometimes beating the crowd — without any forecasting-specific tuning Can retrieval-augmented language models forecast like human experts?. And the act of searching behaves like a tunable compute dial in its own right: deep research agents improve with more search steps along a curve that mirrors the reasoning-token scaling law, complete with the same diminishing returns Do search steps follow the same scaling rules as reasoning tokens?. So 'search' isn't just plumbing — it's a genuine inference-compute axis sitting alongside reasoning.

Here's the thing you might not expect: piling on more reasoning is a weaker lever than it sounds. On numerical and constraint-bound tasks, reasoning models don't systematically beat standard ones — extended chain-of-thought produces more text, not more actual iterative computation Do reasoning models actually beat standard models on optimization?, and models plateau around 55–60% constraint satisfaction regardless of size or training Do larger language models solve constrained optimization better?. Worse, reasoning chains fail through disorganization: models wander into invalid paths and abandon promising ones too early, an 'underthinking' problem you can partly fix with a decoding penalty rather than a smarter model Why do reasoning models abandon promising solution paths? Do reasoning models switch between ideas too frequently?. This is exactly why workflow structure matters for forecasting — the gains come from giving the model scaffolding that keeps it on-task, not from letting it think longer.

When reasoning does help, it's because of training, not inference budget. Reasoning models keep their edge no matter how much compute non-reasoning models are given, because training instills a protocol that makes the extra tokens productive Can non-reasoning models catch up with more compute?. The same lesson shows up in search agents: a 20B model that externalizes its bookkeeping to a stateful harness outperformed the next-best open searcher by a wide margin, and the gain survived ablation — the harness was a learned capability, not mere implementation Can externalizing bookkeeping improve search agent performance?.

The deeper takeaway is that 'improve forecasting over the base model' is mostly a selection-and-orchestration problem. Routing queries to the right specialist beats scaling a single model — ten small models with routing surpassed GPT-4.1 Can routing beat building one better model? — and giving recursive reasoners stochastic latent transitions lets them hold genuine uncertainty and carry multiple candidate futures instead of collapsing to one guess Can stochastic latent reasoning help models explore multiple solutions?. Forecasting is inherently a distribution-over-futures task, so the workflows that win are the ones that decompose the problem, fetch what the model can't know, and preserve uncertainty — rather than the ones that simply make the model think harder.


Sources 12 notes

Can LLMs actually forecast time series better than we think?

LLMs have stronger intrinsic forecasting ability than recognized, but only when workflows separate numerical reasoning from contextual reasoning. Monolithic prompting obscures this capability; structured decomposition surfaces it.

Can decomposing forecasting into stages unlock numerical and contextual reasoning?

Nexus outperforms pure TSFM and LLM baselines on real-world datasets by decomposing forecasting into contextualization, dual-resolution macro/micro outlook, and synthesis stages. Separating numerical extrapolation from event-driven contextual reasoning avoids forcing one model to handle both simultaneously.

Can retrieval-augmented language models forecast like human experts?

A retrieval-augmented LM system achieved near-parity with competitive human forecasters on real forecasting questions published after model training cutoffs, sometimes surpassing human crowds. Newer model generations naturally improved forecasting without domain-specific tuning.

Do search steps follow the same scaling rules as reasoning tokens?

Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.

Do reasoning models actually beat standard models on optimization?

Reasoning variants with extended CoT show no consistent advantage over standard models on constraint-bound numerical tasks like optimal power flow. Extended thinking produces more text, not more iterative computation, suggesting the bottleneck is numeric procedure rather than reasoning steps.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can externalizing bookkeeping improve search agent performance?

A 20B model using Harness-1 achieved 0.730 average curated recall across eight benchmarks, outperforming the next open searcher by 11.4 points. The gains transfer to held-out benchmarks and survive ablation, showing the harness is not mere implementation but a learned capability.

Can routing beat building one better model?

Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.

Can stochastic latent reasoning help models explore multiple solutions?

GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an analyst testing whether search-and-reasoning workflows for forecasting remain bottlenecked by the constraints a curated library documented (papers 2023–2026). The question: what fraction of forecasting gains now come from workflow orchestration vs. raw model capability, and has that ratio shifted?

What a curated library found — and when (dated claims, not current truth):
• Separating numerical from contextual reasoning unlocks latent forecaster ability; single-pass models fail (~2024).
• Retrieval-augmented systems reach near-parity with human forecasters on future events unseen in training (~2024).
• Search budget scales like reasoning tokens, with diminishing returns (~2025).
• Reasoning models plateau at 55–60% constraint satisfaction; reasoning chains don't systematically beat standard models on numerical tasks (~2025–2026).
• Routing small specialist models beats scaling one large model; ten small models with routing surpassed GPT-4.1 (~2026).
• Underthinking (premature thought-switching) is fixable via decoding penalty, not model scale (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2402.18563 (Feb 2024): Approaching Human-Level Forecasting with Language Models
• arXiv:2501.18585 (Jan 2025): Thoughts Are All Over the Place — underthinking in o1-like LLMs
• arXiv:2605.14389 (May 2026): Nexus — agentic decomposition for forecasting
• arXiv:2508.12631 (Aug 2025): Beyond GPT-5 — routing surpasses single-model scaling

Your task:
(1) RE-TEST EACH CONSTRAINT. For the five claims above, judge whether newer models (o3, GPT-5-class, improved reasoning trainers), decoding harnesses (better search, memory, multi-agent orchestration), or fresh evaluation suites have since RELAXED the 55–60% plateau, made single-pass models viable, or collapsed the routing advantage. Cite what relaxed it; flag where constraints still hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially any paper showing a single large model with a new training regime beats decomposed workflows, or forecasting gains driven by scale rather than orchestration.
(3) Propose 2 research questions that ASSUME the workflow-vs.-capability ratio may have shifted: e.g., "Does end-to-end training on forecast + search unify the gains?" or "Do modern reasoning models eliminate the need for dual-resolution decomposition?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines