Can LLMs actually forecast time series better than we think?
Explores whether language models possess stronger forecasting ability than current benchmarks suggest, and what role workflow design plays in revealing or hiding that capability.
The debate over whether LLMs can forecast time-series has been muddled by inconsistent evaluation. Some studies show LLMs underperform dedicated TSFMs; others show LLMs matching or exceeding them. The Nexus authors argue the variance comes from a methodological factor that has been under-attended: how the numerical and contextual reasoning are organized in the forecasting workflow.
Monolithic prompting — ask the LLM to read all the data and produce a forecast — produces uneven results. The model has to integrate seasonal numerical patterns and contextual event-driven catalysts simultaneously, and it does neither well. The intrinsic forecasting capability is real, but the workflow squashes it. Structured workflows — decompose the task into stages that separate numerical reasoning from contextual reasoning, then synthesize — surface the capability that the monolithic approach hides.
This reframes the "can LLMs forecast?" question. The answer is yes when the workflow respects what LLMs do well (contextual reasoning, integration of structured representations) versus what they do poorly (raw numerical extrapolation under noise). Architectures like Nexus that explicitly separate these contributions and use the right component for each get the LLM's strength without exposing its weaknesses.
The methodological consequence for forecasting benchmark design: evaluating "LLM forecasting ability" with a single prompt architecture undersamples the capability space. The right evaluation compares competing workflow designs against the same model, then compares the best workflow's performance against TSFMs. Mixing workflow effects with model effects in a single number obscures which contributes to performance.
The broader observation is that workflow architecture often dominates raw model capability in compound tasks. This shows up here for forecasting, but the pattern recurs: code generation (workflow with planning + execution beats one-shot), retrieval-augmented generation (workflow with retrieval + reranking + generation beats raw generation), reasoning (workflow with structured decomposition beats free-form CoT). For tasks above a complexity threshold, "which workflow" is a stronger lever than "which model."
Inquiring lines that use this note as a source 30
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do longer forecasting horizons degrade LLM accuracy in role-play?
- Why does combining natural language with numerical scores improve prediction accuracy?
- How does process supervision relate to execution-signaled feedback approaches?
- Can autoregressive models learn faithful translation to logical representations without semantic loss?
- Can simple diagnostic tests predict language model performance in production complexity?
- Why do hybrid paradigms outperform pure autoregressive or pure diffusion approaches?
- How do general language model benchmarks predict specialized domain performance?
- Do standard language benchmarks underestimate what LLMs can actually do?
- Do LLMs need world models to make accurate predictions?
- Can a model be strong at MMLU but weak at long-horizon tasks?
- Do monolithic prompts underutilize LLM strengths in forecasting workflows?
- What separates good workflow design from poor workflow design?
- How should benchmarks evaluate workflow architecture versus raw model performance?
- How should organizations redesign workflows if LLMs cannot solve optimization directly?
- Why do macro and micro forecasting scales require different reasoning approaches?
- What real-world forecasting domains benefit most from contextual reasoning integration?
- Which model capabilities actually matter for sustained workflow delegation?
- How much does workflow architecture matter compared to raw model capability in forecasting?
- Do newer language model generations improve forecasting ability without additional training?
- Why are post-cutoff test sets essential for evaluating genuine forecasting ability?
- What role does retrieval mechanism design play in forecast accuracy?
- How do AI researcher forecasts compare across different timeline question phrasings?
- Why do non-experts default to familiar chart types despite domain complexity?
- How do search and reasoning workflows improve forecasting performance over base models?
- Can language models match competitive crowd forecasters on real future events?
- How much does domain expertise actually improve human forecasting under uncertainty?
- Why does LLM performance improve when forecasting tasks include organized reasoning?
- What privacy-preserving evaluation methods best capture real-world forecasting ability?
- How do you partition LLM experts by domain versus by time?
- What architectural changes would help LLMs distinguish causal relationships from temporal sequences?
Related concepts in this collection 1
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can decomposing forecasting into stages unlock numerical and contextual reasoning?
This explores whether breaking time-series forecasting into separate stages for contextualization, dual-resolution outlook, and synthesis allows systems to combine the strengths of numerical models and language models more effectively than either alone.
same paper, the specific architectural instantiation
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Nexus: An Agentic Framework for Time Series Forecasting
- Approaching Human-Level Forecasting with Language Models
- LLMs Corrupt Your Documents When You Delegate
- The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
- Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
- Rethinking Interpretability in the Era of Large Language Models
- Self-Evaluation Guided Beam Search for Reasoning
- Large Language Diffusion Models
Original note title
LLM forecasting ability is stronger than recognized when numerical and contextual reasoning are organized properly — workflow architecture dominates raw model capability