Can LLMs actually forecast time series better than we think?

Explores whether language models possess stronger forecasting ability than current benchmarks suggest, and what role workflow design plays in revealing or hiding that capability.

Synthesis note · 2026-05-18 · sourced from Tasks Planning

The debate over whether LLMs can forecast time-series has been muddled by inconsistent evaluation. Some studies show LLMs underperform dedicated TSFMs; others show LLMs matching or exceeding them. The Nexus authors argue the variance comes from a methodological factor that has been under-attended: how the numerical and contextual reasoning are organized in the forecasting workflow.

Monolithic prompting — ask the LLM to read all the data and produce a forecast — produces uneven results. The model has to integrate seasonal numerical patterns and contextual event-driven catalysts simultaneously, and it does neither well. The intrinsic forecasting capability is real, but the workflow squashes it. Structured workflows — decompose the task into stages that separate numerical reasoning from contextual reasoning, then synthesize — surface the capability that the monolithic approach hides.

This reframes the "can LLMs forecast?" question. The answer is yes when the workflow respects what LLMs do well (contextual reasoning, integration of structured representations) versus what they do poorly (raw numerical extrapolation under noise). Architectures like Nexus that explicitly separate these contributions and use the right component for each get the LLM's strength without exposing its weaknesses.

The methodological consequence for forecasting benchmark design: evaluating "LLM forecasting ability" with a single prompt architecture undersamples the capability space. The right evaluation compares competing workflow designs against the same model, then compares the best workflow's performance against TSFMs. Mixing workflow effects with model effects in a single number obscures which contributes to performance.

The broader observation is that workflow architecture often dominates raw model capability in compound tasks. This shows up here for forecasting, but the pattern recurs: code generation (workflow with planning + execution beats one-shot), retrieval-augmented generation (workflow with retrieval + reranking + generation beats raw generation), reasoning (workflow with structured decomposition beats free-form CoT). For tasks above a complexity threshold, "which workflow" is a stronger lever than "which model."

Inquiring lines that use this note as a source 30

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 1

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 116 in 2-hop network ·medium cluster Open in graph ↗

Can LLMs actually forecast time series better th… Can decomposing forecasting into stages unlock num…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can decomposing forecasting into stages unlock numerical and contextual reasoning? This explores whether breaking time-series forecasting into separate stages for contextualization, dual-resolution outlook, and synthesis allows systems to combine the strengths of numerical models and language models more effectively than either alone.
same paper, the specific architectural instantiation

Can LLMs actually forecast time series better than we think?

Related concepts in this collection 1

Related papers in this collection 8

Search by related questions 4