INQUIRING LINE

What data properties enable transformers to learn sequential decision-making in context?

This explores what it is about the *data* — not the architecture — that lets a transformer pick up sequential decision-making on the fly from its context, without weight updates.


This question is really about data shape rather than model design: given a fixed transformer, what must its training and context data look like for in-context sequential decision-making to emerge? The corpus converges on a surprisingly specific answer — what matters is the *structure* of examples in context, not their quantity. The clearest result is that isolated input-output pairs aren't enough; the context needs full or partial *trajectories* from the same environment, a property called trajectory burstiness Why do trajectories matter more than individual examples for in-context learning?. When the model sees coherent action-sequences from a shared task, it can generalize across wildly different tasks without any weight changes. Scatter the same examples as disconnected snapshots and the ability disappears. So the enabling property is temporal coherence — the data has to carry the thread of a decision process, not just its endpoints.

A second, complementary lever is embedding *future* information into the training data itself. The lookahead-token work shows that if you decorate training sequences with special tokens that encapsulate where the trajectory is heading, models learn goal-conditioned, plan-like generation using completely standard architecture and training Can embedding future information in training data improve planning?. This is striking because planning is usually framed as an architectural problem (you need search, recurrence, deeper computation). Here it's reframed as a data-curation problem: the signal for 'where this is going' just needs to be present in the sequence for the model to internalize it.

The same theme — that diversity and structure in the data manufacture capability — shows up in multi-agent settings. Sequence-model agents trained against a *diverse* population of co-players develop in-context best-response strategies and even spontaneous cooperation, with no hardcoded assumptions Can agents learn cooperation by adapting to diverse partners?. The decision-making competence comes from the variety of partners the data exposed them to, not from special machinery. And for reasoning chains specifically, transformers only acquire genuine cross-distribution multi-hop ability when the training data includes explicit compositional exposure — second-hop generalization fails without it, and success has a measurable signature in how entity representations cluster How do transformers learn to reason across multiple steps?.

The sharp caveat the corpus adds: be careful what you call 'learning.' Several notes suggest transformers often clear these tasks by matching memorized computation subgraphs rather than learning systematic rules, and they collapse on novel compositions with errors compounding step by step Do transformers actually learn systematic compositional reasoning?. Probes of models trained on orbital mechanics and games find task-specific heuristics, not coherent world models Do foundation models learn world models or task-specific shortcuts?. So the data properties that 'enable' in-context decision-making may be enabling sophisticated pattern interpolation over the trajectory distribution rather than true algorithmic understanding — which is exactly why trajectory coverage and compositional exposure matter so much. The model can only interpolate where the data has been.

The thing you might not have expected: this whole line of work quietly inverts the usual story. Capability we'd instinctively attribute to architecture — planning, generalization, cooperation, multi-step reasoning — keeps turning out to be unlocked by properties of the data stream instead. There's even evidence that the reasoning was latent in the base model all along, and minimal training merely *elicits* it Do base models already contain hidden reasoning ability?. If that's right, then 'what data enables in-context decision-making' and 'what data surfaces a skill the model already has' may be closer to the same question than they look.


Sources 7 notes

Why do trajectories matter more than individual examples for in-context learning?

In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.

Can embedding future information in training data improve planning?

TRELAWNEY augments training data with special tokens encapsulating future information, allowing models to learn goal-conditioned generation using standard infrastructure. Results show improved planning, algorithmic reasoning, and story generation without modifying architecture or training procedures.

Can agents learn cooperation by adapting to diverse partners?

Sequence model agents trained against diverse co-players develop in-context best-response strategies that naturally resolve into cooperation. Mutual vulnerability to exploitation creates pressure that drives cooperative mutual adaptation without hardcoded assumptions or timescale separation.

How do transformers learn to reason across multiple steps?

Controlled training reveals transformers learn multi-hop reasoning in three phases: memorization, in-distribution generalization, and cross-distribution reasoning. Successful reasoning correlates with cosine clustering of entity representations, and second-hop generalization requires explicit compositional exposure during training.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Do foundation models learn world models or task-specific shortcuts?

Inductive bias probes show transformers trained on orbital mechanics and games learn predictive patterns, not unified world structure. Fine-tuning reveals nonsensical, slice-dependent laws; circuit analysis shows arithmetic relies on range-matching heuristics, not algorithms.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about what data properties enable transformers to learn sequential decision-making in context. The question remains open: does capability truly emerge from data *structure*, or have newer models/methods shifted the regime?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. Key constraints the corpus identified:
- Trajectory burstiness is *required*: isolated input-output pairs fail; coherent action-sequences from shared tasks succeed (~2023–24).
- Lookahead tokens (special tokens encoding future state) enable planning without architectural changes; reframes planning as data-curation (~2024–25).
- Diverse co-player populations in training data unlock in-context best-response and cooperation without hardcoding (~2026).
- Explicit compositional exposure in training data is necessary for cross-distribution multi-hop reasoning; without it, generalization fails (~2025).
- Risk: transformers may match memorized subgraph patterns rather than learn systematic rules; novel compositions fail with compounding errors (~2023–25).

Anchor papers (verify; mind their dates):
- arXiv:2312.03801 (2023-12): Generalization to New Sequential Decision Making Tasks
- arXiv:2505.23653 (2025-05): How do Transformers Learn Implicit Reasoning?
- arXiv:2602.16301 (2026-02): Multi-agent cooperation through in-context co-player inference
- arXiv:2507.06952 (2025-07): What Has a Foundation Model Found?

Your task:
(1) RE-TEST EACH CONSTRAINT. For trajectory burstiness, lookahead tokens, and compositional exposure: has scaling, retrieval-augmentation, in-context prompt engineering, or new evaluation (e.g., on truly novel task families unseen in any form) since relaxed any claim? Separate the durable question (does data structure matter?) from perishable limitations (e.g., 'isolated pairs fail'—does few-shot or chain-of-thought patching work now?).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look for papers claiming transformers *do* learn systematic rules, or that architecture-level innovations (sparse attention, memory augmentation, explicit search) bypass data-structure requirements.
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If lookahead tokens truly decouple planning from architecture, what happens if you remove lookahead at inference—does the skill persist? (b) Can trajectory burstiness be *approximated* via synthetic data, retrieval, or learned encodings, or is raw distribution coverage irreducible?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines