SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Model Architecture and Internals

Can models learn better by training on messy exploration paths?

Does including trial-and-error, reflection, and backtracking in training data teach models to reason more robustly than teaching only the polished shortest path to answers?

Synthesis note · 2026-06-03 · sourced from Deep Research

Responding to OpenAI's opaque O1, this real-time replication effort contributes a paradigm beyond the engineering: journey learning. Where standard training teaches a model the shortcut — the clean path from problem to correct answer — journey learning encourages models to learn the complete exploration process: trial and error, reflection, and backtracking. The bet is that o1-style deep reasoning comes from internalizing how to search (including dead ends and recoveries), not from memorizing polished solution traces. The paper also models a methodological stance — transparent, continuously-documented, community-engaged research that reports failures as well as successes.

The keeper is the training-data philosophy: include the messy trajectory (failed attempts, self-correction) as the supervision signal, because that is what teaches robust reasoning, whereas shortcut-only data teaches confident-but-brittle answers.

This sits in the vault's reasoning-training thread. It is the constructive counter to the finding that Is reflection in reasoning models actually fixing mistakes? — journey learning tries to make exploration genuine rather than performative — and it pairs with When does RL actually extend reasoning beyond pretraining?: both concern what reasoning data actually teaches the model.

Inquiring lines that use this note as a source 3

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 141 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

journey learning trains models on the complete exploration process — trial error reflection and backtracking — not just shortcut solutions