INQUIRING LINE

Why does imitation learning alone plateau without outcome-based refinement?

This explores why copying expert demonstrations (imitation learning / SFT) eventually stops improving a model, and why pairing it with reward signals tied to whether the answer was actually right is what unlocks further gains.


This explores why copying expert demonstrations alone plateaus, and why outcome-based refinement is what breaks the ceiling. The short version from the corpus: imitation teaches a model how to *look* like it's reasoning without teaching it whether the reasoning *worked*. The clearest evidence is that imitation captures surface form rather than substance — models trained to mimic ChatGPT learn its confident, fluent style and fool human evaluators, but close no real capability gap on novel tasks; the ceiling is set by the base model, not the fine-tuning Can imitating ChatGPT fool evaluators into thinking models improved?. Even more starkly, instruction tuning on semantically empty or deliberately wrong instructions performs nearly as well as correct ones — what transfers is knowledge of the *output format*, not task understanding Does instruction tuning teach task understanding or output format?. Imitation, in other words, is learning the shape of the answer space, and that's a finite well.

Outcome-based refinement supplies the thing imitation structurally cannot: a signal about whether a given attempt actually succeeded. The cleanest demonstration is curriculum — running supervised/imitation training first to establish reasonable behavior, *then* outcome rewards (RLVR) to sharpen it, beats either alone. The imitation phase matters precisely because it produces rollouts good enough that outcome rewards become informative; without it, the reward signal is too sparse to learn from Does sequencing imitation then exploration training improve reasoning?. RL training even shows a predictable two-phase arc: first it consolidates execution correctness (the procedural stuff imitation is good at), then the bottleneck shifts to strategic planning — exactly the exploratory territory imitation never reaches Does RL training follow a predictable two-phase learning sequence?.

The deeper reason imitation alone plateaus is that pure self-driven improvement is circular. A model can only imitate or refine against itself for so long before hitting the generation-verification gap, diversity collapse, and reward hacking; every method that reliably keeps improving smuggles in an *external* anchor — a verifier, a judge, user corrections, or tool feedback Can models reliably improve themselves without external feedback?. Outcome-based refinement is one concrete form of that external anchor. And the anchor needn't be a raw number: when numerical rewards plateau, natural-language critiques that explain *why* a solution failed can break models off the plateau, because the scalar reward lacks information about how to improve Can natural language feedback overcome numerical reward plateaus?.

The most interesting wrinkle is that the imitation-vs-outcome dichotomy isn't actually binary — the corpus has been busy filling in the middle. Supervised RL rewards a model by how closely each step matches an expert's, giving dense signal even when every rollout fails, bridging rigid token-by-token imitation and sparse outcome-only rewards Can step-wise expert rewards help small models learn hard reasoning?. A 'third paradigm' lets agents treat the consequences of their own actions as supervision — no external reward, yet a far better warm-start for later RL than imitation gives Can agents learn from their own actions without external rewards?. And agents can learn from outcomes without any weight update at all, by storing verbal reflections on success/failure in episodic memory — the binary outcome signal is what prevents the model from rationalizing its mistakes away Can agents learn from failure without updating their weights?.

The thing worth walking away with: the plateau isn't a flaw in imitation, it's the natural limit of a method whose job is to teach *form*. You learn the moves by copying; you only learn which moves win by being told when you won. The frontier of this corpus is less about choosing imitation or outcomes and more about engineering the gradient between them — step-wise expert similarity, self-generated consequences, episodic reflection — so the reward signal stays informative the whole way up.


Sources 9 notes

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can step-wise expert rewards help small models learn hard reasoning?

Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.

Can agents learn from their own actions without external rewards?

Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Next inquiring lines