Why do human-curated thought examples fail to improve model thinking?

This explores why feeding models clean, human-written examples of 'good thinking' (curated reasoning traces, labeled exemplars, polished solutions) often doesn't make them reason better — and what the corpus says actually does.

This reads the question as: when we hand a model tidy, human-curated examples of good reasoning and train on them, why doesn't the model's actual thinking improve? The corpus has a surprisingly sharp answer — clean examples teach the *look* of reasoning, not the *engine* of it. Models trained to imitate confident, fluent reasoning mostly learn surface style. Can imitating ChatGPT fool evaluators into thinking models improved? shows imitation models fool human evaluators by mimicking the tone of a stronger model while closing no real capability gap. Do reasoning traces show how models actually think? pushes this further: reasoning traces themselves are persuasive appearances, where logically invalid steps perform almost as well as valid ones — so curating examples for their *correct appearance* optimizes the wrong thing.

The deeper problem is that polished examples strip out exactly what's useful. Does training on messy search processes improve reasoning? found that training on the messy search process — wrong turns, dead ends, backtracking — beats training on clean optimal trajectories by 25%. The mistakes are the lesson: they teach the model an internal model of *how to search*, which a curated 'here's the right answer' example deletes. Human-curated thought is curated precisely to hide the wandering, and the wandering is where the skill lives.

There's also a transfer failure. Can models learn argument quality from labeled examples alone? shows fine-tuning on labeled quality examples lets models learn surface patterns rather than the principle behind them — they don't generalize to new cases. What worked instead was explicit theoretical frameworks (naming the criteria directly) rather than hoping the model would induce them from examples. Examples under-specify the rule; the model latches onto whatever shortcut fits the sample.

And there may be nothing to install in the first place. Do base models already contain hidden reasoning ability? argues post-training *selects* reasoning already latent in the base model rather than creating it — five independent methods all elicit the same buried capability. Does extended thinking help or hurt model reasoning? complements this: the same 'thinking' mechanism can hurt or help depending on training that redirects it from self-doubt into productive analysis. If the bottleneck is elicitation and redirection, then mimicking curated examples is the wrong tool — it copies outputs instead of steering the latent process.

The unexpected takeaway: better model thinking seems to come from showing the struggle, naming the principle, or unlocking what's already there — not from showing the polished result. If you want to go deeper on the 'examples teach style not substance' thread, start with Do reasoning traces show how models actually think?; for the 'mess is the lesson' thread, Does training on messy search processes improve reasoning?.

Sources 6 notes

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Does training on messy search processes improve reasoning?

Stream of Search pretraining, which represents exploration and backtracking as serialized strings, achieves 25% higher accuracy than optimal-trajectory-only training. Models learn internal world models for search and adaptive strategies rather than fixed external methods.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Why do human-curated thought examples fail to improve model thinking?

Sources 6 notes

Next inquiring lines