Why do chain-of-thought examples fail across different conditions?

Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.

Synthesis note · 2026-02-22 · sourced from Reasoning Methods CoT ToT

Manual chain-of-thought prompting rests on an implicit assumption: if a human writes good reasoning examples, the model will reason better. The AutoCoT paper exposes this assumption through systematic sensitivity analysis, documenting four distinct brittleness dimensions:

1. Order sensitivity: Randomly shuffling the order of few-shot CoT exemplars on GPT-3 causes accuracy fluctuations of up to 3.3% below average on GSM8K. The model is sensitive to which examples appear first, not just which examples appear.

2. Complexity sensitivity: Chain length (number of reasoning hops) must match problem difficulty. Simple exemplars (few hops) degrade performance on complex questions; complex exemplars (many hops) degrade performance on simple questions. The model over- or under-reasons to match the exemplar pattern.

3. Diversity requirement: A uniform-complexity exemplar set underperforms a mixed-complexity set. The optimal strategy is a distribution across complexity levels, not the highest-complexity exemplars. This means selecting exemplars for diversity, not just quality.

4. Style sensitivity: Different human annotators writing CoT for the same problems produce results that vary by up to 28.2% accuracy. There is no "neutral" annotation style — every annotator introduces style artifacts that interact with model processing differently.

These four dimensions compound: a set of exemplars that is well-ordered, complexity-matched, diversity-balanced, AND style-appropriate is extremely difficult to produce manually, and what works for one task rarely generalizes to another. This is why automated approaches (AutoCoT, which uses LLM-generated and filtered pseudo-chains) can outperform manually curated exemplars despite producing less obviously "good" reasoning.

Two additional findings extend this picture. Complexity-based prompting confirms the complexity dimension with a direct mechanism: selecting exemplars with more reasoning steps consistently improves multi-step reasoning performance. The relationship is monotonic — more exemplar complexity → better model performance — which means complexity sensitivity is not just about matching but about setting a reasoning floor. CDW-CoT (Clustered Distance-Weighted CoT) provides a practical solution to the compounding problem: by clustering the dataset and training optimal prompt probability distributions per cluster, it dynamically adapts exemplar selection to instance characteristics rather than using one-size-fits-all prompts. This directly addresses the finding that what works for one task rarely generalizes, by making exemplar selection instance-specific rather than task-level.

Latent Skill Discovery (RSD) reframes exemplar selection as a learned reasoning policy. Rather than heuristic selection (by complexity, diversity, or clustering), RSD discovers an unsupervised latent space of reasoning skills from unlabeled demonstrations, then trains a reasoning policy (via PPO) to select demonstrations based on the target task's characteristics. This addresses the compounding brittleness problem by learning which combination of skills a given problem requires, rather than relying on surface-level features like complexity or diversity. The approach implies that the four brittleness dimensions (order, complexity, diversity, style) may be symptoms of a deeper issue: exemplar selection needs to be strategic, matching the specific reasoning capabilities a problem demands.

DPP bias reveals a fifth dimension: prompt-architecture positioning. The "Demos' Position in Prompt" paper shows that moving an unchanged block of demos from the start to the end of a prompt swings accuracy by up to 20% and flips ~50% of predictions — roughly 6x the effect of within-exemplar order shuffling (3.3%). This is not about which demos appear first among themselves, but where the entire demo block sits relative to system prompt and user message. The mechanistic cause is architectural: primacy bias, induction heads, and lost-in-the-middle effects create position-dependent attention gradients that modulate ICL effectiveness. The DPP finding extends the brittleness story to a larger spatial scale — ordering effects appear at every granularity from How much does the order of premises actually matter for reasoning? (within-task) to within-exemplar (3.3%) to prompt-architecture level (20%).

The deeper implication is a connection to Do language models actually use their reasoning steps?: if CoT performance is this sensitive to surface properties of exemplars (order, style, position), the reasoning chains are not cleanly driving outputs. A model that reasons correctly only when given exemplars in the right order is exhibiting a form of causal insufficiency — the reasoning capacity is real but brittle, heavily conditioned on surface formatting.

Inquiring lines that use this note as a source 7

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

23 direct connections · 204 in 2-hop network ·medium cluster Open in graph ↗

Why do chain-of-thought examples fail across dif… Do language models actually use their reasoning st… Can minimal reasoning chains match full explanatio… Does training data format shape reasoning strategy… How much does the order of premises actually matte… How much does demo position alone affect in-contex… Do strict output formats hurt LLM reasoning abilit… Does instruction tuning teach task understanding o…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do language models actually use their reasoning steps? Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
exemplar sensitivity suggests causal fragility: small changes to input framing produce large output changes, evidence that reasoning is not robustly driving outputs
Can minimal reasoning chains match full explanations? Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.
CoD sidesteps style sensitivity by making style irrelevant: minimal drafts have no style to vary
Does training data format shape reasoning strategy more than domain? What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
the style dimension of exemplar sensitivity is a presentation-layer effect; both findings point to format as a powerful lever on reasoning behavior
How much does the order of premises actually matter for reasoning? When you rearrange the order of logical premises in a deduction task, does it change how well language models can solve it? This tests whether LLMs reason abstractly or process input sequentially.
order sensitivity extends from exemplars to premises; shared sequential-processing mechanism
How much does demo position alone affect in-context learning accuracy? Moving demonstrations from prompt start to end without changing their content produces surprisingly large accuracy swings. Does spatial position in the prompt matter more than what demonstrations actually contain?
fifth dimension: prompt-architecture positioning; 6x the magnitude of within-exemplar order sensitivity
Do strict output formats hurt LLM reasoning ability? When LLMs must produce structured JSON or XML with specific schemas, does this constrain their capacity for complex reasoning? This matters because production systems often enforce strict formats for parsing convenience.
output format constraints as another brittleness dimension: while CoT brittleness is about input exemplar sensitivity, format constraints show that the output format also degrades reasoning by competing for model capacity; format is never neutral on either side
Does instruction tuning teach task understanding or output format? Exploring whether models trained on instructions actually learn the task semantics or merely learn to match output distributions. This matters because it challenges assumptions about how fine-tuning improves model behavior.
shared pattern: exemplar brittleness and IT format-learning both show models respond to surface formatting properties rather than semantic task content; models are format-sensitive, not meaning-sensitive

Why do chain-of-thought examples fail across different conditions?

Related concepts in this collection 7

Related papers in this collection 8

Search by related questions 4