SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Model Architecture and Internals

Why do chain-of-thought examples fail across different conditions?

Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.

Synthesis note · 2026-02-22 · sourced from Reasoning Methods CoT ToT
How should we allocate compute budget at inference time?

Manual chain-of-thought prompting rests on an implicit assumption: if a human writes good reasoning examples, the model will reason better. The AutoCoT paper exposes this assumption through systematic sensitivity analysis, documenting four distinct brittleness dimensions:

1. Order sensitivity: Randomly shuffling the order of few-shot CoT exemplars on GPT-3 causes accuracy fluctuations of up to 3.3% below average on GSM8K. The model is sensitive to which examples appear first, not just which examples appear.

2. Complexity sensitivity: Chain length (number of reasoning hops) must match problem difficulty. Simple exemplars (few hops) degrade performance on complex questions; complex exemplars (many hops) degrade performance on simple questions. The model over- or under-reasons to match the exemplar pattern.

3. Diversity requirement: A uniform-complexity exemplar set underperforms a mixed-complexity set. The optimal strategy is a distribution across complexity levels, not the highest-complexity exemplars. This means selecting exemplars for diversity, not just quality.

4. Style sensitivity: Different human annotators writing CoT for the same problems produce results that vary by up to 28.2% accuracy. There is no "neutral" annotation style — every annotator introduces style artifacts that interact with model processing differently.

These four dimensions compound: a set of exemplars that is well-ordered, complexity-matched, diversity-balanced, AND style-appropriate is extremely difficult to produce manually, and what works for one task rarely generalizes to another. This is why automated approaches (AutoCoT, which uses LLM-generated and filtered pseudo-chains) can outperform manually curated exemplars despite producing less obviously "good" reasoning.

Two additional findings extend this picture. Complexity-based prompting confirms the complexity dimension with a direct mechanism: selecting exemplars with more reasoning steps consistently improves multi-step reasoning performance. The relationship is monotonic — more exemplar complexity → better model performance — which means complexity sensitivity is not just about matching but about setting a reasoning floor. CDW-CoT (Clustered Distance-Weighted CoT) provides a practical solution to the compounding problem: by clustering the dataset and training optimal prompt probability distributions per cluster, it dynamically adapts exemplar selection to instance characteristics rather than using one-size-fits-all prompts. This directly addresses the finding that what works for one task rarely generalizes, by making exemplar selection instance-specific rather than task-level.

Latent Skill Discovery (RSD) reframes exemplar selection as a learned reasoning policy. Rather than heuristic selection (by complexity, diversity, or clustering), RSD discovers an unsupervised latent space of reasoning skills from unlabeled demonstrations, then trains a reasoning policy (via PPO) to select demonstrations based on the target task's characteristics. This addresses the compounding brittleness problem by learning which combination of skills a given problem requires, rather than relying on surface-level features like complexity or diversity. The approach implies that the four brittleness dimensions (order, complexity, diversity, style) may be symptoms of a deeper issue: exemplar selection needs to be strategic, matching the specific reasoning capabilities a problem demands.

DPP bias reveals a fifth dimension: prompt-architecture positioning. The "Demos' Position in Prompt" paper shows that moving an unchanged block of demos from the start to the end of a prompt swings accuracy by up to 20% and flips ~50% of predictions — roughly 6x the effect of within-exemplar order shuffling (3.3%). This is not about which demos appear first among themselves, but where the entire demo block sits relative to system prompt and user message. The mechanistic cause is architectural: primacy bias, induction heads, and lost-in-the-middle effects create position-dependent attention gradients that modulate ICL effectiveness. The DPP finding extends the brittleness story to a larger spatial scale — ordering effects appear at every granularity from How much does the order of premises actually matter for reasoning? (within-task) to within-exemplar (3.3%) to prompt-architecture level (20%).

The deeper implication is a connection to Do language models actually use their reasoning steps?: if CoT performance is this sensitive to surface properties of exemplars (order, style, position), the reasoning chains are not cleanly driving outputs. A model that reasons correctly only when given exemplars in the right order is exhibiting a form of causal insufficiency — the reasoning capacity is real but brittle, heavily conditioned on surface formatting.

Inquiring lines that use this note as a source 7

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
23 direct connections · 204 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

cot exemplar performance is brittle across four dimensions order complexity diversity and style