INQUIRING LINE

Why does exemplar performance vary across order complexity diversity and style?

This explores why few-shot examples (CoT exemplars) you hand a model produce wildly different results depending on how they're ordered, how hard they are, how varied they are, and who wrote them — and the corpus suggests the answer is that models latch onto surface form rather than the reasoning the examples are meant to demonstrate.


This explores why the same task can swing in accuracy just because you reshuffled your examples, matched them poorly to the problem's difficulty, used too-similar ones, or had a different person write them. The anchor finding is that chain-of-thought exemplars are brittle along exactly these four axes — reordering alone causes ~3.3% swings, and switching annotators drives up to 28.2% variance, with the effects compounding so that hand-curated examples never transfer cleanly across tasks Why do chain-of-thought examples fail across different conditions?. The deeper question is *why* such cosmetic-seeming changes matter so much, and the corpus points to one culprit: the model is responding to the form of your examples, not their content.

The sharpest evidence is that logically invalid reasoning chains perform almost as well as valid ones — scramble the actual inference and you barely lose anything, because what the model picks up is the shape and rhythm of 'reasoning,' not the logic Does logical validity actually drive chain-of-thought gains?. If the model is keying on style and structure rather than substance, then style and structure become exactly the levers that move performance — which is why a different annotator's voice or a different ordering swings the result. The same dynamic shows up elsewhere: imitation-trained models convincingly copy a stronger model's confident style while closing none of the underlying capability gap Can imitating ChatGPT fool evaluators into thinking models improved?. Surface mimicry is cheap and load-bearing; genuine reasoning is neither.

The 'complexity' axis turns out to be a mirage in a similar way. Models don't actually break at hard problems — they break at *unfamiliar* ones. Reasoning failures track instance-level novelty, not task-level difficulty, so any chain succeeds if the model saw similar instances in training, regardless of how 'complex' it looks Do language models fail at reasoning due to complexity or novelty?. And the apparent signals of difficulty are unreliable: trace length correlates with difficulty only inside the training distribution and decouples entirely outside it, reflecting recalled schemas rather than adaptive effort Does longer reasoning actually mean harder problems?. When grammatical structure genuinely deepens — recursion, embedding — competence degrades predictably, exposing surface heuristics standing in for real structural rules Does LLM grammatical performance decline with structural complexity?. So matching an exemplar to problem 'complexity' is really matching it to distributional familiarity.

Diversity and order have their own physics. Because difficulty is about familiarity, ordering examples from sparse/unfamiliar to dense/familiar — using internal representation sparsity rather than human difficulty labels — measurably improves in-context performance, which is why arbitrary orderings hurt Can representation sparsity order few-shot demonstrations effectively?. Diversity's effect isn't even fixed in direction: the way training reshapes variety reverses across domains, reducing it where convergence is rewarded (code) and increasing it where distinctiveness is rewarded (creative writing) Does preference tuning always reduce diversity the same way?. There's also an inverted-U for length — accuracy peaks at an intermediate chain length that depends on both task and model, so a 'good' exemplar for one model is over- or under-shot for another Why does chain of thought accuracy eventually decline with length?.

The thing worth carrying away: exemplar brittleness isn't four separate quirks to engineer around — it's one fact wearing four costumes. The model is fitting the surface statistics of whatever you show it. That should make you skeptical of the whole project of manual exemplar curation, and even of how we *measure* these gains — apparent capability jumps can dissolve into smooth, predictable curves once you stop using discontinuous metrics, hinting that some of the 'performance' we tune for is partly an artifact of how we score it Are LLM emergent abilities real or measurement artifacts?.


Sources 10 notes

Why do chain-of-thought examples fail across different conditions?

Human-written CoT exemplars degrade performance when reordered (3.3% swings), mismatched to problem complexity, lacking diversity, or written by different annotators (up to 28.2% variance). These four dimensions compound, making manual exemplar curation unreliable across tasks.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Can representation sparsity order few-shot demonstrations effectively?

Sparsity-Guided Curriculum In-Context Learning uses last-layer activation sparsity to order demonstrations from sparse (harder) to dense (easier), yielding considerable performance improvements. This approach requires no external difficulty labels and works across diverse in-context learning tasks.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Are LLM emergent abilities real or measurement artifacts?

Sharp, unpredictable capability transitions vanish when using continuous metrics instead of discontinuous ones. The same model outputs show smooth predictable improvement with scale, suggesting emergence is a measurement choice rather than a real behavioral change.

Next inquiring lines