Why does comparison reasoning generalize better than composition reasoning?
This explores why tasks that ask a model to compare or contrast two things tend to hold up outside training data, while tasks that chain several reasoning steps together (composition) tend to break — and the corpus speaks more directly to why composition fails than to why comparison is robust, so the answer is partly read off that asymmetry.
This explores why comparison-style reasoning survives distribution shift better than multi-step compositional reasoning. The honest starting point: the collection has a lot to say about why composition collapses, and comparison's relative durability falls out of that as the mirror image. The single sharpest finding is that transformers don't actually compose — they reduce compositional reasoning to memorizing computation subgraphs from training and matching against them, then fail drastically on novel combinations Do transformers actually learn systematic compositional reasoning?. The killer detail is that errors *compound across reasoning steps*. Composition is a chain, and every link is a fresh chance for an unfamiliar instance to derail the whole thing.
That compounding is the heart of it. Reasoning failures, it turns out, aren't triggered by abstract task complexity but by *instance-level unfamiliarity* — models fit patterns tied to specific examples rather than general algorithms, so a chain only succeeds when each of its steps resembles something seen before Do language models fail at reasoning due to complexity or novelty?. A comparison is typically one shallow operation over two items; a composition is many operations stacked. If each step has some independent chance of hitting an unfamiliar pattern, the probability of getting the *whole chain* right decays multiplicatively with depth — which is exactly the exponential drop-off in success that shows up when reasoning models are pushed to deeper problems Why do reasoning LLMs fail at deeper problem solving?. Comparison generalizes better not because it's a smarter kind of thinking, but because it's *shorter*: fewer links, fewer places to break.
The same pattern explains why chain-of-thought, the workhorse of composition, is so fragile. CoT reproduces the *form* of reasoning through pattern-matching against familiar schemata rather than performing genuine inference, so it degrades predictably the moment the task, length, or format drifts from training Does chain-of-thought reasoning reveal genuine inference or pattern matching? Does chain-of-thought reasoning actually generalize beyond training data?. A model imitating reasoning form can fake a short comparison convincingly; a long composition exposes the imitation because the fluent-but-invalid logic has to stay coherent across many dependent steps What makes chain-of-thought reasoning actually work?.
There's a wrinkle worth knowing: some of what looks like a 'composition can't generalize' wall is actually an *execution* ceiling, not a reasoning one. Models often know the algorithm but can't run it across many steps in text-only generation — give them tools and they cross the supposed cliff Are reasoning model collapses really failures of reasoning?. That reframes the question: composition may not be inherently less generalizable than comparison so much as it makes far heavier demands on a brittle step-by-step execution channel, where comparison barely taxes it at all.
The constructive flip side is that you can claw some of composition's generalization back by changing its structure. Selectively augmenting natural language with symbolic scaffolding — not full formalization, just enough to pin down the structure between steps — beats both pure language and pure logic, because it gives the chain a backbone that doesn't depend on having memorized the exact instance Why does partial formalization outperform full symbolic logic?. The lesson hiding in all of this: composition isn't doomed, but its generalization is bottlenecked by chain length and execution, and the fixes are about shortening, structuring, or offloading the chain — not about thinking harder.
Sources 8 notes
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.