Why does second-hop reasoning fail when composed with out-of-distribution triples?
This explores why a model that can chain two reasoning steps on familiar facts breaks down when the second step lands on a fact combination it never saw during training.
This explores why a model that can chain two reasoning steps on familiar facts breaks down when the second step lands on a fact combination it never saw during training. The corpus offers a surprisingly mechanical answer: multi-hop reasoning isn't a single capability that transfers freely — it's built in stages, and the last stage is the fragile one. A controlled study of how transformers acquire multi-step reasoning found three developmental phases — first memorizing individual facts, then generalizing within the training distribution, and only last reasoning across distributions — and the key result is that second-hop generalization only appears when the model gets explicit compositional exposure during training How do transformers learn to reason across multiple steps?. In other words, the second hop doesn't 'come for free' once the first hop works. If the model never practiced composing across the relevant fact regions, an out-of-distribution triple gives it nothing to recombine.
The deeper reason this happens points to what chain-of-thought reasoning actually is. Several notes converge on the view that step-by-step reasoning is constrained imitation of reasoning *form*, not genuine symbolic inference — models reproduce familiar reasoning patterns rather than deriving new conclusions Does chain-of-thought reasoning reveal genuine inference or pattern matching?, Why does chain-of-thought reasoning fail in predictable ways?. The DataAlchemy experiments make the failure signature precise: reasoning stays fluent but becomes logically inconsistent the moment you shift task, length, or format away from training Does chain-of-thought reasoning actually generalize beyond training data?. An out-of-distribution triple is exactly such a shift, so the model produces a confident-sounding second hop that doesn't actually follow.
A related note reframes *what* triggers the breakdown, and it's the most useful lateral piece here. Reasoning failures track instance-level unfamiliarity, not task complexity — models fit instance-based patterns rather than general algorithms, so any chain succeeds if the model trained on similar instances and fails otherwise, regardless of how 'simple' the logic looks Do language models fail at reasoning due to complexity or novelty?. This explains the puzzle directly: the second hop isn't hard because it's a second hop; it's hard because the specific triple is novel. The composition itself is the unfamiliar instance.
There's also a structural-memory angle worth knowing about. One line of work argues the failure is partly about how retrieved evidence is stored: flat lists and binary graphs lose the joint constraints that bind three or more entities together, while hypergraph memory keeps multi-entity relations intact across retrieval steps Can hypergraphs capture multi-hop reasoning better than graphs?. Read against the developmental findings, this suggests two reinforcing causes — the model never learned to compose across the distribution, *and* the way facts are represented can quietly drop the constraints a clean second hop would need.
The thing you might not have expected: fixing this isn't mainly about more compute or longer reasoning chains. Training regime beats inference budget — extra tokens only help if training installed a reasoning protocol that makes them productive Can non-reasoning models catch up with more compute?. So a second hop over an out-of-distribution triple won't be rescued by 'thinking longer.' It fails because the compositional behavior was never trained into the model for that region of the distribution, and no amount of inference-time effort manufactures a capability that isn't there.
Sources 7 notes
Controlled training reveals transformers learn multi-hop reasoning in three phases: memorization, in-distribution generalization, and cross-distribution reasoning. Successful reasoning correlates with cosine clustering of entity representations, and second-hop generalization requires explicit compositional exposure during training.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
HGMem organizes retrieved evidence as hyperedges rather than flat lists or binary graphs, allowing three or more entities to bind into single relations without decomposition. This structure accumulates coherent knowledge across retrieval steps, trading representational complexity for constraint expressiveness.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.