Why do recursive belief models require different training than logical derivation?
This reads the question as: why does training a model to hold and revise beliefs under uncertainty (beliefs about beliefs, distributions over what's true) demand something different from training it to crank through fixed logical steps — and the corpus suggests the answer starts by questioning whether LLMs ever did logical derivation in the first place.
This explores why "hold a belief and update it" is a different training target than "derive the next step from rules" — and the most useful thing the corpus does is dissolve the premise that these models do clean logical derivation at all. Several notes converge on the finding that LLMs reason by semantic association, not symbolic manipulation: when meaning is stripped out and only the formal rules remain, performance collapses Do large language models reason symbolically or semantically?. Chain-of-thought, which looks like step-by-step derivation, turns out to be constrained imitation of the *form* of reasoning learned from training, degrading predictably under distribution shift rather than generalizing the way a real proof procedure would Does chain-of-thought reasoning reveal genuine inference or pattern matching?. So "logical derivation" in an LLM is already a kind of performance, not a mechanism — which is the first reason you can't just train it the way you'd specify a deductive system.
The strangest evidence comes from corrupted traces: models trained on deliberately wrong or irrelevant reasoning steps perform about as well as those trained on correct ones, and sometimes generalize *better* out of distribution Do reasoning traces need to be semantically correct?. If the literal logical content of a derivation barely matters, then the trace is functioning as computational scaffolding — a way to allocate compute — not as a chain of truth-preserving inferences. Training that optimizes for correct derivations is optimizing the wrong object. What actually transfers, per the pretraining analysis, is *procedural* knowledge — broad, reusable patterns of how-to-proceed drawn from many documents — as opposed to the narrow memorization that factual recall depends on Does procedural knowledge drive reasoning more than factual retrieval?.
Belief modeling pulls in the opposite direction from derivation in a more concrete way: a derivation wants one path, but a belief is a distribution. The clearest note here makes recursive latent reasoning *stochastic*, replacing deterministic latent updates with sampling so the model can represent a spread of possible solutions and carry genuine uncertainty forward, rather than committing to a single line a deterministic design forces on it Can stochastic latent reasoning help models explore multiple solutions?. That's the architectural signature of belief-holding — and it's incompatible with training regimes that reward a single correct derivation, because those regimes punish exactly the exploration that representing alternatives requires.
There's a deeper training-vs-inference distinction underneath all this. Reasoning models persistently beat non-reasoning ones no matter how much inference compute you throw at the weaker model, because training installs a *protocol* that makes extra tokens productive — the gap is about training structure, not raw capacity Can non-reasoning models catch up with more compute?. Relatedly, much of what post-training does is *elicit* reasoning already latent in base activations rather than create it Do base models already contain hidden reasoning ability?, and the learning signal concentrates in a small set of high-entropy "forking" tokens — the decision points where the model could branch — rather than spreading evenly across a derivation Do high-entropy tokens drive reasoning model improvements?. Those forking points are precisely where beliefs live: moments of uncertainty between alternatives, not the deterministic stretches between them.
The payoff for a curious reader: the reason recursive belief modeling needs different training isn't that beliefs are "harder" than logic. It's that logical-derivation training quietly assumes a symbolic mechanism the model doesn't have, rewards a single path when the model's real competence lives in branching, and optimizes trace content that turns out to be scaffolding. An alternative thread — energy-based transformers that assign an energy to each candidate prediction and minimize over them at inference — points at what belief-shaped training might look like instead: learn a landscape over possibilities and let the model settle into one, getting System-2 behavior without any domain-specific derivation scaffolding Can energy minimization unlock reasoning without domain-specific training?.
Sources 9 notes
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.