What makes schema identification necessary after assessing thoughts and evidence?
This explores why simply weighing thoughts and evidence isn't enough — why the corpus suggests you also have to name the *kind* of reasoning being used (its underlying schema) before you can trust or act on it.
This explores why simply weighing thoughts and evidence isn't enough — why a reasoning system also has to identify which *kind* of reasoning is in play before the assessment means anything. The corpus keeps circling one uncomfortable finding: the content of a reasoning step and its evidential support can look fine while the underlying structure is doing something entirely different than it appears. So identifying the schema becomes the step that tells you what you're actually assessing.
The sharpest version comes from the GenMinds work, which argues that causal belief networks capture only part of how reasoning works — they can't represent associative links, analogical mappings, or emotion-driven belief shifts Can causal models alone capture how humans actually reason?. If you assess a thought purely as a causal claim with supporting evidence, you misread every move that's actually associative or analogical. The schema has to be identified first because it determines which evaluation rules even apply. The same logic shows up in the PI framework, which sorts reasoning into six categories and discovers that some types — verification, backtracking — receive almost no downstream attention and can be pruned without losing accuracy Can reasoning steps be dynamically pruned without losing accuracy?. You can only tell which steps are load-bearing once you've categorized what type of step each one is.
Why is the schema layer so necessary? Because a string of confident-looking thoughts with valid-seeming evidence can be pure form. Invalid chains of thought perform nearly as well as valid ones, which means models learn the *shape* of reasoning rather than genuine inference Does logical validity actually drive chain-of-thought gains?. Reasoning traces turn out to be stylistic mimicry rather than verified causal work — the visible steps correlate with answers through learned formatting, not functional logic Do reasoning traces actually cause correct answers?. The broader synthesis names this directly: chain-of-thought is constrained imitation, where format effects dominate content What makes chain-of-thought reasoning actually work?. If the trace can be valid-looking and empty, then assessing the thoughts and the evidence tells you almost nothing without first asking *what kind of operation this actually is.*
There's a quieter, more constructive reason too. Some work shows reasoning isn't even in the visible tokens — models solve hard tasks through latent computation Can models reason without generating visible thinking steps?, and a single steerable internal feature can trigger a reasoning *mode* that overrides surface instructions Can we trigger reasoning without explicit chain-of-thought prompts?. That implies the real schema lives below the words you're assessing. Identifying it isn't bookkeeping — it's the only way to reach the thing that's actually doing the reasoning, rather than the narration laid over it.
The payoff a curious reader might not expect: across all these papers, schema identification is what separates evaluating reasoning from evaluating its costume. Whether you're pruning redundant steps, steering between overthinking and underthinking Can confidence patterns reveal overthinking versus underthinking?, or deciding whether to believe a trace at all, the type-level question comes first because everything downstream — which rules apply, which steps matter, whether the evidence is even the right evidence — depends on it.
Sources 8 notes
Causal belief networks excel at modeling causal reasoning but cannot represent associative links, analogical mappings, or emotion-driven belief shifts. The GenMinds framework itself acknowledges this as a tractable starting point rather than a complete theory.
The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
Depth-recurrent and compressed-token architectures solve reasoning tasks through hidden computation rather than output tokens. A 27M-parameter model solved Sudoku-Extreme and 30×30 mazes perfectly while CoT methods scored zero.
SAE-identified reasoning features can be directly steered to match or exceed chain-of-thought performance across six model families. This reasoning mode activates early in generation and overrides surface-level instructions, suggesting latent reasoning is a fundamental capability independent of explicit prompting.
ReBalance uses confidence variance and overconfidence as diagnostic signals to apply training-free steering vectors that reduce overthinking redundancy while promoting exploration during underthinking, improving accuracy across models from 0.5B to 32B parameters.