Can models possess latent reasoning capability that training signals fail to unlock?

This explores whether the ability to reason already lives inside a model's pretrained weights, waiting to be switched on — so that training is less about teaching reasoning than about finding the right key to unlock it.

This explores whether models already hold reasoning ability that training merely surfaces rather than builds — and the corpus comes down surprisingly hard on "yes, mostly." The strongest version of the claim is that base models already contain latent reasoning, and that five completely different techniques — reinforcement learning, critique fine-tuning, changing how the model decodes text, steering internal features, and reward-verified RL — all reach into the *same* pre-existing capability rather than each creating new skill Do base models already contain hidden reasoning ability?. If five unrelated keys open the same door, the door was already there. The bottleneck, on this view, is elicitation, not acquisition.

The most vivid evidence is how *little* signal it takes to unlock. A single SAE-identified "reasoning feature" can be steered directly to match or beat chain-of-thought prompting across six model families, activating early in generation and even overriding surface instructions Can we trigger reasoning without explicit chain-of-thought prompts?. You can elicit big gains with no RL at all — four modular "cognitive tools" lifted GPT-4.1 on a hard math benchmark from 27% to 43% purely by isolating reasoning operations cleanly Can modular cognitive tools unlock reasoning without training?. And when reinforcement learning *is* applied, the dynamics suggest it sharpens sampling within existing boundaries rather than expanding them: one training example can suffice to activate the behavior, and even spurious rewards work nearly as well as correct ones for models with the right pretraining What does reward learning actually do to model reasoning?. The unsettling corollary appears in a separate thread — deliberately *corrupted* reasoning traces teach about as well as correct ones, implying the trace is computational scaffolding that triggers latent computation, not meaningful content the model learns from Do reasoning traces need to be semantically correct?.

If you want to know *where* this latent capability comes from, the answer points back to pretraining itself: reasoning generalization is driven by broad, transferable procedural knowledge spread across many documents, unlike factual recall which depends on narrow memorization of specific sources Does procedural knowledge drive reasoning more than factual retrieval?. That reframes the whole question — training signals don't "fail to unlock" capability so much as they're competing to access something pretraining already distributed widely. It also explains why confidence alone can serve as a reward to strengthen reasoning without any human labels or external verifier Can model confidence work as a reward signal for reasoning?: the model already knows enough to grade its own traces.

But the corpus doesn't let "it's all already there" off the hook. There's a ceiling to what gets unlocked. When semantic content is stripped from a task, model performance collapses even with the correct rules sitting in context — the latent capability is semantic association, not formal symbolic logic, so it can't escape its training distribution Do large language models reason symbolically or semantically?. Reasoning failures track instance-level *unfamiliarity*, not task complexity: models fit patterns from similar instances rather than learning a general algorithm, so a chain succeeds only if something like it was seen before Do language models fail at reasoning due to complexity or novelty?. So the honest synthesis is two-sided — training signals genuinely *under*-elicit a large reservoir of latent reasoning, but that reservoir is bounded by what pretraining made familiar. Unlocking is real; conjuring is not.

The forward edge of the corpus is about *managing* that latent capability rather than just triggering it — making latent reasoning stochastic so a model can hold uncertainty and explore multiple solution paths instead of committing early Can stochastic latent reasoning help models explore multiple solutions?, steering reasoning toward brevity by moving along a single direction in activation space with no retraining Can we steer reasoning toward brevity without retraining?, or teaching a model to route between thinking hard and answering fast Can models learn when to think versus respond quickly?. The interesting thing you may not have expected to want to know: across these papers, the lever that controls reasoning often turns out to be a single feature, a single direction, or a single example — which is exactly what you'd predict if the capability is already present and training is just choosing whether to express it.

Sources 12 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can we trigger reasoning without explicit chain-of-thought prompts?

SAE-identified reasoning features can be directly steered to match or exceed chain-of-thought performance across six model families. This reasoning mode activates early in generation and overrides surface-level instructions, suggesting latent reasoning is a fundamental capability independent of explicit prompting.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Can stochastic latent reasoning help models explore multiple solutions?

GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability analyst re-evaluating latent reasoning in LLMs. The core question: do models possess reasoning ability that training signals fail to unlock, or has the frontier moved?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. A library of papers claims:
• Base models contain latent reasoning; five unrelated techniques (RL, critique, decoding changes, SAE steering, reward-verified RL) unlock the *same* pre-existing capability, not new skills (~2025–2026).
• A single SAE-identified reasoning feature steers to match chain-of-thought across six model families; even one training example activates behavior, spurious rewards work nearly as well as correct ones (~2025).
• Corrupted reasoning traces teach as well as correct traces, implying traces are computational scaffolding triggering latent computation, not meaningful content (~2025).
• Procedural knowledge from pretraining drives reasoning; factual recall depends on narrow memorization (~2024–2025).
• Model confidence serves as intrinsic reward to strengthen reasoning without external verification (~2025).
• Semantic content stripping collapses performance even with correct rules in context; reasoning is semantic association, not formal symbolic logic (~2023).
• Reasoning failures track instance-level unfamiliarity, not task complexity; models fit patterns from similar instances (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2305.14825 (2023) — semantic vs. symbolic reasoning boundaries.
• arXiv:2506.12115 (2025) — cognitive tools unlock reasoning modularly.
• arXiv:2602.06176 (2026) — reasoning failure mechanisms.
• arXiv:2605.19376 (2026) — recursive reasoning modes.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above — especially the claim that latent reasoning is *already present* and training is elicitation not acquisition — judge whether newer model scaling (o1, o3, or successors), chain-of-thought variants, test-time compute budgets, or interpretability breakthroughs have relaxed, overturned, or *sharpened* the boundary. Has the "latent reservoir" grown? Can models now escape semantic-association limits? Cite what changed and where constraints still hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months: papers claiming reasoning *is* acquired, not elicited; or that test-time scaling has fundamentally reshaped what "latent" means.
(3) Propose 2 durable research questions that assume the regime *may* have shifted — e.g., if latent reasoning has grown, what drives growth? If elicitation has ceiling'd out, what is the next frontier?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can models possess latent reasoning capability that training signals fail to unlock?

Sources 12 notes

Next inquiring lines