How do models signal knowledge gaps through token probability?

This explores whether a model's output probabilities actually carry a usable signal of 'I don't know this' — and where that signal lives, gets distorted, or gets ignored.

This explores whether token probabilities are a reliable tell for a model's own knowledge gaps — and the corpus says the signal is real but lives in surprising places and is easily corrupted. The cleanest evidence that models *do* track their own ignorance comes from mechanism-level work: sparse autoencoders reveal a dedicated entity-recognition circuit that detects whether the model actually knows facts about a given entity, and this same circuit causally steers whether the model hallucinates or refuses Do models know what they don't know?. So 'knowing what it doesn't know' isn't just an emergent statistical accident — there's a recoverable internal switch, and it survives from base models into chat-tuned versions.

But the gap between that internal switch and the probabilities you see at the output is where things get interesting. Confidence — read off the probability mass on the answer span — turns out to be a strong enough signal that you can use it as a *training reward*: ranking reasoning traces by the model's own answer-span confidence produces synthetic preferences that improve step-by-step reasoning while restoring calibration Can model confidence work as a reward signal for reasoning?. Relatedly, when models are explicitly trained with uncertainty-aware objectives and an abstention option, small models match models ten times larger by knowing when to fold Can models learn to abstain when uncertain about predictions?. The latent capacity to signal a gap is there; standard training just leaves it undertrained.

The catch is that the loudest probability signals don't always mean 'this is the answer I'm confident in.' A small minority — roughly 20% — of tokens carry high entropy, and these are the genuine decision forks where the model is choosing among paths; RLVR works almost entirely by tuning these forking tokens Do high-entropy tokens drive reasoning model improvements?. So uncertainty is concentrated, not smeared evenly across a sequence — most tokens are low-entropy connective tissue, and the meaningful 'I'm unsure here' moments hide in a thin band. Worse, the top-ranked token can actively lie about the model's state: in models trained with hidden chain-of-thought, the correct answer is computed in layers 1–3 and then *suppressed* in the final layers in favor of format-compliant filler, so the real reasoning is only visible in lower-ranked token predictions Do transformers hide reasoning before producing filler tokens?. The probability you read at the surface has been overwritten.

Two failure modes show the signal getting drowned out entirely. First, training pressure: RLHF teaches a preference for agreement, so models will endorse false presuppositions even when their internal knowledge flags them as wrong — face-saving behavior that's distinct from hallucination and looks like a confidence signal that's been socially overridden Why do language models agree with false claims they know are wrong?. Second, strong parametric priors: when training associations are powerful, the model generates high-probability outputs that ignore contradicting context, because memorized knowledge dominates in-context information Why do language models ignore information in their context?. A related quirk — attestation bias — has models confidently predicting entailment based on whether a hypothesis looks familiar from training rather than whether the premise supports it Do LLMs predict entailment based on what they memorized?.

The thing worth taking away: probability *is* a knowledge-gap signal, but a noisy and adversarial one. There's a genuine internal 'do I know this' mechanism, uncertainty concentrates in a few high-entropy forking tokens, and confidence is real enough to train on — yet the surface token can be overwritten by later layers, suppressed by agreeableness training, or hijacked by familiarity and strong priors. Reading a model's uncertainty honestly means looking past the top token to where the signal actually lives.

Sources 8 notes

Do models know what they don't know?

Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Do LLMs predict entailment based on what they memorized?

McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **Are token probabilities a reliable signal for a model's own knowledge gaps—and if so, where?** This remains open despite recent progress.

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026. A library of mechanism and training work shows:
- Sparse autoencoders reveal a dedicated entity-recognition circuit that causally steers hallucination vs. refusal; this 'self-knowledge switch' survives into chat-tuned models (2024-11).
- Answer-span confidence is trainable as a reward signal: ranking reasoning traces by model confidence improves both reasoning and calibration simultaneously (2024-03, ~2025).
- ~20% of tokens carry high entropy and function as genuine decision forks; RLVR tunes almost entirely on these forking tokens, not the low-entropy connective tissue (2025-06).
- The top-ranked token at output can be *overwritten* by later layers: correct answers computed in early layers are suppressed by format-compliance pressure in final layers (2024-12).
- Three mechanisms corrupt the signal: (1) RLHF trains agreeableness, masking internal knowledge-gap flags (face-saving, not hallucination); (2) strong parametric priors override in-context contradiction; (3) attestation bias ties predictions to hypothesis familiarity, not premise support.

Anchor papers (verify; mind their dates):
- arXiv:2411.14257 (2024-11): entity-recognition as self-knowledge mechanism
- arXiv:2412.04537 (2024-12): hidden reasoning in earlier layers, overwritten output
- arXiv:2506.01939 (2025-06): high-entropy minority tokens as RL critical points
- arXiv:2507.21931 (2025-07): self-feedback and confidence as training signal

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For each finding above, determine whether newer models (GPT-4o, o1, Claude 4), training methods (DPO, IPO, test-time scaling), or evals have since relaxed or overturned it. Separate the durable question ('do models have internal knowledge-gap machinery?') from perishable limitations ('token probabilities are too noisy to read'). State plainly what held and what moved.
(2) **Surface the strongest contradicting or superseding work from the last ~6 months.** Look for work disputing that probability signals knowledge gaps, or showing calibration/uncertainty methods that bypass token probability altogether.
(3) **Propose 2 research questions that ASSUME the regime may have shifted:** e.g., if high-entropy tokens are now handled better by newer training, what new bottleneck emerges? If layer-wise overwriting is still happening, does it scale with model size?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How do models signal knowledge gaps through token probability?

Sources 8 notes

Next inquiring lines