INQUIRING LINE

How can judges evaluate thinking without seeing the actual thoughts?

This explores the gap between what a judge can observe—final outputs, surface features—and the hidden computation that produced them, and what tools the corpus offers for grading reasoning you can't directly read.


This explores how anyone—human or AI—can score the quality of *thinking* when the actual thought process is hidden, either because it happened in latent space or because the judge only ever sees the polished output. The corpus turns out to have a surprisingly rich answer, and it starts with why the naive approach fails.

The core problem is that judges grade what they can see, and what they can see is exploitable. LLM evaluators systematically reward fake citations, confident tone, and rich formatting independent of whether the content is any good—biases you can trigger without any access to the model's internals Can LLM judges be tricked without accessing their internals?. The same trap catches humans: imitation models that merely mimic ChatGPT's fluent, confident style fool human evaluators into thinking capability improved when factuality didn't budge at all Can imitating ChatGPT fool evaluators into thinking models improved?. So 'evaluate the visible answer' isn't a neutral fallback—it actively rewards the appearance of thought over the thing itself.

This matters more once you realize the thoughts may genuinely be invisible. Depth-recurrent and compressed-token architectures solve hard reasoning tasks entirely in hidden computation—a 27M-parameter model cracked extreme Sudoku and large mazes with no verbalized chain-of-thought at all, where step-by-step methods scored zero Can models reason without generating visible thinking steps?. And even when a model *does* write out its reasoning, the visible trace isn't trustworthy: chain-of-thought accuracy is driven partly by raw output probability and memorization rather than genuine inference What three separate factors drive chain-of-thought performance?, and more reasoning tokens can actively hurt, with accuracy peaking then collapsing as models overthink Does more thinking time always improve reasoning accuracy?. The words on the page are not the thinking.

The corpus's interesting move is to evaluate reasoning by its *structure and traces* rather than its content. One line of work proposes measurable properties—traceability, counterfactual adaptability, and motif compositionality—that test whether an agent reasons causally or just produces coherent-sounding speech Can we measure reasoning quality beyond output plausibility?. Another reads the model's own layers: a 'deep-thinking ratio' tracks how often a token's predicted answer gets significantly revised as it passes through the network, which correlates with accuracy and lets you measure reasoning effort without ever reading a thought Can we measure how deeply a model actually reasons?. Notably, this also exposes fake reasoning from the other direction—theory-of-mind benchmarks turn out solvable by pure pattern-matching, so a judge looking only at correct answers would be fooled into crediting reasoning that never occurred Can language models solve ToM benchmarks without real reasoning?.

The third strategy is to make the judge itself think. Training evaluators with reinforcement learning to reason through their verdicts—rather than snap to surface cues—directly suppresses the authority, verbosity, position, and beauty biases that plague shallow judges Can reasoning during evaluation reduce judgment bias in LLM judges?. There's a subtlety worth knowing: thinking doesn't automatically help. Untrained models use extended deliberation counterproductively, spiraling into self-doubt that degrades their judgment, and only RL training flips that same mechanism into productive analysis Does extended thinking help or hurt model reasoning?. So the honest answer to the question is layered: you can't grade hidden thoughts by reading them, but you can grade them by their structural fingerprints, by internal layer-wise signals, and by handing the judge a reasoning process of its own—each of which sidesteps the trap of mistaking confident style for genuine thought.


Sources 10 notes

Can LLM judges be tricked without accessing their internals?

Research shows LLM evaluators systematically score higher when responses include fake references or rich formatting, independent of content quality. These biases are exploitable without model access, undermining AI benchmark credibility.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Can models reason without generating visible thinking steps?

Depth-recurrent and compressed-token architectures solve reasoning tasks through hidden computation rather than output tokens. A 27M-parameter model solved Sudoku-Extreme and 30×30 mazes perfectly while CoT methods scored zero.

What three separate factors drive chain-of-thought performance?

A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Can we measure reasoning quality beyond output plausibility?

Research identifies traceability, counterfactual adaptability, and motif compositionality as testable measures of human-like reasoning. These structural properties reveal whether an agent genuinely reasons causally or merely mimics coherent speech.

Can we measure how deeply a model actually reasons?

Deep-thinking ratio (DTR) measures the proportion of tokens whose predictions undergo significant revision across model layers, correlating robustly with accuracy across AIME, HMMT, and GPQA benchmarks. Think@n, a test-time strategy using DTR, matches self-consistency performance while reducing inference costs.

Can language models solve ToM benchmarks without real reasoning?

Supervised fine-tuning matches reinforcement learning performance on ToM tasks, suggesting models exploit structural vulnerabilities rather than develop genuine reasoning. Distribution biases and templated artifacts allow surface-level pattern recognition to achieve competitive generalization.

Can reasoning during evaluation reduce judgment bias in LLM judges?

Training judges with reinforcement learning to reason about evaluations—by converting judgment tasks into verifiable problems with synthetic data pairs—produces judges that think through their decisions rather than relying on exploitable surface features, directly mitigating authority, verbosity, position, and beauty bias.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM evaluation researcher. The question remains open: **How can judges—human or machine—evaluate reasoning quality when the actual thought process is hidden or inaccessible?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as timestamped constraints to re-test:
- LLM judges systematically reward confident tone, citations, and formatting independent of factuality; humans fall for imitation of style over capability (~2023–2024).
- Chain-of-thought traces don't reliably reflect genuine inference; accuracy is driven by memorization and output probability, not reasoning steps (~2024-07).
- Models can solve hard reasoning tasks (Sudoku, mazes) entirely in hidden latent computation with zero verbalized steps; step-by-step methods scored zero (~2024–2025).
- More reasoning tokens can *degrade* accuracy after a threshold; models overthink and collapse (~2025-06).
- Theory-of-mind benchmarks are solvable by pure pattern-matching without explicit reasoning (~2025-04).
- RL-trained judges suppress surface-cue bias; untrained thinking spirals into self-doubt, but RL converts it to productive analysis (~2025-05).
- A 'deep-thinking ratio' (layer-wise token prediction revision) correlates with accuracy and measures reasoning effort without reading thoughts (~2026-02).

Anchor papers (verify; mind their dates):
- arXiv:2305.15717 (2023-05): Imitation captures style, not factuality.
- arXiv:2407.01687 (2024-07): Chain-of-thought performance reflects probability & memorization.
- arXiv:2505.10320 (2025-05): RL incentivizes thinking in LLM judges.
- arXiv:2602.13517 (2026-02): Deep-thinking ratio measures reasoning effort.

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, judge whether newer models (o1-preview, o3, etc.), training methods (test-time scaling, compute-optimal inference), or evaluation harnesses have since RELAXED or OVERTURNED it. Separate the durable question (likely still open: *how do you ground judgment in hidden thought?*) from the perishable limitation (e.g., *current judges reward style*—has this been solved?). Cite what resolved it.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months. If newer results show layer-wise signals fail, or RL judges fail at scale, or hidden reasoning no longer hides, say so plainly.
(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., if test-time scaling now exposes reasoning, how do you *distinguish* depth from mere token inflation? If RL judges work, why do they fail on adversarial cases?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines