INQUIRING LINE

What makes the 45 percent accuracy saturation threshold universal?

This asks about a 'universal 45% accuracy saturation threshold' — but the corpus has no such finding; the only 45% figure here is a metric artifact, not a ceiling, so the honest answer is to flag the premise and redirect to what the corpus actually says about why accuracy numbers mislead.


This explores a supposed universal threshold where accuracy saturates around 45% — and the corpus doesn't support that framing. The one place 45 appears is a warning sign, not a law: ROUGE-based hallucination detection inflates measured capability by up to 45.9 percent compared to human-aligned metrics, which means much of the 'progress' in that area is measuring length variation rather than factual accuracy Is hallucination detection progress real or just metric artifacts?. That's an evaluation artifact — a gap between what the metric rewards and what's actually true — not a saturation point that models hit and can't cross. So if the 45% number came from somewhere, it's worth checking whether it's describing a real ceiling or an inflated score.

What the corpus does have, and what's more interesting, is a recurring theme that aggregate accuracy is a treacherous number to begin with. Fluent, confident, wrong answers tend to be invisible to standard accuracy evaluation: in domains like medical triage, legal interpretation, and financial planning, the dangerous errors concentrate in rare cases where harm happens, and overall accuracy looks strong precisely because it averages those cases away Why do confident wrong answers hide in standard accuracy metrics?. A single headline accuracy figure can stay high while the failures that matter most go uncounted — which is the opposite of a clean universal threshold.

The same skepticism applies to the idea that a number reflects a stable property of the model at all. Setting temperature to zero produces the same output every time, but that consistency is just one fixed draw from the model's probability distribution — repeatable is not the same as reliable Does setting temperature to zero actually make LLM outputs reliable?. Any 'threshold' you read off a single deterministic run may be an accident of that one sample rather than a true measure of capability.

If the real curiosity behind the question is *why measured performance plateaus or where ceilings come from*, the corpus points sideways to a more concrete answer: ceilings tend to be task-structural, not universal. Sparsity tolerance, for instance, varies dramatically — single-question tasks tolerate 95% sparsity while multi-hop and aggregation tasks fall apart at 50–67%, because some tasks concentrate reasoning in a few tokens and others need attention spread across many regions How much sparsity can different reasoning tasks actually tolerate?. Where models do hit limits, the limits move with the task. There's no single magic percentage.

The takeaway the corpus offers isn't a universal threshold — it's the reverse lesson: be suspicious of any clean universal accuracy number, because the field has a documented habit of producing illusory progress when the metric and the truth drift apart Is hallucination detection progress real or just metric artifacts? Why do confident wrong answers hide in standard accuracy metrics?.


Sources 4 notes

Is hallucination detection progress real or just metric artifacts?

ROUGE-based evaluation inflates detection capability by up to 45.9 percent compared to human-aligned metrics. Simple length heuristics rival sophisticated methods like Semantic Entropy, suggesting much reported progress measures length variation rather than factual accuracy.

Why do confident wrong answers hide in standard accuracy metrics?

Medical triage, legal interpretation, and financial planning show a consistent pattern: surface heuristics conflict with unstated constraints, producing fluent confident errors that concentrate in rare cases where harm occurs. Aggregate accuracy masks these failures because overall performance looks strong.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

How much sparsity can different reasoning tasks actually tolerate?

Single-QA tasks tolerate 95% sparsity while multi-hop and aggregation tasks degrade substantially at 50-67% sparsity. This pattern reflects structural differences: single-QA concentrates reasoning in few tokens, while multi-hop and aggregation require distributed attention across multiple regions.

Next inquiring lines