How do coverage and identifiability set separate performance ceilings?
This explores two different things that can cap performance: coverage (whether your method ever reaches the right region of the space at all) and identifiability (whether, once reached, you can actually tell the right answer apart from a convincing wrong one) — and why fixing one does nothing for the other.
This explores two different things that can cap performance: coverage and identifiability. Coverage is a reach problem — does your method ever touch the part of the space that matters? Identifiability is a discrimination problem — once you're there, can you tell the genuinely-correct thing apart from a near-miss that looks just as good? They feel related, but the corpus keeps showing they're separate ceilings: you can max out one and still be capped by the other.
The coverage ceiling is about breadth. In safety testing of personas, optimizing for *support coverage* — reaching rare, consequential user configurations — beats trying to statistically match the average population, because the dangerous cases live in the tails that density-matching never visits Should persona simulation prioritize coverage over statistical matching?. The same shape shows up in agent evaluation: capability isn't a scalar but a vector across separable axes (task success, privacy, long-horizon memory, mode-shift, ecosystem readiness), and a single-number benchmark simply fails to *cover* the axes a real deployment depends on, so it systematically misranks models Does a single benchmark score actually predict agent readiness?. No amount of precision on the axis you measured rescues you from the axes you never looked at.
The identifiability ceiling is the opposite failure: you've reached the right region, your aggregate numbers look great, and you still can't distinguish good from bad. Models can carry every linearly-decodable feature a task needs while their internal organization is fractured — invisible to standard metrics, fatal under distribution shift Can models be smart without organized internal structure?. Fluent, confident, *wrong* answers hide inside strong overall accuracy, concentrating exactly in the rare high-harm cases where you most need to catch them Why do confident wrong answers hide in standard accuracy metrics?. In retrieval, pooled-cosine recall happily returns structural near-misses that a topical match can't be told apart from — until a learned verifier operating on full token-interaction patterns does the discriminating that the compressed representation couldn't Can verification separate structural near-misses from topical matches?.
That last example is the tell: identifiability is usually bought by a *separate mechanism* layered on top of coverage. The cleanest framing is the internal-vs-external split in test-time scaling — internal methods build the capability (they widen what's reachable), while external methods are search and verification that extract the right answer from what's already reachable. They complement rather than compete precisely because they're attacking different ceilings How do internal and external test-time scaling compare?. Asynchronous verifiers that police a reasoning trace without slowing it down are pure identifiability machinery: they don't help the model reach better answers, they catch the bad ones it already produced Can verifiers monitor reasoning without slowing generation down?.
The thing worth walking away with: when a system plateaus, the diagnosis splits cleanly. If it never reaches the cases that matter, more verification won't help — you have a coverage problem, and you need broader generation or evaluation. If it reaches them but can't tell right from convincingly-wrong, more breadth won't help — you need a discriminator. Conflating the two is why a model that looks excellent in aggregate can fail in exactly the place it counts.
Sources 7 notes
Evolutionary optimization of Persona Generator code achieves broader trait coverage than density-matched baselines, including rare but consequential user configurations that naive LLM prompting misses.
Capability decomposes into task success, privacy compliance, long-horizon retention, mode-shift behavior, and ecosystem readiness. Models ranked highest on one axis often rank lower on others, making single-score evaluations systematically misleading for real deployment.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
Medical triage, legal interpretation, and financial planning show a consistent pattern: surface heuristics conflict with unstated constraints, producing fluent confident errors that concentrate in rare cases where harm occurs. Aggregate accuracy masks these failures because overall performance looks strong.
A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.
Research shows test-time scaling methods split into internal (training models for autonomous reasoning) and external (inference-time search and verification). They complement rather than compete; internal builds capability while external extracts performance from existing capability.
Decoupling verification from generation lets verifiers run alongside a single trace, forking to extract verifiable state and intervening only on violations. On correct runs the latency penalty is near-zero; interwhen matches or beats CoT across benchmarks at similar token budgets.