Can learned verifiers over token similarity replace dense compositional training?
This explores whether a cheap, lightweight verifier reading token-to-token similarity patterns can stand in for the expensive work of training a model to compose reasoning steps densely — i.e., whether you can bolt verification onto a weak base instead of teaching genuine composition.
This reads the question as a trade between two strategies: spend training compute teaching a model to compose (dense compositional training), or spend much less on a small verifier that reads token-similarity maps and rejects bad outputs after the fact. The corpus suggests these aren't actually substitutes — they solve different halves of the problem — but the boundary between them is more porous than it first looks.
The strongest case *for* the verifier route comes from work showing that a small Transformer operating on full token-token similarity maps reliably catches "structural near-misses" that compressed-vector methods wave through Can verification separate structural near-misses from topical matches?. The lesson there is that the discriminative signal lives in the *interaction pattern* between tokens, not in any pooled summary — and a cheap downstream stage can read it. That's encouraging if you think the base model already produces roughly-right candidates and just needs filtering.
But filtering can't manufacture a capability the candidate pool never contains, and this is the load-bearing constraint. Self-improvement in language models is formally bounded by a generation–verification gap: a model can't reliably exceed what some external check can validate What stops large language models from improving themselves?. A learned verifier is exactly that external check — which means it *enables* the dense training loop rather than replacing it. And there's reason to doubt the candidates in the first place: transformers' "compositional reasoning" often reduces to memorized subgraph matching that shatters on novel compositions Do transformers actually learn systematic compositional reasoning?, and chain-of-thought frequently imitates the *form* of reasoning rather than performing it Does chain-of-thought reasoning reveal genuine inference or pattern matching?. A verifier over token similarity can flag a near-miss, but if the model never generates a genuinely novel correct composition, there's nothing right to select.
Where the question gets interesting is that the line between "verification" and "training signal" is dissolving in the corpus. VeriFree drops the verifier entirely, using the likelihood of a reference answer given the generated reasoning as *both* reward and training weight Can reasoning improvement work without answer verification? — so a token-level similarity-like signal becomes the training objective, not a post-hoc gate. Relatedly, RLVR's learning signal turns out to concentrate in ~20% of high-entropy "forking" tokens Do high-entropy tokens drive reasoning model improvements?, and reasoning chains rank tokens by functional importance, preserving symbolic-computation tokens first Which tokens in reasoning chains actually matter most?. Both hint that "dense" training is already sparse where it counts — most of the compute isn't doing compositional work.
So the honest answer the corpus points to: a learned verifier over token similarity can replace the *brute-force density* of training — you don't need to push gradients through every token — but it can't replace the compositional *structure* itself. Composition appears to live in modular subnetworks that pretraining makes more reliable Do neural networks naturally learn modular compositional structure?; a verifier sharpens which outputs you keep, but the ability to compose still has to be built into the model that generates them.
Sources 8 notes
A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.