How should we measure and report serial compute separately?

This explores why serial (sequential, depth-accumulating) compute deserves its own measurement and reporting line, separate from parallel (breadth, sampling-many) compute — and what the corpus says you'd actually capture by doing so.

This reads the question as: when we report "test-time compute," we usually lump everything into one token or FLOP budget — but serial and parallel compute do fundamentally different work, so collapsing them hides the thing that matters. The corpus backs this up strongly. The recurring framing is that parallel scaling buys *coverage* (many short independent attempts) while serial scaling buys *depth* (one chain accumulating intermediate results), and the right mix depends on task structure How should we balance parallel versus sequential compute at test time?. If you only report total tokens, two systems with identical budgets but opposite allocations look the same on paper while behaving completely differently.

The sharpest reason to break serial out separately is that on certain problems it isn't a tuning knob — it's the only thing that works. On compositional tasks like graph connectivity, sequential chain-of-thought gives an *exponential* accuracy advantage over parallel voting, because the solution genuinely requires accumulating results step by step that short parallel chains can never reach When does sequential reasoning beat parallel voting?. A combined compute number erases exactly this: you can't tell whether a model spent its budget on the kind of depth the task demands or wasted it sampling wide.

But serial compute also carries a cost that parallel compute doesn't, which is the second reason to report it on its own axis: error accumulates per step. Decomposing CoT reveals that genuine reasoning exists but compounds error with each additional step What three separate factors drive chain-of-thought performance?, and information-theoretic analysis of slow-thinking frameworks finds that "snowball" errors accrue per step regardless of which search algorithm you use — efficacy tracks total reasoning budget and reward quality, not the framework name Does the choice of reasoning framework actually matter for test-time performance?. So serial depth has a built-in degradation curve that parallel breadth lacks. Reporting serial steps separately lets you see where the chain starts to rot — which is invisible if depth is buried in an aggregate.

This also connects to a measurement problem the corpus flags directly: short-horizon numbers don't predict long-horizon (deeply serial) behavior. Models that rank identically on single-turn tasks diverge dramatically by relay 25 of a 50-round delegation Do short benchmarks predict how models perform over long workflows?. The honest way to surface that is to report performance *as a function of serial depth*, not at a single point. The same logic motivates step-level rather than global metrics: local, per-step confidence catches breakdowns that whole-trace averaging masks and enables early stopping Does step-level confidence outperform global averaging for trace filtering?.

The practical upshot the corpus points toward: report compute along two axes — parallel width and serial depth — rather than one scalar, because the optimal allocation is adaptive per prompt difficulty Can we allocate inference compute based on prompt difficulty? How should we allocate compute budget at inference time? and because serial compute can substitute for raw model size on hard prompts, making depth a first-class resource you'd want to price independently Can inference compute replace scaling up model size?. And since serial and parallel sit on the broader internal-vs-external scaling map internal-vs-test-time-scaling, naming which axis a gain came from is what makes results comparable at all. The thing you didn't know you wanted: a single "compute budget" number is almost meaningless for reasoning systems — the interesting signal is entirely in *how* that budget splits between going wide and going deep.

Sources 10 notes

How should we balance parallel versus sequential compute at test time?

Parallel methods improve coverage; sequential methods enable depth. The optimal choice depends on task structure: parallel wins for independent short problems, sequential for compositional chains requiring intermediate accumulation.

When does sequential reasoning beat parallel voting?

On structured tasks requiring sequential multi-step reasoning like graph connectivity, chain-of-thought achieves exponentially higher accuracy than parallel voting. The difference emerges because solutions genuinely require accumulating intermediate results sequentially, which short parallel chains cannot achieve.

What three separate factors drive chain-of-thought performance?

A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.

Does the choice of reasoning framework actually matter for test-time performance?

Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.

Do short benchmarks predict how models perform over long workflows?

DELEGATE-52 evaluated models across 50-round-trip relays and found short-interaction performance does not predict sustained delegation accuracy. Models ranking similarly on single-turn tasks diverged dramatically by relay 25, revealing degradation curves invisible to standard benchmarks.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

How should we allocate compute budget at inference time?

Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

How do internal and external test-time scaling compare?

Research shows test-time scaling methods split into internal (training models for autonomous reasoning) and external (inference-time search and verification). They complement rather than compete; internal builds capability while external extracts performance from existing capability.

How should we measure and report serial compute separately?

Sources 10 notes

Next inquiring lines