SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals Reasoning, Retrieval, and Evaluation

Can architecture choices improve inference efficiency without sacrificing accuracy?

Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.

Synthesis note · 2026-02-23 · sourced from Inference time scaling
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

Standard scaling laws (Chinchilla) optimize the trade-off between model parameters and training data for a fixed training compute budget. They say nothing about inference cost. But as LLMs move from research to deployment, inference cost dominates — and architecture choices affect inference efficiency in ways that parameter count alone does not predict.

The conditional scaling law augments Chinchilla by conditioning on three architectural variables: hidden size, the ratio of MLP parameters to attention parameters, and grouped-query attention (GQA) configuration. These variables affect inference throughput independently of their effect on accuracy. A model with the same parameter count and training budget can have dramatically different inference costs depending on how those parameters are allocated between MLP and attention layers.

Empirical validation across 200+ models (80M-3B parameters, 8B-100B training tokens): optimized architectures achieve up to 2.1% higher accuracy AND 42% greater inference throughput compared to LLaMA-3.2 under the same training budget. The "and" is the key finding — accuracy and inference efficiency are not zero-sum when architecture is treated as a free variable. Suboptimal architectures simultaneously sacrifice both.

This adds a third optimization lever to the inference compute landscape. Can inference compute replace scaling up model size? establishes the training-inference compute trade-off. Can we allocate inference compute based on prompt difficulty? establishes adaptive allocation. Architecture optimization sits upstream of both: it determines the baseline efficiency at which every unit of inference compute converts to performance. A 42% throughput improvement means the same inference budget produces 42% more reasoning attempts, parallel samples, or search steps.

For reasoning systems that scale inference compute extensively, the architectural multiplier compounds: a model that's 42% more efficient per inference step gets 42% more exploration per token budget, which matters disproportionately for approaches like Why does parallel reasoning outperform single chain thinking? where more parallel attempts directly improve accuracy.

Inquiring lines that use this note as a source 45

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
20 direct connections · 144 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

conditional scaling laws that incorporate architectural variables predict inference efficiency independently of training compute