Do small models show different parameter efficiency patterns than large models?
This explores whether small models get more done per parameter in ways that don't simply scale down from large ones — and where the efficiency frontier sits differently for them.
This explores whether small models get more done per parameter in ways that don't simply scale down from large ones. The corpus suggests they do — and the differences aren't just about size, they're about which jobs reward smallness. The most striking case is output diversity: models around 500M parameters actually generate *more* unique outputs per sample than bigger ones, because larger models concentrate probability mass on their preferred answers and so produce less variety within a fixed budget Why aren't bigger models better for generating diverse outputs?. That's a pattern that runs backwards from the usual "bigger is better" intuition.
A recurring theme is that small models can match large ones once you separate *format* from *knowledge*. A 1.5B model with LoRA-only tuning matched much larger RL-trained models on reasoning, implying that a lot of what looks like reasoning capability is really learned output organization, not new facts Can small models reason well by just learning output format?. The same separability shows up in function calling, where small models trained with DPO on a teacher's correct-and-incorrect examples close the gap by directly targeting rigid-format failures that plain fine-tuning misses Can small models match large models on function calling?. So for small models, the efficient lever is often "teach the shape of the answer," not "cram in more parameters."
Architecture is where the patterns genuinely diverge by scale. For sub-billion models, depth beats width — deep-and-thin designs gain several accuracy points over balanced ones by composing concepts through layers, directly contradicting the Kaplan scaling laws derived from larger models Does depth matter more than width for tiny language models?. More generally, folding architectural variables like hidden size and attention ratios into scaling laws unlocks big inference gains that flat parameter-count thinking ignores Can architecture choices improve inference efficiency without sacrificing accuracy?. And you can sidestep parameter scaling entirely: spending more compute at inference time lets a smaller model match a larger one on hard prompts, showing pretraining size and inference compute are interchangeable resources rather than independent ones Can inference compute replace scaling up model size?.
The deeper twist is that the small-vs-large framing sometimes dissolves. Some ceilings are scale-invariant — LLMs plateau at 55–60% on constrained optimization regardless of parameter count Do larger language models solve constrained optimization better?, and they pattern-match rather than actually run iterative numerical methods no matter how big they get Do large language models actually perform iterative optimization?. Where the task has a hard wall, more parameters buy nothing. So the real efficiency story isn't "small different from large" so much as: scaling pays off for some capabilities and is wasted on others, and small models expose that boundary more cheaply.
If you want the practical payoff, the agent literature has already drawn the conclusion — small models handle most repetitive, well-defined agent subtasks at 10–30× lower cost, making heterogeneous "small by default, large selectively" systems the rational design Can small language models handle most agent tasks?. Routing a fleet of small specialists can even beat a single frontier model Can routing beat building one better model?, and at the system level, raw per-parameter efficiency turns out to matter less than where you spend total compute across planning, memory, and tools Why does agent efficiency differ from model size reduction?.
Sources 11 notes
Research shows that for synthetic data generation, models around 500M parameters outperform larger ones in output diversity per sample. Larger models concentrate probability mass on preferred outputs, reducing the variety of distinct samples generated within a fixed budget.
A 1.5B parameter model with LoRA-only post-training matched larger full-parameter RL models on reasoning tasks, suggesting RL teaches output format organization rather than new factual knowledge. This efficiency indicates reasoning and knowledge storage are separable capabilities.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.
Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.
Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.
Agentic systems consume resources exponentially through recursive loops, making per-token model efficiency marginal. True efficiency requires system-level trade-offs between task success and total cost across planning, memory, and tool use.