INQUIRING LINE

Do scaling laws change when weight precision becomes a design variable?

This explores what happens to scaling laws — the predictable curves relating model size, data, and compute to performance — once the *precision* of each weight (how many bits it uses) is something you get to choose rather than a fixed assumption (usually 16-bit).


This explores what happens to the familiar scaling curves once you stop treating 16-bit weights as a given and let precision itself become a knob. The sharpest answer in the corpus is that yes, the curve moves: BitNet shows that LLMs trained natively with *ternary* weights (roughly 1.58 bits each) match full-precision FP16/BF16 models on perplexity and end-task benchmarks at the same parameter count, while slashing latency, memory, and energy Can ternary weights match full precision model performance?. The striking part isn't just the compression — it's that the authors frame the result as defining a *new* scaling law, one with a different cost axis, and an invitation to design hardware around 1-bit models. So precision isn't a lossy afterthought applied to a trained model; made a first-class design variable, it redraws the relationship between size and capability.

The deeper point is that scaling laws were never one fixed law — they're a template you can re-parameterize by whatever you let vary. The corpus has a clear example: when you fold *architectural* choices (hidden size, the MLP-to-attention ratio, grouped-query attention) into the scaling law, you can optimize for inference efficiency and squeeze out 42% more throughput *and* 2.1% higher accuracy under the same training budget Can architecture choices improve inference efficiency without sacrificing accuracy?. Precision is the same kind of move — adding a dimension the original Chinchilla-style law held constant. Each new design variable doesn't break scaling laws; it gives you a richer surface to find better trade-offs on.

What's worth knowing is that the most interesting scaling shifts are happening *off* the parameter-count axis entirely. Inference-time compute can substitute for model size: smaller models given more thinking time match larger ones on hard prompts, which means pretraining and inference compute aren't independent resources Can inference compute replace scaling up model size?. The same pattern recurs for research agents, where the number of *search steps* follows the same diminishing-returns curve as reasoning tokens — a genuinely new inference-compute axis Do search steps follow the same scaling rules as reasoning tokens?. Precision joins this family: it's one more dimension along which you can trade resources, and the lesson across all of them is that capability is governed by a multi-axis budget, not a single number.

There's also a hardware-shaped reason precision matters as a design variable, visible in the mobile work. On memory-bound devices, the bottleneck isn't computing weights — it's *moving* them, so recomputing a transformer block twice can be cheaper than fetching separate weights from memory Does recomputing weights cost less than moving them on mobile?. Low-bit weights attack the same bottleneck from the other side: fewer bits per weight means less to move. This is exactly why BitNet's authors point toward custom 1-bit hardware — once precision is a design variable, it co-evolves with the chip, and the 'cost' term in the scaling law stops being abstract FLOPs and becomes bytes-moved on real silicon.

If you want to keep pulling this thread, the adjacent territory is everything that decouples performance from naive weight-counting: representation finetuning that intervenes on frozen activations instead of updating weights, hitting 10–50x better parameter efficiency than LoRA Can editing hidden representations beat weight updates for finetuning?; finetuning's own multiplicative scaling law, where a larger base model helps more than more data How should finetuning scale with model and data size?; and weight *sparsity*, a different bit-budget move that trades dense capacity for interpretable, modular circuits Can sparse weight training make neural networks interpretable by design?. The throughline: 'how many parameters' was always a proxy. As precision, sparsity, architecture, and inference compute each become design variables, scaling laws don't dissolve — they multiply into a family, each describing a different face of the same cost-versus-capability surface.


Sources 8 notes

Can ternary weights match full precision model performance?

BitNet b1.58 trains natively with ternary weights and matches FP16/BF16 performance on perplexity and end-task benchmarks at equal model size, while cutting latency, memory, and energy costs. The result enables a new scaling law and opens the path to hardware designed specifically for 1-bit LLMs.

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Do search steps follow the same scaling rules as reasoning tokens?

Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.

Does recomputing weights cost less than moving them on mobile?

MobileLLM shows that on memory-bound mobile hardware, sharing weights between adjacent transformer blocks by recomputing one block twice uses less latency than fetching separate weights, gaining accuracy with no parameter increase.

Can editing hidden representations beat weight updates for finetuning?

ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.

How should finetuning scale with model and data size?

Systematic experiments across 1B–16B models reveal finetuning follows a power-based multiplicative scaling law. Larger base models improve finetuning more than more pretraining data, while increasing PET parameters provides minimal benefit.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Next inquiring lines