How do weight perturbations reveal what performance benchmarks cannot measure?

This explores why poking and prodding a model's internal weights — perturbing them, ablating them, forcing them sparse — exposes structural problems that a clean benchmark score will never show.

This explores why poking and prodding a model's internal weights exposes problems that a clean benchmark score will never show. The corpus keeps circling one uncomfortable idea: a model's test accuracy describes its outputs, not its internal organization — and the two can come apart completely. The Fractured Entangled Representation work shows that two networks trained by gradient descent can produce identical answers on every single input while their internal representations are wired together in radically different, sometimes incoherent ways Can AI pass every test while understanding nothing?. A benchmark cannot see this, because a benchmark only ever asks 'did you get the right answer?'

The reason perturbation is the diagnostic that benchmarks aren't is that perturbation interrogates structure directly. A model can hold all the features a task needs in a linearly decodable form — so it scores perfectly — while its underlying organization is brittle and tangled. That brittleness only becomes visible when you nudge the weights or shift the distribution; the model that looked equal on the leaderboard falls apart, and the one with cleaner internal structure survives Can models be smart without organized internal structure?. Perturbation, in other words, is a stress test for the wiring, not the answer.

The inverse experiment makes the same point from the other direction. When you train transformers with deliberately sparse weights, you force modular structure into existence, and then ablation — knocking out specific circuits — can confirm that particular neurons are actually necessary and sufficient for a task Can sparse weight training make neural networks interpretable by design?. That necessary-and-sufficient claim is something no accuracy number can establish; you can only earn it by removing pieces and watching what breaks. Benchmarks tell you the model works; ablation tells you which parts do the work and whether the rest is dead weight.

This connects to a broader corpus theme: benchmarks are quietly selective about what they measure. Standard NLP evaluations filter out the very examples where human annotators disagree, hiding a gap as large as 32% vs. 90% accuracy on ambiguous cases Do standard NLP benchmarks hide LLM ambiguity failures?. And benchmark gains can reflect memorization of contaminated data rather than genuine capability — RLVR research separates real behavioral activation from leaderboard movement, showing the two can coexist without either confirming the other Can genuine reasoning activation coexist with contaminated benchmarks?. In both cases the score is measuring the wrong thing, or measuring the right thing for the wrong reason.

The thing worth taking away: 'understanding' and 'getting the right answer' are not the same axis, and they require different instruments. Output benchmarks measure behavior on a curated, undisturbed test set. Weight-level interventions — perturbation, distribution shift, ablation, enforced sparsity — measure whether the internal machinery is robust, modular, and actually responsible for the behavior. A model that aces every test can still be, structurally, an imposter Can AI pass every test while understanding nothing? — and the only way to catch it is to stop trusting the score and start disturbing the weights.

Sources 5 notes

Can AI pass every test while understanding nothing?

The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Do standard NLP benchmarks hide LLM ambiguity failures?

By filtering out examples where annotators disagree, benchmarks remove test cases that would reveal LLM failures at ambiguity recognition. Research using ambiguous examples shows a 32% vs. 90% accuracy gap invisible to standard evaluation.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

How do weight perturbations reveal what performance benchmarks cannot measure?

Sources 5 notes

Next inquiring lines