How does activation consistency training differ from output-level consistency?
This explores two ways to teach a model to stay consistent when you wrap or perturb a prompt — one that matches what the model says (output-level) versus one that matches what happens inside the model (activation-level).
This explores two ways to teach a model to behave the same on a clean prompt and a messed-with version of it: matching its outputs versus matching its internal activations. The core distinction comes from a single line of work on consistency training Can models learn to ignore irrelevant prompt changes?, which describes two methods. Bias-augmented Consistency Training (BCT) works at the output level: it takes the model's own answer to a clean prompt and trains the model to give that same answer when the prompt is wrapped in distracting or biasing text. Activation Consistency Training (ACT) goes deeper — instead of matching the final words, it matches the model's internal hidden states between the clean and wrapped versions. Both use the model's own clean behavior as the target, which sidesteps the staleness problem of standard fine-tuning (where you're chasing a fixed dataset that may no longer reflect what the model can do).
Why would you reach inside the model rather than just fix the output? Because identical outputs can hide very different internal machinery. One striking result shows that two networks can produce the exact same answers while having radically different — and in one case 'fractured, entangled' — internal representations, and that the broken internals only reveal themselves when you perturb the weights or ask for transfer to a new context Can identical outputs hide broken internal representations?. That's the gap output-level training can't see: BCT can make the surface answer match while the model arrives there by a wobblier internal route. ACT's bet is that aligning the activations produces a more robust kind of invariance, not just a cosmetic one.
This matters because activations aren't static — they shift in systematic ways with the input. Models grow denser activations for familiar data and default to sparse ones for unfamiliar inputs Is representational sparsity learned or intrinsic to neural networks?, and they actively sparsify their hidden states when a task gets harder or drifts out of distribution, as a kind of stabilizing filter Do language models sparsify their activations under difficult tasks?. A wrapped or adversarial prompt can look 'unfamiliar' to the model and trigger exactly these internal shifts — so an output-only fix leaves that internal volatility untouched, while an activation-level fix targets it directly.
The theme running underneath all of this is that consistency at the surface is not the same as consistency underneath. A model can return the identical output every time and still be drawing one unreliable sample from its distribution Does setting temperature to zero actually make LLM outputs reliable?. The same caution applies here: matching outputs (BCT) buys you the cheaper, more interpretable form of consistency, while matching activations (ACT) reaches for the deeper form — at the cost of training against internal states that are noisier and less directly tied to the answer you actually care about. If you want to go further, the contrast between these two methods is the cleanest place to see why 'same answer' and 'same reasoning' are different engineering targets.
Sources 5 notes
Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.
Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.
During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.