What is the behavioral signature of a model tracking input surprise?
This explores the observable behaviors that reveal whether a model is registering how unexpected its input is — and the corpus suggests surprise shows up in two opposite ways: confidence that collapses when inputs are familiar, and processing that bloats when inputs are anomalous.
This explores the observable behaviors that reveal whether a model is registering how unexpected its input is. The corpus doesn't have a paper titled 'surprise tracking,' but it does have several findings that, read together, sketch what that signature looks like — and they point in two directions at once.
The clearest tell is entropy. When a post-trained model is fed its own prior outputs — inputs it 'expected' because it generated the trajectory — its output entropy drops 3-4x compared to off-policy text it didn't produce Do models recognize their own outputs as actions shaping future inputs?. That confidence gap is itself a surprise meter: low entropy on familiar trajectories, higher entropy on unfamiliar ones. The model behaves as if it knows the difference between input it caused and input it merely received. A related convergence effect shows up in training, where RL collapses onto a single dominant format within the first epoch and suppresses the alternatives Does RL training collapse format diversity in pretrained models? — narrowing as a behavioral signature that the model has locked onto an 'expected' shape and treats deviations as low-probability.
The inverse signature is what happens under genuine surprise — and here the behavior is failure, not graceful adjustment. Append a semantically irrelevant sentence to a math problem and reasoning models don't ignore it; error rates jump roughly 300% and responses get *longer* How vulnerable are reasoning models to irrelevant text?. That length inflation is the most concrete behavioral fingerprint in the corpus: a model thrown by unexpected input churns more tokens, as though the surprise consumes processing it can't resolve. Surprise registers as visible thrashing.
There's also an *internal* version of the signature, and it's the most counterintuitive thread. DPO training builds a two-stage detection circuit that flags injected steering vectors — the model can internally register 'this activation is anomalous' with near-perfect accuracy How do language models detect injected steering vectors internally?. So a surprise-tracking signal demonstrably exists inside the weights. The twist: safety training actively suppresses it, dropping detection from 63.8% to 10.8%. The capacity to notice anomalous input is real, but it can be trained out.
The thing you might not have known you wanted to know: the behavioral signature of tracking surprise and the behavioral signature of *reporting* it come apart. Models can hold an internal anomaly signal while their outward behavior stays smooth and confident — and reflection mechanisms that should catch the mismatch rarely fire, mostly confirming the first answer rather than revising under surprise Can we actually trust reasoning model outputs?. So if you're hunting for surprise in a model's behavior, watch the entropy and the token count, not the model's own commentary — that's the channel most likely to have been tuned to stay quiet.
Sources 5 notes
Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Appending semantically unrelated sentences to math problems significantly increases error rates in reasoning models. These query-agnostic triggers discovered on cheaper models transfer effectively to stronger models and also inflate response length.
Contrastive preference optimization trains evidence-carrier features in early layers to suppress gate features that default to denial, enabling near-perfect detection of internal perturbations. Safety training actively suppresses this capability, reducing detection from 63.8% to 10.8%.
Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.