Does latent density emerge during pretraining from training data familiarity?

This explores whether the dense vs. sparse activation patterns inside a model are something it *learns* during pretraining based on how familiar the data is — rather than being a fixed property of the architecture.

This explores whether the dense vs. sparse activation patterns inside a model are learned from data familiarity during pretraining, and the corpus answers fairly directly: yes. The clearest evidence is that neural networks develop dense activations for data they've seen a lot of and fall back to sparse representations for unfamiliar inputs — and this split emerges purely from pretraining exposure, before any task-specific fine-tuning Is representational sparsity learned or intrinsic to neural networks?. Density isn't baked into the network; it's a trace of what the model got comfortable with.

What makes this interesting is how many *other* behaviors turn out to be governed by the same familiarity logic. The strength of a concept's 'priming' after a few gradient updates is predictable from how probable its keywords were *before* learning, with a sharp threshold separating things that stick from things that don't — just three exposures can lock it in Can we predict keyword priming before learning happens?. Hallucination risk follows the same shape from the other side: models go wrong not when they're under-confident, but when they hit entity *combinations* they never saw co-occur in training, so pretraining co-occurrence statistics predict failure better than the model's own confidence does Can pretraining data statistics detect hallucinations better than model confidence?. Familiar territory → dense, confident, primed; unfamiliar territory → sparse, brittle, hallucination-prone. It's the same gradient seen through different instruments.

The deeper claim across these notes is that pretraining is where the durable structure gets *planted*, and later training only selects or nudges it. Cognitive biases, for instance, are causally traced to the pretrained backbone — models sharing a backbone show the same biases regardless of what instruction data you fine-tune on; fine-tuning only modulates Where do cognitive biases in language models come from?. Reasoning ability tells the same story: five independent methods all *elicit* reasoning already latent in base-model activations rather than installing it, so the bottleneck is elicitation, not acquisition Do base models already contain hidden reasoning ability?. Even RL post-training mostly amplifies one format distribution that pretraining already favored while suppressing the alternatives Does RL training collapse format diversity in pretrained models?. Density-from-familiarity is one instance of a broader pattern: the model's character is laid down by exposure, and downstream training is a selector on top of it.

There's a useful counterpoint about *why* familiarity gets compressed into structure so readily. Latent-level prediction recovers compositional hierarchy exponentially faster than token-level prediction, because representations at the same level are far more correlated than raw tokens Why is predicting latents more sample-efficient than tokens? — which hints at why familiar data consolidates into dense, reusable internal structure rather than staying diffuse. And one architecture builds the iterative computation directly into pretraining latent space, suggesting density isn't just an accident of exposure but something you can deliberately shape at the pretraining stage Can reasoning happen in latent space during pretraining?.

The thing you didn't know you wanted to know: this implies a practical asymmetry in how you'd intervene on a model. If density, biases, priming, and hallucination-proneness are all written during pretraining, then trying to fix them by fine-tuning is fighting the wrong layer — which is exactly why decoding-time approaches that leave base weights untouched preserve knowledge better than direct fine-tuning, who corrupts the knowledge stored in lower layers Can decoding-time tuning preserve knowledge better than weight fine-tuning?.

Sources 9 notes

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Can pretraining data statistics detect hallucinations better than model confidence?

QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Why is predicting latents more sample-efficient than tokens?

A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.

Can reasoning happen in latent space during pretraining?

Ouro models achieve 2–3× efficiency gains by performing iterative reasoning in latent space during pretraining, not through extra capacity. Their intermediate predictions align faithfully with final outputs, making latent traces more honest than explicit chain-of-thought reasoning.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Does latent density emerge during pretraining from training data familiarity?

Sources 9 notes

Next inquiring lines