Does representational density emerge from training data exposure during pretraining?

This explores whether the 'density' of a model's internal representations — how richly it activates for a given input — is something built up through seeing data during pretraining, rather than baked into the architecture from the start. The corpus answers directly: yes, density is learned through familiarity. Networks develop dense activations for inputs that resemble their training data and fall back to sparse representations for unfamiliar ones, and this split emerges during pretraining itself, before any task-specific tuning Is representational sparsity learned or intrinsic to neural networks?. Density isn't a fixed property of the network — it's a fingerprint of exposure.

What makes this interesting is how cleanly it rhymes with a whole cluster of findings about pretraining being the decisive formative stage. Cognitive biases turn out to work the same way: models that share a pretrained backbone show the same bias patterns regardless of what finetuning data they later see, so the biases are planted in pretraining and only nudged afterward Where do cognitive biases in language models come from?. Even reasoning ability seems to be present in base-model activations already — post-training selects and elicits it rather than creating it Do base models already contain hidden reasoning ability?. The recurring theme: the substance is laid down by exposure during pretraining, and later stages mostly steer what's already there.

The familiarity mechanism gets even more concrete when you look at how predictable it is. Whether a keyword gets 'primed' after learning is strongly predictable from its probability *before* learning, with a sharp threshold around 10^-3 and as few as three exposures enough to lock the effect in Can we predict keyword priming before learning happens?. That's the same story as representational density at a finer grain: prior exposure determines how the model lights up. And it has practical teeth — you can read the statistics of pretraining data to predict failure. Entity co-occurrence patterns from training data flag hallucination risk better than the model's own confidence, because the root cause is unseen *combinations* of things the model never densely encoded Can pretraining data statistics detect hallucinations better than model confidence?.

But exposure isn't monolithic — *what kind* of knowledge you absorb matters. Analysis of five million pretraining documents shows reasoning leans on broad, transferable procedural knowledge drawn from many sources, while factual recall depends on narrow, document-specific memorization Does procedural knowledge drive reasoning more than factual retrieval?. So 'density from exposure' isn't one uniform dial; familiar procedures generalize, while familiar facts stay pinned to where they were seen. That's a useful corrective to a simple 'more data = denser everywhere' picture.

The flip side worth knowing: if density is learned, you can also damage it. Direct fine-tuning corrupts knowledge stored in lower layers, while decoding-time approaches that leave base weights untouched preserve that pretrained knowledge far better Can decoding-time tuning preserve knowledge better than weight fine-tuning?. In other words, the dense representations pretraining builds are an asset that aggressive post-training can erode — which is exactly why so much recent work tries to stay close to the base distribution rather than overwrite it.

Sources 7 notes

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Can pretraining data statistics detect hallucinations better than model confidence?

QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher. The question: does representational density—how richly a model activates for a given input—emerge from training data exposure during pretraining, or is it architecture-determined?

What a curated library found — and when (findings span 2024–2026; treat as dated claims, not current truth):
• Dense activations for familiar inputs and sparse ones for OOD inputs emerge during pretraining itself; density is a learned fingerprint of exposure, not architectural (2024–2025).
• Cognitive biases are planted in pretraining and only nudged by finetuning; reasoning ability already exists latent in base-model activations—post-training elicits rather than builds it (2025).
• Knowledge priming is predictable from pre-learning keyword probability, with a sharp threshold around 10^-3 and as few as three exposures locking the effect (2024–2025).
• Procedural knowledge in pretraining—drawn from many sources—drives reasoning generalization; factual knowledge depends on narrow, document-specific memorization (2025).
• Direct finetuning corrupts pretrained knowledge in lower layers; decoding-time methods preserve it better, suggesting dense representations built by pretraining are assets that post-training can erode (2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2411.12580 (2024-11): Procedural Knowledge in Pretraining Drives Reasoning
• arXiv:2507.07186 (2025-07): Planted in Pretraining, Swayed by Finetuning—Cognitive Bias Origins
• arXiv:2603.03415 (2026-03): Farther the Shift, Sparser the Representation—OOD Mechanisms
• arXiv:2605.28388 (2026-05): Sample Difficulty in RLVR for LLMs

Your task:
(1) RE-TEST: For each claim above—especially the OOD sparsity mechanism, the inertia of biases post-pretraining, and the erosion thesis for decoding-time vs. direct tuning—judge whether scaling, new RL methods (e.g., synthetic data, preference learning, verifier-based RL), or newer evals have since relaxed or overturned it. Separate the durable question (does exposure during pretraining build representational structure?) from perishable constraints (e.g., does direct finetuning *always* corrupt knowledge?). Cite what resolved each.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months: does any recent paper argue density is *not* learned, or that post-training fundamentally restructures rather than steers pretrained representations?
(3) Propose 2 research questions that assume the regime may have moved: e.g., if continual learning and adaptive RL have matured, does density *adapt* after pretraining in ways the library missed? If weight-sparse circuits are more interpretable, does density even matter for mechanistic explanation?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Does representational density emerge from training data exposure during pretraining?

Sources 7 notes

Next inquiring lines