Can we detect and measure circuit formation before generalization emerges?
This explores whether the internal structure that powers generalization — the 'circuits' a model builds — leaves measurable traces that show up *before* the model visibly starts generalizing, so we could catch it forming rather than only noticing after the fact.
This explores whether the internal machinery behind generalization leaves early, measurable fingerprints — whether we can watch a circuit assemble before the model's behavior actually improves. The corpus suggests the answer is a cautious yes: several lines of work have found internal signals that move *ahead* of the behavioral jump, which is exactly what you'd need to detect formation before generalization emerges.
The cleanest example is the memorization-to-generalization phase transition. Models appear to memorize until they hit a capacity ceiling — roughly 3.6 bits per parameter — and only then does 'grokking,' the shift to genuine generalization, kick in When do language models stop memorizing and start generalizing?. That capacity number is a *property you can measure on the model itself*, not just a description of its output, which means the trigger point is in principle predictable rather than something you only recognize once accuracy suddenly climbs. A complementary study of multi-hop reasoning watches the circuit form in slow motion: transformers pass through three stages — memorization, in-distribution generalization, then cross-distribution reasoning — and successful reasoning shows up as a 'cosine clustering' signature in how entity representations geometrically organize How do transformers learn to reason across multiple steps?. The geometry tightening is the early tell.
The same theme — internal structure as a *predictor* of generalization — recurs under different vocabulary. In compositional learning, whether the constituents of a task are *linearly decodable* from hidden activations reliably forecasts whether the model will generalize compositionally Can neural networks learn compositional skills without symbolic mechanisms?. And pruning experiments show networks quietly carve compositional tasks into isolated modular subnetworks, with pretraining making that modular structure more consistent — circuit-like organization you can probe directly via ablation Do neural networks naturally learn modular compositional structure?. Read together, these say the substrate of generalization is observable, and observable *early*: clustering geometry, linear decodability, and module isolation all precede or accompany the capability rather than merely trailing it.
There's a sharper cross-domain lesson here too. Base models often already contain the circuitry for a capability before any sign of it in behavior — five independent methods all *elicit* reasoning that was latent in base-model activations, suggesting post-training selects pre-existing circuits rather than building new ones Do base models already contain hidden reasoning ability?. That reframes the whole question: sometimes 'circuit formation before generalization' isn't a slow build at all, but a dormant structure waiting for the right probe. The flip side is a warning about measurement itself — the apparent exploration-exploitation trade-off in RL turns out to be an artifact of measuring at the token level; hidden-state metrics like Effective Rank tell a completely different story Is the exploration-exploitation trade-off actually fundamental?.
That warning matters because internal structure is slippery: identical behavior can hide radically different internal mechanisms, and pushing one metric up can quietly degrade another What actually happens inside a language model?. So the honest synthesis is that we *can* detect and measure circuit formation early — bits-per-parameter ceilings, clustering signatures, decodability, module isolation are all real handles — but the metric you choose determines what you see, and the same generalization can arrive through different circuits. If you want to go deeper, the grokking-capacity and three-stage multi-hop notes are the most direct doorways; the internals note is the one that keeps you honest about how much any single measurement can claim.
Sources 7 notes
GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.
Controlled training reveals transformers learn multi-hop reasoning in three phases: memorization, in-distribution generalization, and cross-distribution reasoning. Successful reasoning correlates with cosine clustering of entity representations, and second-hop generalization requires explicit compositional exposure during training.
Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.
Research shows that LLMs can achieve the same output through different internal mechanisms, and improvements in one dimension like accuracy reliably degrade others like faithfulness and calibration. Internal structure matters even when behavior appears identical.