INQUIRING LINE

Does grokking in modular arithmetic follow the same three-phase learning trajectory?

This explores whether the classic grokking story from modular arithmetic — train accuracy saturates, then generalization snaps in later — shows up as a clean three-phase arc, and what the corpus says about the shape of that memorize-then-generalize transition.


This reads the question as being about the *shape* of grokking — is it really a tidy three-phase trajectory? — rather than about modular arithmetic specifically, which the corpus doesn't cover head-on. What the collection does have is several independent attempts to count the phases of memorize-then-generalize learning, and they disagree in an interesting way.

The cleanest grokking result here frames it as a *two-state* phase transition, not three. Models memorize until they hit a measurable capacity ceiling — about 3.6 bits per parameter — and only once that storage fills does the shift to genuine generalization kick in When do language models stop memorizing and start generalizing?. On that account grokking isn't a gradual trajectory at all; it's a threshold you cross when memorization stops being a viable strategy. A related finding from RLVR shows the tail end of this directly: a model can keep improving its test accuracy for 1,400 steps *after* training accuracy already hit 100% Can a single training example unlock mathematical reasoning? — the signature post-saturation gap that makes grokking look like delayed understanding.

Where a genuine three-phase structure does appear is in transformers learning multi-hop reasoning: memorization, then in-distribution generalization, then cross-distribution reasoning, with the jump to true reasoning marked by entity representations clustering together in the model's internal space How do transformers learn to reason across multiple steps?. That's the closest the corpus comes to validating a three-phase grokking arc — but notice it's three phases because the *task* has a compositional second hop, not because grokking inherently has three stages. The RL literature, meanwhile, counts *two* phases (master execution, then master strategy) Does RL training follow a predictable two-phase learning sequence?. So the number of phases tracks the task's structure, not a universal law of learning.

The more unsettling thread: some of what looks like grokking may not be real generalization. Transformers often pass in-distribution tests by memorizing computation subgraphs and then collapse on novel compositions Do transformers actually learn systematic compositional reasoning?, and local token-level memorization — predicting from the immediately preceding tokens — accounts for up to 67% of reasoning errors, getting worse exactly when the problem shifts away from training distribution Where do memorization errors arise in chain-of-thought reasoning?. Modular arithmetic is the canonical grokking demo precisely because its clean algebraic structure lets you *prove* the model found the general rule. For messier tasks, a confident post-saturation accuracy curve might be the model getting better at subgraph matching rather than grokking the underlying function.

The thing worth taking away: 'how many phases' is the wrong question. The corpus suggests the real variable is whether the task has a capacity ceiling that forces memorization to fail (two-state transition) or a compositional layer that has to be learned separately on top (extra phase) — and whether the late-stage 'generalization' you observe is the genuine article or memorization wearing a disguise.


Sources 6 notes

When do language models stop memorizing and start generalizing?

GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

How do transformers learn to reason across multiple steps?

Controlled training reveals transformers learn multi-hop reasoning in three phases: memorization, in-distribution generalization, and cross-distribution reasoning. Successful reasoning correlates with cosine clustering of entity representations, and second-hop generalization requires explicit compositional exposure during training.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanist analyzing whether grokking follows a universal learning trajectory. The question: does grokking in modular arithmetic (or any task) really exhibit a fixed three-phase shape—or does phase structure vary by task and memorization pressure?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2025. A curated library reports:
• Grokking may be a *two-state threshold*, not a gradient: memorization saturates at ~3.6 bits/parameter, then generalization kicks in (2024–2025).
• Three-phase arcs (memorization → in-distribution generalization → cross-distribution reasoning) appear *only* when the task itself is compositional with multi-hop structure; the number of phases tracks task architecture, not an intrinsic grokking law (2025).
• Post-saturation accuracy gains often reflect subgraph memorization or token-level prediction, not genuine compositional generalization; token-level local memorization accounts for up to 67% of reasoning errors and worsens on distribution shift (2025).
• Transformers pass in-distribution tests via memorized computation subgraphs, then collapse on novel compositions (2023–2024).

Anchor papers (verify; mind their dates):
• arXiv:2505.24832 (2025): How much do language models memorize?
• arXiv:2505.23653 (2025): How do Transformers Learn Implicit Reasoning?
• arXiv:2508.02037 (2025): Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
• arXiv:2305.18654 (2023): Faith and Fate: Limits of Transformers on Compositionality

Your task:
(1) RE-TEST EACH CONSTRAINT. For the two-state threshold claim: has newer scaling, architecture, or training (e.g., sparse models, continual learning, or RL fine-tuning post-2025) made the transition gradient or multi-phase even in 'simple' tasks? For subgraph/token-memorization: do mechanistic interpretability tools (SAE, activation steering, probes) now isolate *whether* late-stage gains are genuine compositional learning or disguised memorization? Flag which constraints still hold and which have softened.
(2) Surface the strongest *disagreement* in the last 6 months: do any recent papers argue that grokking *is* universal and three-phase despite these findings?
(3) Propose two research questions that assume phase structure is task-dependent: (a) can we predict the number of learning phases from task structure alone? (b) can we design evaluation to definitively separate memorized subgraph matching from true generalization?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines