INQUIRING LINE

Why does hierarchical formal language training improve token efficiency more than natural language?

This explores why pre-training a model on structured, rule-based formal languages first — before it sees ordinary text — lets it learn syntax with far fewer natural-language tokens than starting on natural language alone.


This explores why pre-training a model on structured, rule-based formal languages first — before it sees ordinary text — lets it learn syntax with far fewer natural-language tokens than starting on natural language alone. The headline result is concrete: pre-pretraining 1B models on hierarchical formal languages reaches the same loss and better syntactic generalization with 33% fewer natural-language tokens, and the effect is mechanistic, not incidental — attention heads shaped by the formal-language phase stay critical for syntactic performance later on real text Can formal language pretraining make language models more efficient?. The model isn't memorizing formal strings; it's building reusable structural machinery that natural language would otherwise have to teach the slow way.

The reason this works comes into focus when you look at what carries the learning signal in a sequence. Not every token teaches equally: only about 20% of tokens are high-entropy 'forking points' where the model actually makes decisions, and training on just those matches full-gradient updates Do high-entropy tokens drive reasoning model improvements?. Reasoning chains show the same skew — models internally rank tokens by function and preferentially preserve symbolic-computation tokens while pruning grammar and filler first Which tokens in reasoning chains actually matter most?. Natural language is mostly low-signal connective tissue wrapped around a thin spine of structure. Formal languages are almost all spine. So a formal-language token is denser in the thing that transfers — hierarchical, nested, rule-governed structure — which is exactly why fewer of them go further.

There's a deeper tension underneath, too. LLMs lean on semantic associations rather than formal logical manipulation; when you strip the familiar meaning out of a task, performance collapses even when the correct rules are sitting in context Do large language models reason symbolically or semantically?. Natural-language pretraining over-feeds the semantic-association channel and under-builds the structural one. Formal-language pre-pretraining is a corrective: it forces the model to acquire the syntactic, hierarchy-tracking apparatus directly, rather than hoping it precipitates out of enough sentences about the world.

The same instinct — give the model the right level of abstraction to work at — shows up elsewhere in the corpus under different names. Meta's Large Concept Model reasons over sentence embeddings in a language-agnostic space with paragraph-level planning, and produces more coherent output than flat token-by-token generation Can reasoning happen at the sentence level instead of tokens?. That's the hierarchy argument from the output side; formal-language pre-pretraining is the same argument from the input side. Both say: structure imposed at the right granularity beats brute token volume.

Worth knowing where the limits are. Pretraining choices set a real ceiling — prompting can only reorganize knowledge already in the training distribution, not inject what was never there Can prompt optimization teach models knowledge they lack?, and strong parametric priors can override context the model is explicitly given Why do language models ignore information in their context?. That's the case for installing good structural priors early: the formal-language phase is cheap, permanent, and seeds attention heads you can't reliably bolt on afterward — which is the whole reason 33% of the natural-language budget simply becomes unnecessary.


Sources 7 notes

Can formal language pretraining make language models more efficient?

Pre-pretraining 1B models on hierarchical formal languages achieves equivalent loss and better syntactic generalization using 33% fewer natural language tokens. The mechanism persists: attention heads trained on formal languages remain critical for syntactic performance on natural language.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Next inquiring lines