Can formal language pretraining make language models more efficient?

Does training language models on hierarchical formal languages before natural language improve how efficiently they learn syntax? This explores whether structural inductive biases in training data matter more than raw data volume.

Synthesis note · 2026-02-21 · sourced from Linguistics, NLP, NLU

Between Circuits and Chomsky (2025) tests whether training language models on formal languages before natural language can improve acquisition efficiency. The result is surprisingly strong:

For a 1B-parameter model trained on ~1.6B natural language tokens, pre-pretraining on formal languages with hierarchical dependencies:

Achieves the same loss as natural language-only training
Shows better linguistic generalization on syntactic evaluations
Uses 33% fewer natural language tokens to reach equivalent performance

The effect is mechanistically grounded: attention heads acquired during pre-pretraining on formal languages remain crucial for the model's performance on syntactic evaluations in natural language. Structure from formal language training transfers to natural language processing at the level of learned mechanisms.

Why hierarchical formal languages specifically? Papadimitriou & Jurafsky (2023) showed that within the Chomsky hierarchy, context-sensitive languages transfer best to natural language. The key: effective transfer requires formal languages that capture the hierarchical dependency structures present in natural language. Not all formal languages transfer — only those that share the structural properties that matter for syntax.

This directly supports Can language models learn grammar from child-scale data?: if syntactic structure is efficiently acquirable from hierarchical formal languages (which encode the relevant inductive biases), then syntactic competence is trainable from far less data than previously thought — as long as the structure of training provides the right biases.

The broader implication: data volume matters less than structural inductive bias for syntactic generalization. LLMs trained on the right structures learn syntax efficiently; LLMs trained only on natural language may be learning syntax the hard way.

Inquiring lines that use this note as a source 10

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 124 in 2-hop network ·dense cluster Open in graph ↗

Can formal language pretraining make language mo… Can language models learn grammar from child-scale… What formal languages actually help transformers l… Can models pass tests while missing the actual gra… Can models learn multi-token concepts during fine-…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can language models learn grammar from child-scale data? If models trained on ~100 million words—roughly what children experience—can match human syntactic performance, what does that tell us about what data volume is actually necessary for learning grammar?
this provides a mechanism: hierarchical structure in training data enables efficient syntactic acquisition
What formal languages actually help transformers learn natural language? Not all formal languages are equally useful for pre-pretraining. This explores which formal languages transfer well to natural language and why—combining structural requirements with what transformers can actually learn.
the second constraint on when transfer works
Can models pass tests while missing the actual grammar? Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
formal language pre-pretraining addresses this by instilling structural inductive biases
Can models learn multi-token concepts during fine-tuning? Does training models to predict multiple tokens at once, rather than one token sequentially, help them form coherent semantic units? This matters because current next-token prediction fragments concepts like "ribonucleic acid" into arbitrary subword pieces.
both change the learning unit to improve efficiency: pre-pretraining changes the data to hierarchical formal languages, CAFT changes the prediction target to multi-token concepts; complementary approaches operating at different training stages (pre-pretraining vs. post-training)

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

pre-pretraining on hierarchical formal languages achieves 33% greater token efficiency than matched natural language training

Can formal language pretraining make language models more efficient?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4