Can formal language pretraining make language models more efficient?
Does training language models on hierarchical formal languages before natural language improve how efficiently they learn syntax? This explores whether structural inductive biases in training data matter more than raw data volume.
Between Circuits and Chomsky (2025) tests whether training language models on formal languages before natural language can improve acquisition efficiency. The result is surprisingly strong:
For a 1B-parameter model trained on ~1.6B natural language tokens, pre-pretraining on formal languages with hierarchical dependencies:
- Achieves the same loss as natural language-only training
- Shows better linguistic generalization on syntactic evaluations
- Uses 33% fewer natural language tokens to reach equivalent performance
The effect is mechanistically grounded: attention heads acquired during pre-pretraining on formal languages remain crucial for the model's performance on syntactic evaluations in natural language. Structure from formal language training transfers to natural language processing at the level of learned mechanisms.
Why hierarchical formal languages specifically? Papadimitriou & Jurafsky (2023) showed that within the Chomsky hierarchy, context-sensitive languages transfer best to natural language. The key: effective transfer requires formal languages that capture the hierarchical dependency structures present in natural language. Not all formal languages transfer — only those that share the structural properties that matter for syntax.
This directly supports Can language models learn grammar from child-scale data?: if syntactic structure is efficiently acquirable from hierarchical formal languages (which encode the relevant inductive biases), then syntactic competence is trainable from far less data than previously thought — as long as the structure of training provides the right biases.
The broader implication: data volume matters less than structural inductive bias for syntactic generalization. LLMs trained on the right structures learn syntax efficiently; LLMs trained only on natural language may be learning syntax the hard way.
Inquiring lines that use this note as a source 10
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do context-sensitive languages transfer better than regular or context-free languages?
- What happens when formal languages satisfy hierarchy but fail learnability constraints?
- What's the difference between formal and functional linguistic competence?
- Why do only context-sensitive formal languages transfer effectively to natural language?
- Can formal language pretraining address surface generalization without learning true linguistic structure?
- How much do structural inductive biases matter compared to training data volume?
- Why does hierarchical formal language training improve token efficiency more than natural language?
- Why does augmenting natural language with formal representations outperform full formalization?
- What limits the effectiveness of formal language pretraining on transformer architectures?
- How does training order affect knowledge acquisition in language models?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can language models learn grammar from child-scale data?
If models trained on ~100 million words—roughly what children experience—can match human syntactic performance, what does that tell us about what data volume is actually necessary for learning grammar?
this provides a mechanism: hierarchical structure in training data enables efficient syntactic acquisition
-
What formal languages actually help transformers learn natural language?
Not all formal languages are equally useful for pre-pretraining. This explores which formal languages transfer well to natural language and why—combining structural requirements with what transformers can actually learn.
the second constraint on when transfer works
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
formal language pre-pretraining addresses this by instilling structural inductive biases
-
Can models learn multi-token concepts during fine-tuning?
Does training models to predict multiple tokens at once, rather than one token sequentially, help them form coherent semantic units? This matters because current next-token prediction fragments concepts like "ribonucleic acid" into arbitrary subword pieces.
both change the learning unit to improve efficiency: pre-pretraining changes the data to hierarchical formal languages, CAFT changes the prediction target to multi-token concepts; complementary approaches operating at different training stages (pre-pretraining vs. post-training)
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
- Bigger is not always better: The importance of human-scale language modeling for psycholinguistics
- Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
- Language models show human-like content effects on reasoning tasks
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- Chain-of-thought Reasoning Is A Policy Improvement Operator
- AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts
- Lil-Bevo: Explorations of Strategies for Training Language Models in More Humanlike Ways
Original note title
pre-pretraining on hierarchical formal languages achieves 33% greater token efficiency than matched natural language training