Can text-trained models compress images better than specialized tools?
Do general-purpose language models trained only on text outperform domain-specific compressors like PNG and FLAC on their native data? This tests whether compression ability is universal or requires domain specialization.
The source coding theorem (Shannon, 1948) makes this equivalence formal: maximizing log-likelihood is equivalent to minimizing bits per message. A probabilistic model IS a compressor, and a compressor IS a probabilistic model. Grünwald et al. (2309.10668) take this from theory to practice by measuring the offline compression capabilities of large language models across data modalities.
The striking finding: Chinchilla models, trained exclusively on internet text (Wikipedia, websites, GitHub, books), achieve state-of-the-art compression rates on image and audio data — beating domain-specific compressors like FLAC and PNG. This is not what you'd expect. Domain-specific compressors are engineered for their modality. A text-trained model shouldn't outperform them on non-text data.
The mechanism is in-context learning functioning as conditioning. The model doesn't learn image or audio representations during training. Instead, at compression time, it uses its context window to condition itself as a task-specific compressor. General-purpose compression via adaptation, not specialization.
However, this comes with a scaling caveat that inverts the typical scaling law narrative. When measuring adjusted compression rate (accounting for model parameters in the compressed output), scaling beyond a certain point deteriorates compression performance. The parameters themselves become overhead. Smaller models trained specifically on the target data can achieve better adjusted compression than massive general-purpose models. Since Why does reasoning training help math but hurt medical tasks?, the adjusted compression overhead may concentrate in the deeper reasoning layers — the same layers that show redundancy under pruning.
The deeper principle: a model that compresses well generalizes well (Hutter, 2006). This reframes generalization as a compression problem rather than a learning problem. Since Do foundation models learn world models or task-specific shortcuts?, the compression framing suggests these heuristics are efficient compression shortcuts — good enough to compress but not sufficient for genuine world modeling.
Inquiring lines that use this note as a source 16
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why does statistical compression destroy literary connotation and meaning?
- Why does language compression via statistical dependencies capture cultural and situated language use?
- What compression explains why syntax fits in low-dimensional subspaces?
- Why does each rewrite cycle degrade domain-specific details differently than compression?
- What specific information must be exported from the language system?
- Why does capturing domain structure reduce data requirements more than raw volume?
- Why does transcription destroy prosodic information in speech processing?
- Why does training data format matter more than its domain content?
- Why does adjusted compression performance degrade as models scale larger?
- How do general language model benchmarks predict specialized domain performance?
- Does training data format matter more than who generates it?
- What information does transcription destroy that direct speech-to-speech models preserve?
- How does modeling capability relate to lossless compression in language models?
- How does epiplexity measure extractable value differently from compression codelength?
- Can task-agnostic compression of documents remain broadly useful for later queries?
- How does the compression view extend from trained models to training objectives?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
compression may explain why heuristics suffice: they compress well enough
-
Can large language models develop genuine world models without direct environmental contact?
Do LLMs extract meaningful world structures from human-generated text despite lacking direct sensory access to reality? This matters for understanding what kind of grounding and knowledge these systems actually possess.
compression-as-generalization offers an alternative framing for how world models emerge
-
Why does reasoning training help math but hurt medical tasks?
Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
the compression framing maps onto the layer separation: lower layers compress facts (memorization-heavy, document-specific), while higher layers compress procedures (generalizable across instances); the scaling caveat where adjusted compression deteriorates for larger models may reflect redundancy in deeper reasoning layers
-
Does procedural knowledge drive reasoning more than factual retrieval?
Explores whether models learn reasoning through general procedures across diverse documents rather than memorizing specific facts. This matters for understanding what pretraining data actually teaches models to reason.
procedural knowledge compresses better than factual knowledge because one procedure covers many instances, directly explaining why compression = generalization holds more strongly for reasoning than for factual recall
-
Are neural network optimizers actually memory systems?
Do gradient-based optimizers like Adam function as associative memory modules that compress context, just like network layers? This reframes the relationship between training and learning.
Nested Learning operationalizes the compression principle at the component level: every NN component (including optimizers) is an associative memory compressing its context flow, making compression=generalization apply recursively at every nesting level
-
Can a reasoning model's thinking trace compress context effectively?
Does the raw reasoning trace produced by a thinking model naturally function as a context compressor without specialized training or modules? And how does this compare to dedicated compression methods?
extends: the modeling-is-compression identity carries forward from weights to thinking traces used directly as compressed context
-
Why do Shannon and Kolmogorov measures fail to value data?
Shannon information and Kolmogorov complexity assume unlimited computational capacity. But do these classical measures actually capture what bounded learners can extract from real data?
extends: refines modeling-is-compression by distinguishing codelength from a bounded model's extractable value
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Language Modeling is Compression
- From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
- CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning
- Beyond Language Modeling: An Exploration of Multimodal Pretraining
- Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor
- How much do language models memorize?
- Computational structuralism: Toward a formal theory of meaning in the age of digital intelligence
- Bigger is not always better: The importance of human-scale language modeling for psycholinguistics
Original note title
language modeling is equivalent to lossless compression — LLMs trained on text outperform domain-specific compressors on images and audio via in-context conditioning