SYNTHESIS NOTE

Can text-trained models compress images better than specialized tools?

Do general-purpose language models trained only on text outperform domain-specific compressors like PNG and FLAC on their native data? This tests whether compression ability is universal or requires domain specialization.

Synthesis note · 2026-02-22 · sourced from LLM Architecture

The source coding theorem (Shannon, 1948) makes this equivalence formal: maximizing log-likelihood is equivalent to minimizing bits per message. A probabilistic model IS a compressor, and a compressor IS a probabilistic model. Grünwald et al. (2309.10668) take this from theory to practice by measuring the offline compression capabilities of large language models across data modalities.

The striking finding: Chinchilla models, trained exclusively on internet text (Wikipedia, websites, GitHub, books), achieve state-of-the-art compression rates on image and audio data — beating domain-specific compressors like FLAC and PNG. This is not what you'd expect. Domain-specific compressors are engineered for their modality. A text-trained model shouldn't outperform them on non-text data.

The mechanism is in-context learning functioning as conditioning. The model doesn't learn image or audio representations during training. Instead, at compression time, it uses its context window to condition itself as a task-specific compressor. General-purpose compression via adaptation, not specialization.

However, this comes with a scaling caveat that inverts the typical scaling law narrative. When measuring adjusted compression rate (accounting for model parameters in the compressed output), scaling beyond a certain point deteriorates compression performance. The parameters themselves become overhead. Smaller models trained specifically on the target data can achieve better adjusted compression than massive general-purpose models. Since Why does reasoning training help math but hurt medical tasks?, the adjusted compression overhead may concentrate in the deeper reasoning layers — the same layers that show redundancy under pruning.

The deeper principle: a model that compresses well generalizes well (Hutter, 2006). This reframes generalization as a compression problem rather than a learning problem. Since Do foundation models learn world models or task-specific shortcuts?, the compression framing suggests these heuristics are efficient compression shortcuts — good enough to compress but not sufficient for genuine world modeling.

Inquiring lines that use this note as a source 16

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

17 direct connections · 164 in 2-hop network ·dense cluster Open in graph ↗

Can text-trained models compress images better t… Do foundation models learn world models or task-sp… Can large language models develop genuine world mo… Why does reasoning training help math but hurt med… Does procedural knowledge drive reasoning more tha… Are neural network optimizers actually memory syst… Can a reasoning model's thinking trace compress co… Why do Shannon and Kolmogorov measures fail to val…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do foundation models learn world models or task-specific shortcuts? When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
compression may explain why heuristics suffice: they compress well enough
Can large language models develop genuine world models without direct environmental contact? Do LLMs extract meaningful world structures from human-generated text despite lacking direct sensory access to reality? This matters for understanding what kind of grounding and knowledge these systems actually possess.
compression-as-generalization offers an alternative framing for how world models emerge
Why does reasoning training help math but hurt medical tasks? Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
the compression framing maps onto the layer separation: lower layers compress facts (memorization-heavy, document-specific), while higher layers compress procedures (generalizable across instances); the scaling caveat where adjusted compression deteriorates for larger models may reflect redundancy in deeper reasoning layers
Does procedural knowledge drive reasoning more than factual retrieval? Explores whether models learn reasoning through general procedures across diverse documents rather than memorizing specific facts. This matters for understanding what pretraining data actually teaches models to reason.
procedural knowledge compresses better than factual knowledge because one procedure covers many instances, directly explaining why compression = generalization holds more strongly for reasoning than for factual recall
Are neural network optimizers actually memory systems? Do gradient-based optimizers like Adam function as associative memory modules that compress context, just like network layers? This reframes the relationship between training and learning.
Nested Learning operationalizes the compression principle at the component level: every NN component (including optimizers) is an associative memory compressing its context flow, making compression=generalization apply recursively at every nesting level
Can a reasoning model's thinking trace compress context effectively? Does the raw reasoning trace produced by a thinking model naturally function as a context compressor without specialized training or modules? And how does this compare to dedicated compression methods?
extends: the modeling-is-compression identity carries forward from weights to thinking traces used directly as compressed context
Why do Shannon and Kolmogorov measures fail to value data? Shannon information and Kolmogorov complexity assume unlimited computational capacity. But do these classical measures actually capture what bounded learners can extract from real data?
extends: refines modeling-is-compression by distinguishing codelength from a bounded model's extractable value

Can text-trained models compress images better than specialized tools?

Related concepts in this collection 7

Related papers in this collection 8

Search by related questions 4