Do discrete tokenized modalities preserve information better than continuous embeddings?

This explores whether chopping a modality (text, image, audio) into discrete tokens holds onto more of the original signal than mapping it into a smooth continuous embedding space — and the corpus suggests that's the wrong axis to judge them on.

This explores whether discrete tokens preserve information better than continuous embeddings, and the honest answer from the corpus is that discretization usually *throws information away on purpose* — and that loss is often the point. When VQ-Rec maps an item's text through discrete codes before turning it into an embedding, it deliberately compresses away fine-grained textual detail, and that's exactly why it transfers across domains better than a direct text embedding: the discrete bottleneck strips out text-similarity bias that would otherwise overfit to one domain's vocabulary Can discrete codes transfer better than text embeddings? Can discretizing text embeddings improve recommendation transfer?. So the win isn't preservation — it's useful forgetting.

Where discrete tokens genuinely shine is composition and unification, not fidelity. A model like MIO trains on mixed discrete tokens across four modalities and gains abilities — interleaved video-text output, chain-of-visual-thought reasoning — that dual-encoder systems built on continuous features can't do, precisely because a shared discrete vocabulary lets one autoregressive model treat everything as the same kind of symbol Can a single model generate all modalities without external encoders?. Discreteness buys you a common substrate, not richer detail.

The continuous side, meanwhile, wins on the dimension the question doesn't ask about: learning efficiency. There's a formal result that predicting your own continuous latents recovers compositional structure exponentially faster than predicting discrete tokens, because neighboring latents are far more correlated than raw tokens are Why is predicting latents more sample-efficient than tokens?. And reasoning can happen entirely in continuous sentence-embedding space, language-agnostic, before any tokens get decoded at all Can reasoning happen at the sentence level instead of tokens?. So if anything, continuous representations preserve *relational* structure that token boundaries fragment.

The deeper twist is that not all tokens carry equal information regardless of format. In reasoning chains, models internally rank tokens by function — symbolic-computation tokens are preserved while grammar and filler get pruned first Which tokens in reasoning chains actually matter most? — and only about 20% of tokens, the high-entropy 'forking points,' actually drive learning Do high-entropy tokens drive reasoning model improvements?. Information density is wildly uneven inside the token stream itself, which means 'discrete vs. continuous' is less important than *which* parts of the signal a representation chooses to keep sharp.

The thing you didn't know you wanted to know: both formats sit downstream of a bigger loss. Text itself is a lossy human abstraction that already strips physics, geometry, and causality before any tokenizer or embedder touches it Are text-only language models fundamentally limited by abstraction? — and a plain natural-language *description* of an image can bridge a recognition task better than raw embedding similarity does Can describing images in text improve zero-shot recognition?. So the real question isn't whether discrete tokens preserve more than continuous embeddings, but which abstraction keeps the information *your task actually needs* — and both are deliberately, usefully lossy.

Sources 9 notes

Can discrete codes transfer better than text embeddings?

VQ-Rec demonstrates that mapping item text to discrete codes via product quantization, then to embeddings, improves cross-domain transfer compared to direct text encoding. The discrete intermediate reduces text bias and enables efficient per-domain fine-tuning.

Can discretizing text embeddings improve recommendation transfer?

VQ-Rec uses product quantization to map item text to discrete codes that index learned embeddings, breaking the tight coupling between text and recommendations. This decoupling prevents text-similarity bias and allows lookup tables to adapt to new domains without retraining the text encoder.

Can a single model generate all modalities without external encoders?

MIO trains a foundation model on mixed discrete tokens across four modalities with causal modeling, achieving end-to-end generation in both directions. The model emergently produces interleaved video-text output and chain-of-visual-thought reasoning that dual-modal encoder-based systems cannot.

Why is predicting latents more sample-efficient than tokens?

A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Are text-only language models fundamentally limited by abstraction?

Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.

Can describing images in text improve zero-shot recognition?

SignRAG demonstrates that describing an unknown image via vision-language model, then retrieving known designs from a text-indexed database, eliminates the need for recognition model training. Natural-language description bridges the visual-reference gap better than direct embedding similarity.

Do discrete tokenized modalities preserve information better than continuous embeddings?

Sources 9 notes

Next inquiring lines