SYNTHESIS NOTE
Model Architecture and Internals

Can latent thought vectors scale language models beyond parameters?

Explores whether explicit latent thought vectors with dual-rate learning create new scaling dimensions independent of model size. This matters because it suggests alternatives to simply building larger models.

Synthesis note · 2026-02-23 · sourced from Cognitive Models Latent

Latent-Thought Language Models (LTMs) propose a different scaling strategy than larger parameters or longer contexts: explicit latent thought vectors that follow a prior model in latent space and guide autoregressive token generation. This creates additional scaling dimensions — higher sample efficiency by increasing training compute per token, with further gains by trading model size for more inference steps.

Architecture. Latent thought vectors represent an abstract representation of the entire sequence, controlling the decoder's generation of each token. Training uses variational Bayes with a dual-rate process: fast learning of local variational parameters for the posterior distribution of latent vectors (adapting quickly to specific inputs) coupled with slow learning of global decoder parameters (gradually accumulating general knowledge).

Cognitive inspiration. The dual-rate scheme parallels established cognitive models:

Scaling properties. LTMs demonstrate superior sample and parameter efficiency compared to conventional autoregressive models and discrete diffusion models. They significantly outperform on validation perplexity and zero-shot language modeling. Emergent few-shot in-context reasoning capabilities scale with both model size and latent size — providing two independent scaling dimensions.

The connection to existing latent reasoning approaches is important but distinct. Can models reason without generating visible thinking tokens? describes depth-recurrent architectures that iterate in latent space at inference time. LTMs use latent vectors differently — as sequence-level abstractions that guide token generation rather than per-token iterative computation. The dual-rate learning provides a training-time mechanism that depth-recurrence does not.

The Titans parallel is also notable: Can neural memory modules scale language models beyond attention limits? separates fast attention (short-term) from slow memory (long-term). LTMs separate fast local adaptation from slow global learning. Both architectures implement the fast-slow cognitive distinction but at different levels — Titans for memory, LTMs for generation.

Inquiring lines that use this note as a source 53

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 138 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

latent-thought language models introduce additional scaling dimensions beyond parameters by incorporating explicit latent thought vectors with dual-rate learning