SYNTHESIS NOTE

Can latent thought vectors scale language models beyond parameters?

Explores whether explicit latent thought vectors with dual-rate learning create new scaling dimensions independent of model size. This matters because it suggests alternatives to simply building larger models.

Synthesis note · 2026-02-23 · sourced from Cognitive Models Latent

Latent-Thought Language Models (LTMs) propose a different scaling strategy than larger parameters or longer contexts: explicit latent thought vectors that follow a prior model in latent space and guide autoregressive token generation. This creates additional scaling dimensions — higher sample efficiency by increasing training compute per token, with further gains by trading model size for more inference steps.

Architecture. Latent thought vectors represent an abstract representation of the entire sequence, controlling the decoder's generation of each token. Training uses variational Bayes with a dual-rate process: fast learning of local variational parameters for the posterior distribution of latent vectors (adapting quickly to specific inputs) coupled with slow learning of global decoder parameters (gradually accumulating general knowledge).

Cognitive inspiration. The dual-rate scheme parallels established cognitive models:

Declarative-procedural model (Ullman 2004): latent vectors and local parameters parallel declarative/episodic memory; global decoder parameters parallel procedural memory
Fast-slow learning (Kumaran et al. 2016): fast episodic learning and slow schematic learning interplay
Language of thought (Fodor 1975): latent thought vectors as "words" of an internal thought language

Scaling properties. LTMs demonstrate superior sample and parameter efficiency compared to conventional autoregressive models and discrete diffusion models. They significantly outperform on validation perplexity and zero-shot language modeling. Emergent few-shot in-context reasoning capabilities scale with both model size and latent size — providing two independent scaling dimensions.

The connection to existing latent reasoning approaches is important but distinct. Can models reason without generating visible thinking tokens? describes depth-recurrent architectures that iterate in latent space at inference time. LTMs use latent vectors differently — as sequence-level abstractions that guide token generation rather than per-token iterative computation. The dual-rate learning provides a training-time mechanism that depth-recurrence does not.

The Titans parallel is also notable: Can neural memory modules scale language models beyond attention limits? separates fast attention (short-term) from slow memory (long-term). LTMs separate fast local adaptation from slow global learning. Both architectures implement the fast-slow cognitive distinction but at different levels — Titans for memory, LTMs for generation.

Inquiring lines that use this note as a source 53

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 138 in 2-hop network ·dense cluster Open in graph ↗

Can latent thought vectors scale language models… Can models reason without generating visible think… Can neural memory modules scale language models be… Can inference compute replace scaling up model siz… Can computational power accelerate scientific disc…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can models reason without generating visible thinking tokens? Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
different latent approach: per-token iterative computation vs sequence-level latent vectors
Can neural memory modules scale language models beyond attention limits? Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
Titans implements fast-slow at memory level; LTMs implement it at generation level
Can inference compute replace scaling up model size? Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
LTMs demonstrate this: model size can be traded for inference steps
Can computational power accelerate scientific discovery itself? Does the pace of research breakthroughs scale with computing resources, like model performance does? ASI-ARCH tested this by running thousands of autonomous experiments to discover neural architectures.
LTMs provide new dimensions for architecture search

Can latent thought vectors scale language models beyond parameters?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4