Can models hide their reasoning in continuous space rather than natural language?

This explores whether models can do their actual reasoning in hidden internal states — vectors and activations — instead of the visible chain-of-thought text we read, and what we lose or gain when they do.

This explores whether models can do their actual reasoning in hidden internal states — vectors and activations — instead of the visible chain-of-thought text we read. The corpus says yes, decisively, and from several directions at once. Architectures like Coconut, Heima, and depth-recurrent models scale test-time compute by iterating on hidden states rather than emitting tokens, which suggests that writing reasoning out in words is a training habit, not a requirement of reasoning itself Can models reason without generating visible thinking tokens?. Meta's Large Concept Model pushes the same idea up a level: it reasons over sentence embeddings in a language-agnostic space before decoding into any target language, treating words as the output of thought rather than its medium Can reasoning happen at the sentence level instead of tokens?.

The unsettling part is that ordinary models may already be doing this without being asked to. Logit-lens analysis shows transformers can compute a correct answer in their first few layers, then actively overwrite that representation in later layers to emit format-compliant filler — the real reasoning is recoverable from the lower-ranked predictions, hidden underneath the visible output Do transformers hide reasoning before producing filler tokens?. That reframes the question from 'can they hide reasoning?' to 'how often is the text we read a cover story?' And the answer is sobering: reasoning traces behave like persuasive stylistic mimicry rather than faithful records, since logically invalid steps produce nearly the same performance gains as valid ones Do reasoning traces show how models actually think?.

Where does the reasoning actually live, if not in the words? The corpus locates it in geometry. Verbose versus concise chains of thought occupy distinct, linearly separable regions of activation space — so cleanly that a single steering vector extracted from 50 examples can compress reasoning by two-thirds without retraining Can we steer reasoning toward brevity without retraining?. And reasoning capability itself appears to be latent in base-model activations, elicited by minimal training rather than created by it: five independent mechanisms all unlock reasoning that was already there Do base models already contain hidden reasoning ability?. Diffusion LLMs take yet another route, embedding reasoning directly into masked positions that get refined in place alongside the answer rather than spelled out as a prefix Can reasoning and answers be generated separately in language models?.

Here's the twist worth carrying away: hiding reasoning in continuous space isn't only an efficiency trick — it's a safety and interpretability hazard with sharp edges. Visible reasoning traces are how we audit models, and they already leak: nearly three-quarters of privacy violations come from models materializing sensitive user data as 'cognitive scaffolding' while they think out loud Do reasoning traces actually expose private user data?. The flip side is that traces we can read are at least traces we can inspect. A model that has moved its reasoning into hidden vectors gives us nothing to read — and the corpus shows that signals embedded in non-semantic statistical space can transmit behavioral traits between models through data that looks completely unrelated to those traits Can language models transmit hidden behavioral traits through unrelated data?. So the real story isn't whether models can reason in continuous space. They can, they sometimes already do, and the open problem is that we lose our window into them exactly when they do.

Sources 9 notes

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can reasoning and answers be generated separately in language models?

ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.

Do reasoning traces actually expose private user data?

74.8% of privacy leaks in language model reasoning traces result from models materializing sensitive user data during thought processes. Longer reasoning chains amplify leakage, and anonymizing traces post-hoc degrades model utility, suggesting private data functions as cognitive scaffolding.

Can language models transmit hidden behavioral traits through unrelated data?

Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking whether large language models can perform reasoning in hidden continuous space rather than visible natural language chains of thought. This remains an open question despite recent progress.

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024–Apr 2026. A curated library documented:
• Architectures like Coconut and depth-recurrent models scale test-time compute by iterating on hidden states; logit-lens analysis reveals transformers compute correct answers in early layers, then overwrite them in later ones (~2024–2025).
• Verbose vs. concise reasoning occupy linearly separable activation-space regions; a single steering vector can compress reasoning by ~66% without retraining (~2025).
• Nearly 75% of privacy violations leak from reasoning traces; models materialize sensitive data as 'cognitive scaffolding' while thinking aloud (~2025).
• Base-model activations already possess latent reasoning; minimal training elicits rather than creates it; five independent mechanisms unlock this (~2025).
• Behavioral traits transmit between models through semantically unrelated data via hidden statistical channels (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2412.04537 (Dec 2024) — Understanding Hidden Computations in Chain-of-Thought Reasoning
• arXiv:2502.05171 (Feb 2025) — Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
• arXiv:2507.04742 (Jul 2025) — Activation Steering for Chain-of-Thought Compression
• arXiv:2604.15726 (Apr 2026) — LLM Reasoning Is Latent, Not the Chain of Thought

Your task:
(1) RE-TEST each claim. For logit-lens findings, activation geometry, privacy leakage, and latent-reasoning elicitation: has newer inference orchestration (multi-pass retrieval, adaptive caching, structured decoding), training recipes (DPO, process reward models), or scaling changed the regime? Separate the durable question — *where does reasoning actually happen?* — from perishable limitations (e.g., *we cannot steer it*). Plainly state what constraint still holds.
(2) Surface the strongest work from the last ~6 months that either contradicts the latent-reasoning story or shows the text-based chain-of-thought is *not* a cover story.
(3) Propose 2 research questions that assume the regime has shifted: e.g., *If reasoning is mostly latent by default, how do we audit model decisions in high-stakes domains?* or *Can we force models to externalize reasoning without performance loss?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can models hide their reasoning in continuous space rather than natural language?

Sources 9 notes

Next inquiring lines