SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals

Why can't we easily adapt reinforcement learning to diffusion language models?

Autoregressive models enable efficient RL post-training through factorizable log-probabilities, but diffusion models generate tokens in parallel non-sequential order. What makes likelihood computation intractable in diffusion, and can we work around it?

Synthesis note · 2026-05-03 · sourced from Diffusion LLM

The maturation of post-training techniques in autoregressive LLMs — RLHF, RLAIF, GRPO, DPO — has been a major source of capability gain. These methods all rely on the ability to efficiently compute the log-probability of a generated sequence, which is straightforward in AR models because the joint probability factorizes along sequence position. Each token's probability is conditioned on the prior tokens, so the sequence log-probability is just a sum of token log-probabilities computed in a single forward pass.

In diffusion language models, this factorization breaks. Generation is iterative and non-sequential — tokens are denoised in parallel, with masked positions revealed across multiple steps in arbitrary order. The log-likelihood of a final sequence is no longer a simple sum but a marginalization over the trajectory of denoising steps, which is intractable. This creates a significant technical barrier to applying the mature suite of RL algorithms developed for AR models to DLMs. The constraint travels with the AR factorization rather than with reasoning itself, which is why Does autoregressive generation uniquely enable LLM scaling? reframes which AR coupling is actually contingent.

The literature has converged on three streams of workaround. First, parallelizing the reasoning chain — Diffusion-of-Thought (DoT) reformulates CoT for parallel diffusion by treating reasoning steps as intermediate thoughts refined throughout the denoising process, with scheduled and coupled sampling for self-correction. Second, adapting policy gradient methods — variants of GRPO are introduced for DLMs, often by treating outcome rewards on the final answer rather than per-step likelihoods. Third, adapting preference optimization — DPO variants for DLMs work around the intractable likelihoods.

DCoLT (Diffusion Chain of Lateral Thought) is illustrative of what becomes possible once these adaptations exist. Treating each reverse diffusion step as a latent thinking action and optimizing the entire denoising trajectory with outcome-based RL produces +9.8% on GSM8K and +19.5% on HumanEval over base LLaDA, partly through a learned Unmasking Policy Module that selects token reveal order. The deeper point: DLMs do not lack reasoning capability, they lacked compatible post-training tools, and the cognitive style they unlock (lateral, parallel thinking rather than sequential vertical thinking) may be qualitatively different from AR-trained reasoning.

Inquiring lines that use this note as a source 15

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 145 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

applying RL to diffusion language models is hard because parallel non-sequential generation makes log-likelihood intractable — the technical barrier that blocks adapting GRPO and DPO