Why can't we easily adapt reinforcement learning to diffusion language models?

Autoregressive models enable efficient RL post-training through factorizable log-probabilities, but diffusion models generate tokens in parallel non-sequential order. What makes likelihood computation intractable in diffusion, and can we work around it?

Synthesis note · 2026-05-03 · sourced from Diffusion LLM

The maturation of post-training techniques in autoregressive LLMs — RLHF, RLAIF, GRPO, DPO — has been a major source of capability gain. These methods all rely on the ability to efficiently compute the log-probability of a generated sequence, which is straightforward in AR models because the joint probability factorizes along sequence position. Each token's probability is conditioned on the prior tokens, so the sequence log-probability is just a sum of token log-probabilities computed in a single forward pass.

In diffusion language models, this factorization breaks. Generation is iterative and non-sequential — tokens are denoised in parallel, with masked positions revealed across multiple steps in arbitrary order. The log-likelihood of a final sequence is no longer a simple sum but a marginalization over the trajectory of denoising steps, which is intractable. This creates a significant technical barrier to applying the mature suite of RL algorithms developed for AR models to DLMs. The constraint travels with the AR factorization rather than with reasoning itself, which is why Does autoregressive generation uniquely enable LLM scaling? reframes which AR coupling is actually contingent.

The literature has converged on three streams of workaround. First, parallelizing the reasoning chain — Diffusion-of-Thought (DoT) reformulates CoT for parallel diffusion by treating reasoning steps as intermediate thoughts refined throughout the denoising process, with scheduled and coupled sampling for self-correction. Second, adapting policy gradient methods — variants of GRPO are introduced for DLMs, often by treating outcome rewards on the final answer rather than per-step likelihoods. Third, adapting preference optimization — DPO variants for DLMs work around the intractable likelihoods.

DCoLT (Diffusion Chain of Lateral Thought) is illustrative of what becomes possible once these adaptations exist. Treating each reverse diffusion step as a latent thinking action and optimizing the entire denoising trajectory with outcome-based RL produces +9.8% on GSM8K and +19.5% on HumanEval over base LLaDA, partly through a learned Unmasking Policy Module that selects token reveal order. The deeper point: DLMs do not lack reasoning capability, they lacked compatible post-training tools, and the cognitive style they unlock (lateral, parallel thinking rather than sequential vertical thinking) may be qualitatively different from AR-trained reasoning.

Inquiring lines that use this note as a source 15

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 145 in 2-hop network ·dense cluster Open in graph ↗

Why can't we easily adapt reinforcement learning… Does autoregressive generation uniquely enable LLM… Can two simple techniques match complex RL algorit… Does RLVR actually expand what models can reason a… Can parallel architectures solve inherently sequen… Does RL teach reasoning or just when to use it?

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does autoregressive generation uniquely enable LLM scaling? Is the autoregressive factorization truly necessary for LLM scalability, or do other generative principles like diffusion achieve comparable performance? This matters because it shapes which architectural paths deserve investment.
extends: same paradigm shift; LLaDA shows why DLMs merit RL adaptation while this note shows the technical price
Can two simple techniques match complex RL algorithms? Does vanilla PPO with minimal modifications rival more sophisticated reasoning algorithms like GRPO and DAPO? This explores whether algorithmic complexity is necessary for effective LLM reasoning training.
complements: the AR-side PPO/GRPO innovations the diffusion stream cannot directly inherit and must rebuild
Does RLVR actually expand what models can reason about? Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.
extends: if RLVR is sampling-efficiency on AR, the DLM RL question is whether equivalent boundaries even exist under non-sequential decoding
Can parallel architectures solve inherently sequential problems? Complexity theory suggests some problems like reasoning and planning are fundamentally sequential. Can parallel architectures like Transformers overcome this limitation, or do we need fundamentally different computational approaches?
tension: serial-scaling implies DLM lateral thinking faces a hard ceiling on inherently sequential problems regardless of RL adaptation
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
tension: DCoLT's outcome-based RL on denoising trajectory makes "when to think" into token-reveal-order rather than deployment timing

Why can't we easily adapt reinforcement learning to diffusion language models?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 3