Why can't we easily adapt reinforcement learning to diffusion language models?
Autoregressive models enable efficient RL post-training through factorizable log-probabilities, but diffusion models generate tokens in parallel non-sequential order. What makes likelihood computation intractable in diffusion, and can we work around it?
The maturation of post-training techniques in autoregressive LLMs — RLHF, RLAIF, GRPO, DPO — has been a major source of capability gain. These methods all rely on the ability to efficiently compute the log-probability of a generated sequence, which is straightforward in AR models because the joint probability factorizes along sequence position. Each token's probability is conditioned on the prior tokens, so the sequence log-probability is just a sum of token log-probabilities computed in a single forward pass.
In diffusion language models, this factorization breaks. Generation is iterative and non-sequential — tokens are denoised in parallel, with masked positions revealed across multiple steps in arbitrary order. The log-likelihood of a final sequence is no longer a simple sum but a marginalization over the trajectory of denoising steps, which is intractable. This creates a significant technical barrier to applying the mature suite of RL algorithms developed for AR models to DLMs. The constraint travels with the AR factorization rather than with reasoning itself, which is why Does autoregressive generation uniquely enable LLM scaling? reframes which AR coupling is actually contingent.
The literature has converged on three streams of workaround. First, parallelizing the reasoning chain — Diffusion-of-Thought (DoT) reformulates CoT for parallel diffusion by treating reasoning steps as intermediate thoughts refined throughout the denoising process, with scheduled and coupled sampling for self-correction. Second, adapting policy gradient methods — variants of GRPO are introduced for DLMs, often by treating outcome rewards on the final answer rather than per-step likelihoods. Third, adapting preference optimization — DPO variants for DLMs work around the intractable likelihoods.
DCoLT (Diffusion Chain of Lateral Thought) is illustrative of what becomes possible once these adaptations exist. Treating each reverse diffusion step as a latent thinking action and optimizing the entire denoising trajectory with outcome-based RL produces +9.8% on GSM8K and +19.5% on HumanEval over base LLaDA, partly through a learned Unmasking Policy Module that selects token reveal order. The deeper point: DLMs do not lack reasoning capability, they lacked compatible post-training tools, and the cognitive style they unlock (lateral, parallel thinking rather than sequential vertical thinking) may be qualitatively different from AR-trained reasoning.
Inquiring lines that use this note as a source 15
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can outcome-based rewards fully replace per-step likelihood in diffusion RL training?
- Why do autoregressive models fail at controlling syntactic structure and semantic content?
- Can diffusion models condition on right context natively without special training for infilling?
- How can diffusion models predict future tokens without completing prior blocks?
- What makes asymmetric distillation effective for converting pretrained diffusion models?
- Why do hybrid paradigms outperform pure autoregressive or pure diffusion approaches?
- What makes diffusion sampling preserve multiple optimal solutions better than alternatives?
- What structural differences between diffusion and autoregressive models enable bidirectional prompting?
- Do diffusion language models learn differently than autoregressive models?
- Can diffusion language models match autoregressive inference speed in practice?
- Can diffusion models perform infilling and reverse generation as naturally as forward generation?
- Why is reinforcement learning harder to apply to diffusion language models?
- What makes the embers of autoregression framework predictive?
- Why do diffusion models fail at inherently sequential problems?
- How does selective looping in diffusion models differ from recurrence in autoregressive architectures?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does autoregressive generation uniquely enable LLM scaling?
Is the autoregressive factorization truly necessary for LLM scalability, or do other generative principles like diffusion achieve comparable performance? This matters because it shapes which architectural paths deserve investment.
extends: same paradigm shift; LLaDA shows why DLMs merit RL adaptation while this note shows the technical price
-
Can two simple techniques match complex RL algorithms?
Does vanilla PPO with minimal modifications rival more sophisticated reasoning algorithms like GRPO and DAPO? This explores whether algorithmic complexity is necessary for effective LLM reasoning training.
complements: the AR-side PPO/GRPO innovations the diffusion stream cannot directly inherit and must rebuild
-
Does RLVR actually expand what models can reason about?
Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.
extends: if RLVR is sampling-efficiency on AR, the DLM RL question is whether equivalent boundaries even exist under non-sequential decoding
-
Can parallel architectures solve inherently sequential problems?
Complexity theory suggests some problems like reasoning and planning are fundamentally sequential. Can parallel architectures like Transformers overcome this limitation, or do we need fundamentally different computational approaches?
tension: serial-scaling implies DLM lateral thinking faces a hard ceiling on inherently sequential problems regardless of RL adaptation
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
tension: DCoLT's outcome-based RL on denoising trajectory makes "when to think" into token-reveal-order rather than deployment timing
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- A Survey on Diffusion Language Models
- Large Language Diffusion Models
- Diffusion-LM Improves Controllable Text Generation
- Looped Diffusion Language Models
- Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
- Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing
- Learn from your own latents and not from tokens: A sample-complexity theory
- Diffusion Language Models Know the Answer Before Decoding
Original note title
applying RL to diffusion language models is hard because parallel non-sequential generation makes log-likelihood intractable — the technical barrier that blocks adapting GRPO and DPO