Can diffusion models condition on right context natively without special training for infilling?
This explores whether diffusion language models can fill in missing text using both left and right context as a built-in property of how they generate — rather than needing a dedicated infilling objective the way autoregressive (left-to-right) models do.
This explores whether diffusion language models get infilling "for free" because of how they generate, instead of needing a special training recipe. The corpus suggests the answer is largely yes, and the reason traces back to a single architectural difference: diffusion models refine a whole sequence in parallel rather than predicting one token at a time from left to right. Because every position can attend to every other position (bidirectional attention), the surrounding text on *both* sides is already part of what the model conditions on at each denoising step. Infilling isn't a bolted-on mode — it's just what happens when you leave some positions masked and let the model fill them while reading the context around them.
The clearest evidence is Diffusion-LM, which succeeds on fine-grained control tasks — including infilling — where plug-and-play methods on autoregressive models fail Can diffusion models enable control that autoregressive models cannot reach?. Its continuous latent variables let gradients flow across the entire sequence at once, replacing the discrete left-to-right bottleneck. Infilling there is one instance of a more general capability: conditioning on global structure that an autoregressive model, committed to its prefix, can't easily reach back into.
The sharpest framing of the contrast comes from in-place prompting, which explicitly names what it removes: the "prefix-only constraint" of autoregressive models Can reasoning and answers be generated separately in language models?. Because attention is bidirectional, you can embed instructions or reasoning *directly into masked positions inside the sequence* and have them refined alongside the answer — something a left-to-right model structurally cannot do, since it only ever sees what came before. That's the deeper point: "infilling" and "conditioning on right context" are the same thing, and the model does both natively.
The same parallel, non-sequential generation that makes this possible also has costs worth knowing about. It breaks the clean log-likelihood factorization that left-to-right models rely on, which is why adapting reinforcement learning to diffusion LLMs is genuinely hard — likelihoods become intractable and need workarounds Why can't we easily adapt reinforcement learning to diffusion language models?. And there's an upside lurking in the same mechanism: because the model refines the whole sequence at once, it often "knows" the answer well before decoding finishes — up to 99% of some benchmarks converge by the midpoint, enabling early-exit speedups Can diffusion models commit to answers before full decoding?.
So the thing you didn't know you wanted to know: infilling for diffusion models isn't a feature someone trained in — it's the default consequence of seeing both directions at once. The trade is that this same property, which makes right-context conditioning native, is exactly what makes likelihood-based tooling (like standard RL) awkward to port over. The capability and the difficulty come from the same place.
Sources 4 notes
Diffusion-LM succeeds on six fine-grained control tasks (syntax, semantics, infilling, length) where plug-and-play methods fail. Its continuous latent variables allow gradients to flow across the entire sequence simultaneously, replacing the discrete-token bottleneck and enabling parallel denoising.
ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.
Diffusion language models cannot directly use AR-developed RL methods like GRPO and DPO because iterative non-sequential token generation requires marginalizing over denoising trajectories, making likelihood intractable. Workarounds exist—outcome-based rewards, policy learning for unmasking order, and adapted preference optimization—enabling models like DCoLT to gain 9–19% on benchmarks.
Up to 99% of MMLU instances and 97% of GSM8K instances reach correct answers by the midpoint of refinement. Prophet exploits this by monitoring confidence gaps to stop early, achieving 3.4× speedup with no quality loss.