Can generating entire videos at once beat keyframe interpolation?
Does synthesizing a video's full temporal duration in a single pass, rather than generating keyframes and filling gaps, produce more globally coherent motion? This explores whether pipeline decomposition fundamentally limits motion consistency.
Text-to-video generation is harder than text-to-image because motion is sensitive to error and adds a temporal dimension that strains memory, compute, and data. The prevalent approach generates distant keyframes first, then fills the gaps with a cascade of temporal super-resolution models — and Lumiere identifies an inherent limitation in this: it cannot learn globally-coherent motion, because the keyframe-then-interpolate pipeline never represents the whole temporal trajectory at once.
Lumiere's response is architectural: a Space-Time U-Net that generates the entire temporal duration of the video in a single pass, incorporating both spatial and temporal down- and up-sampling modules. Generating the full clip at once — rather than stitching independently-generated keyframes — is what produces coherent motion, and it generalizes to image-to-video, inpainting, and stylized generation.
The transferable keeper is a generation principle: coherence is a property of generating the whole at once, not of stitching locally-coherent pieces. Cascades that decompose a globally-structured output into independently-generated fragments lose the global structure exactly where it matters (motion, here). This rhymes with Can iterative revision cycles match how humans actually write?: both treat a long structured artifact as something to denoise as a whole rather than assemble piecewise.
Inquiring lines that use this note as a source 2
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 1
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can iterative revision cycles match how humans actually write?
Does framing research writing as a diffusion process—where drafts are refined through retrieval-augmented cycles—better capture human cognition than linear pipelines and reduce information loss?
shared principle: generate/denoise the whole structured artifact rather than assemble locally-coherent fragments
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Lumiere: A Space-Time Diffusion Model for Video Generation
- Multistep Consistency Models
- Do Language Models Understand Time?
- MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind
- Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
- Emerging Properties in Unified Multimodal Pretraining
- Consistency Models Made Easy
- Aether Weaver: Multimodal Affective Narrative Co-Generation with Dynamic Scene Graphs
Original note title
generating a video's full duration in a single space-time pass beats keyframe-plus-interpolation for globally coherent motion