Long-Context Autoregressive Video Modeling
with Next-Frame Prediction

ShowLab, National University of Singapore

📖TL;DR: FAR (i.e., Frame AutoRegressive Models) is a new baseline for autoregressive video generation, and achieves state-of-the-art performance on both short- and long-context video modeling.

Abstract

Long-context autoregressive modeling has significantly advanced language generation, but video generation still struggles to fully utilize extended temporal contexts. To investigate long-context video modeling, we introduce Frame AutoRegressive (FAR), a strong baseline for video autoregressive modeling. Just as language models learn causal dependencies between tokens (i.e., Token AR), FAR models temporal causal dependencies between continuous frames, achieving better convergence than Token AR and video diffusion transformers. Building on FAR, we observe that long-context vision modeling faces challenges due to visual redundancy. Existing RoPE lacks effective temporal decay for remote context and fails to extrapolate well to long video sequences. Additionally, training on long videos is computationally expensive, as vision tokens grow much faster than language tokens. To tackle these issues, we propose balancing locality and long-range dependency. We introduce FlexRoPE, an test-time technique that adds flexible temporal decay to RoPE, enabling extrapolation to 16× longer vision contexts. Furthermore, we propose long short-term context modeling, where a high-resolution short-term context window ensures fine-grained temporal consistency, while an unlimited long-term context window encodes long-range information using fewer tokens. With this approach, we can train on long video sequences with a manageable token context length. We demonstrate that FAR achieves state-of-the-art performance in both short- and long-video generation, providing a simple yet effective baseline for video autoregressive modeling.

What is the potential of FAR compared to video diffusion transformers?

1. Better Converenge:
  • FAR requires same training cost to video diffusion transformers.
  • FAR achieves better convergence than video diffusion transformers with the same latent space.

2. Native Support for Vision Context:
  • Video diffusion transformers: requires additional image-to-video fine-tuning to expolit image conditions.
  • FAR: provides native support for clean vision context at various lengths, achieving state-of-the-art performance in video generation (context frame = 0) and video prediction (context frame ≥ 1).

3. Test-time Temporal Extrapolation:
  • Video diffusion transformers: are usually not suitable to generate longer sequence than their training window.
  • FAR: support for 16x longer test-time temporal extrapolation without fine-tuning on long video sequence.

4. Efficient Training/inference on Long Video Sequence:
  • Video diffusion transformers: cannot efficiently train/infer on long videos because the vision token length scales rapidly with the number of frames.
  • FAR: exploits long short-term context modeling, reduce redundant token lengths during training and fine-tuning on long videos. During inference, KV-Cache is supported to ensure efficiency.

Pipeline

FAR is trained using a frame-wise flow-matching objective with autoregressive contexts. The attention mask in FAR 👇 preserves causality between frames while allowing full attention within each frame.


Key Techniques:
  1. Stochastic Clean Context: In training, we stochastically replace a portion of noisy context with clean context frames and use timesteps beyond the diffusion scheduler (e.g., -1) to indicate them. This strategy bridges the training and inference gap of observed contexts without incurring additional training costs.
  2. Long Short-Term Context Modeling: In training/fine-tuning on long-video sequences, we maintain a high-resolution short-term context window to model fine-grained temporal consistency, and an unlimited long-term context window to reduce redundant tokens with aggressive patchification.

Main Results

1. Compared to previous methods, FAR effectively exploits the provided context frames (annotated with red boxes) and achieves long-prediction consistency (on 3D structures and wall's patterns).

(red boxes: observed context frames, left: prediction, right: ground-truth)


2. FAR achieves state-of-the-art performance on unconditional/conditional video generation, short-video prediction and long-video preidction.


3. FAR demonstrates strong test-time temporal extrapolation performance on both periodical motion and non-periodical motion.

(16x extrapolation, trained on 16 frames while inferring 256 frames, displayed at 4x speedup)

BibTeX

@article{gu2025long,
      title={Long-Context Autoregressive Video Modeling with Next-Frame Prediction},
      author={Gu, Yuchao and Mao, weijia and Shou, Mike Zheng},
      journal={arXiv preprint arXiv:2503.19325},
      year={2025}
}