Video diffusion models have recently shown promise for world modeling through autoregressive frame prediction conditioned on actions. However, they struggle to maintain long-term memory due to the high computational cost associated with processing extended sequences in attention layers. To overcome this limitation, we propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency. Unlike previous approaches that retrofit SSMs for non-causal vision tasks, our method fully exploits the inherent advantages of SSMs in causal sequence modeling.
Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory, combined with dense local attention to ensure coherence between consecutive frames. We evaluate the long-term memory capabilities of our model through spatial retrieval and reasoning tasks over extended horizons. Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory, while maintaining practical inference speeds suitable for interactive applications.
Trained on the TECO Minecraft dataset, we pass 25 frames as context to our model and generate 125 more frames given a random sequence of actions. For comparison, we include results from DFoT on the right. Context frames are shown with a red border. Our model successfully generates videos of high fidelity across long time horizons.
Our method displays long-term memory capabilities evaluated through two separate tasks. The spatial reasoning task involves providing the model with a random agent trajectory and observations as context. The model is then tasked to reconstruct the continuation of this trajectory. Assuming the model has been given enough context such that the entire environment has been committed into memory, the model should reconstruct every observation along the continued trajectory.
The spatial retrieval task involves providing the model with a random agent trajectory and the corresponding observations as context, then tasking the model to backtrack through the exact sequence of actions to the agent's starting position. Given the scene is static, the generated sequence should reverse the context frames.
Even over long-horizons, our method outperforms relevant baselines in memory-related metrics.
@misc{po2025longcontextstatespacevideoworld,
title={Long-Context State-Space Video World Models},
author={Ryan Po and Yotam Nitzan and Richard Zhang and Berlin Chen and Tri Dao and Eli Shechtman and Gordon Wetzstein and Xun Huang},
year={2025},
eprint={2505.20171},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.20171}
}