Skip to content

Latest commit

 

History

History
8 lines (6 loc) · 3.11 KB

2501.00602.md

File metadata and controls

8 lines (6 loc) · 3.11 KB

STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes

We present STORM, a spatio-temporal reconstruction model designed for reconstructing dynamic outdoor scenes from sparse observations. Existing dynamic reconstruction methods often rely on per-scene optimization, dense observations across space and time, and strong motion supervision, resulting in lengthy optimization times, limited generalization to novel views or scenes, and degenerated quality caused by noisy pseudo-labels for dynamics. To address these challenges, STORM leverages a data-driven Transformer architecture that directly infers dynamic 3D scene representations--parameterized by 3D Gaussians and their velocities--in a single forward pass. Our key design is to aggregate 3D Gaussians from all frames using self-supervised scene flows, transforming them to the target timestep to enable complete (i.e., "amodal") reconstructions from arbitrary viewpoints at any moment in time. As an emergent property, STORM automatically captures dynamic instances and generates high-quality masks using only reconstruction losses. Extensive experiments on public datasets show that STORM achieves precise dynamic scene reconstruction, surpassing state-of-the-art per-scene optimization methods (+4.3 to 6.6 PSNR) and existing feed-forward approaches (+2.1 to 4.7 PSNR) in dynamic regions. STORM reconstructs large-scale outdoor scenes in 200ms, supports real-time rendering, and outperforms competitors in scene flow estimation, improving 3D EPE by 0.422m and Acc5 by 28.02%. Beyond reconstruction, we showcase four additional applications of our model, illustrating the potential of self-supervised learning for broader dynamic scene understanding.

我们提出了 STORM,一种用于从稀疏观测中重建动态户外场景的时空重建模型。现有的动态重建方法通常依赖于逐场景优化、空间和时间上的密集观测以及强监督的运动信息,导致优化时间较长,对新视图或新场景的泛化能力有限,并且在动态伪标签噪声影响下质量下降。 为解决这些问题,STORM 采用了一种数据驱动的 Transformer 架构,能够通过单次前向传递直接推断动态三维场景表示,参数化为 3D 高斯及其速度。其核心设计是通过自监督的场景流(scene flow)将所有帧的 3D 高斯聚合,并将其变换到目标时间步,支持从任意视点和时间点进行完整(即“全模态”)重建。 作为一种自然涌现的特性,STORM 仅通过重建损失即可自动捕捉动态实例并生成高质量的遮罩。我们在公共数据集上的大量实验表明,STORM 在动态场景重建中表现出色,在动态区域的重建精度上显著超越了现有的逐场景优化方法(PSNR 提高 4.3 至 6.6)和前馈方法(PSNR 提高 2.1 至 4.7)。STORM 在 200ms 内即可完成大规模户外场景重建,支持实时渲染,并在场景流估计中表现优异,3D EPE 提高 0.422m,Acc5 提高 28.02%。 除了重建功能外,我们展示了该模型的四种额外应用,进一步证明了自监督学习在更广泛的动态场景理解中的潜力。