We propose Long-LRM, a generalizable 3D Gaussian reconstruction model that is capable of reconstructing a large scene from a long sequence of input images. Specifically, our model can process 32 source images at 960x540 resolution within only 1.3 seconds on a single A100 80G GPU. Our architecture features a mixture of the recent Mamba2 blocks and the classical transformer blocks which allowed many more tokens to be processed than prior work, enhanced by efficient token merging and Gaussian pruning steps that balance between quality and efficiency. Unlike previous feed-forward models that are limited to processing 1~4 input images and can only reconstruct a small portion of a large scene, Long-LRM reconstructs the entire scene in a single feed-forward step. On large-scale scene datasets such as DL3DV-140 and Tanks and Temples, our method achieves performance comparable to optimization-based approaches while being two orders of magnitude more efficient.
我们提出了Long-LRM,这是一个可扩展的3D高斯重建模型,能够从长序列的输入图像中重建大规模场景。具体来说,我们的模型能够在一块A100 80G GPU上仅用1.3秒处理32张分辨率为960x540的源图像。我们的架构结合了近期的Mamba2模块和经典的Transformer模块,能够处理比以往工作更多的tokens,并通过高效的token合并和高斯修剪步骤在质量与效率之间取得平衡。与之前受限于处理1至4张输入图像、只能重建场景一小部分的前馈模型不同,Long-LRM能够在单次前馈步骤中重建整个场景。在像DL3DV-140和Tanks and Temples这样的大规模场景数据集上,我们的方法在性能上与基于优化的方法相当,但效率却高出两个数量级。