3D Gaussian Splatting (3DGS) has recently revolutionized novel view synthesis in the Simultaneous Localization and Mapping (SLAM). However, existing SLAM methods utilizing 3DGS have failed to provide high-quality novel view rendering for monocular, stereo, and RGB-D cameras simultaneously. Notably, some methods perform well for RGB-D cameras but suffer significant degradation in rendering quality for monocular cameras. In this paper, we present Scaffold-SLAM, which delivers simultaneous localization and high-quality photorealistic mapping across monocular, stereo, and RGB-D cameras. We introduce two key innovations to achieve this state-of-the-art visual quality. First, we propose Appearance-from-Motion embedding, enabling 3D Gaussians to better model image appearance variations across different camera poses. Second, we introduce a frequency regularization pyramid to guide the distribution of Gaussians, allowing the model to effectively capture finer details in the scene. Extensive experiments on monocular, stereo, and RGB-D datasets demonstrate that Scaffold-SLAM significantly outperforms state-of-the-art methods in photorealistic mapping quality, e.g., PSNR is 16.76% higher in the TUM RGB-D datasets for monocular cameras.
3D 高斯点绘制(3D Gaussian Splatting, 3DGS)近年来在同步定位与建图(Simultaneous Localization and Mapping, SLAM)中的新视图合成任务中取得了革命性进展。然而,现有利用 3DGS 的 SLAM 方法尚未能够同时为单目、立体视觉和 RGB-D 摄像机提供高质量的新视图渲染。其中,一些方法在 RGB-D 摄像机中表现较好,但在单目摄像机中渲染质量显著下降。 本文提出了 Scaffold-SLAM,一种能够在单目、立体视觉和 RGB-D 摄像机中同时实现高质量光真实感建图和定位的系统。为实现这一最先进的视觉质量,我们提出了基于运动的外观嵌入(Appearance-from-Motion embedding),使得 3D 高斯能够更好地建模不同相机姿态下图像的外观变化。此外,我们引入了频率正则化金字塔(Frequency Regularization Pyramid),用于引导高斯分布,从而有效捕捉场景中的细节信息。 在单目、立体视觉和 RGB-D 数据集上的大量实验表明,Scaffold-SLAM 在光真实感建图质量上显著优于当前最先进的方法。例如,在 TUM RGB-D 数据集的单目摄像机实验中,PSNR 提高了 16.76%。