3D semantic field learning is crucial for applications like autonomous navigation, AR/VR, and robotics, where accurate comprehension of 3D scenes from limited viewpoints is essential. Existing methods struggle under sparse view conditions, relying on inefficient per-scene multi-view optimizations, which are impractical for many real-world tasks. To address this, we propose SLGaussian, a feed-forward method for constructing 3D semantic fields from sparse viewpoints, allowing direct inference of 3DGS-based scenes. By ensuring consistent SAM segmentations through video tracking and using low-dimensional indexing for high-dimensional CLIP features, SLGaussian efficiently embeds language information in 3D space, offering a robust solution for accurate 3D scene understanding under sparse view conditions. In experiments on two-view sparse 3D object querying and segmentation in the LERF and 3D-OVS datasets, SLGaussian outperforms existing methods in chosen IoU, Localization Accuracy, and mIoU. Moreover, our model achieves scene inference in under 30 seconds and open-vocabulary querying in just 0.011 seconds per query.
3D语义场学习在自动驾驶、增强/虚拟现实(AR/VR)和机器人等领域至关重要,因为这些应用需要从有限视角中准确理解3D场景。然而,现有方法在稀疏视图条件下表现不佳,依赖于效率低下的逐场景多视图优化,这在许多实际任务中并不实用。 为了解决这一问题,我们提出了 SLGaussian,一种用于从稀疏视角构建3D语义场的前馈方法,实现对基于3D高斯投影(3DGS)场景的直接推理。通过视频跟踪确保一致的 SAM(Segment Anything Model)分割,以及使用低维索引高维 CLIP 特征,SLGaussian 能高效地在3D空间中嵌入语言信息,从而在稀疏视图条件下提供稳健的3D场景理解解决方案。 在 LERF 和 3D-OVS 数据集上的双视图稀疏3D对象查询与分割实验中,SLGaussian 在选择的 IoU、定位准确率(Localization Accuracy)和 mIoU 指标上均优于现有方法。此外,我们的模型在场景推理中实现了小于30秒的推理时间,并能以每次查询仅 0.011 秒的速度完成开放词汇查询,展现了高效性和实用性。