Applying Gaussian Splatting to perception tasks for 3D scene understanding is becoming increasingly popular. Most existing works primarily focus on rendering 2D feature maps from novel viewpoints, which leads to an imprecise 3D language field with outlier languages, ultimately failing to align objects in 3D space. By utilizing masked images for feature extraction, these approaches also lack essential contextual information, leading to inaccurate feature representation. To this end, we propose a Language-Embedded Surface Field (LangSurf), which accurately aligns the 3D language fields with the surface of objects, facilitating precise 2D and 3D segmentation with text query, widely expanding the downstream tasks such as removal and editing. The core of LangSurf is a joint training strategy that flattens the language Gaussian on the object surfaces using geometry supervision and contrastive losses to assign accurate language features to the Gaussians of objects. In addition, we also introduce the Hierarchical-Context Awareness Module to extract features at the image level for contextual information then perform hierarchical mask pooling using masks segmented by SAM to obtain fine-grained language features in different hierarchies. Extensive experiments on open-vocabulary 2D and 3D semantic segmentation demonstrate that LangSurf outperforms the previous state-of-the-art method LangSplat by a large margin. As shown in Fig. 1, our method is capable of segmenting objects in 3D space, thus boosting the effectiveness of our approach in instance recognition, removal, and editing, which is also supported by comprehensive experiments.
将高斯点云应用于3D场景理解的感知任务正变得越来越受欢迎。大多数现有工作主要集中在从新视点渲染二维特征图,这导致3D语言场存在异常语言,从而无法在三维空间中精确对齐对象。通过利用遮罩图像进行特征提取,这些方法也缺乏必要的上下文信息,导致特征表示不准确。为此,我们提出了语言嵌入表面场(Language-Embedded Surface Field,LangSurf),该方法能够准确地将3D语言场与对象表面对齐,促进了基于文本查询的精确二维和三维分割,广泛扩展了下游任务,如移除和编辑。LangSurf的核心是联合训练策略,利用几何监督和对比损失将语言高斯在对象表面上展开,从而为对象的高斯分配准确的语言特征。此外,我们还引入了层级上下文感知模块,以在图像级别提取上下文信息,然后使用由SAM分割的遮罩进行层级掩模池化,以获得不同层级的细粒度语言特征。在开放词汇的二维和三维语义分割上的广泛实验表明,LangSurf在很大程度上优于之前的最先进方法LangSplat。如图1所示,我们的方法能够在三维空间中分割对象,从而提升了我们方法在实例识别、移除和编辑方面的有效性,这也得到了全面实验的支持。