3D scene representations have gained immense popularity in recent years. Methods that use Neural Radiance fields are versatile for traditional tasks such as novel view synthesis. In recent times, some work has emerged that aims to extend the functionality of NeRF beyond view synthesis, for semantically aware tasks such as editing and segmentation using 3D feature field distillation from 2D foundation models. However, these methods have two major limitations: (a) they are limited by the rendering speed of NeRF pipelines, and (b) implicitly represented feature fields suffer from continuity artifacts reducing feature quality. Recently, 3D Gaussian Splatting has shown state-of-the-art performance on real-time radiance field rendering. In this work, we go one step further: in addition to radiance field rendering, we enable 3D Gaussian splatting on arbitrary-dimension semantic features via 2D foundation model distillation. This translation is not straightforward: naively incorporating feature fields in the 3DGS framework leads to warp-level divergence. We propose architectural and training changes to efficiently avert this problem. Our proposed method is general, and our experiments showcase novel view semantic segmentation, language-guided editing and segment anything through learning feature fields from state-of-the-art 2D foundation models such as SAM and CLIP-LSeg. Across experiments, our distillation method is able to provide comparable or better results, while being significantly faster to both train and render. Additionally, to the best of our knowledge, we are the first method to enable point and bounding-box prompting for radiance field manipulation, by leveraging the SAM model.
近年来,三维场景表示在技术界获得了巨大的关注。使用神经辐射场 (NeRF) 的方法对传统任务如新视角合成表现出多功能性。近期,一些工作出现,旨在扩展 NeRF 功能超越视角合成,用于诸如编辑和分割等具有语义意识的任务,这是通过从二维基础模型提炼三维特征场实现的。然而,这些方法有两个主要限制:(a) 它们受限于 NeRF 管道的渲染速度,以及 (b) 隐式表示的特征场遭受连续性伪影,降低了特征质量。最近,三维高斯分散在实时辐射场渲染上展现了最先进的性能。在这项工作中,我们更进一步:除了辐射场渲染外,我们还通过二维基础模型提炼,使三维高斯分散能够应用于任意维度的语义特征。这种转换并不简单:直接将特征场纳入3DGS框架会导致层级发散。我们提出了架构和训练的改变,以有效地避免这个问题。我们提出的方法是通用的,我们的实验展示了新视角的语义分割、语言引导的编辑,以及通过学习来自最先进的二维基础模型如 SAM 和 CLIP-LSeg 的特征场,实现“分割任何事物”。在实验中,我们的提炼方法能够提供可比较或更好的结果,同时在训练和渲染方面显著更快。此外,据我们所知,我们是第一个通过利用 SAM 模型,实现辐射场操纵的点和边界框提示的方法。