Recent works in 3D multimodal learning have made remarkable progress. However, typically 3D multimodal models are only capable of handling point clouds. Compared to the emerging 3D representation technique, 3D Gaussian Splatting (3DGS), the spatially sparse point cloud cannot depict the texture information of 3D objects, resulting in inferior reconstruction capabilities. This limitation constrains the potential of point cloud-based 3D multimodal representation learning. In this paper, we present CLIP-GS, a novel multimodal representation learning framework grounded in 3DGS. We introduce the GS Tokenizer to generate serialized gaussian tokens, which are then processed through transformer layers pre-initialized with weights from point cloud models, resulting in the 3DGS embeddings. CLIP-GS leverages contrastive loss between 3DGS and the visual-text embeddings of CLIP, and we introduce an image voting loss to guide the directionality and convergence of gradient optimization. Furthermore, we develop an efficient way to generate triplets of 3DGS, images, and text, facilitating CLIP-GS in learning unified multimodal representations. Leveraging the well-aligned multimodal representations, CLIP-GS demonstrates versatility and outperforms point cloud-based models on various 3D tasks, including multimodal retrieval, zero-shot, and few-shot classification.
近年来,三维多模态学习取得了显著进展。然而,典型的三维多模态模型通常仅能处理点云。与新兴的三维表示技术——三维高斯散射(3D Gaussian Splatting, 3DGS)相比,空间稀疏的点云无法准确描述三维物体的纹理信息,从而导致较差的重建能力。这一局限性制约了基于点云的三维多模态表示学习的潜力。 在本文中,我们提出了CLIP-GS,一种基于3DGS的创新多模态表示学习框架。我们引入了GS分词器(GS Tokenizer),用于生成序列化的高斯标记(Gaussian Tokens),这些标记随后通过预初始化了点云模型权重的Transformer层进行处理,从而生成3DGS嵌入。CLIP-GS利用3DGS与CLIP的视觉-文本嵌入之间的对比损失(Contrastive Loss),并提出了一种图像投票损失(Image Voting Loss)以指导梯度优化的方向性和收敛性。此外,我们开发了一种高效的方法,用于生成3DGS、图像和文本的三元组,从而促进CLIP-GS学习统一的多模态表示。 通过对齐良好的多模态表示,CLIP-GS展现了高度的通用性,并在多种三维任务中优于基于点云的模型,包括多模态检索、零样本和小样本分类等任务。