Text-based generation and editing of 3D scenes hold significant potential for streamlining content creation through intuitive user interactions. While recent advances leverage 3D Gaussian Splatting (3DGS) for high-fidelity and real-time rendering, existing methods are often specialized and task-focused, lacking a unified framework for both generation and editing. In this paper, we introduce SplatFlow, a comprehensive framework that addresses this gap by enabling direct 3DGS generation and editing. SplatFlow comprises two main components: a multi-view rectified flow (RF) model and a Gaussian Splatting Decoder (GSDecoder). The multi-view RF model operates in latent space, generating multi-view images, depths, and camera poses simultaneously, conditioned on text prompts, thus addressing challenges like diverse scene scales and complex camera trajectories in real-world settings. Then, the GSDecoder efficiently translates these latent outputs into 3DGS representations through a feed-forward 3DGS method. Leveraging training-free inversion and inpainting techniques, SplatFlow enables seamless 3DGS editing and supports a broad range of 3D tasks-including object editing, novel view synthesis, and camera pose estimation-within a unified framework without requiring additional complex pipelines. We validate SplatFlow's capabilities on the MVImgNet and DL3DV-7K datasets, demonstrating its versatility and effectiveness in various 3D generation, editing, and inpainting-based tasks.
基于文本的 3D 场景生成和编辑在通过直观的用户交互简化内容创作方面具有巨大的潜力。尽管最近的进展利用了 3D 高斯投影(3D Gaussian Splatting, 3DGS)实现高保真和实时渲染,但现有方法往往专注于特定任务,缺乏一个同时支持生成和编辑的统一框架。 本文提出了 SplatFlow,一个综合框架,填补了这一空白,实现了直接的 3DGS 生成和编辑。SplatFlow 包含两个主要组件:多视角校正流(Multi-view Rectified Flow, RF)模型和高斯投影解码器(Gaussian Splatting Decoder, GSDecoder)。多视角 RF 模型在潜在空间中操作,基于文本提示同时生成多视角图像、深度图和相机位姿,从而解决了现实场景中多样化场景尺度和复杂相机轨迹等挑战。随后,GSDecoder 通过前馈 3DGS 方法高效地将这些潜在输出转换为 3DGS 表示。 通过无训练反演和修复技术,SplatFlow 实现了无缝的 3DGS 编辑,并在一个统一框架下支持广泛的 3D 任务,包括对象编辑、新视角合成和相机位姿估计,无需额外复杂的管道。我们在 MVImgNet 和 DL3DV-7K 数据集上验证了 SplatFlow 的能力,展示了其在各种 3D 生成、编辑和基于修复任务中的多功能性和有效性。