We introduce EnerVerse, a comprehensive framework for embodied future space generation specifically designed for robotic manipulation tasks. EnerVerse seamlessly integrates convolutional and bidirectional attention mechanisms for inner-chunk space modeling, ensuring low-level consistency and continuity. Recognizing the inherent redundancy in video data, we propose a sparse memory context combined with a chunkwise unidirectional generative paradigm to enable the generation of infinitely long sequences. To further augment robotic capabilities, we introduce the Free Anchor View (FAV) space, which provides flexible perspectives to enhance observation and analysis. The FAV space mitigates motion modeling ambiguity, removes physical constraints in confined environments, and significantly improves the robot's generalization and adaptability across various tasks and settings. To address the prohibitive costs and labor intensity of acquiring multi-camera observations, we present a data engine pipeline that integrates a generative model with 4D Gaussian Splatting (4DGS). This pipeline leverages the generative model's robust generalization capabilities and the spatial constraints provided by 4DGS, enabling an iterative enhancement of data quality and diversity, thus creating a data flywheel effect that effectively narrows the sim-to-real gap. Finally, our experiments demonstrate that the embodied future space generation prior substantially enhances policy predictive capabilities, resulting in improved overall performance, particularly in long-range robotic manipulation tasks.
我们提出了 EnerVerse,一种专为机器人操作任务设计的综合性化身未来空间生成框架。EnerVerse 将卷积和双向注意力机制无缝整合用于块内空间建模,确保低层次的一致性和连续性。鉴于视频数据中固有的冗余性,我们提出了一种稀疏记忆上下文,结合基于块的单向生成范式,实现了无限长序列的生成。 为了进一步增强机器人能力,我们引入了自由锚视角空间(Free Anchor View, FAV space),提供灵活的观察视角以改善对场景的观察和分析。FAV 空间有效减轻了运动建模中的歧义问题,消除了受限环境中的物理约束,并显著提升了机器人在各种任务和环境中的泛化性与适应性。 针对获取多摄像机观测的高昂成本和劳动强度,我们提出了一种结合生成模型与 4D 高斯点绘制(4D Gaussian Splatting, 4DGS)的数据引擎管线。该管线利用生成模型的强泛化能力以及 4DGS 提供的空间约束,迭代提升数据质量和多样性,形成了一个数据飞轮效应,有效缩小了模拟到现实的差距(sim-to-real gap)。 实验结果表明,化身未来空间生成先验显著增强了策略预测能力,从而提升了整体性能,特别是在长距离机器人操作任务中表现尤为突出。