VLA模型的演进历程反映了多模态融合与智能控制技术的迭代升级。本文从理论、技术和应用3个层面,系统分析VLA的进展与挑战,旨在讨论其发展规律并探讨未来研究方向。
具身数据的昂贵和不足目前是具身智能的重要瓶颈,而高质量的合成大数据为具身端到端大模型的泛化提供了一个低成本方案。本报告以端到端操作模型GraspVLA 和 端到端导航Uni-NaVid 等系列工作为例,探讨视觉-语言-动作(VLA)大模型系统的技术突破及其泛化能力的实现。
本文系统分析了视觉–语言–动作(vision-language-action, VLA)模型的相关技术,介绍了该技术的诞生、发展、挑战,以及未来技术突破的方向。
A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA's potential to facilitate scalable and efficient robot policy learning.
在真实世界中学习具身操作技能代价昂贵,目前广泛采用的做法是基于仿真环境的学习和由虚到实迁移。但构建一个通用且高保真的仿真环境仍然非常困难,即便为某个单项任务构建相应的仿真环境也很难。同时,为使仿真训练的智能体能够由虚到实迁移,常需要在包括几何、结构、材质、动力学等的高维空间中进行采样,维数灾难问题突显。如能对目标环境快速构建一个机理化的专用世界模型,则只需在机理引导下对该模型进行小范围域随机化,即可支持鲁棒可泛化的策略学习。本报告探讨两种世界模型驱动的具身智能范式:1)直接在目标环境中采集“任务无关”的操作轨迹数据,学习符合物理规律的专用世界模型,用于多种下游任务的学习,其核心问题是如何基于稀疏轨迹数据学习符合物理规律的精准世界模型;2)首先基于大规模仿真预训练通用世界基础模型,再针对目标环境进行快速适配得到专用世界模型,用于目标环境多种下游任务的学习,其核心问题是如何实现通用世界模型的精准高效适配。本次报告将分析和综述两类范式在驱动导航、抓取等具身任务中的应用,并结合视觉-语言-动作(VLA)架构最新进展,探讨和展望数据和物理联合驱动、仿真与现实数据协同、世界模型轻量化等未来方向。
本报告聚焦视觉语言动作模型(VLA),围绕从VLM到VLA的核心关键因素展开系统汇报,深度解析多模态具身智能的技术落地与优化思路。明确多模态具身任务分为视觉语言导航与操作两大类型,阐述VLA作为机器人核心策略的优势,并通过CALVIN、SimpleeEnv仿真环境与真机20类任务、300组实验,验证了模型的性能与泛化能力。报告围绕三大核心问题展开研究:优选KosMos、PaliGemma作为VLA骨干网络;对比单步/多步、离散/连续动作空间,及交错式与策略头架构,证实连续动作+策略头架构在长程任务、泛化性与数据效率上更具优势;探究MoE结构、动作chunk执行方式及损失函数的影响,发现MoE可提升泛化性但对已知场景拟合有限。同时针对跨本体数据使用,提出同本体数据优先级最高,跨本体预训练+后训练存在任务权衡。最后系统总结了VLA模型选型、架构设计与数据训练的实践方案,为具身智能落地提供清晰参考。