Haoqi Yuan袁昊琦

I received my Ph.D. from Peking University in 2026, advised by Prof. Zongqing Lu. I received my Bachelor's degree from the Turing Class at Peking University in 2021.

My research interests lie in Embodied AI and Reinforcement Learning (RL). My current research focuses on foundation models for robotic manipulation.

I am open to collaborations and discussions.

Email | Google Scholar | GitHub

Systems and Technical Reports

Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

Haoqi Yuan*, Zhixuan Liang*, Anzhe Chen*, Ye Wang*, Haoyang Li*, Pei Lin*, Yiyang Huang*, Zixing Lei*, Tong Zhang*, Jiazhao Zhang, Jie Zhang, Jingyang Fan, Gengze Zhou, Qihang Peng, Chenxu Lv, Xiaoyue Chen, An Yang, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Chenfei Wu^§, Xiong-Hui Chen^§

2026

arXiv project page blog bibtex

Alignment unlocks scale, and scale unlocks generalization. Qwen-RobotManip aligns heterogeneous robot embodiments into a unified Vision-Language-Action framework for scalable multi-source robot learning. It substantially outperforms prior models across all evaluated OOD settings and ranks 1st on the RoboChallenge Table30 v1 Generalist Track, demonstrating strong generalization across unseen tasks, scenes, and robot platforms.

Qwen-RobotNav Technical Report: A Scalable Navigation Model Designed for an Agentic Navigation System

Jiazhao Zhang*, Gengze Zhou*, Hale Yin*, Yiyang Huang*, Zixing Lei*, Qihang Peng*, Haoqi Yuan, Jie Zhang, Xudong Guo, Xiaoyue Chen, An Yang, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Chenxu Lv^§, Chenfei Wu^§, Xiong-Hui Chen^§

2026

arXiv project page blog bibtex

Qwen-RobotNav is a scalable navigation foundation model built for agentic navigation systems, with a parameterized interface that dynamically controls task modes and observation strategies. Trained on 15.6M samples, it achieves new state-of-the-art results across major navigation benchmarks and demonstrates strong zero-shot generalization in diverse environments.

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Jie Zhang*, Xiaoyue Chen*, Anzhe Chen, Deqing Li, Gengze Zhou, Hale Yin, Haoqi Yuan, Haoyang Li, Jiahao Li, Jiazhao Zhang, Jingren Zhou, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Pei Lin, Qihang Peng, Shengming Yin, Tianhe Wu, Tianyi Yan, Xiao Xu, Yan Shu, Yanran Zhang, Ye Wang, Yi Wang, Yilei Chen, Yixian Xu, Yiyang Huang, Yuxiang Chen, Zekai Zhang, Zhendong Wang, Zixing Lei, Zhixuan Liang, Zihao Liu, Zikai Zhou, Chenxu Lv^§, Xiong-Hui Chen^§, Chenfei Wu^§

2026

arXiv project page blog bibtex

Qwen-RobotWorld is a language-conditioned video world model that predicts physically grounded future visual trajectories across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. Built on a Double-Stream MMDiT and large-scale Embodied World Knowledge, it ranks 1st overall on EWMBench and DreamGen Bench, while demonstrating robust zero-shot generalization and multi-view consistency.

RealDexUMI: A Wearable Universal Manipulation Interface for Dexterous Robot Learning

Chaoyi Xu, Yixuan Jiang, Jiahui Huan, Yuhui Fu, Haoyu Zhou, Weitian Yuan, Jiayi Yu, Wanpeng Zhang, Haoqi Yuan, Zongqing Lu^§

2026

arXiv project page bibtex

RealDexUMI is a wearable universal manipulation interface that aligns human demonstration collection with robot deployment through a shared dexterous end-effector, in-hand vision, and tactile sensing. Policies trained from RealDexUMI demonstrations achieve strong real-robot dexterous manipulation performance across various tasks, generalize to unseen initial poses, and transfer across robot embodiments.

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Qiuyue Wang*, Mingsheng Li*, Jian Guan*, Jinhui Ye, Sicheng Xie, Yitao Liu, Junhao Chen, Zhixuan Liang, Jie Zhang, Xintong Hu, Xuhong Huang, Pei Lin, Junyang Lin, Dayiheng Liu, Shuai Bai^§, Jingren Zhou, Jiazhao Zhang, Haoqi Yuan, Gengze Zhou, Hang Yin, Ye Wang, Yiyang Huang, Zixing Lei, Wujian Peng, Delin Chen, Yingming Zheng, Jingyang Fan, Xianwei Zhuang, Xin Zhou, Haoyang Li, Anzhe Chen, Tong Zhang, Xuejing Liu, Yuchong Sun, Ruizhe Chen, Zhaohai Li, Chenxu Lv, Zhibo Yang, Tao Yu, Xionghui Chen

2026

arXiv project page GitHub bibtex

Qwen-VLA is a unified Vision-Language-Action generalist model that casts manipulation, navigation, and trajectory prediction into a shared action-and-trajectory prediction framework. With embodiment-aware prompt conditioning, it serves multiple robot platforms with one set of weights and demonstrates competitive multi-task performance.

Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization

Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan , Chi Zhang, Yiqing Wang, Yicheng Feng, Zongqing Lu^§

2026

arXiv project page GitHub bibtex

Being-H0.5 is a foundational VLA model that scales human-centric learning with unified action space to enable robust cross-embodiment robot control.

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

Hao Luo*, Yicheng Feng*, Wanpeng Zhang*, Sipeng Zheng*, Ye Wang, Haoqi Yuan , Jiazheng Liu, Chaoyi Xu, Qin Jin, Zongqing Lu^§

ICML 2026

arXiv project page GitHub bibtex

We introduce Being-H0, the first dexterous Vision-Language-Action model pretrained from large-scale human videos. Being-H0 acquires dexterous manipulation skills from UniHand, the introduced large-scale dataset containing egocentric human videos, language, and hand motion annotations. By explicitly predicting human hand motions, the resulting foundation model seamlessly transfers to real-world robotic dexterous manipulation tasks.

Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills

Haoqi Yuan, Yu Bai, Yuhui Fu, Bohan Zhou, Yicheng Feng, Xinrun Xu, Yi Zhan, Börje F. Karlsson, Zongqing Lu^§

2025

arXiv project page GitHub bibtex blog

Being-0 is a hierarchical agent framework for humanoid robots, with a novel Vision-Language Model module bridging the gap between the Foundation Model's language-based task plans and the execution of low-level skills. Being-0 is capable of controlling humanoid robots with multi-fingered dexterous hands and active cameras, enhancing their dexterity in both navigation and manipulation tasks, and solving complex, long-horizon embodied tasks in the real world.

Creative Agents: Empowering Agents with Imagination for Creative Tasks

Penglin Cai*, Chi Zhang*, Yuhui Fu, Haoqi Yuan , Zongqing Lu^§

UAI 2025

conference paper arXiv project page GitHub bibtex blog

Creative tasks are challenging for open-ended agents, where the agent should give novel and diverse task solutions. We propose creative agents with the ability of imagination and introduce several variants in implementation. We benchmark creative tasks in the challenging open-world game Minecraft and propose novel evaluation metrics utilizing GPT-4V. Creative agents are the first AI agents accomplishing diverse building creation in Minecraft survival mode.

Plan4MC: Skill Reinforcement Learning and Planning for Open-World Minecraft Tasks

Haoqi Yuan^§ , Chi Zhang, Hongcheng Wang, Feiyang Xie, Penglin Cai, Hao Dong, Zongqing Lu^§

NeurIPS FMDM Workshop 2023

workshop paper arXiv project page GitHub bibtex blog

Plan4MC is an embodied agent in the Minecraft game, accomplishing 40 tasks including unlocking Iron Pickaxe from scratch. It has a hierarchical structure, integrating LLM-based planning and RL-based skill acquisition. It acquires three types of fine-grained basic skills through RL without demonstrations. With a skill graph pre-generated by the LLM, the skill search algorithm plans and interactively selects policies to solve complicated tasks.

Research Papers

Transport Discrepancy as a Reliability Signal for Vision-Language-Action Models

Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Yicheng Feng, Sipeng Zheng, Qin Jin, Zongqing Lu^§

ECCV 2026

arXiv project page GitHub bibtex

DiG-Flow enhances VLA robustness through geometric regularization. DiG-Flow computes a discrepancy measure between empirical distributions of observation and action embeddings, maps it to a modulation weight via a monotone function, and applies residual updates to the observation embeddings before flow matching. DiG-Flow integrates into existing VLA architectures with negligible overhead and consistently improves performance, with particularly pronounced gains on complex multi-step tasks and under limited training data.

X-DiffVLA: X-Embodied Diffusion Action Heads for Vision-Language-Action Models

Boyu Li, Chaoyi Xu, Haoqi Yuan, Xinrun Xu, Börje F. Karlsson, Dongbin Zhao, Haoran Li, Zongqing Lu^§

RSS 2026

bibtex

We propose X-DiffVLA, a diffusion-based cross-embodied VLA model that learns universal policies without embodiment-specific fine-tuning via Embodiment Forcing and Morphological Tree Diffusion, achieving significant improvements in both simulation and real-world settings.

Actionable Human Motion Generation via Latent Imitation and Fine-Grained Text Completion

Feiyang Xie, Haoqi Yuan, Zongqing Lu^§

CVPR Findings Workshop 2026

TAM (Text-to-Action-to-Motion) is a novel text2motion framework that can directly generate joint actions for simulated humanoids conditioned on text and convert to human motions via physics simulation. Extensive quantitative and qualitative results demonstrate that TAM achieves state-of-the-art motion generation quality while rigorously adhering to physical constraints.

Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting

Wanpeng Zhang, Hao Luo, Sipeng Zheng, Yicheng Feng, Haiweng Xu, Ziheng Xi, Chaoyi Xu, Haoqi Yuan , Zongqing Lu^§

arXiv 2026

arXiv project page GitHub bibtex

To avoid averaging over conflicting data in heterogeneous robot datasets, PTR reweights each training sample by how attributable its post-action outcome is, enabling conservative offline adaptation without reward labels.

Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization

Ye Wang*, Sipeng Zheng*, Hao Luo*, Wanpeng Zhang*, Haoqi Yuan , Chaoyi Xu, Haiweng Xu, Yicheng Feng, Mingyang Yu, Zhiyu Kang, Zongqing Lu, Qin Jin^§

arXiv 2026

arXiv project page GitHub bibtex

We present a systematic, controlled study of VLA scaling that revisits core training choices for pre-training across diverse robots. Our analysis targets three dimensions of VLA scaling: Physical Alignment, Embodiment Mixture, and Training Regularization.

UniTacHand: Unified Spatio-Tactile Representation for Human to Robotic Hand Skill Transfer

Chi Zhang*, Penglin Cai*, Haoqi Yuan, Chaoyi Xu, Zongqing Lu^§

arXiv 2025

arXiv Project Page GitHub bibtex

We propose a unified representation to align robotic tactile information captured by dexterous hands with human hand touch obtained from haptic gloves. By training policies on cheap human data, it enables efficient human-to-robot policy transfer.

Universal Dexterous Functional Grasping via Demonstration-Editing Reinforcement Learning

Chuan Mao, Haoqi Yuan, Ziye Huang, Chaoyi Xu, Kai Ma, Zongqing Lu^§

CVPR 2026

arXiv Project Page GitHub bibtex

DemoFunGrasp is the follow-up to DemoGrasp, focusing on functional grasping. The learned policy generalizes to unseen combinations of objects and functional grasping conditions, and achieves zero-shot sim-to-real transfer. For the same object, the policy can produce diverse grasps by adjusting the grasping style and affordance.

Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild

Hao Luo, Ye Wang, Wanpeng Zhang, Haoqi Yuan , Yicheng Feng, Haiweng Xu, Sipeng Zheng, Zongqing Lu^§

CVPR 2026

JALA learns a predictive action embedding aligned with both inverse dynamics and real actions, yielding a transition-aware, behavior-centric latent space for training VLAs from heterogeneous human data and robot data.

Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

Yicheng Feng, Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan , Sipeng Zheng, Zongqing Lu^§

CVPR 2026

arXiv Project Page GitHub bibtex

VIPA-VLA learns 2D-to-3D visual-physical grounding from human videos with spatial-aware VLA pre-training, enabling robot policies with stronger spatial understanding and generalization.

DemoGrasp: Universal Dexterous Grasping from a Single Demonstration

Haoqi Yuan*, Ziye Huang*, Ye Wang, Chuan Mao, Chaoyi Xu, Zongqing Lu^§

ICLR 2026

conference paper arXiv Project Page GitHub bibtex blog talk

DemoGrasp is a simple yet effective method for universal dexterous grasping, formulating the problem as a one-step MDP of editing a single successful demonstration. It achieves a SOTA success rate of 95% on DexGraspNet objects with Shadow Hand and shows strong transferability to diverse dexterous hand embodiments and zero-shot generalization to unseen object datasets. On the real robot, the policy successfully grasps 110 unseen objects, including small, thin items. It generalizes to spatial, background, and lighting changes, supports both RGB and depth inputs, and extends to language-guided grasping in cluttered scenes.

DemoHLM: From One Demonstration to Generalizable Humanoid Loco-Manipulation

Yuhui Fu*, Feiyang Xie*, Chaoyi Xu, Jing Xiong, Haoqi Yuan, Zongqing Lu^§

RA-L 2026

paper arXiv project page GitHub bibtex

DemoHLM is a framework for humanoid loco-manipulation that enables scalable data generation and policy learning for any object-centric tasks. For each task, DemoHLM requires only a single teleoperation demonstration in simulation and automatically synthesizes hundreds to thousands of successful trajectories that complete the same task in varied environments. The learned policies successfully complete ten tasks on a real Unitree G1 robot with grippers and an active RGB-D camera.

Towards Proprioception-Aware Embodied Planning for Dual-Arm Humanoid Robots

Boyu Li, Siyuan He, Hang Xu, Haoqi Yuan, Yu Zang, Liwei Hu, Junpeng Yue, Zhenxiong Jiang, Pengbo Hu, Börje F. Karlsson, Yehui Tang, Zongqing Lu^§

ICRA 2026

arXiv bibtex

Proprio-MLLM enhances MLLM's embodiment awareness by incorporating proprioception with motion-based position embedding and a cross-spatial encoder. Experiments in our DualTHOR benchmark show that Proprio-MLLM achieves an improvement of 19.75% in embodied planning.

Learning Diverse Bimanual Dexterous Manipulation Skills from Human Demonstrations

Bohan Zhou, Haoqi Yuan, Yuhui Fu, Zongqing Lu^§

AAAI 2026 oral

arXiv bibtex blog

BiDexHD is a unified and scalable RL framework to learn bimanual manipulation skills, automatically constructing tasks from human trajectories and employing a teacher-student framework to obtain a vision-based policy tackling similar tasks. We demonstrate mastering 141 tasks from TACO dataset with a success rate of 74.59%.

Cross-Embodiment Dexterous Grasping with Reinforcement Learning

Haoqi Yuan, Bohan Zhou, Yuhui Fu, Zongqing Lu^§

ICLR 2025

conference paper arXiv project page GitHub bibtex blog talk slides poster

CrossDex is an RL-based method for cross-embodiment dexterous grasping. Inspired by human teleoperation, we propose universal eigengrasp actions, which are converted to actions of various dexterous hands through retargeting. CrossDex successfully controls various hand embodiments with a single policy, and effectively transfers to unseen embodiments through zero-shot generalization and finetuning.

Efficient Residual Learning with Mixture-of-Experts for Universal Dexterous Grasping

Ziye Huang, Haoqi Yuan, Yuhui Fu, Zongqing Lu^§

ICLR 2025

conference paper arXiv project page GitHub bibtex blog talk poster

ResDex, our RL-based framework for universal dexterous grasping, achieves SOTA performance on DexGraspNet. ResDex employs residual policy learning for efficient multi-task RL, equipped with a mixture of geometry-unaware base policies that enhances generalization and diversity in grasping styles. ResDex masters grasping 3200 objects within 12-hours' training on a single RTX 4090 GPU, achieving zero generalization gaps.

Offline Model-Based Skill Stitching

Penglin Cai, Feiyang Xie, Haoqi Yuan, Zongqing Lu^§

2024

manuscript bibtex

Given pre-trained short-term skills, how can we stitch them to solve long-horizon tasks accurately? We propose a data-efficient framework for offline, model-based skill stitching, enabling effective transitions between skills.

Pre-Trained Multi-Goal Transformers with Prompt Optimization for Efficient Online Adaptation

Haoqi Yuan, Yuhui Fu, Feiyang Xie, Zongqing Lu^§

NeurIPS 2024

conference paper GitHub bibtex poster

Previous works in skill pre-training utilize offline, task-agnostic dataset to accelerate RL. However, these approaches still require substantial RL steps to learn a new task. We propose MGPO, a method that leverages the power of Transformer-based policies to model sequences of goals during offline pre-training, enabling efficient online adaptation through prompt optimization.

RL-GPT: Integrating Reinforcement Learning and Code-as-Policy

Shaoteng Liu, Haoqi Yuan, Minda Hu, Yanwei Li, Yukang Chen, Shu Liu, Zongqing Lu, Jiaya Jia^§

NeurIPS 2024 oral

conference paper arXiv project page bibtex poster

RL-GPT equips Large Language Models (LLMs) with Reinforcement Learning (RL) tools, empowering LLM agents to solve challenging tasks in complex, open-world environments. It has a hierarchical framework: a slow LLM agent plans subtasks and selects proper tools (RL or code-as-policy); a fast LLM agent instantiates RL training pipelines or generates code to learn subtasks. LLM agents can perform self-improvement via trial-and-error efficiently. RL-GPT shows great efficiency on solving diverse Minecraft tasks, obtaining Diamond at 8% success rate within 3M environment steps.

Pre-Training Goal-Based Models for Sample-Efficient Reinforcement Learning

Haoqi Yuan, Zhancun Mu, Feiyang Xie, Zongqing Lu^§

ICLR 2024 oral (top 1.2%)

conference paper project page GitHub bibtex talk slides poster

PTGM pre-trains on task-agnostic datasets to accelerate learning downstream tasks with RL. The pre-trained models provide: 1. a low-level, goal-conditioned policy that can perform diverse short-term behaviors; 2. a discrete high-level action space consisting of clustered goals in the dataset; 3. a goal prior model that guides and stablize downstream RL to train the high-level policy. PTGM can extend to the complicated domain Minecraft with large datasets, showing great sample efficiency, task performance, interpretability, and generalization of the acquired low-level skills.

Robust Task Representations for Offline Meta-Reinforcement Learning via Contrastive Learning

Haoqi Yuan, Zongqing Lu^§

ICML 2022

conference paper GitHub bibtex talk slides poster

Offline meta-RL is a data-efficient RL paradigm that learns from offline data to adapt to new tasks. We propose a contrastive learning framework for robust task representations in context-based offline meta-RL. Our method improves the adaptation performance on unseen tasks, especially when the context is out-of-distribution.

DMotion: Robotic Visuomotor Control with Unsupervised Forward Model Learned from Videos

Haoqi Yuan*, Ruihai Wu*, Andrew Zhao*, Haipeng Zhang, Zihan Ding, Hao Dong^§

IROS 2021

conference paper arXiv project page GitHub bibtex blog talk poster

We study learning world models from action-free videos. Our unsupervised learning method leverages spatial transformers to disentangle the motion of controllable agent, learns a forward model conditioned on the explicit representation of actions. Using a few samples labelled with true actions, our method achieves superior performance on video prediction and model predictive control tasks.

DLGAN: Disentangling Label-Specific Fine-Grained Features for Image Manipulation

Guanqi Zhan*, Yihao Zhao*, Bingchan Zhao, Haoqi Yuan , Baoquan Chen, Hao Dong^§

arXiv 2019

arXiv bibtex

The first work to utilize discrete multi-labels to control which features to be disentangled, and enable interpolation between two domains without using continuous labels. An end-to-end method to support image manipulation conditioned on both images and labels, enabling both smooth and immediate changes simultaneously.

Services

Conference Reviewer

ICML'22,24,25; NeurIPS'22,23,24,25; ICLR'24,25,26; AAAI'23,24,25,26; CVPR'24,26; CORL'25; IROS'25.

Teaching Assistant

Deep Reinforcement Learning, Zongqing Lu, 2023 Spring
Computational Thinking in Social Science, Xiaoming Li, 2020 Autumn
Deep Generative Models, Hao Dong, 2020 Spring