Oral Presentation at ICLR 2025 World Model Workshop [6.6%]
While generative robot policies have demonstrated significant potential in learning complex, multimodal behaviors from demonstrations, they still exhibit diverse failures at deployment-time. Policy steering offers an elegant solution to reducing the chance of failure by using an external verifier to select from low-level actions proposed by an imperfect generative policy. Here, one might hope to use a Vision Language Model (VLM) as a verifier, leveraging its open-world reasoning capabilities. However, off-the-shelf VLMs struggle to understand the consequences of low-level robot actions as they are represented fundamentally differently than the text and images the VLM was trained on. In response, we propose FOREWARN, a novel framework to unlock the potential of VLMs as open-vocabulary verifiers for runtime policy steering. Our key idea is to decouple the VLM’s burden of predicting action outcomes (foresight) from evaluation (forethought). For foresight, we leverage a latent world model to imagine future latent states given diverse low-level action plans. For forethought, we align the VLM with these predicted latent states to reason about the consequences of actions in its native representation–natural language–and effectively filter proposed plans. We validate our framework across diverse robotic manipulation tasks, demonstrating its ability to bridge representational gaps and provide robust, generalizable policy steering.
@inproceedings{wu2024forewarn,title={From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment},author={Wu, Yilin and Tian, Ran and Swamy, Gokul and Bajcsy, Andrea},booktitle={Robotics: Science and Systems (RSS)},year={2025},}
2024
arxiv
Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment
Visuomotor robot policies, increasingly pre-trained on large-scale datasets, promise significant advancements across robotics domains. However, aligning these policies with end-user preferences remains a challenge, particularly when the preferences are hard to specify. While reinforcement learning from human feedback (RLHF) has become the predominant mechanism for alignment in non-embodied domains like large language models, it has not seen the same success in aligning visuomotor policies due to the prohibitive amount of human feedback required to learn visual reward functions. To address this limitation, we propose Representation-Aligned Preference-based Learning (RAPL), an observation-only method for learning visual rewards from significantly less human preference feedback. Unlike traditional RLHF, RAPL focuses human feedback on fine-tuning pre-trained vision encoders to align with the end-user’s visual representation and then constructs a dense visual reward via feature matching in this aligned representation space. We first validate RAPL through simulation experiments in the X-Magical benchmark and Franka Panda robotic manipulation, demonstrating that it can learn rewards aligned with human preferences, more efficiently uses preference data, and generalizes across robot embodiments. Finally, our hardware experiments align pre-trained Diffusion Policies for three object manipulation tasks. We find that RAPL can fine-tune these policies with 5x less real human preference data, taking the first step towards minimizing human feedback while maximizing visuomotor robot policy alignment.
@inproceedings{tian2024rapl,title={Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment},author={Tian, Ran and Wu, Yilin and Xu, Chenfeng and Tomizuka, Masayoshi and Malik, Jitendra and Bajcsy, Andrea},booktitle={submission to the International Journal of Robotics Research's (IJRR) special issue on Foundation Models and Neural-Symbolic AI for Robotics},year={2024},}
IROS
Learning Generalizable Tool-use Skills through Trajectory Generation
Autonomous systems that efficiently utilize tools can assist humans in completing many common tasks such as cooking and cleaning. However, current systems fall short of matching human-level of intelligence in terms of adapting to novel tools. Prior works based on affordance often make strong assumptions about the environments and cannot scale to more complex, contact-rich tasks. In this work, we tackle this challenge and explore how agents can learn to use previously unseen tools to manipulate deformable objects. We propose to learn a generative model of the tool-use trajectories as a sequence of tool point clouds, which generalizes to different tool shapes. Given any novel tool, we first generate a tool-use trajectory and then optimize the sequence of tool poses to align with the generated trajectory. We train a single model on four different challenging deformable object manipulation tasks, using demonstration data from only one tool per task. The model generalizes to various novel tools, significantly outperforming baselines. We further test our trained policy in the real world with unseen tools, where it achieves the performance comparable to human.
@inproceedings{qitooluse2024,title={Learning Generalizable Tool-use Skills through Trajectory Generation},author={Qi, Carl and Wu, Yilin and Yu, Lifan and Liu, Haoyue and Jiang, Bowen and Lin, Xingyu and Held, David},booktitle={IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},year={2024},}
RSS
HACMan++: Spatially-Grounded Motion Primitives for Manipulation
In this work, we introduce spatially-grounded parameterized motion primitives to improve policy generalization for robotic manipulation tasks. By grounding the primitives on a spatial location in the environment, our proposed method is able to effectively generalize across object shape and pose variations.
@inproceedings{jiang2024hacman++,title={HACMan++: Spatially-Grounded Motion Primitives for Manipulation},author={Jiang, Bowen and Wu, Yilin and Zhou, Wenxuan and Paxton, Chris and Held, David},booktitle={Robotics: Science and Systems (RSS)},year={2024},}
RSS
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
In this work, we introduce DROID (Distributed Robot Interaction Dataset), a diverse robot manipulation dataset with 76k demonstration trajectories or 350h of interaction data, collected across 564 scenes and 86 tasks by 50 data collectors in North America, Asia, and Europe over the course of 12 months. We demonstrate that training with DROID leads to policies with higher performance, greater robustness, and improved generalization ability.
@inproceedings{khazatsky2024droid,title={DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset},author={Team, DROID Dataset},booktitle={Robotics: Science and Systems (RSS)},year={2024},}
ICRA
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment Team
In IEEE International Conference on Robotics and Automation (ICRA), 2024
@inproceedings{open_x_embodiment_rt_x_2023,title={Open {X-E}mbodiment: Robotic Learning Datasets and {RT-X} Models},author={Team, Open X-Embodiment},booktitle={IEEE International Conference on Robotics and Automation (ICRA)},year={2024},}
2023
CoRL
Stabilize to Act: Learning to Coordinate for Bimanual Manipulation
We present a system for bimanual manipulation that coordinates by assigning roles to arms: a stabilizing arm holds an object stationary while an acting arm acts in this simplified environment.
@inproceedings{grannen2023stabilize,title={Stabilize to Act: Learning to Coordinate for Bimanual Manipulation},author={Grannen, Jennifer and Wu, Yilin and Vu, Brandon and Sadigh, Dorsa},booktitle={Proceedings of the 7th Conference on Robotic Learning (CoRL)},year={2023},}
ICRA
In-Mouth Robotic Bite Transfer with Visual and Haptic Sensing
We build a semi autonomous robotic system to do in-mouth transfer of food safely and comfortably for disabled people. The system is composed of a force-reactive controller to safely accommodate the motions of the user throughout the transfer, a novel dexterous wrist-like end effector to reduce the discomfort and a visual sensor to identify the user mouth.
@inproceedings{10160467,author={Shaikewitz, Lorenzo and Wu, Yilin and Belkhale, Suneel and Grannen, Jennifer and Sundaresan, Priya and Sadigh, Dorsa},booktitle={2023 IEEE International Conference on Robotics and Automation (ICRA)},title={In-Mouth Robotic Bite Transfer with Visual and Haptic Sensing},year={2023},}
2022
CoRL
Learning Bimanual Scooping Policies for Food Acquisition
We propose a general bimanual scooping primitive and an adaptive stabilization strategy that enables successful acquisition of a diverse set of food geometries and physical properties with close-loop visual feedback.
@inproceedings{grannen2022learning,title={Learning Bimanual Scooping Policies for Food Acquisition},author={Grannen, Jennifer and Wu, Yilin and Belkhale, Suneel and Sadigh, Dorsa},booktitle={Proceedings of the 6th Conference on Robot Learning (CoRL)},year={2022},}
2021
ICLR
Solving Compositional Reinforcement Learning Problems via Task Reduction
Our work is to train a RL agent to acquire rope-spreading and cloth-spreading skills without any human demonstrations and the method applies to real robots after domain adaptation.
@inproceedings{li2021solving,title={Solving Compositional Reinforcement Learning Problems via Task Reduction},author={Li, Yunfei and Wu, Yilin and Xu, Huazhe and Wang, Xiaolong and Wu, Yi},booktitle={International Conference on Learning Representations (ICLR)},year={2021},}
2020
RSS
Learning to Manipulate Deformable Objects without Demonstrations
In this work, we introduce spatially-grounded parameterized motion primitives to improve policy generalization for robotic manipulation tasks. By grounding the primitives on a spatial location in the environment, our proposed method is able to effectively generalize across object shape and pose variations.
@inproceedings{wu2020learning,title={Learning to Manipulate Deformable Objects without Demonstrations},author={Wu, Yilin and Yan, Wilson and Kurutach, Thanard and Pinto, Lerrel and Abbeel, Pieter},booktitle={Robotics: Science and Systems, (RSS)},year={2020},organization={MIT Press Journals},}