Publications | Yilin Wu 吴怡琳

2025

arxiv
Do What You Say: Steering Vision-Language-Action Models via Runtime Reasoning-Action Alignment Verification

Yilin Wu , Anqi Li, Tucker Hermans, Fabio Ramos, Andrea Bajcsy^*, and Claudia P’erez-D’Arpino^*

In arxiv, 2025

Abs arXiv Bib Website

Reasoning Vision Language Action (VLA) models improve robotic instruction-following by generating step-by-step textual plans before low-level actions, an approach inspired by Chain-of-Thought (CoT) reasoning in language models. Yet even with a correct textual plan, the generated actions can still miss the intended outcomes in the plan, especially in out-of-distribution (OOD) scenarios. We formalize this phenomenon as a lack of embodied CoT faithfulness, and introduce a training-free, runtime policy steering method for reasoning-action alignment. Given a reasoning VLA’s intermediate textual plan, our framework samples multiple candidate action sequences from the same model, predicts their outcomes via simulation, and uses a pre-trained Vision-Language Model (VLM) to select the sequence whose outcome best aligns with the VLA’s own textual plan. Only executing action sequences that align with the textual reasoning turns our base VLA’s natural action diversity from a source of error into a strength, boosting robustness to semantic and visual OOD perturbations and enabling novel behavior composition without costly re-training. We also contribute a reasoning-annotated extension of LIBERO-100, environment variations tailored for OOD evaluation, and demonstrate up to 15% performance gain over prior work on behavior composition tasks and scales with compute and data diversity. Project Website at: this https URL
@inproceedings{wu2025steering, title = {Do What You Say: Steering Vision-Language-Action Models via Runtime Reasoning-Action Alignment Verification}, author = {Wu, Yilin and Li, Anqi and Hermans, Tucker and Ramos, Fabio and Bajcsy, Andrea and P'erez-D'Arpino, Claudia}, booktitle = {arxiv}, year = {2025}, }
RSS
From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment

Yilin Wu, Ran Tian, Gokul Swamy, and Andrea Bajcsy

In Robotics: Science and Systems (RSS), 2025

Outstanding Paper Award at ICLR 2025 World Model Workshop

Abs arXiv Bib Website

Outstanding Paper Award at ICLR 2025 World Model Workshop

While generative robot policies have demonstrated significant potential in learning complex, multimodal behaviors from demonstrations, they still exhibit diverse failures at deployment-time. Policy steering offers an elegant solution to reducing the chance of failure by using an external verifier to select from low-level actions proposed by an imperfect generative policy. Here, one might hope to use a Vision Language Model (VLM) as a verifier, leveraging its open-world reasoning capabilities. However, off-the-shelf VLMs struggle to understand the consequences of low-level robot actions as they are represented fundamentally differently than the text and images the VLM was trained on. In response, we propose FOREWARN, a novel framework to unlock the potential of VLMs as open-vocabulary verifiers for runtime policy steering. Our key idea is to decouple the VLM’s burden of predicting action outcomes (foresight) from evaluation (forethought). For foresight, we leverage a latent world model to imagine future latent states given diverse low-level action plans. For forethought, we align the VLM with these predicted latent states to reason about the consequences of actions in its native representation–natural language–and effectively filter proposed plans. We validate our framework across diverse robotic manipulation tasks, demonstrating its ability to bridge representational gaps and provide robust, generalizable policy steering.
@inproceedings{wu2024forewarn, title = {From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment}, author = {Wu, Yilin and Tian, Ran and Swamy, Gokul and Bajcsy, Andrea}, booktitle = {Robotics: Science and Systems (RSS)}, year = {2025}, }

2024

arxiv
Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment

Ran Tian , Yilin Wu , Chenfeng Xu, Masayoshi Tomizuka, Jitendra Malik, and Andrea Bajcsy

In submission to the International Journal of Robotics Research’s (IJRR) special issue on Foundation Models and Neural-Symbolic AI for Robotics, 2024

Abs arXiv Bib Website

Visuomotor robot policies, increasingly pre-trained on large-scale datasets, promise significant advancements across robotics domains. However, aligning these policies with end-user preferences remains a challenge, particularly when the preferences are hard to specify. While reinforcement learning from human feedback (RLHF) has become the predominant mechanism for alignment in non-embodied domains like large language models, it has not seen the same success in aligning visuomotor policies due to the prohibitive amount of human feedback required to learn visual reward functions. To address this limitation, we propose Representation-Aligned Preference-based Learning (RAPL), an observation-only method for learning visual rewards from significantly less human preference feedback. Unlike traditional RLHF, RAPL focuses human feedback on fine-tuning pre-trained vision encoders to align with the end-user’s visual representation and then constructs a dense visual reward via feature matching in this aligned representation space. We first validate RAPL through simulation experiments in the X-Magical benchmark and Franka Panda robotic manipulation, demonstrating that it can learn rewards aligned with human preferences, more efficiently uses preference data, and generalizes across robot embodiments. Finally, our hardware experiments align pre-trained Diffusion Policies for three object manipulation tasks. We find that RAPL can fine-tune these policies with 5x less real human preference data, taking the first step towards minimizing human feedback while maximizing visuomotor robot policy alignment.
@inproceedings{tian2024rapl, title = {Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment}, author = {Tian, Ran and Wu, Yilin and Xu, Chenfeng and Tomizuka, Masayoshi and Malik, Jitendra and Bajcsy, Andrea}, booktitle = {submission to the International Journal of Robotics Research's (IJRR) special issue on Foundation Models and Neural-Symbolic AI for Robotics}, year = {2024}, }
IROS
Learning Generalizable Tool-use Skills through Trajectory Generation

Carl Qi^* , Yilin Wu^*, Lifan Yu, Haoyue Liu, Bowen Jiang, Xingyu Lin^**, and David Held^**

In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024

Abs arXiv Bib Video Website

Autonomous systems that efficiently utilize tools can assist humans in completing many common tasks such as cooking and cleaning. However, current systems fall short of matching human-level of intelligence in terms of adapting to novel tools. Prior works based on affordance often make strong assumptions about the environments and cannot scale to more complex, contact-rich tasks. In this work, we tackle this challenge and explore how agents can learn to use previously unseen tools to manipulate deformable objects. We propose to learn a generative model of the tool-use trajectories as a sequence of tool point clouds, which generalizes to different tool shapes. Given any novel tool, we first generate a tool-use trajectory and then optimize the sequence of tool poses to align with the generated trajectory. We train a single model on four different challenging deformable object manipulation tasks, using demonstration data from only one tool per task. The model generalizes to various novel tools, significantly outperforming baselines. We further test our trained policy in the real world with unseen tools, where it achieves the performance comparable to human.
@inproceedings{qitooluse2024, title = {Learning Generalizable Tool-use Skills through Trajectory Generation}, author = {Qi, Carl and Wu, Yilin and Yu, Lifan and Liu, Haoyue and Jiang, Bowen and Lin, Xingyu and Held, David}, booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, year = {2024}, }
RSS
HACMan++: Spatially-Grounded Motion Primitives for Manipulation

Bowen Jiang^* , Yilin Wu^*, Wenxuan Zhou, Chris Paxton, and David Held

In Robotics: Science and Systems (RSS), 2024

Abs arXiv Bib Video Website

In this work, we introduce spatially-grounded parameterized motion primitives to improve policy generalization for robotic manipulation tasks. By grounding the primitives on a spatial location in the environment, our proposed method is able to effectively generalize across object shape and pose variations.
@inproceedings{jiang2024hacman++, title = {HACMan++: Spatially-Grounded Motion Primitives for Manipulation}, author = {Jiang, Bowen and Wu, Yilin and Zhou, Wenxuan and Paxton, Chris and Held, David}, booktitle = {Robotics: Science and Systems (RSS)}, year = {2024}, }
RSS
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

DROID Dataset Team

In Robotics: Science and Systems (RSS), 2024

Abs arXiv Bib Video Website

In this work, we introduce DROID (Distributed Robot Interaction Dataset), a diverse robot manipulation dataset with 76k demonstration trajectories or 350h of interaction data, collected across 564 scenes and 86 tasks by 50 data collectors in North America, Asia, and Europe over the course of 12 months. We demonstrate that training with DROID leads to policies with higher performance, greater robustness, and improved generalization ability.
@inproceedings{khazatsky2024droid, title = {DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset}, author = {Team, DROID Dataset}, booktitle = {Robotics: Science and Systems (RSS)}, year = {2024}, }

ICRA

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Team

In IEEE International Conference on Robotics and Automation (ICRA), 2024

Best Paper Award

arXiv Bib Video Website

Best Paper Award

@inproceedings{open_x_embodiment_rt_x_2023,
  title = {Open {X-E}mbodiment: Robotic Learning Datasets and {RT-X} Models},
  author = {Team, Open X-Embodiment},
  booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
  year = {2024},
}

2023

CoRL
Stabilize to Act: Learning to Coordinate for Bimanual Manipulation

Jennifer Grannen , Yilin Wu, Brandon Vu, and Dorsa Sadigh

In Proceedings of the 7th Conference on Robotic Learning (CoRL), 2023

Oral Presentation [6.6%]

Abs arXiv Bib Video Website

Oral Presentation [6.6%]

We present a system for bimanual manipulation that coordinates by assigning roles to arms: a stabilizing arm holds an object stationary while an acting arm acts in this simplified environment.
@inproceedings{grannen2023stabilize, title = {Stabilize to Act: Learning to Coordinate for Bimanual Manipulation}, author = {Grannen, Jennifer and Wu, Yilin and Vu, Brandon and Sadigh, Dorsa}, booktitle = {Proceedings of the 7th Conference on Robotic Learning (CoRL)}, year = {2023}, }
ICRA
In-Mouth Robotic Bite Transfer with Visual and Haptic Sensing

Lorenzo Shaikewitz^* , Yilin Wu^*, Suneel Belkhale^*, Jennifer Grannen, Priya Sundaresan, and Dorsa Sadigh

In 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023

Abs arXiv Bib Video Blog Website

We build a semi autonomous robotic system to do in-mouth transfer of food safely and comfortably for disabled people. The system is composed of a force-reactive controller to safely accommodate the motions of the user throughout the transfer, a novel dexterous wrist-like end effector to reduce the discomfort and a visual sensor to identify the user mouth.
@inproceedings{10160467, author = {Shaikewitz, Lorenzo and Wu, Yilin and Belkhale, Suneel and Grannen, Jennifer and Sundaresan, Priya and Sadigh, Dorsa}, booktitle = {2023 IEEE International Conference on Robotics and Automation (ICRA)}, title = {In-Mouth Robotic Bite Transfer with Visual and Haptic Sensing}, year = {2023}, }

2022

CoRL
Learning Bimanual Scooping Policies for Food Acquisition

Jennifer Grannen^* , Yilin Wu^*, Suneel Belkhale, and Dorsa Sadigh

In Proceedings of the 6th Conference on Robot Learning (CoRL), 2022

Abs arXiv Bib Video Blog Website

We propose a general bimanual scooping primitive and an adaptive stabilization strategy that enables successful acquisition of a diverse set of food geometries and physical properties with close-loop visual feedback.
@inproceedings{grannen2022learning, title = {Learning Bimanual Scooping Policies for Food Acquisition}, author = {Grannen, Jennifer and Wu, Yilin and Belkhale, Suneel and Sadigh, Dorsa}, booktitle = {Proceedings of the 6th Conference on Robot Learning (CoRL)}, year = {2022}, }

2021

ICLR
Solving Compositional Reinforcement Learning Problems via Task Reduction

Yunfei Li , Yilin Wu, Huazhe Xu, Xiaolong Wang, and Yi Wu

In International Conference on Learning Representations (ICLR), 2021

Abs arXiv Bib Video Code Website

Our work is to train a RL agent to acquire rope-spreading and cloth-spreading skills without any human demonstrations and the method applies to real robots after domain adaptation.
@inproceedings{li2021solving, title = {Solving Compositional Reinforcement Learning Problems via Task Reduction}, author = {Li, Yunfei and Wu, Yilin and Xu, Huazhe and Wang, Xiaolong and Wu, Yi}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2021}, }

2020

RSS
Learning to Manipulate Deformable Objects without Demonstrations

Yilin Wu^*, Wilson Yan^*, Thanard Kurutach, Lerrel Pinto, and Pieter Abbeel

In Robotics: Science and Systems, (RSS), 2020

Abs arXiv Bib Video Blog Code Website

In this work, we introduce spatially-grounded parameterized motion primitives to improve policy generalization for robotic manipulation tasks. By grounding the primitives on a spatial location in the environment, our proposed method is able to effectively generalize across object shape and pose variations.
@inproceedings{wu2020learning, title = {Learning to Manipulate Deformable Objects without Demonstrations}, author = {Wu, Yilin and Yan, Wilson and Kurutach, Thanard and Pinto, Lerrel and Abbeel, Pieter}, booktitle = {Robotics: Science and Systems, (RSS)}, year = {2020}, organization = {MIT Press Journals}, }