From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment

1Carnegie Mellon University 2 UC Berkeley

Our method FOREWARN proposes a VLM-in-the-loop policy steering system to improve multi-modal generative policy's performance during deployment without retraining.

Abstract

While generative robot policies have demonstrated significant potential in learning complex, multimodal behaviors from demonstrations, they still exhibit diverse failures at deployment-time. Policy steering offers an elegant solution to reducing the chance of failure by using an external verifier to select from low-level actions proposed by an imperfect generative policy. Here, one might hope to use a Vision Language Model (VLM) as a verifier, leveraging its open-world reasoning capabilities. However, off-the-shelf VLMs struggle to understand the consequences of low-level robot actions as they are represented fundamentally differently than the text and images the VLM was trained on. In response, we propose FOREWARN, a novel framework to unlock the potential of VLMs as open-vocabulary verifiers for runtime policy steering. Our key idea is to decouple the VLM’s burden of predicting action outcomes (foresight) from evaluation (forethought). For foresight, we leverage a latent world model to imagine future latent states given diverse low-level action plans. For forethought, we align the VLM with these predicted latent states to reason about the consequences of actions in its native representation—natural language—and effectively filter proposed plans. We validate our framework across diverse robotic manipulation tasks, demonstrating its ability to bridge representational gaps and provide robust, generalizable policy.

FOREWARN

Filtering Options via REpresenting World-model Action Rollouts via Narration


Method Figure

We present our method FOREWARN, an VLM-in-the-loop policy steering algorithm for multi-modal generative robot policies. Our key idea is to decouple the VLM’s burden of predicting action outcomes from evaluation. By predicting action outcomes with a pre-trained latent dynamics model and aligning a VLM to reason about these latent states in text, FOREWARN can select action plans at runtime that are most appropriate for new task contexts and user needs.

Training Pipeline


Training Pipeline

On the left, a world model is pretrained to learn good latent embeddings of the dynamics conditioned on the observations and actions. On the right, the sequence of learned latent embeddings is projected through a linear layer to the text embedding space, similar to the original vision token processing in the Llama-3.2 Model. The projection layer and Llama model are finetuned together using LoRA, but the world model remains frozen. We finetune the VLM to align the latent embedding with underlying textual representation so we design a visual question answering task to ask VLM to generate behavior narrations that capture nuanced details.

Real World Results

Qualitative Results for Behavior Narration

Baselines
FOREWARN : our proposed method using the world model for predicting the latent states of the future and finetuned VLM for generating behavior narrations.
FOREWARN-Oracle : an upper-bound on our approach's performance assuming access to ground-truth future observations (instead of relying on the latent dynamics to predict future outcomes).
VLM-Act : directly fine-tuning the original Llama-3.2-11B-Vision-Instruct model to generate behavior narrations end-to-end from current observation and an action plan(represented as text), without explicitly predicting outcomes with a world model.
VLM-Img : using a GPT-4o to generate behavior narrations in a zero-shot manner from the predicted visual observations from the world model.
VLM-Img-Oracle : an upper-bound on the performance of VLM-Img, assuming access to ground-truth visual observations.

On the left are videos of the ground-truth observations. On the right are the prompt used for querying and the generated behavior narrations. Only FOREWARN and FOREWARN-Oracle consistently produce accurate outcome narrations, effectively capturing nuanced motion details. In contrast, the baselines frequently hallucinate or fail to capture critical contact details between the gripper and objects.

Cup Task

Left image

Bag Task

Left image

Quantitative Results for Behavior Narration

We conduct experiments to show the alignment between predicted behavior narrations from different methods and ground-truth narrations. FOREWARN outperforms all baselines across both tasks and achieves performance comparable to FOREWARN-Oracle , which has access to ground-truth action outcomes and represents the upper bound for our approach.

behavior narration

Qualitative Results for Policy Steering

Baselines
FOREWARN : our proposed method that (1)predicts the action outcomes with the world model in latent space; (2)generates behavior narrations with finetuned VLM from latent states of the future; (3)selects the best action plan using the same VLM based on the task description and narrations.
VLM-Act : directly finetuning the VLM to generate behavior narrations from observations and the action plans without explicit world model while keeping (3) same as our method.
VLM-DynLat-Category : keeping (1) similar to our method while directly finetuning VLM to generate a set of indices for valid action plans based on task descriptions and predicted latent states of the future, combining (2) and (3) together.
Classifier-Dyn-Latent : similar to VLM-DynLat-Category but using a learned classifier instead a VLM to predict success/failure for each plan and randomly select from the successful plans.

Our results demonstrate that FOREWARN can effectively steer the policy towards safe and aligned behavior modes by leverging the VLM as an interpreter and evaluator of predicted latent action outcomes. It outperforms VLM-Act in all tasks and Classifier-Dyn-Latent as well as VLM-DynLat-Category in novel task descriptions (not seen during training).

Task Scenario 1
Base Policy
FOREWARN
VLM-Act
VLM-DynLat-Category
Classifier-Dyn-Latent
Task Scenario 2
Base Policy
FOREWARN
VLM-Act
VLM-DynLat-Category
Classifier-Dyn-Latent
Task Scenario 3
Base Policy
FOREWARN
VLM-Act
VLM-DynLat-Category
Classifier-Dyn-Latent
Task Scenario 4
Base Policy
FOREWARN
VLM-Act
VLM-DynLat-Category
Classifier-Dyn-Latent

Quantitative Results for Policy Steering

Failure Modes


Interactive Visuliazation:   Drag the slider to visualize different failure modes.

Failure Case:  


In this example, the actual execution of the action plan knocks the cup down on the table while the world model mistakenly imagines the robot successfully picks up the cup from the table via the handle.


BibTeX

@misc{wu2025forewarn,
		title={From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment}, 
		author={Yilin Wu and Ran Tian and Gokul Swamy and Andrea Bajcsy},
		year={2025},
		eprint={2502.01828},
		archivePrefix={arXiv},
		primaryClass={cs.RO},
		url={https://arxiv.org/abs/2502.01828}, 
  
}