Inference-time Policy Steering via
Vision and Touch

Yilin Wu Zilin Si Zeynep Temel Oliver Kroemer Andrea Bajcsy
Carnegie Mellon University
Teaser video — add static/videos/teaser.mp4

ViTaL steers a pre-trained diffusion policy using bi-level visuo-tactile guidance — visual mode selection at long horizon, tactile refinement at short horizon.

Method

ViTaL decomposes multimodal steering into a bi-level optimization: long-horizon visual mode selection followed by short-horizon tactile refinement.

ViTaL overview: visual mode selection and tactile contact refinement
Multi-Modal Policy Steering. ViTaL selects visual modes (which cup to target) and refines local contact (grasp force, slip avoidance), steering the base policy toward actions that satisfy both task goals and contact constraints. The high-level visual verifier picks the globally best action sequence; the low-level tactile verifier refines the first few steps via diffusion editing.

Contributions

Evaluation of ViTaL

Multimodal Visuo-Tactile Policy Steering — all 20 evaluation trials per task played simultaneously. Each cell shows one independent rollout; double click the video to view in full screen.

Instruction: Transfer the liquid to the blue cup and return (Trials 1–10)
Trial 1
Trial 2
Trial 3
Trial 4
Trial 5
Trial 6
Trial 7
Trial 8
Trial 9
Trial 10
Instruction: Transfer the liquid to the yellow cup and return (Trials 11–20)
Trial 11
Trial 12
Trial 13
Trial 14
Trial 15
Trial 16
Trial 17
Trial 18
Trial 19
Trial 20
Instruction: Wipe the orange marks from the whiteboard (Trials 1–10)
Trial 1
Trial 2
Trial 3
Trial 4
Trial 5
Trial 6
Trial 7
Trial 8
Trial 9
Trial 10
Instruction: Wipe the black marks from the whiteboard (Trials 11–20)
Trial 11
Trial 12
Trial 13
Trial 14
Trial 15
Trial 16
Trial 17
Trial 18
Trial 19
Trial 20
Instruction: Insert the purple peg into the top-left hole of the tabletop (Trials 1–10)
Trial 1
Trial 2
Trial 3
Trial 4
Trial 5
Trial 6
Trial 7
Trial 8
Trial 9
Trial 10
Instruction: Insert the purple peg into the top-right hole of the tabletop (Trials 11–20)
Trial 11
Trial 12
Trial 13
Trial 14
Trial 15
Trial 16
Trial 17
Trial 18
Trial 19
Trial 20

Interactive Reward Visualization

Drag the scrubber (or click either plot) to step through a trial. Each plot shows GT (solid) vs Predicted (dashed) reward curves for every textual phase objective. Visual and tactile rewards use independent scales.

t = 0
Visual Observations
Ground Truth
visual GT frame
Predicted
visual pred frame
Tactile Observations (GelSight)
Ground Truth
tactile GT frame
Predicted
tactile pred frame
Visual Reward
Tactile Reward
t = 0 t = 99

World Model Imagination

The world model predicts future visual observations from both cameras and tactile sensors. Each card shows ground truth (top row), model prediction (middle row), and their difference (bottom row) across three views: front camera, wrist camera, and tactile image. Double click any card to view it in full screen.

Results

ViTaL achieves consistent gains in both visual task completion and contact success across all three real-world tasks. Hover any bar for the success rate and standard error √(p(1−p)/n) over 20 trials.

Success rates (%) on Pipette, Wiping, Insertion, and the average across tasks. ViTaL improves overall success by 51% over the base policy and outperforms all unimodal baselines.

How Do Baselines Compare to ViTaL?

Select a task and a baseline to see representative failure rollouts. Each card shows the front camera (left) and the tactile image (right). Double click any card to view it in full screen.