ViTaL: Inference-time Policy Steering via Vision and Touch

Teaser video — add static/videos/teaser.mp4

ViTaL steers a pre-trained diffusion policy using bi-level visuo-tactile guidance — visual mode selection at long horizon, tactile refinement at short horizon.

Method

ViTaL decomposes multimodal steering into a bi-level optimization: long-horizon visual mode selection followed by short-horizon tactile refinement.

ViTaL overview: visual mode selection and tactile contact refinement — **Multi-Modal Policy Steering.** ViTaL selects visual modes (which cup to target) and refines local contact (grasp force, slip avoidance), steering the base policy toward actions that satisfy both task goals and contact constraints. The high-level visual verifier picks the globally best action sequence; the low-level tactile verifier refines the first few steps via diffusion editing.

Contributions

1 We propose a bi-level multimodal guidance framework that uses vision for long-horizon semantic mode selection and touch for targeted contact refinement.
2 We introduce, to our knowledge, the first language-conditioned tactile reward in the world model's latent space, and use it with language-conditioned visual rewards and a visuo-tactile latent world model for outcome-based policy steering without task-specific reward learning.
3 We conduct extensive real-world experiments across three contact-rich manipulation tasks, where ViTaL improves overall success by 51% over the base policy, exceeds unimodal steering by at least 33%, and outperforms naive multimodal fusion by at least 20%.

Evaluation of ViTaL

Multimodal Visuo-Tactile Policy Steering — all 20 evaluation trials per task played simultaneously. Each cell shows one independent rollout; double click the video to view in full screen.

Instruction: Transfer the liquid to the blue cup and return (Trials 1–10)

Trial 1

Trial 2

Trial 3

Trial 4

Trial 5

Trial 6

Trial 7

Trial 8

Trial 9

Trial 10

Instruction: Transfer the liquid to the yellow cup and return (Trials 11–20)

Trial 11

Trial 12

Trial 13

Trial 14

Trial 15

Trial 16

Trial 17

Trial 18

Trial 19

Trial 20

Instruction: Wipe the orange marks from the whiteboard (Trials 1–10)

Trial 1

Trial 2

Trial 3

Trial 4

Trial 5

Trial 6

Trial 7

Trial 8

Trial 9

Trial 10

Instruction: Wipe the black marks from the whiteboard (Trials 11–20)

Trial 11

Trial 12

Trial 13

Trial 14

Trial 15

Trial 16

Trial 17

Trial 18

Trial 19

Trial 20

Instruction: Insert the purple peg into the top-left hole of the tabletop (Trials 1–10)

Trial 1

Trial 2

Trial 3

Trial 4

Trial 5

Trial 6

Trial 7

Trial 8

Trial 9

Trial 10

Instruction: Insert the purple peg into the top-right hole of the tabletop (Trials 11–20)

Trial 11

Trial 12

Trial 13

Trial 14

Trial 15

Trial 16

Trial 17

Trial 18

Trial 19

Trial 20

Interactive Reward Visualization

Drag the scrubber (or click either plot) to step through a trial. Each plot shows GT (solid) vs Predicted (dashed) reward curves for every textual phase objective. Visual and tactile rewards use independent scales.

Task

t = 0

Visual Observations

Ground Truth

visual GT frame

Predicted

visual pred frame

Tactile Observations (GelSight)

Ground Truth

tactile GT frame

Predicted

tactile pred frame

Visual Reward

Tactile Reward

t = 0 t = 99

World Model Imagination

The world model predicts future visual observations from both cameras and tactile sensors. Each card shows ground truth (top row), model prediction (middle row), and their difference (bottom row) across three views: front camera, wrist camera, and tactile image. Double click any card to view it in full screen.

Task

Outcome

Results

ViTaL achieves consistent gains in both visual task completion and contact success across all three real-world tasks. Hover any bar for the success rate and standard error √(p(1−p)/n) over 20 trials.

Show

Success rates (%) on Pipette, Wiping, Insertion, and the average across tasks. ViTaL improves overall success by 51% over the base policy and outperforms all unimodal baselines.

How Do Baselines Compare to ViTaL?

Select a task and a baseline to see representative failure rollouts. Each card shows the front camera (left) and the tactile image (right). Double click any card to view it in full screen.

Task

Baseline