ViTaL decomposes multimodal steering into a bi-level optimization: long-horizon visual mode selection followed by short-horizon tactile refinement.
Multimodal Visuo-Tactile Policy Steering — all 20 evaluation trials per task played simultaneously. Each cell shows one independent rollout; double click the video to view in full screen.
Drag the scrubber (or click either plot) to step through a trial. Each plot shows GT (solid) vs Predicted (dashed) reward curves for every textual phase objective. Visual and tactile rewards use independent scales.
The world model predicts future visual observations from both cameras and tactile sensors. Each card shows ground truth (top row), model prediction (middle row), and their difference (bottom row) across three views: front camera, wrist camera, and tactile image. Double click any card to view it in full screen.
ViTaL achieves consistent gains in both visual task completion and contact success across all three real-world tasks. Hover any bar for the success rate and standard error √(p(1−p)/n) over 20 trials.
Success rates (%) on Pipette, Wiping, Insertion, and the average across tasks. ViTaL improves overall success by 51% over the base policy and outperforms all unimodal baselines.
Select a task and a baseline to see representative failure rollouts. Each card shows the front camera (left) and the tactile image (right). Double click any card to view it in full screen.