We present TASKVERSE, a simulation benchmark and structured evaluation framework for bimanual robotic manipulation. Unlike existing benchmarks that evaluate policies solely based on task success, TASKVERSE introduces an initial suite of tiered, semantically diverse manipulation tasks with fine-grained diagnostic metrics to probe the capabilities and failure modes of learning-based agents. The benchmark provides an initial set of tasks that target specific skills such as coordination, precision, and interaction under variability, and are decomposed into variations that focus on spatial and physical variability. Alongside sparse rewards, TASKVERSE includes high-quality human demonstrations to support data-driven learning. Our evaluation shows that aggregate success rates often conceal critical skill deficiencies, and that TASKVERSE enables nuanced, stagewise insights into policy behavior. By systematically characterizing when, where, and why policies fail, TASKVERSE provides a new foundation for developing and evaluating generalizable robotic agents.
TASKVERSE is a benchmark for evaluating bimanual manipulation policies under diverse task settings. The first iteration consists of 10 base tasks and 3000+ human demonstrations. The tasks are derived from common tasks that humans perform in diverse settings, from service style tasks such as lifting a tray, to warehouse tasks like closing a box, to industrial tasks like rotating hand-wheels. Each task includes multiple variations—ranging from static setups to dynamic shifts in object pose and semantic context—designed to assess policy performance in a systematic manner. To facilitate research in imitation learning and demo-driven policy training, we provide a suite of raw expert human demonstrations, along with fine-grained evaluation metrics such as trajectory smoothness, environment collisions, etc.
Task Name | Variations | # Demos | Traj Len | Skills | Coordination Type |
---|---|---|---|---|---|
Bread in Toaster | Static, Pos | 151 | 98.229 | grasp, lift, insert | Loosely Coord. |
Cube Handover | Static, Pos, Rot, PR, Vertical | 511 | 93.631 | grasp, hold | Loosely Coord. |
Lift Pot | Static, Pos, Rot, PR | 390 | 58.561 | grasp, lift | Tight Sym. |
Lift Tray | Static, Pos, Rot, PR, Drag | 730 | 77.318 | grasp, lift | Tight Sym. |
Pack Box | Static, Pos, Rot, PR | 312 | 123.016 | push | Uncoord. |
Pick Single Book From Table | Static, Pos, Rot, PR | 359 | 103.364 | grasp, lift | Loosely Coord. |
Rotate Valve | Static, Pos, Rot, PR | 456 | 112.484 | grasp, rotate along axis | Uncoord. |
Stack Single Book Shelf | Static, Pos, PR | 199 | 187.280 | push, grasp, lift, place | Loosely Coord. |
Stack Two Block | Static, Pos, Rot, PR | 400 | 108.368 | grasp, hold, place | Loosely Coord. |
Sweep Table | Static | 104 | 121.327 | grasp, sweep | Loosely Coord. |
We evaluate policy performance across trajectory, precision, task progression, and bimanual coordination.
Trajectory-Based Metrics
Spatial Precision Metrics
Task Progression Metrics
Bimanual Coordination Metrics
Method | Overall Metrics | Lift Tray | Stack Two Cubes | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Success | Rank | SPL | Static | Pos | Ori | P+O | T | Static | Pos | Ori | P+O | |
ACT | 0.35 | 1.28 | 0.27 | 1.00 | 0.52 | 0.45 | 0.82 | 1.00 | 0.16 | 0.09 | 0.02 | 0.11 |
BC | 0.09 | 2.95 | 0.06 | 0.56 | 0.35 | 0.34 | 0.31 | 0.00 | 0.00 | 0.03 | 0.00 | 0.00 |
DP | 0.19 | 2.36 | 0.15 | 0.68 | 0.04 | 0.43 | 0.37 | 0.15 | 0.00 | 0.00 | 0.00 | 0.01 |
OpenVLA | 0.16 | 2.27 | 0.08 | 1.00 | 0.20 | 0.54 | 0.32 | 0.00 | 0.00 | 0.04 | 0.02 | 0.04 |
Method | Stack Single Book Shelf | Rod Handover | Lift Pot | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Static | Pos | P+O | Static | Pos | Ori | P+O | P+O+T | Static | Pos | Ori | P+O | |
ACT | 0.00 | 0.00 | 0.03 | 0.63 | 0.82 | 0.51 | 0.32 | 0.46 | 1.00 | 0.53 | 0.73 | 0.22 |
BC | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.04 | 0.05 | 0.05 | 0.00 |
DP | 0.00 | 0.00 | 0.00 | 0.64 | 0.00 | 0.11 | 0.00 | 0.00 | 0.91 | 0.00 | 0.77 | 0.21 |
OpenVLA | 0.00 | 0.00 | 0.00 | 1.00 | 0.08 | 0.06 | 0.00 | 0.14 | 0.98 | 0.06 | 0.08 | 0.04 |
Method | Pack Box | Pick Book from Table | Rotate Valve | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Static | Pos | Ori | P+O | Static | Pos | Ori | P+O | Static | Pos | P+O | T | |
ACT | 0.33 | 0.82 | 0.11 | 0.37 | 0.00 | 0.13 | 0.20 | 0.18 | 1.00 | 0.12 | 0.09 | 0.03 |
BC | 0.00 | 0.51 | 0.00 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 0.89 | 0.00 | 0.00 | 0.01 |
DP | 0.16 | 0.73 | 0.23 | 0.32 | 0.17 | 0.00 | 0.15 | 0.01 | 1.00 | 0.00 | 0.00 | 0.01 |
OpenVLA | 0.00 | 0.10 | 0.06 | 0.06 | 0.00 | 0.00 | 0.00 | 0.02 | 1.00 | 0.02 | 0.00 | 0.02 |
This section provides a visual summary of performance degradation using radial metrics plots and failure mode heatmaps. These visualizations allow fine-grained interpretation of agent performance under different task variations.