Taskverse: Where Robotic Manipulation Meets Structured and Scalable Evaluation

Video

Abstract

We present TASKVERSE, a simulation benchmark and structured evaluation framework for bimanual robotic manipulation. Unlike existing benchmarks that evaluate policies solely based on task success, TASKVERSE introduces an initial suite of tiered, semantically diverse manipulation tasks with fine-grained diagnostic metrics to probe the capabilities and failure modes of learning-based agents. The benchmark provides an initial set of tasks that target specific skills such as coordination, precision, and interaction under variability, and are decomposed into variations that focus on spatial and physical variability. Alongside sparse rewards, TASKVERSE includes high-quality human demonstrations to support data-driven learning. Our evaluation shows that aggregate success rates often conceal critical skill deficiencies, and that TASKVERSE enables nuanced, stagewise insights into policy behavior. By systematically characterizing when, where, and why policies fail, TASKVERSE provides a new foundation for developing and evaluating generalizable robotic agents.

Benchmark

Task Overview

(a) Lift Pot

(b) Lift Tray

(c) Move Tray

(d) Pack Box

(e) Reach Target

(f) Rotate Valve

(g) Cube Handover

Taskverse Benchmark

TASKVERSE is a benchmark for evaluating bimanual manipulation policies under diverse task settings. The first iteration consists of 10 base tasks and 3000+ human demonstrations. The tasks are derived from common tasks that humans perform in diverse settings, from service style tasks such as lifting a tray, to warehouse tasks like closing a box, to industrial tasks like rotating hand-wheels. Each task includes multiple variations—ranging from static setups to dynamic shifts in object pose and semantic context—designed to assess policy performance in a systematic manner. To facilitate research in imitation learning and demo-driven policy training, we provide a suite of raw expert human demonstrations, along with fine-grained evaluation metrics such as trajectory smoothness, environment collisions, etc.

Base Task Set in RobotArena

Task Name	Variations	# Demos	Traj Len	Skills	Coordination Type
Bread in Toaster	Static, Pos	151	98.229	grasp, lift, insert	Loosely Coord.
Cube Handover	Static, Pos, Rot, PR, Vertical	511	93.631	grasp, hold	Loosely Coord.
Lift Pot	Static, Pos, Rot, PR	390	58.561	grasp, lift	Tight Sym.
Lift Tray	Static, Pos, Rot, PR, Drag	730	77.318	grasp, lift	Tight Sym.
Pack Box	Static, Pos, Rot, PR	312	123.016	push	Uncoord.
Pick Single Book From Table	Static, Pos, Rot, PR	359	103.364	grasp, lift	Loosely Coord.
Rotate Valve	Static, Pos, Rot, PR	456	112.484	grasp, rotate along axis	Uncoord.
Stack Single Book Shelf	Static, Pos, PR	199	187.280	push, grasp, lift, place	Loosely Coord.
Stack Two Block	Static, Pos, Rot, PR	400	108.368	grasp, hold, place	Loosely Coord.
Sweep Table	Static	104	121.327	grasp, sweep	Loosely Coord.

Evaluation Metrics

We evaluate policy performance across trajectory, precision, task progression, and bimanual coordination.

Trajectory-Based Metrics

Joint Path Length: Total angular joint distance during execution.
Cartesian Path Length: 3D distance traveled by end-effectors.
Jerk (Joint / Cartesian): Measures motion smoothness.
Collision Counts: Number of robot/environment collisions.

Spatial Precision Metrics

Final Distance to Target: Distance between final and goal object pose.
Orientation Error: Geodesic difference between object rotations.

Task Progression Metrics

Stage-wise Success: Binary success indicators for each sub-task.
Time in Each Stage: Timesteps spent per task stage.

Bimanual Coordination Metrics

Gripper Vertical Sync: Height difference between two arms.
EE Velocity Difference: Measures arm coordination.
Slip Count: Tracks unintended object drops.

Simulation Results

Task rollouts

Task

with variation

and method

episode

Performance on Bimanual Tasks with Variations

Method	Overall Metrics			Lift Tray					Stack Two Cubes
Method	Success	Rank	SPL	Static	Pos	Ori	P+O	T	Static	Pos	Ori	P+O
ACT	0.35	1.28	0.27	1.00	0.52	0.45	0.82	1.00	0.16	0.09	0.02	0.11
BC	0.09	2.95	0.06	0.56	0.35	0.34	0.31	0.00	0.00	0.03	0.00	0.00
DP	0.19	2.36	0.15	0.68	0.04	0.43	0.37	0.15	0.00	0.00	0.00	0.01
OpenVLA	0.16	2.27	0.08	1.00	0.20	0.54	0.32	0.00	0.00	0.04	0.02	0.04

Method	Stack Single Book Shelf			Rod Handover					Lift Pot
Method	Static	Pos	P+O	Static	Pos	Ori	P+O	P+O+T	Static	Pos	Ori	P+O
ACT	0.00	0.00	0.03	0.63	0.82	0.51	0.32	0.46	1.00	0.53	0.73	0.22
BC	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.04	0.04	0.05	0.05	0.00
DP	0.00	0.00	0.00	0.64	0.00	0.11	0.00	0.00	0.91	0.00	0.77	0.21
OpenVLA	0.00	0.00	0.00	1.00	0.08	0.06	0.00	0.14	0.98	0.06	0.08	0.04

Method	Pack Box				Pick Book from Table				Rotate Valve
Method	Static	Pos	Ori	P+O	Static	Pos	Ori	P+O	Static	Pos	P+O	T
ACT	0.33	0.82	0.11	0.37	0.00	0.13	0.20	0.18	1.00	0.12	0.09	0.03
BC	0.00	0.51	0.00	0.10	0.00	0.00	0.00	0.00	0.89	0.00	0.00	0.01
DP	0.16	0.73	0.23	0.32	0.17	0.00	0.15	0.01	1.00	0.00	0.00	0.01
OpenVLA	0.00	0.10	0.06	0.06	0.00	0.00	0.00	0.02	1.00	0.02	0.00	0.02

Real-world experiments

To validate the trends observed in our simulation benchmarks, we conducted real-world experiments on three tasks—Lift Tray, Stack Two Cubes, and Rod Handover – using a bimanual Franka Panda robot setup. These tasks closely mirror their simulated counterparts and include multiple task variations. For each task, we collected 100 demonstrations under the static variation using a VR Oculus controller, and fine-tuned OpenVLA until training accuracy plateaued. We then evaluated each task variation over 25 trials. The results, shown in Figure 4, demonstrate a strong correlation between the real-world task performance of OpenVLA and the trends observed in simulation, with performance decreasing as task complexity increased due to variations

Failure Analysis & Metric Summary

This section provides a visual summary of performance degradation using radial metrics plots and failure mode heatmaps. These visualizations allow fine-grained interpretation of agent performance under different task variations.

Task

Metric

Where Robotic Manipulation Meets Structured and Scalable Evaluation