ContractBench-DROID Lite: runtime-readiness evaluation for robot policies.
A benchmark protocol for studying whether robot policy data carries the contracts, safety context, replay records, and evidence needed to evaluate learned behavior before physical deployment.
Runtime readiness
Measures whether robot episodes can be represented as bounded, reviewable action under an explicit runtime contract.
Derived benchmark layer
Uses schemas, manifests, splits, and metadata references to evaluate public robot datasets without changing the underlying dataset.
Evidence-first evaluation
Treats replayability, validation status, and evidence references as first-class benchmark outputs, not afterthoughts.
Robot learning benchmarks usually measure task success, but deployment teams also need to know whether learned behavior can be checked, constrained, replayed, and audited before it is trusted near physical systems.
ContractBench-DROID Lite introduces a contract-first evaluation protocol for robot policy data. Instead of re-hosting raw robot data, it publishes derived metadata: upstream episode references, runtime contract schemas, evidence manifest schemas, split files, coverage metrics, and missing-metadata taxonomy.
The Lite v0 split references five successful DROID v1.0.1 episodes from distinct collection sites. The first baseline shows that the references can be represented with valid contracts and evidence manifests, while deployment-critical safety metadata remains largely absent from the public episode references.
Why this benchmark exists.
Robot learning benchmarks usually focus on task success. Deployment teams also need to know whether a policy output can be checked, constrained, replayed, and explained under real physical assumptions.
ContractBench-DROID Lite asks a deployment-facing question: does the data path expose enough structure to evaluate whether an action would be admissible under a runtime contract?
Evaluation dimensions
Contract completeness
Whether each episode can be mapped to an explicit action, observation, timing, and robot-profile contract.
Safety metadata coverage
Whether workspace bounds, contact limits, stop conditions, and operator context are available or must be inferred.
Replayability score
Whether observations, actions, timestamps, language goals, and calibration references are sufficient for deterministic review.
Evidence coverage
Whether an episode can produce a reviewable manifest with hashes, validation status, and failure annotations.
Benchmark construction.
The Lite protocol is designed as a derived evaluation layer over public robotics datasets. The first target is DROID-style real-world robot manipulation data, normalized into contract and evidence manifest fields.
1. Reference
Link upstream public episodes, dataset versions, and source identifiers.
2. Normalize
Map actions, observations, timing, and robot context into contract fields.
3. Validate
Run schema checks and mark missing deployment-critical metadata.
4. Record
Emit evidence manifests that can be cited, hashed, and reviewed.
Lite v0 source set.
Lite v0 uses real DROID v1.0.1 episode identifiers from the public DROID annotation mapping. The artifact stores references and derived metadata only; it does not re-host raw robot video, sensor streams, actions, trajectories, or annotation payloads.
| Collection site | Episode ID |
|---|---|
| AUTOLab | AUTOLab+44bb9c36+2023-11-25-10h-11m-14s |
| CLVR | CLVR+236539bc+2023-05-09-01h-17m-11s |
| GuptaLab | GuptaLab+553d1bd5+2023-04-20-12h-41m-59s |
| ILIAD | ILIAD+50aee79f+2023-08-15-23h-13m-19s |
| IPRL | IPRL+41a1536a+2023-11-14-14h-07m-16s |
DROID is credited as the upstream dataset. The DROID project page describes 76k demonstration trajectories, 350 hours of interaction, 564 scenes, and 86 tasks.
Lite v0 baseline results
| Metric | Value | Interpretation |
|---|---|---|
| Contract schema validity | 1.000 | The draft runtime contract contains all required top-level fields. |
| Manifest schema validity | 1.000 | All five evidence manifests contain required top-level fields. |
| Split reference validity | 1.000 | All split rows reference known contract and manifest IDs. |
| Contract completeness | 0.952 | Most nested contract fields are populated; unavailable fields remain explicit. |
| Safety metadata coverage | 0.000 | Safety-critical fields are not present or inferred in the current public episode references. |
| Replay field coverage | 0.822 | Most replay-critical reference fields are present or inferred. |
| Average replay readiness score | 0.560 | The current Lite v0 references are partially replay-ready, with blockers. |
| Evidence reference coverage | 1.000 | Each manifest has a source reference hash and manifest hash. |
These values are generated from the Lite v0 derived metadata layer. They measure reference and metadata readiness, not robot task success.
Missing metadata taxonomy
The repeated missing fields show the gap between robot-learning data and deployment-readiness evidence. This does not mean the upstream dataset is deficient; DROID was designed as a large-scale robot learning dataset, not a deployment certification package.
ContractBench-DROID Lite measures the additional metadata layer needed when learned behavior is evaluated for runtime admissibility.
| Missing field | Count |
|---|---|
| camera_calibration | 3 |
| contact_limits | 5 |
| nominal_control_hz | 1 |
| operator_context | 5 |
| robot_world_transform | 5 |
| safety_zones | 5 |
| stop_conditions | 5 |
| workspace_bounds | 5 |
Reproducibility.
The research artifact contains JSON schemas, a Lite split, evidence manifest examples, a Hugging Face-ready dataset card, baseline metrics, and a validator. The validator regenerates the result table from the derived metadata files.
The Xolver-authored derived materials are intended for CC BY 4.0 release after final approval. The artifact does not include or relicense upstream DROID raw data or annotation payloads.
Upstream attribution.
ContractBench-DROID Lite uses references to DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset, and the public DROID annotation repository maintained on Hugging Face.
Citation
@misc{xolver_contractbench_droid_lite_2026,
title = {ContractBench-DROID Lite: Runtime-Readiness Evaluation for Robot Policies},
author = {Xolver Research},
year = {2026},
howpublished = {Xolver Research},
url = {https://xolver.ai/research/contractbench-droid/}
}Please also cite the upstream DROID dataset when using the ContractBench-DROID Lite references.
Research artifacts
Runtime contract schema
Defines the robot profile, action space, observation space, timing, workspace, safety policy, and calibration assumptions.
Evidence manifest schema
Records upstream references, validation status, field coverage, replay readiness, and evidence hashes for each episode.
Lite v0 split
Five real DROID v1.0.1 successful episode references from distinct collection sites.
Validator and metrics
A reproducible baseline that reports schema validity, metadata coverage, replay readiness, and missing fields.
Scope and limitations.
ContractBench-DROID Lite evaluates the data and evidence path around robot policy behavior. It does not certify hardware deployment by itself.
Its purpose is to make the deployment gap measurable: contracts, safety metadata, replay records, and evidence must be visible before learned behavior can be responsibly evaluated near physical systems.
The Lite v0 split is intentionally small and should not be interpreted as representative of the full DROID dataset. The current task-family label is coarse because the first pass uses upstream episode references without re-hosting annotation payloads.
Partner with us on deployment-grade evaluation.
Xolver is building the measurement layer between learned robot policies and real physical execution. We work with teams that care about evidence, safety boundaries, and field readiness.
Contact Xolver