Xolver Research

ContractBench-DROID Lite: runtime-readiness evaluation for robot policies.

A benchmark protocol for studying whether robot policy data carries the contracts, safety context, replay records, and evidence needed to evaluate learned behavior before physical deployment.

Read Results Citation

Runtime readiness

Measures whether robot episodes can be represented as bounded, reviewable action under an explicit runtime contract.

Derived benchmark layer

Uses schemas, manifests, splits, and metadata references to evaluate public robot datasets without changing the underlying dataset.

Evidence-first evaluation

Treats replayability, validation status, and evidence references as first-class benchmark outputs, not afterthoughts.

Abstract

Robot learning benchmarks usually measure task success, but deployment teams also need to know whether learned behavior can be checked, constrained, replayed, and audited before it is trusted near physical systems.

ContractBench-DROID Lite introduces a contract-first evaluation protocol for robot policy data. Instead of re-hosting raw robot data, it publishes derived metadata: upstream episode references, runtime contract schemas, evidence manifest schemas, split files, coverage metrics, and missing-metadata taxonomy.

The Lite v0 split references five successful DROID v1.0.1 episodes from distinct collection sites. The first baseline shows that the references can be represented with valid contracts and evidence manifests, while deployment-critical safety metadata remains largely absent from the public episode references.

Why this benchmark exists.

Robot learning benchmarks usually focus on task success. Deployment teams also need to know whether a policy output can be checked, constrained, replayed, and explained under real physical assumptions.

ContractBench-DROID Lite asks a deployment-facing question: does the data path expose enough structure to evaluate whether an action would be admissible under a runtime contract?

Evaluation dimensions

Contract completeness

Whether each episode can be mapped to an explicit action, observation, timing, and robot-profile contract.

Safety metadata coverage

Whether workspace bounds, contact limits, stop conditions, and operator context are available or must be inferred.

Replayability score

Whether observations, actions, timestamps, language goals, and calibration references are sufficient for reviewable deployment analysis.

Evidence coverage

Whether an episode can produce a reviewable manifest with hashes, validation status, and failure annotations.

Benchmark construction.

The Lite protocol is designed as a derived evaluation layer over public robotics datasets. The first target is DROID-style real-world robot manipulation data, normalized into contract and evidence manifest fields.

1. Reference

Link upstream public episodes, dataset versions, and source identifiers.

2. Normalize

Map actions, observations, timing, and robot context into contract fields.

3. Validate

Run schema checks and mark missing deployment-critical metadata.

4. Record

Emit evidence manifests that can be cited, hashed, and reviewed.

Lite v0 source set.

Lite v0 uses real DROID v1.0.1 episode identifiers from the public DROID annotation mapping. The artifact stores references and derived metadata only; it does not re-host raw robot video, sensor streams, actions, trajectories, or annotation payloads.

Collection site	Episode ID
AUTOLab	AUTOLab+44bb9c36+2023-11-25-10h-11m-14s
CLVR	CLVR+236539bc+2023-05-09-01h-17m-11s
GuptaLab	GuptaLab+553d1bd5+2023-04-20-12h-41m-59s
ILIAD	ILIAD+50aee79f+2023-08-15-23h-13m-19s
IPRL	IPRL+41a1536a+2023-11-14-14h-07m-16s

DROID is credited as the upstream dataset. The DROID project page describes 76k demonstration trajectories, 350 hours of interaction, 564 scenes, and 86 tasks.

Lite v0 baseline results

Metric	Value	Interpretation
Contract schema validity	1.000	The draft runtime contract contains all required top-level fields.
Manifest schema validity	1.000	All five evidence manifests contain required top-level fields.
Split reference validity	1.000	All split rows reference known contract and manifest IDs.
Contract completeness	0.952	Most nested contract fields are populated; unavailable fields remain explicit.
Safety metadata coverage	0.000	Safety-critical fields are not present or inferred in the current public episode references.
Replay field coverage	0.822	Most replay-critical reference fields are present or inferred.
Average replay readiness score	0.560	The current Lite v0 references are partially replay-ready, with blockers.
Evidence reference coverage	1.000	Each manifest has a source reference hash and manifest hash.

These values are generated from the Lite v0 derived metadata layer. They measure reference and metadata readiness, not robot task success.

Missing metadata taxonomy

The repeated missing fields show the gap between robot-learning data and deployment-readiness evidence. This does not mean the upstream dataset is deficient; DROID was designed as a large-scale robot learning dataset, not a deployment certification package.

ContractBench-DROID Lite measures the additional metadata layer needed when learned behavior is evaluated for runtime admissibility.

Missing field	Count
camera_calibration	3
contact_limits	5
nominal_control_hz	1
operator_context	5
robot_world_transform	5
safety_zones	5
stop_conditions	5
workspace_bounds	5

Reproducibility.

The research artifact contains JSON schemas, a Lite split, evidence manifest examples, a Hugging Face-ready dataset card, baseline metrics, and a validator. The validator regenerates the result table from the derived metadata files.

The Xolver-authored derived materials are intended for CC BY 4.0 release after final approval. The artifact does not include or relicense upstream DROID raw data or annotation payloads.

Upstream attribution.

ContractBench-DROID Lite uses references to DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset, and the public DROID annotation repository maintained on Hugging Face.

DROID project DROID paper DROID annotations

Citation

@misc{xolver_contractbench_droid_lite_2026,
  title        = {ContractBench-DROID Lite: Runtime-Readiness Evaluation for Robot Policies},
  author       = {Xolver Research},
  year         = {2026},
  howpublished = {Xolver Research},
  url          = {https://xolver.ai/research/contractbench-droid/}
}

Please also cite the upstream DROID dataset when using the ContractBench-DROID Lite references.

Research artifacts

Runtime contract schema

Defines the robot profile, action space, observation space, timing, workspace, safety policy, and calibration assumptions.

Evidence manifest schema

Records upstream references, validation status, field coverage, replay readiness, and evidence hashes for each episode.

Lite v0 split

Five real DROID v1.0.1 successful episode references from distinct collection sites.

Validator and metrics

A reproducible baseline that reports schema validity, metadata coverage, replay readiness, and missing fields.

Scope and limitations.

ContractBench-DROID Lite evaluates the data and evidence path around robot policy behavior. It does not certify hardware deployment by itself.

Its purpose is to make the deployment gap measurable: contracts, safety metadata, replay records, and evidence must be visible before learned behavior can be responsibly evaluated near physical systems.

The Lite v0 split is intentionally small and should not be interpreted as representative of the full DROID dataset. The current task-family label is coarse because the first pass uses upstream episode references without re-hosting annotation payloads.

Partner with us on deployment-grade evaluation.

Xolver is building the measurement layer between learned robot policies and real physical execution. We work with teams that care about evidence, safety boundaries, and field readiness.

Contact Xolver