Back to Blog
ResearchDec 16, 20257 min read

How to train a RFM (Robotics Foundation Model)

#Robotics#Foundation Models#Training#Research

Training a robotics foundation model is not an exercise in scaling parameters. It is an exercise in deciding what kind of world you want a machine to survive in. Unlike language or vision models, an RFM does not live in files, tokens, or frozen datasets. It lives in time, friction, latency, contact, failure, and recovery.

This is why most attempts at large scale robotics learning fail quietly. They begin with the wrong abstraction. They assume robotics is just another multimodal problem. It is not. Robotics is a closed loop system where perception, reasoning, and action continuously interfere with one another.

An RFM is best understood as a general policy that can span tasks, environments, and embodiments. Not a task specific controller. Not a demo trained for one arm, one table, one lighting condition. It is a system that can perceive intent, reason under uncertainty, and act in ways that remain stable when the world pushes back.

Today we share, from our experience, how such a model is actually trained, end to end, without pretending that physics can be abstracted away.


Define intelligence before you define models

Before collecting data or choosing architectures, define what intelligence means in your system.

Is the robot expected to manipulate objects, navigate spaces, or collaborate with humans. Is it required to learn new tasks from language or only execute known skills. Are failures acceptable if recoverable, or must the system be conservative by default. What latency budget does the system have. Does inference need to run at 5 Hz, 10 Hz, or 50 Hz.

These questions are not philosophical. They directly constrain architecture, training signals, and deployment.

Most RFM projects fail because teams gather data before defining the distribution shift the model must survive. As a result, the model performs well in controlled settings and collapses under mild perturbations.

If you cannot articulate the failure modes you are designing for, you are not training a foundation model. You are collecting demos.


Data is not volume, it is causality

RFM training is often framed as a data scaling problem. That is only partially true. The real challenge is not how much data you have, but whether the data teaches causality rather than correlation.

Robotics data is expensive, biased, and shaped by embodiment. Sensors encode perspective. Actuators encode constraints. Human operators encode habits. If you simply aggregate trajectories, the model learns these biases instead of the task.

Robust RFM pipelines deliberately combine four classes of data.

Real world demonstrations anchor the model to physics. Teleoperation, kinesthetic teaching, or expert policies teach contact dynamics, friction, and feasibility.

Simulation rollouts provide breadth. They allow exploration of rare events, edge cases, and failures that are unsafe or slow to produce in reality. Domain randomization here is not about noise, but about uncertainty that mirrors reality.

Corrective and recovery data is the most valuable and the most ignored. Successful trajectories teach optimism. Near misses, aborts, and human interventions teach robustness. Without this data, models fail catastrophically instead of gracefully.

Language grounded annotations provide abstraction. Not just task names, but intent, constraints, and success conditions. This is what allows generalisation beyond memorised trajectories.

The objective is not balance. The objective is coverage of cause and effect.


Embodiment is a first class problem

Generalising across embodiments is not a slogan. It is a technical challenge.

A 7 degree of freedom arm, a 4 degree of freedom arm, and a quadruped do not share an action space. If you ignore this, cross embodiment generalisation is impossible.

Most successful RFMs standardise actions and state through abstraction. Proprioception is normalised into relative joint states, velocities, or end effector frames. Actions are tokenised into latent representations rather than raw torques.

Common approaches include action chunking, trajectory prediction, or latent action codes learned via VQ style encoders. The model does not predict individual motor commands. It predicts short horizon behaviours that can be mapped onto different bodies through embodiment specific decoders.

This separation is what allows the same policy to control different hardware without retraining from scratch.


Architecture follows control, not fashion

Most modern RFMs use a vision language action structure. Cameras provide state. Language provides goal conditioning. The model outputs actions or plans.

The critical architectural decision is not the transformer variant. It is how time and feedback are handled.

High capacity models are slow. Motors are fast. Physics does not wait for attention layers to converge.

For this reason, RFMs rarely operate at motor control frequencies. Instead, they act at a semantic rate. They predict short horizon trajectories, action chunks, or goal states. Classical controllers handle interpolation, stabilisation, and safety at high frequency.

This frequency separation is not a compromise. It is how biological systems work.

A typical stack looks like this. The RFM runs at low frequency and reasons about intent and strategy. A mid level controller translates these outputs into feasible motion plans. A low level controller enforces safety, smoothness, and constraints.

End to end purity is attractive in papers. Hybrid systems survive contact with reality.


Training happens in stages, not once

RFMs are not trained end to end in a single pass. They are grown through stages.

First comes representation learning. Vision, proprioception, and language are aligned into a shared latent space. Masked prediction, contrastive objectives, and future state modelling are common here. Control is not yet involved.

Second comes imitation. The model learns to map states and goals to action representations using demonstrations. Losses are supervised. Stability matters more than optimality.

Third comes interaction. Reinforcement learning, online fine tuning, or human in the loop correction exposes the model to its own mistakes. This is where robustness is learned.

These stages are not linear. Teams loop between them continuously as new data exposes new failure modes.


Sim to real is a loop, not a bridge

Sim to real transfer is often described as a milestone. In practice, it is a gradient.

Simulation enables scale. Reality provides truth.

Early training leans heavily on simulation to explore. As deployment begins, real world logs are fed back into the simulator. Physics parameters are recalibrated. Contact models are refined. Latency and sensor noise are updated.

This real to sim feedback creates a living digital twin. Simulation becomes less idealised and more predictive. It stops being a sandbox and starts becoming an instrument.

If real failures do not exist in your simulator, your simulator is lying to you.


Safety is trained and enforced

Safety in RFMs is both architectural and learned.

Certain constraints must be hard. Joint limits. Collision boundaries. Emergency stops. These are enforced outside the model.

Other behaviours can and should be learned. When to slow down. When to abort. How to behave under uncertainty.

This requires explicit signals. Unsafe actions are penalised. Near misses are logged. Human overrides are treated as supervision, not noise.

Evaluation must reflect this. Success rate alone is meaningless. Intervention frequency, recovery time, and degradation under stress matter far more.


Evaluate how systems degrade, not how they peak

RFMs are often judged by demos. This is misleading.

The real test is degradation. How performance changes as lighting shifts. How behaviour changes when latency increases. What happens after hours of continuous operation.

Foundation models are valuable not because they never fail, but because they fail predictably and recoverably.


Why this approach matters

Teams that make real progress in robotics do not treat it as a model problem. They treat it as a system problem. At Xolver, this methodology did not emerge from imitation. It emerged from first principles. From building systems that must operate in the physical world, under uncertainty, at scale. If intelligence is to move beyond screens and into environments, it must be trained with respect for physics, obsession with feedback loops, and humility about what models can and cannot do.

A robotics foundation model is not trained once. It is raised.

Share:

Related Posts