Xolver Blog

Xolver now supports pre-deployment warehouse robotics validation

Tue, 14 Jul 2026 00:00:00 GMT

Warehouse robotics fails at the boundary between route intent and physical permission. A route may be plausible, but still unsafe because localization is stale, an aisle is occupied, a keepout zone changed, or a base stop invalidates arm motion.

Xolver's new warehouse robotics capability brings our contract, safety, replay, and readiness stack to warehouse mobile automation workflows.

This first release is available today for simulation, replay, and readiness validation. It is designed to help teams inspect behavior, validate routes, generate evidence, and understand readiness before connecting to or controlling physical warehouse robots.

What the demo path covers

The warehouse mobile demo path supports canonical tasks including tote movement, station docking, aisle navigation, occupancy-blocked routes, stale localization recovery, and arm/base coupled-motion blocks.

Teams can select the warehouse_mobile_demo blueprint, load a map and occupancy fixture, choose a canonical task, validate route safety, validate mobile-base safety, generate replay and evidence, and review pass/block reasons.

Evidence before claims

The workflow records route validation, base safety monitor output, keepout and speed zone checks, collision-world validation, replay hash, contract hash, and a no-hardware-execution assertion.

Readiness is intentionally explicit. The system can report demo_ready, bench_blocked, hardware_loop_blocked, or production_blocked so teams understand what is validated and what remains site-specific.

Working on warehouse robotics automation? Tell us about your warehouse layout, robot platform, task flow, and safety constraints. We can scope a validation workflow, integration path, or pilot readiness review.

Explore warehouse robotics automation in Xolver.

From Seeing To Feeling: Why Tactile Intelligence Matters For Industrial Robots

Sun, 12 Jul 2026 00:00:00 GMT

Industrial robots have become very good at moving through space. But many of the most valuable factory tasks are not just motion problems. They are contact problems.

Picking up a fragile component. Holding an object without crushing it. Detecting slip before a part falls. Knowing when an insertion is blocked instead of forcing the motion. These moments depend on touch.

Xolver now supports tactile and dexterous manipulation inside its physical AI runtime. That means supported grippers and robotic hands can surface contact, grip, and safety state through the same deployment system used for monitoring, replay, and evidence.

This is a step toward robots that do not simply execute paths, but understand interaction.

Explore tactile dexterity in Xolver Console.

The geometry of singularities. Why the Jacobian determinant governs robotic refusal

Mon, 15 Jun 2026 00:00:00 GMT

When a multi-axis robotic manipulator moves along a planned trajectory, it appears to navigate the physical world with effortless coordination. But if the trajectory commands the arm to reach its maximum extension or align two collinear joints—such as pointing straight up—the smooth motion can instantly dissolve. The arm shudders violently, the joint motors draw peak current, and the controller shuts down with a tracking error. This is not a software crash or a hardware failure in the traditional sense; it is a collision with geometry.

This phenomenon is known as a kinematic singularity. In the physical world, singularities represent regions of configuration space where a robot loses one or more degrees of freedom. While a computer simulation can easily represent coordinates at these points, the physical actuators driving the joints are subject to the laws of electromagnetism and mechanics. Crossing a singularity requires physical velocities that exceed the limits of the hardware, turning a simple geometric transition into a high-torque mechanical emergency.

As the robotics industry transitions toward Vision-Language-Action (VLA) foundation models, this problem has only intensified. Modern VLAs are trained on end-to-end datasets, learning to propose trajectories directly in 3D task space (the Cartesian coordinates of the end-effector). Because these models lack a structural understanding of the robot's physical configuration, they frequently plan paths that slice through singular zones. They expect the robot to magically glide through the singularity, completely unaware that the underlying joint motors cannot keep pace.

The Mathematics of the Jacobian

To understand why singularities cause physical failure, we must examine the mapping between the robot's joint configurations $q \in \mathbb{R}^n$ and its end-effector pose $x \in \text{SE}(3)$. This forward kinematic mapping is defined by a non-linear vector function $x = f(q)$.

Differentiating this function with respect to time yields the relationship between joint velocities $\dot{q}$ and end-effector velocities $v$:

$$v = J(q)\dot{q}$$

where $J(q) = \frac{\partial f}{\partial q} \in \mathbb{R}^{6 \times n}$ is the configuration-dependent Jacobian matrix. The Jacobian maps the tangent space of the joint manifold to the tangent space of the task space.

For a standard 6-DOF manipulator, the Jacobian is a square matrix. When the robot is in a normal, open configuration, $J(q)$ is invertible, allowing us to compute the required joint velocities for any desired end-effector command:

$$\dot{q} = J(q)^{-1}v$$

However, as the robot approaches a singularity, the Jacobian matrix loses rank, and its determinant approaches zero ($\det(J(q)) \to 0$). For general redundant manipulators (where $n > 6$), we evaluate Yoshikawa's manipulability index $w(q)$:

$$w(q) = \sqrt{\det(J(q)J(q)^T)}$$

As $w(q) \to 0$, the manipulability ellipsoid flattens along the singular direction, meaning the robot cannot produce any motion along that axis. Mathematically, the inverse Jacobian $J(q)^{-1}$ explodes. To achieve even an infinitesimal velocity $v$ along the singular direction, the controller commands joint velocities $\dot{q}$ that approach infinity. In the physical world, motors saturate, current spikes, and the system trips its safety limiters to prevent permanent hardware damage.

The Limits of Damped Least-Squares

In classical control theory, engineers attempt to bypass singularities using Damped Least-Squares (DLS) or the singularity-robust inverse. Instead of solving the exact inverse kinematics, the controller solves a regularized minimization problem:

$$\min_{\dot{q}} \quad \|J(q)\dot{q} - v\|^2 + \lambda^2 \|\dot{q}\|^2$$

where $\lambda$ is a damping factor. The analytical solution is given by:

$$J^* = J(q)^T (J(q)J(q)^T + \lambda^2 I)^{-1}$$

When the robot is far from singularities, $\lambda$ is set to zero, yielding the exact inverse. As $w(q)$ drops, $\lambda$ is increased to bound the joint velocities. While DLS prevents the motors from saturating, it introduces a severe mathematical trade-off: it sacrifices tracking accuracy. The robot drifts away from the commanded path, introducing positional deviations. In a tight industrial workcell or a collaborative environment, this drift translates to unguided movement, leading to collisions with nearby obstacles.

Singularity Refusal in Bounded Execution

At Xolver, we believe that compromising on path tracking or risking motor saturation are both symptoms of a flawed system architecture. Safety and control must be decoupled from neural reasoning.

Under our Bounded Execution framework, the VLA model (Model X1-D) acts as a proposer, generating intent-driven waypoints in task space. The Deterministic Enforcement Layer then intercepts these waypoints and evaluates them against the robot's kinematic limits at the configuration level.

Instead of damping the controller at the boundary, the Enforcement Layer solves a constrained quadratic program (QP) that includes the manipulability index as a state constraint:

$$\begin{aligned}\min_{u} \quad & \frac{1}{2} \| u - u_{nom} \|^2 \\\\ \text{subject to} \quad & w(f(q + \Delta t \cdot \dot{q})) \geq w_{min} \\\\ & q_{min} \leq q \leq q_{max} \\\\ & \dot{q}_{min} \leq \dot{q} \leq \dot{q}_{max}\end{aligned}$$

If the proposed task-space trajectory forces the robot toward a state where $w(q) < w_{min}$, the Enforcement Layer projects the trajectory along the tangent of the safe configuration boundary. If no safe projection exists—such as when the command requires reaching outside the physical workspace—the system executes a safe refusal. It halts actuation, logs the geometric boundary violation, and escalates to the operator interface.

Conclusion

The physical boundaries of a robot are defined by geometry and physics, not by training distributions. Expecting a neural network to implicitly learn to avoid kinematic singularities is a dangerous assumption that leads to mechanical failure or uncontrolled drift.

At Xolver, we treat geometry as an absolute boundary. By parsing latent neural intents through a deterministic enforcement layer that understands joint space, we ensure that every commanded action is kinematically feasible. We don't try to teach models to avoid singularities; we build runtimes that are physically incapable of entering them.

The physics of irreversible states. How Control Barrier Functions guarantee safety

Sat, 16 May 2026 00:00:00 GMT

When a large language model hallucinates, the cost of the error is negligible: a developer presses backspace, or a user re-generates the prompt. The digital space is intrinsically forgiving, operating on virtual tokens where failure is merely an computational inconvenience.

But when a physical foundation model hallucinates while commanding a multi-ton robotic manipulator, the cost is immediately physical. A joint exceeds its torque limit, an end-effector collides with its environment, or a high-velocity movement violates a keep-out zone. In physical AI, "alignment" is not a set of conversational guidelines, it is the deterministic boundary between operation and destruction.

For years, the robotics industry has attempted to solve safety through the brute-force scaling of neural network parameters, relying on reinforcement learning penalties to "teach" the model to avoid catastrophic states. This is a dangerous category error. Probabilistic safety is not safety at all. To build truly deployable physical autonomy, we must treat safety not as a learned behavior, but as a mathematical invariant.

The Fallacy of Probabilistic Safety

In machine learning, it is customary to train policies by assigning negative rewards to undesirable behaviors. If a robot collides with an obstacle during a training rollout, the policy receives a severe mathematical penalty. Over millions of iterations, the neural network learns a policy that minimizes the likelihood of this penalty.

However, deep neural networks are, at their core, high-dimensional interpolators. They output continuous probability distributions over action spaces. Even if a model is 99.99% confident in a safe trajectory, the remaining 0.01% represents a non-zero probability of executing an action that results in an irreversible physical state.

In classical mechanics, many physical failures are thermodynamically irreversible: a bent metal link, a shattered gear, or an overheated motor cannot be corrected by a subsequent positive reward. In the physical world, we cannot afford to learn safety by experiencing failure. Safety must be enforced externally, deterministically, and with absolute mathematical guarantees at the very boundary of hardware execution.

Nagumo's Theorem and the Mathematics of Set Invariance

To guarantee safety, we must move away from soft penalties and look to the classical mathematics of dynamical systems, specifically Set Invariance and Control Barrier Functions (CBFs).

Let the continuous-time physical system of the robot be modeled as a control-affine system:

$$\dot{x} = f(x) + g(x)u$$

where $x \in \mathcal{C}$ represents the state of the robot (its position, velocity, and torque in configuration space), and $u \in \mathcal{U}$ is the control input commanded by a planning policy.

We define a closed, safe set $\mathcal{C}_{safe} \subset \mathcal{C}$ as the superlevel set of a continuously differentiable scalar function $h(x)$:

$$\mathcal{C}_{safe} = \{ x \in \mathcal{C} : h(x) \geq 0 \}$$

The boundary of this safe set, where the robot is precisely on the edge of a collision or joint limit, is denoted as $\partial \mathcal{C}_{safe} = \{ x \in \mathcal{C} : h(x) = 0 \}$.

Our objective is to guarantee that if the robot starts within this safe set, it remains there for all future time:

$$x(t_0) \in \mathcal{C}_{safe} \implies x(t) \in \mathcal{C}_{safe} \quad \forall t \geq t_0$$

According to Nagumo's Theorem, a cornerstone of dynamical systems theory, a closed set is forward invariant under a vector field if and only if the vector field at the boundary points back into the set.

For control systems, this means the control input $u$ must satisfy a constraint that prevents the state from crossing the boundary. This is formalized by the Control Barrier Function inequality:

$$\dot{h}(x, u) \geq -\gamma h(x)$$

Applying the chain rule, we can expand $\dot{h}(x, u)$ in terms of the system dynamics:

$$\nabla h(x)^T (f(x) + g(x)u) \geq -\gamma h(x)$$

Using Lie derivatives to represent the directional derivative of the safety function $h$ along the vector fields $f$ and $g$, we write this elegantly as:

$$L_f h(x) + L_g h(x)u \geq -\gamma h(x)$$

This inequality defines a half-space of safe control inputs. As long as the command $u$ lies within this half-space, the system is mathematically guaranteed to remain within $\mathcal{C}_{safe}$ indefinitely. If the robot is far from the boundary ($h(x)$ is large), the constraint is highly permissive, allowing the planning policy complete freedom. But as the robot approaches the boundary ($h(x) \to 0$), the constraint restricts the allowable control inputs, forcing the system to decelerate or steer away, regardless of what the neural network commands.

The Principle of Safe Projection

The engineering elegance of this mathematical formulation lies in its modularity. We do not need to restrict the neural network's architecture or alter its weights to ensure safety. Instead, the safety filter acts as a pure geometric projection.

If the neural network proposes a nominal command $u_{nom}$, we can project this command onto the safe half-space defined by our Control Barrier Functions. This is a minimum-deviation optimization problem:

$$\begin{aligned}\min_{u \in \mathcal{U}} \quad & \frac{1}{2} \| u - u_{nom} \|^2 \\\\ \text{subject to} \quad & L_f h_i(x) + L_g h_i(x)u \geq -\gamma h_i(x), \quad \forall i\end{aligned}$$

Because the constraints are linear with respect to the control input $u$, this optimization is convex. It can be solved deterministically at the hardware boundary with minimal computational overhead.

If the proposed command $u_{nom}$ is safe, the projection has no effect, allowing the system to leverage the full intelligence of the neural network. If the command would cause a violation, the projection outputs the mathematically minimal correction that preserves set invariance. The network is allowed to "think" freely, but the hardware remains physically incapable of executing a dangerous command.

Conclusion

The future of industrial physical AI cannot rely on the hope that a neural network will not fail. In environments where robots share spaces with humans and interact with expensive infrastructure, hope is not an engineering metric.

True progress in robotics requires a division of labor. We must leverage the incredible semantic understanding and pattern recognition of deep foundation models to propose complex, intelligent behaviors. But we must bound that intelligence within the rigorous, absolute constraints of physical geometry and control theory.

At Xolver, our work is anchored by this division. By separating probabilistic reasoning from deterministic safety invariance, we don't just build smarter models, we build systems that can be trusted to survive the complexity of the real world.

The mathematics of isometric space. Hexagonal logic in a Cartesian world

Sun, 10 May 2026 00:00:00 GMT

When a robot maps its environment, it typically imposes a grid over the continuous physical world. For decades, the default structure of these grids has been Cartesian, a checkerboard of squares. This is entirely natural for humans who design digital screens built from square pixels and for algorithms optimized around 2D matrices. But the physical world does not come neatly divided into orthogonal blocks.

The problem with a Cartesian grid becomes obvious the moment a robot needs to compute distance diagonally. In a continuous 2D space, the true Euclidean distance between two points $(x_1, y_1)$ and $(x_2, y_2)$ is defined by the $L_2$ norm,

$$ d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2} $$

However, when this space is discretized into a square grid using 4-way connectivity, the system is forced to use the Manhattan distance, or $L_1$ norm,

$$ d = |x_2 - x_1| + |y_2 - y_1| $$

This mathematically penalizes diagonal movement, requiring a zig-zag path that heavily overestimates true physical distance. If the system instead uses 8-way connectivity, the metric distance to a diagonal neighbor becomes inconsistent ($\sqrt{2}$ instead of $1$) compared to an orthogonal neighbor. This distortion might seem trivial in a simple path planner, but when evaluating millions of spatial constraints in a high-frequency control loop, geometric inconsistency translates to significant computational waste.

The biological blueprint of space

If we look to biology for inspiration, we find that the mammalian brain does not map space using a Cartesian checkerboard. The discovery of 'grid cells' in the entorhinal cortex revealed that biological nervous systems encode physical space using a tightly packed hexagonal lattice.

A hexagonal grid fundamentally solves the metric distortion problem of Cartesian space. Every hexagon has exactly six neighbors, and the distance to the center of every single neighbor is identical. It provides a uniform, isometric representation of space where the mathematical distance exactly matches the physical distance in all basic directions. This property is elegantly defined by mapping the 2D plane into a 3D cube coordinate system $(q, r, s)$ constrained by a simple plane equation,

$$ q + r + s = 0 $$

By embedding the hexagonal grid into this 3D subspace, the distance between any two hexagons $A$ and $B$ becomes a mathematically pure maximum of absolute differences,

$$ d = \max(|q_A - q_B|, |r_A - r_B|, |s_A - s_B|) $$

This guarantees that distance computations remain perfectly isotropic, meaning they expand uniformly in a circle rather than forming the blocky artifacts seen in square grid expansions.

Reducing the complexity of spatial reasoning

At the scale of modern physical AI, evaluating spatial constraints such as collision avoidance, keep-out zones, and kinematic reachability is computationally intensive. Standard grid-based evaluations scale poorly as resolution increases.

By shifting from Cartesian to neuro-inspired hexagonal geometries, the complexity of distance calculations drops significantly. Because the metric distance is uniform, spatial expansion algorithms propagate symmetrically. This geometric purity allows systems to map continuous areas with far fewer computational steps, reducing the burden on low-level control loops.

Translating intent to geometry

The engineering challenge lies in mapping the conceptual elegance of a hexagonal grid onto standard hardware architectures that natively favor rectilinear arrays.

In the framework of Bounded Execution, foundation models often propose semantic trajectories in abstract or Cartesian space. However, to guarantee safe physical operation, the deterministic enforcement layer must constrain these proposals. By fundamentally structuring these spatial constraints using neuro-inspired hexagonal geometries, the enforcement layer can solve distance and boundary optimizations natively. The system evaluates the safest path out of a collision zone by relying on the pure, equidistant geometry of the hexagon, avoiding the mathematically noisy approximations of square grids.

Conclusion

The assumptions we bake into our foundational geometries dictate the performance ceiling of our systems. While Cartesian grids are convenient for screens, they are computationally hostile to continuous physical motion.

At Xolver, we believe that true spatial intelligence requires representations that respect the physics of motion. By moving beyond Cartesian limitations and embracing neuro-inspired geometries, control layers can evaluate complex physical constraints with mathematical purity and biological efficiency.

Why Physical AI needs imagination. The math of object permanence

Sun, 12 Apr 2026 00:00:00 GMT

When a human operator drops a wrench behind a toolbox, they don't assume the wrench has ceased to exist. They instinctively maintain a mental model of the object's continued presence, location, and geometry despite total visual occlusion. In developmental psychology, this is known as object permanence.

In the realm of robotics, a surprising number of advanced Vision-Language-Action (VLA) models completely lack this capability. They are highly performant reactive engines that operate frame-by-frame. If a forklift drives in front of a staging pallet, or if a dense part obscures another in a bin-picking operation, these systems frequently fail. To them, the occluded object has been erased from reality.

The Limits of Frame-by-Frame Perception

The status quo in much of the industry has been to build "thicker" perception networks. The assumption is that by scaling parameters, the neural model will implicitly learn to interpolate missing data. Some implementations go a step further, relying on rudimentary temporal smoothing or short-horizon LSTM architectures to "remember" the last few frames.

However, simple persistence memory or implicit feature-space memorization scales poorly in cluttered, dynamic environments. The physical world contains long occlusions, moving actors, and complex geometric intersections. A reactive system that simply holds a stale 3D coordinate until an object reappears will invariably collide with the environment if the object itself is moving or if the robot needs to interact with the hidden geometry.

Intelligence in physical AI is not merely the ability to perceive. It is the ability to maintain a mathematically rigorous hallucination of the unobserved.

Filtering the Unseen. The Mathematics of the World Model

To solve occlusion, we must move from a paradigm of perception (What do I see?) to a paradigm of state estimation (What is the true state of the world?).

Mathematically, this demands a probabilistic approach, conceptualizing the environment as a hidden Markov model where the true state $x_t$ is unobserved, and we are given a sequence of noisy observations $z_{1:t}$. The objective is to maintain a belief distribution over $x_t$, expressed as $p(x_t | z_{1:t})$.

At Xolver, addressing this takes the form of an active World Model. Instead of merely responding to pixels, the system continuously predicts the forward evolution of the scene based on physical priors. This process is governed by the recursive Bayesian update equation:

$$ p(x_t | z_{1:t}) = \eta \, p(z_t | x_t) \int p(x_t | x_{t-1}, u_{t-1}) \, p(x_{t-1} | z_{1:t-1}) \, dx_{t-1} $$

This equation contains two critical phases. First is the *Prediction Step* (the integral). Here, the model applies transition dynamics $p(x_t | x_{t-1}, u_{t-1})$, effectively hallucinating how the world, including both the robot and dynamic actors, should evolve over the time step. Second is the *Correction Step*, where the incoming visual observation $z_t$ grounds the hallucination, updating the probability distribution.

Bounded Execution with Probabilistic State

Understanding the mathematics of state estimation is only half the problem. The harder engineering challenge is integrating this probability distribution into a deterministic control loop.

If a robot is tasked with moving through a cluttered warehouse aisle, and a worker steps behind a pallet stack, the robot’s World Model now holds a probability distribution of the worker's location behind the occlusion. But the underlying motor-driver cannot be given a probability. It demands a definite command.

This is where the Xolver architecture of Bounded Execution is vital. Our foundation model proposes an intent, incorporating its probabilistic estimate of the occluded world. The Deterministic Enforcement Layer then takes this proposal and evaluates it against spatial and kinematic constraints.

Crucially, the constraint boundaries are dynamically expanded based on the variance (uncertainty) of the unobserved state. If the World Model is highly uncertain about the position of a hidden object, the keep-out zone inflates, forcing the path planner to take a wider berth or reduce velocity. The system operates confidently up to the very margin of its mathematical uncertainty.

Conclusion

A robot that cannot imagine what it cannot see is doomed to fail in a complex physical environment. By grounding physical AI not just in perception but in recursive probabilistic modeling, we enable machines to hold object permanence as a mathematical truth.

At Xolver, we do not view occlusion as an edge case or a perception failure. It is the fundamental state of the real world. By building World Models that predict the unseen and control loops that respect the geometry of uncertainty, we are moving beyond reactive operations and toward genuine spatial intelligence.

The calculus of smooth motion. Solving for robotic snap

Tue, 31 Mar 2026 00:00:00 GMT

Anyone who has ever deployed a heavy industrial manipulator knows the sound of 'robotic snap'. It is the violent, shuddering crack that occurs when a high-torque actuator is given an abrupt change in trajectory. In the digital realm, a command can change instantly. In the physical realm, it collides with inertia.

In a simulation, a foundation model can easily output a piecewise trajectory. The arm is at Point A, and in the next timestep, it is told to be at Point B. Many early Vision-Language-Action (VLA) implementations treated robot pathing like digital drawing, connecting dots with straight lines. When these discrete, angular paths are fed directly into the high-gain feedback loops of physical motor controllers, the result is crippling mechanical stress, jitter, and rapid hardware degradation.

The Higher-Order Derivatives of Motion

Moving a physical mass smoothly requires respecting the calculus of motion. It is not enough to guarantee continuous position ($C^0$ continuity) or even continuous velocity ($C^1$ continuity). To avoid snapping a gearbox or dropping a payload, the control system must ensure continuous acceleration ($C^2$ continuity) and manage the rate of change of acceleration, known as jerk.

This turns trajectory generation from a simple geometric problem into a non-linear optimal control problem involving higher-order derivatives of position with respect to time. Mathematically, true fluidity requires minimizing an objective function that penalizes abruptness, the most common being the integral of squared jerk:

$$ J = \frac{1}{2} \int_0^T \left\| \frac{d^3 \mathbf{x}(t)}{dt^3} \right\|^2 dt $$

subject to the boundary conditions of initial and final position, velocity, and acceleration. Applying the calculus of variations and Pontryagin's Minimum Principle to this functional reveals that the unconstrained optimal trajectory in 1D space is a fifth-order (quintic) polynomial:

$$ x(t) = a_0 + a_1 t + a_2 t^2 + a_3 t^3 + a_4 t^4 + a_5 t^5 $$

The coefficients $\{a_0, \dots, a_5\}$ strictly depend on the boundary states. However, in physical robotics, the problem is highly constrained. We must enforce hard limits on joint velocities $\dot{q}_{max}$, accelerations $\ddot{q}_{max}$, and torques $\tau_{max}$. Thus, the true constrained optimization problem becomes:

$$ \min_{\mathbf{x}(t)} \int_0^T \left\| \dddot{\mathbf{x}}(t) \right\|^2 dt \quad \text{subject to} \quad \mathbf{x}(t) \in \mathcal{C}_{free}, \; \left\| \dot{\mathbf{q}}(t) \right\| \le \dot{\mathbf{q}}_{max}, \; \left\| \boldsymbol{\tau}(t) \right\| \le \boldsymbol{\tau}_{max} $$

This requires solving inverse kinematics (mapping task-space $\mathbf{x}$ to joint-space $\mathbf{q}$ via the Jacobian $\mathbf{J}(\mathbf{q})$) and recursive Newton-Euler dynamics at high frequency.

When a neural network outputs discrete action chunks, it is fundamentally ignorant of these continuous-time constraints. It proposes where the arm should go, not the physics of how it gets there.

The Algorithmic Shock Absorber

This is exactly why a foundation model should never directly command a motor. At Xolver, we structure our control spine to respect this boundary.

In our architecture, the foundation model acts as an intent engine. It operates at a relatively low frequency (e.g., 10 Hz), outputting a sequence of semantic waypoints or latent action chunks based on its interpretation of the scene.

These discrete outputs are then intercepted by the Deterministic Enforcement Layer. This layer acts as an algorithmic shock absorber. It takes the rough, probabilistic waypoints and fits a physically realizable, $C^2$-continuous spline interpolation that strictly respects the hardware's maximum torque, velocity, and jerk envelopes.

The enforcement layer effectively translates the VLA's step-functions into smooth, drivable trajectories. It then streams this optimized signal to the edge runtime and hardware controllers at high frequency (e.g., 500 Hz or 1000 Hz). The physical robot never feels the 'thought process' of the neural network, it only feels the mathematically verified momentum.

Intelligence Requires Grounding

Intelligence without physical grounding is destructive. By architecturally separating the 'nervous system' that plans from the 'spinal cord' that executes, we ensure that the scale and complexity of modern foundation models do not destroy the machinery they are tasked to operate.

At Xolver, we believe that true physical AI must speak the language of continuous calculus, not just discrete representation.

The mathematics of spatial constraints. Why foundation models need bounded execution

Tue, 10 Mar 2026 00:00:00 GMT

The arrival of Vision-Language-Action (VLA) models represents a paradigm shift, seemingly solving two of the hardest problems in robotics, intent interpretation and visual perception. By scaling up foundation models, we can now issue a high-level semantic command and have a system parse the visual scene to propose a surprisingly coherent sequence of actions.

Yet, moving from digital token generation to physical actuation introduces a critical, often overlooked disconnect. The digital realm is forgiving. A hallucinated word merely requires a backspace. The physical world, however, is bound by hard, unforgiving geometry and classical mechanics. When building infrastructure for physical autonomy, we quickly confront a stark reality. Implicit learning, simply scaling a model until it "learns" not to crash, is fundamentally insufficient for safety-critical operations. The long tail of physical edge cases cannot be brute-forced by data alone.

To guarantee safe actuation, we must look beyond the weights of a neural network and return to the rigorous mathematics of spatial constraints.

The Geometry of Reachability and Configuration Spaces

When a robotics foundation model proposes an action, it typically operates in task space, a human-interpretable 3D Cartesian coordinate system. However, the robot itself moves in Configuration Space ($\mathcal{C}$-space), a high-dimensional manifold where every point represents a complete specification of every joint angle and actuator state. For a standard 6-DOF or 7-DOF robotic manipulator, this space is incredibly complex.

Within this manifold, the space is partitioned into $\mathcal{C}_{free}$ (the set of all valid, collision-free configurations) and $\mathcal{C}_{obs}$ (the set of configurations that result in self-collisions, environmental collisions, or kinematic singularities). The boundary separating these two sets is highly non-linear, non-convex, and computationally intensive to map.

Foundation models, by their very nature, are probabilistic mapping functions. They output a distribution over possible next tokenized actions. Even a highly capable, massively scaled model might assign a 99.9% probability to a trajectory that exists entirely within $\mathcal{C}_{free}$, while assigning a 0.1% probability to a trajectory that barely intersects $\mathcal{C}_{obs}$.

In a text-generation task, a 0.1% error rate is an engineering triumph. In physical operations, whether navigating a cluttered manufacturing cell, orchestrating precision logistics, or operating in unstructured field environments, a 0.1% chance of executing a catastrophic trajectory is unacceptable. The foundation model cannot be trusted to independently self-enforce the rigid mathematical boundaries of $\mathcal{C}_{free}$.

Formulating the Deterministic Enforcement Layer

To bridge the gap between probabilistic intent and physical safety, advanced AI architectures must strictly decouple proposal from execution. The neural model proposes an intent, but a deterministic enforcement layer must constrain it.

Mathematically, this translates to a constrained non-linear optimization problem. Given a proposed state $x_{prop}$ generated by the foundation model, we must find an executable state $x_{exec}$ that minimizes the deviation from the model's proposal, subject to the strict condition that $x_{exec}$ lies safely within the bounds of physical reality.

$$ \min_{x_{exec}} || x_{exec} - x_{prop} ||^2 $$

$$ \text{subject to } x_{exec} \in \mathcal{C}_{free} $$

This optimization acts as a mathematical projection, requiring the system to account for multiple intersecting constraints simultaneously.

Kinematic Constraints. Ensuring target poses are mathematically reachable given the manipulator's inverse kinematics and joint limits.
Spatial Constraints. Preventing intersection with known geometric obstacles, dynamic actors, and defined keep-out zones using bounding volumes and spatial fields.
Dynamic Constraints. Bounding velocity, acceleration, and jerk to maintain physical stability and protect hardware integrity.

By treating the foundation model's output as an unconstrained structural prior and passing it through this deterministic projection, we guarantee that the final execution trace is mathematically bounded, strictly safe, and auditable.

Differentiable Optimization at the Edge

Historically, solving these non-convex spatial optimization problems in real-time was a massive bottleneck. Traditional motion planners and sequential solvers require significant CPU overhead, making them poorly suited for the low-latency, high-frequency control loops required to keep pace with modern AI inference limits.

However, the mathematics of modern AI provides a remarkably elegant solution to the very problem it created.

By formulating these spatial, kinematic, and dynamic constraints as entirely differentiable functions, we can harness the exact same hardware-accelerated tensor primitives (GPUs/TPUs) used to run the foundation models. Utilizing modern frameworks designed for high-performance numerical computing (such as JAX or PyTorch), we can evaluate massive batches of spatial constraints in parallel.

Instead of searching for a path through discrete sampling, differentiable optimization allows us to compute the gradients of the constraint violations. If a proposed trajectory intersects an obstacle, the gradient points precisely in the direction of the safest, most efficient route back into $\mathcal{C}_{free}$. This allows systems to optimize reachability, layout compliance, and collision avoidance simultaneously in a matter of milliseconds.

This is the profound architectural shift happening in physical AI. We are moving constraint solving out of the realm of slow, classical CPU loops and directly into the realm of hardware-accelerated, differentiable tensor operations.

Intelligence, Constrained

The future of robotics AI will not be defined by a single, omniscient neural network that perfectly predicts every granular physical interaction. The real world contains too much sensor noise, environmental entropy, and geometric complexity for pure deep learning to operate safely in an unbounded manner.

The path to scalable, safe, and deployable autonomy lies in rigorous system architecture. It requires foundation models that intuitively propose state and intent, inextricably coupled with a deterministic enforcement layer that speaks the immutable language of spatial mathematics. By translating probabilistic inferences into guaranteed physical boundaries, we can finally leverage the power of modern AI in environments where failure is expensive.

The geometry of non-uniqueness and why robotics is also a diffusion problem

Sat, 21 Feb 2026 00:00:00 GMT

As we move deeper into the era of general-purpose physical AI, we are forced to confront a mathematical reality that the symbolic world of LLMs rarely touches, the problem of non-uniqueness. In language, there may be many ways to say the same thing, but in robotics, there are many ways to *do* the same thing, and picking the wrong combination results in physical failure.

At Xolver, we believe that the transition from traditional robotics to truly agentic systems depends on moving away from deterministic "correctness" and toward the geometry of uncertainty. This is why we see robotics not as a regression problem, but as a diffusion problem.

The Fallacy of the Average

Traditional robotics models, including many early Vision-Language-Action (VLA) architectures, are built on a dangerous assumption, that for every observation $s$, there exists a single optimal action vector $a$. This is the mathematical framework of regression.

The problem arises when the world offers more than one valid path. Imagine a robot tasked with picking up a tool that can be approached from either the left or the right. A deterministic model, trained on both types of demonstrations, will attempt to minimize the mean squared error. Mathematically, it computes the expectation

$$a_{pred} = \int a \cdot p(a|s) \, da$$

This is the "regression to the mean." If the distribution $p(a|s)$ is bi-modal, the average of two valid paths is often a path that goes directly through the center, colliding with the table or grasping at thin air. In the physical world, the "average" of two right answers is almost always a wrong answer.

Intelligence is not about finding the one "right" answer. It is about acknowledging the infinite set of almost-right answers while strictly avoiding the sea of impossible ones.

Action Manifolds and Multi-modality

To solve for non-uniqueness, we must redefine the "Action Space." It is not merely a 7-DOF vector of joint angles. It is a complex, non-Euclidean manifold $\mathcal{M}$ shaped by the constraints of physics and the requirements of the task.

Real-world tasks are inherently multi-modal. A single visual input does not map to a point, it maps to a distribution. When a human reaches for a cup, their nervous system isn't solving for a single coordinate. It is navigating a high-probability "valley" in a manifold where millions of trajectories are equally valid.

Deterministic models collapse this manifold into a single, fragile point. To build robust robots, we need models that can represent the entire distribution, preserving the "valleys" of intent while respecting the high-energy "ridges" of physical impossibility.

Diffusion as a Vector Field of Intent

This is where diffusion models change the game. Instead of predicting a path directly, diffusion treats action generation as a process of refinement. It starts with noise and iteratively "pulls" it toward a valid state.

This is formalized through Stochastic Differential Equations (SDEs). The model learns the Score Function, which is the gradient of the log-density of the data

$$\mathbf{s}(x, t) = \nabla_x \log p_t(x)$$

Think of this score function as a vector field or a "current." No matter where you start in the action space, the score function points you toward the nearest valid manifold.

Diffusion doesn't just "generate" an action. It refines it against the implicit constraints of the environment. It is the mathematical bridge between "noise" (uncertainty) and "score" (intent). By learning the vector field rather than the point, the robot gains the ability to recover from perturbations and handle multi-modal choices without ever "averaging" its way into a collision.

From Brains to Nervous Systems

The philosophical shift here is profound. LLMs deal with discrete tokens where "correctness" is a matter of sequence and probability over a finite vocabulary. Robotics deals with continuous flows where "correctness" is a matter of survival in a non-linear physical world.

Diffusion models represent the first time we've had a mathematical framework that respects the "messiness" of reality without attempting to simplify it. By embracing the stochastic nature of motion, we move closer to how biological nervous systems operate, not by executing a pre-computed script, but by continuous, score-driven refinement.

Traditional models are fragile because they expect the world to match their single prediction. Diffusion models are resilient because they are designed to walk through the probability of the world, constantly correcting their course toward the manifold of success.

Conclusion. Solving for the Infinitesimal

The next era of robotics won't be defined by bigger models or more parameters, but by models that can navigate the geometry of uncertainty. We are moving away from the "Fallacy of the Average" and toward a physics-grounded understanding of non-uniqueness.

To touch the world, we must first learn to walk through the probability of it. At Xolver, we are building the mathematical scaffolding that allows machines to do exactly that, solving for the infinitesimal adjustments that turn a noisy intent into a certain action.

Intelligence is the ability to navigate the many ways to be right, while understanding exactly how to not be wrong.

The new mathematics of touch, solving for tactile intelligence

Fri, 16 Jan 2026 00:00:00 GMT

As 2026 begins, a fundamental truth is becoming unavoidable in robotics. Touch is no longer an auxiliary sense. It is the central bottleneck of general purpose physical intelligence.

Vision provides geometry. Language provides intent. Touch provides the closed loop reality check. The moment a robot makes contact with the world, abstraction ends and physics begins.

Tactile intelligence is not a signal processing problem. It is a problem of non smooth dynamics, stochastic control, and energy transfer across matter. Scaling models helps, but only up to the point where the laws of mechanics reassert themselves.

Touch as a physical tensor field, not a signal

Modern tactile sensors do not observe a scalar pressure value. They sample a discretized version of the Cauchy stress tensor $\boldsymbol{\sigma}(x, t)$ at the contact interface.

Touch is best described as a time varying mapping from the contact manifold $\mathcal{M}_c$ to force and torque space,

$$\mathbf{f}(t) = \int_{\mathcal{M}_c} \boldsymbol{\sigma}(x, t)\,\mathbf{n}(x)\,dA$$

where $\mathbf{n}$ is the surface normal.

Unlike vision, which passively observes photons, touch measures the transmission of energy through deformable matter. The so called messiness of tactile data is not noise. It is the high frequency structure of shear stress $\tau$ and normal pressure $p$ that determines whether an object remains stable or begins to slip.

This is why touch scales differently from vision. Increasing taxel density without understanding the physics simply produces more chaos.

The discontinuity problem, contact is not smooth

The hardest part of tactile intelligence is not dimensionality. It is discontinuity.

The transition from free space to contact is governed by the Signorini complementarity condition,

$$g_n \ge 0,\quad \lambda_n \ge 0,\quad g_n \lambda_n = 0$$

where $g_n$ is the contact gap and $\lambda_n$ is the normal force.

This is a true mathematical discontinuity. There is no smooth interpolation between touching and not touching. Classical approaches tried to smooth this transition away. In 2026, the shift is toward embracing it.

Differentiable contact models now allow gradients to flow through stick slip transitions defined by the Coulomb friction cone,

$$\|\boldsymbol{\tau}\| \le \mu \lambda_n$$

This matters because manipulation lives at the boundary of stability. Slip, micro vibration, and deformation are not edge cases. They are the task.

From tactile fields to latent manifolds

Raw tactile observations $x_t$ live in an extremely high dimensional space. Thousands of taxels across time quickly become intractable unless structured.

The modern approach is to project these observations onto a lower dimensional tactile manifold $\mathcal{Z}$, learned through interaction rather than geometry.

This is increasingly formalized through variational information bottlenecks. We seek a latent representation $z_t$ that preserves predictive power while discarding irrelevant variation,

$$\max I(z_t; x_{t+1}) - \beta I(z_t; x_t)$$

These latent variables function as tactile tokens. They are not symbolic labels like "slip" or "stable." They are coordinates in a physical interaction space where distance encodes risk. Moving closer to the boundary of a cluster corresponds to a rising probability of failure.

In this framing, tactile intelligence is not classification. It is navigation on a learned manifold shaped by physics.

Prediction, not reaction, the active inference view

Humans do not respond to touch after the fact. We anticipate it.

This is best captured through predictive processing and active inference. The robot maintains an internal generative model that predicts expected tactile feedback $\hat{x}_t$ given vision $x_v$ and action $u_t$,

$\hat{x}_t = g(x_v, u_t)$

The key signal is not touch itself, but surprisal,

$\delta_t = x_t - \hat{x}_t$

This prediction error drives immediate belief updates in the internal state $b_t$, often bypassing higher level reasoning. A spike in $\delta_t$ is what causes instant grip correction when an object turns out heavier, softer, or slipperier than expected.

In physical systems, prediction error is faster and more reliable than symbolic reasoning. This is why tactile control cannot wait for language level planning.

Hierarchical control under physical constraints

Tactile intelligence operates across time scales.

At the millisecond level, reflexive control loops stabilize contact. At the second level, higher policies reason about task completion. This structure is naturally modeled as a hierarchical stochastic optimal control problem.

At the low level, stability is governed by energy dissipation. At the high level, value functions encode task intent. The unifying object is the Hamilton Jacobi Bellman equation,

$$V(x) = \min_u \mathbb{E} \left[ \int_0^T \left( \|\delta_t\|^2 + \lambda \|\tau(t)\|^2 \right) dt \right]$$

When tactile error $\delta_t$ exceeds a safety threshold, the value function collapses around stability. The policy shifts instantly from goal seeking to damage prevention. This is not a heuristic. It is optimal behavior under physical risk.

Why touch is the ultimate test of truth

In virtual domains, models can hallucinate. In the physical world, conservation of momentum and energy act as non negotiable loss functions.

The shift in 2026 is the recognition that physics cannot be trained away. It must be embedded into representation, prediction, and control.

Tactile intelligence is where geometry meets dynamics, where probability meets friction, and where intelligence finally becomes accountable to reality.

At Xolver, we see touch not as another modality, but as the grounding layer of physical intelligence. It is at the contact patch, where bits meet atoms, that artificial intelligence stops being impressive and starts being real.

Predictions for the mathematics of robotics AI in 2026, from tokens to touch

Fri, 02 Jan 2026 00:00:00 GMT

Looking ahead to 2026, we see a clear shift in the industry's trajectory. If 2024 and 2025 were about giving robots a brain through LLMs and VLMs, 2026 feels like the year we finally give them a functioning nervous system.

We are moving past the novelty of robots that can see and speak, and into the much harder engineering reality of robots that can act with the subtle, contact-rich fidelity of a human. This transition is not cosmetic. It is rooted in changes to how we model physics, control, and learning itself.

Below is how we expect the mathematics and physics of robotics AI to evolve in 2026.

1. The tokenization of continuous physics

The most profound shift in 2026 is mathematical. We are dissolving the long-standing barrier between discrete reasoning and continuous control.

Until recently, these were treated as separate domains. Language models predicted the next text token, while control policies minimized a cost function $J$ over continuous state-space trajectories $x(t)$, $u(t)$. The coupling between the two was fragile and largely hand-engineered.

In 2026, Vision-Language-Action architectures mature into systems that treat physical force and motion as just another language. The core idea is the discretization of the manifold of useful actions. Instead of outputting a continuous voltage or torque command for each motor, the model predicts a sequence of action or motion tokens. These tokens are then decoded by a low-level diffusion or control policy into high-frequency actuation, often at 100 Hz or higher.

What emerges is a unified latent space $z$, where semantic intent, such as "twist the cap," aligns with dynamic affordances like torque vectors, friction cones, and compliance regions. We are also seeing the quantization of force itself. By training on large-scale teleoperation datasets, models learn to predict action tokens $a_t$ that implicitly encode compliance and contact dynamics. The math shifts from explicit inverse kinematics toward probabilistic token prediction,

$$a_t \sim p(a \mid o_{\le t}, c)$$

where $a_t$ is the action token, $o_{\le t}$ is the history of observations, and $c$ is the high-level language command. Control becomes a sequence modeling problem, without sacrificing physical realism.

2. Solving contact-rich manipulation through differentiable physics

For decades, the sim-to-real gap has been the graveyard of robotics startups. Robots trained in rigid-body simulators failed in the real world because simulators could not model deformation, frictional slip, or micro-collisions. A rubber seal, a greasy bolt, or a slightly misaligned part was enough to cause failure.

In 2026, the key breakthrough is the practical adoption of differentiable soft-body simulation. Instead of using non-differentiable physics engines that block gradient flow, the physics equations themselves become part of the computational graph.

This enables end-to-end learning across perception, control, and physics. A physical failure, such as dropping a cup, generates an error signal that propagates backward through the simulator,

$$\frac{\partial \mathcal{L}}{\partial \theta} = \frac{\partial \mathcal{L}}{\partial x_T} \cdot \frac{\partial x_T}{\partial \theta}$$

where $\theta$ includes both policy parameters and physical simulation parameters. The robot can effectively learn by dreaming physics.

This reframing also changes how classic manipulation problems are solved. Peg-in-hole assembly is no longer treated as a purely geometric constraint. Instead, it is modeled as an energy minimization problem involving friction, deformation, and contact forces. The robot learns to introduce small corrective motions that reduce system energy, allowing parts to slide into place rather than jam.

3. Visuotactile foundation models

Vision alone is insufficient for manipulation. You cannot see the weight of a hammer or the slipperiness of a soap bar. Humans rely heavily on touch to regulate grip force and adapt instantly to unexpected changes.

We expect 2026 to be the year visuotactile multimodality becomes foundational. Robots move beyond RGB inputs toward dense tactile fields generated by high-resolution sensors. Many of these sensors use internal cameras to observe gel deformation, producing high-dimensional tactile signals.

Mathematically, the goal is to fuse visual observations $x_v$ and tactile observations $x_t$ into a shared representation $z$. The likely breakthrough lies in cross-modal prediction. Before contact occurs, the visual system predicts the expected tactile signal $\hat{x}_t$,

$\hat{x}_t = f(x_v)$

Once contact happens, the difference between expected and actual touch,

$\delta_t = x_t - \hat{x}_t$

becomes the primary learning signal. This surprisal drives rapid adaptation of grip and force. It mirrors how humans instantly adjust when an object turns out to be heavier or slipperier than it appears.

4. Energy-optimal control as a first-class objective

A critical but often overlooked dimension is energy. Generative models are computationally expensive, and humanoid actuators are power-hungry. In 2026, efficiency becomes a first-class term in the objective function.

Control policies are no longer optimized solely for task success. Instead, they incorporate energy directly, following the principle of least action,

$$\min \int_0^T \tau(t)^2 \, dt$$

where $\tau(t)$ represents torque or energy expenditure. By penalizing energy use during training, including through RLHF-style objectives for efficiency, robots learn to exploit momentum, gravity, and passive dynamics. The resulting motion looks fluid and almost lazy, and can extend battery life by 20 to 30 percent without hardware changes.

The convergence that defines 2026

What makes 2026 distinctive is not a single breakthrough paper, but the convergence of these mathematical streams.

Tokenized action allows the brain to communicate fluently with the body. Differentiable physics teaches the brain the body's constraints. Visuotactile sensing gives the body real feedback grounded in contact and force.

We believe the winners of 2026 will not be the teams with the largest language models, but the ones who successfully ground those models in the unforgiving, nonlinear mathematics of the physical world.

How to train a RFM (Robotics Foundation Model)

Tue, 16 Dec 2025 00:00:00 GMT

Training a robotics foundation model is not an exercise in scaling parameters. It is an exercise in deciding what kind of world you want a machine to survive in. Unlike language or vision models, an RFM does not live in files, tokens, or frozen datasets. It lives in time, friction, latency, contact, failure, and recovery.

This is why most attempts at large scale robotics learning fail quietly. They begin with the wrong abstraction. They assume robotics is just another multimodal problem. It is not. Robotics is a closed loop system where perception, reasoning, and action continuously interfere with one another.

An RFM is best understood as a general policy that can span tasks, environments, and embodiments. Not a task specific controller. Not a demo trained for one arm, one table, one lighting condition. It is a system that can perceive intent, reason under uncertainty, and act in ways that remain stable when the world pushes back.

Today we share, from our experience, how such a model is actually trained, end to end, without pretending that physics can be abstracted away.

Define intelligence before you define models

Before collecting data or choosing architectures, define what intelligence means in your system.

Is the robot expected to manipulate objects, navigate spaces, or collaborate with humans. Is it required to learn new tasks from language or only execute known skills. Are failures acceptable if recoverable, or must the system be conservative by default. What latency budget does the system have. Does inference need to run at 5 Hz, 10 Hz, or 50 Hz.

These questions are not philosophical. They directly constrain architecture, training signals, and deployment.

Most RFM projects fail because teams gather data before defining the distribution shift the model must survive. As a result, the model performs well in controlled settings and collapses under mild perturbations.

If you cannot articulate the failure modes you are designing for, you are not training a foundation model. You are collecting demos.

Data is not volume, it is causality

RFM training is often framed as a data scaling problem. That is only partially true. The real challenge is not how much data you have, but whether the data teaches causality rather than correlation.

Robotics data is expensive, biased, and shaped by embodiment. Sensors encode perspective. Actuators encode constraints. Human operators encode habits. If you simply aggregate trajectories, the model learns these biases instead of the task.

Robust RFM pipelines deliberately combine four classes of data.

Real world demonstrations anchor the model to physics. Teleoperation, kinesthetic teaching, or expert policies teach contact dynamics, friction, and feasibility.

Simulation rollouts provide breadth. They allow exploration of rare events, edge cases, and failures that are unsafe or slow to produce in reality. Domain randomization here is not about noise, but about uncertainty that mirrors reality.

Corrective and recovery data is the most valuable and the most ignored. Successful trajectories teach optimism. Near misses, aborts, and human interventions teach robustness. Without this data, models fail catastrophically instead of gracefully.

Language grounded annotations provide abstraction. Not just task names, but intent, constraints, and success conditions. This is what allows generalisation beyond memorised trajectories.

The objective is not balance. The objective is coverage of cause and effect.

Embodiment is a first class problem

Generalising across embodiments is not a slogan. It is a technical challenge.

A 7 degree of freedom arm, a 4 degree of freedom arm, and a quadruped do not share an action space. If you ignore this, cross embodiment generalisation is impossible.

Most successful RFMs standardise actions and state through abstraction. Proprioception is normalised into relative joint states, velocities, or end effector frames. Actions are tokenised into latent representations rather than raw torques.

Common approaches include action chunking, trajectory prediction, or latent action codes learned via VQ style encoders. The model does not predict individual motor commands. It predicts short horizon behaviours that can be mapped onto different bodies through embodiment specific decoders.

This separation is what allows the same policy to control different hardware without retraining from scratch.

Architecture follows control, not fashion

Most modern RFMs use a vision language action structure. Cameras provide state. Language provides goal conditioning. The model outputs actions or plans.

The critical architectural decision is not the transformer variant. It is how time and feedback are handled.

High capacity models are slow. Motors are fast. Physics does not wait for attention layers to converge.

For this reason, RFMs rarely operate at motor control frequencies. Instead, they act at a semantic rate. They predict short horizon trajectories, action chunks, or goal states. Classical controllers handle interpolation, stabilisation, and safety at high frequency.

This frequency separation is not a compromise. It is how biological systems work.

A typical stack looks like this. The RFM runs at low frequency and reasons about intent and strategy. A mid level controller translates these outputs into feasible motion plans. A low level controller enforces safety, smoothness, and constraints.

End to end purity is attractive in papers. Hybrid systems survive contact with reality.

Training happens in stages, not once

RFMs are not trained end to end in a single pass. They are grown through stages.

First comes representation learning. Vision, proprioception, and language are aligned into a shared latent space. Masked prediction, contrastive objectives, and future state modelling are common here. Control is not yet involved.

Second comes imitation. The model learns to map states and goals to action representations using demonstrations. Losses are supervised. Stability matters more than optimality.

Third comes interaction. Reinforcement learning, online fine tuning, or human in the loop correction exposes the model to its own mistakes. This is where robustness is learned.

These stages are not linear. Teams loop between them continuously as new data exposes new failure modes.

Sim to real is a loop, not a bridge

Sim to real transfer is often described as a milestone. In practice, it is a gradient.

Simulation enables scale. Reality provides truth.

Early training leans heavily on simulation to explore. As deployment begins, real world logs are fed back into the simulator. Physics parameters are recalibrated. Contact models are refined. Latency and sensor noise are updated.

This real to sim feedback creates a living digital twin. Simulation becomes less idealised and more predictive. It stops being a sandbox and starts becoming an instrument.

If real failures do not exist in your simulator, your simulator is lying to you.

Safety is trained and enforced

Safety in RFMs is both architectural and learned.

Certain constraints must be hard. Joint limits. Collision boundaries. Emergency stops. These are enforced outside the model.

Other behaviours can and should be learned. When to slow down. When to abort. How to behave under uncertainty.

This requires explicit signals. Unsafe actions are penalised. Near misses are logged. Human overrides are treated as supervision, not noise.

Evaluation must reflect this. Success rate alone is meaningless. Intervention frequency, recovery time, and degradation under stress matter far more.

Evaluate how systems degrade, not how they peak

RFMs are often judged by demos. This is misleading.

The real test is degradation. How performance changes as lighting shifts. How behaviour changes when latency increases. What happens after hours of continuous operation.

Foundation models are valuable not because they never fail, but because they fail predictably and recoverably.

Why this approach matters

Teams that make real progress in robotics do not treat it as a model problem. They treat it as a system problem. At Xolver, this methodology did not emerge from imitation. It emerged from first principles. From building systems that must operate in the physical world, under uncertainty, at scale. If intelligence is to move beyond screens and into environments, it must be trained with respect for physics, obsession with feedback loops, and humility about what models can and cannot do.

A robotics foundation model is not trained once. It is raised.

Why we chose to open source.

Mon, 15 Dec 2025 00:00:00 GMT

We chose to open source part of Xolver not as a marketing gesture, but as an architectural decision. Physical intelligence is not something that can be built in isolation. It sits at the intersection of perception, control, systems engineering, and real world messiness. No single company, no matter how well funded or well staffed, has a monopoly on insight in this space.

Closed systems create the illusion of progress. Open systems reveal where reality pushes back.

When intelligence leaves the screen and enters the physical world, assumptions break quickly. Latency matters. Sensors drift. Edge cases dominate. Open sourcing core components forces us to confront these truths early. It exposes our ideas to environments we did not anticipate and to scrutiny we cannot control. That is uncomfortable, but it is also how systems mature.

We also believe that trust in physical intelligence cannot be earned through claims alone. When software is responsible for actions in the real world, seeing how it works matters. Operators, partners, and developers need to understand behavior, failure modes, and limits. Open code makes this possible. It turns black boxes into inspectable systems and fear into informed judgment.

Another reason is ecosystem health. Physical intelligence is still early. Tools, data formats, simulators, and evaluation methods are fragmented. By open sourcing parts of our stack, especially runtimes, interfaces, and tooling, we reduce friction for others building adjacent systems. A healthier ecosystem benefits everyone, including us. Standards emerge faster when they are built in the open.

Open source also keeps us honest. It creates a forcing function against brittle design and hidden shortcuts. If something only works under perfect conditions, it will be discovered quickly. That pressure improves quality far more effectively than internal reviews alone.

Importantly, open source does not mean giving away the business. We are deliberate about what we open and what we keep proprietary. Core research ideas, production hardened intelligence, safety layers, and customer specific systems remain closed. What we open are the foundations that should be shared, the scaffolding that helps the field move forward together.

Many of the most important infrastructure layers in technology followed this path. Operating systems. Databases. Cloud primitives. They succeeded not because they were closed, but because they were trusted, extensible, and shaped by real use. Physical intelligence will follow a similar trajectory.

Finally, this is about alignment with our long term vision. Xolver is not trying to win by secrecy. We are trying to win by building systems that work, systems that last, and systems others rely on. Open sourcing part of our work is a signal of confidence in our direction and respect for the community building alongside us.

Physical intelligence will define how machines coexist with people in the real world. That responsibility is too large to keep entirely behind closed doors.

How we think and what we do.

Sun, 14 Dec 2025 00:00:00 GMT

Xolver starts from a simple belief. Intelligence only matters when it survives contact with the real world. We are not interested in models that look impressive in isolation but fail under noise, delay, and unpredictability. Our work begins where clean data ends and uncertainty begins.

We see the physical world as a continuous stream, not a sequence of snapshots. Objects move, environments drift, and intent changes over time. Any system that treats perception as a one-time act and decision making as a static output will eventually fail. Xolver is built around closed loops where seeing, reasoning, acting, and verifying happen continuously.

We do not separate intelligence from responsibility. Every decision a system makes in the physical world has consequences. Safety, explainability, and failure handling are not add-ons. They are part of the core architecture. If a system cannot explain why it acted or recognize when it is wrong, it does not belong in production.

Xolver is software first but not software only. We design our systems to run on existing cameras, drones, machines, and edge devices. At the same time, we accept that some problems demand full stack ownership. We will let product market fit decide when hardware becomes necessary. Until then, we remain hardware flexible and architecture disciplined.

We believe platforms matter more than point solutions. Features solve today’s problem. Platforms survive tomorrow’s variability. Xolver builds reusable intelligence systems that can be adapted across environments rather than rebuilt for each use case. This allows our customers to scale capability without multiplying complexity.

We are not a services company disguised as a product. Integration and deployment exist to unlock value, not to become the business. Our goal is to build systems that work out of the box, improve over time, and reduce operational burden rather than add to it.

We measure success differently. Not by demo accuracy, but by uptime. Not by benchmark scores, but by trust earned in real environments. When operators stop watching dashboards and start relying on the system, we know we are doing our job.

Xolver is being built for the long term. Physical intelligence is infrastructure, not a trend. It will quietly sit beneath cities, factories, vehicles, and public spaces, making them safer, more efficient, and more aware. We intend to build that layer with humility, rigor, and respect for the real world it serves.