UNISafe: Uncertainty-aware Latent Safety Filters for Avoiding Out-of-Distribution Failures

Carnegie Mellon University
Under Review, 2025

Abstract

Recent advances in generative world models have enabled classical safe control methods, such as Hamilton-Jacobi (HJ) reachability, to generalize to complex robotic systems operating directly from high-dimensional sensor observations. However, obtaining comprehensive coverage of all safety-critical scenarios during world model training is extremely challenging. As a result, latent safety filters built on top of these models may miss novel hazards and even fail to prevent known ones, overconfidently misclassifying risky out-of-distribution (OOD) situations as safe. To address this, we introduce an uncertainty-aware latent safety filter that proactively steers robots away from both known and unseen failures. Our key idea is to use the world model's epistemic uncertainty as a proxy for identifying unseen potential hazards. We propose a principled method to detect OOD world model predictions by calibrating an uncertainty threshold via conformal prediction. By performing reachability analysis in an augmented state space--spanning both the latent representation and the epistemic uncertainty--we synthesize a latent safety filter that can reliably safeguard arbitrary policies from both known and unseen safety hazards. In simulation and hardware experiments on vision-based control tasks with a Franka manipulator, we show that our uncertainty-aware safety filter preemptively detects potential unsafe scenarios and reliably proposes safe, in-distribution actions.


Challenge: Unreliable WM Can Result in OOD failures.

Description

While latent safety filters can compute control strategies that prevent hard-to-model failures, their training and runtime filtering rely on imagined futures generated by the latent dynamics model. However, a pretrained world model can hallucinate in uncertain scenarios where it lacks knowledge, leading to OOD failures.

Consider the simple example in Figure above where a Dubins car must avoid two failure sets: a circular grey and a rectangular purple region. The world model is trained with RGB images of the environment and angular velocity actions, but the model training data is limited, lacking knowledge of the robot entering the purple failure set. When the world model imagines an action sequence in which the robot enters this region, the world model hallucinates as soon as the scenario goes out-of-distribution: the robot teleports away from the failure region and to a safe state. This phenomenon leads to latent safety filters that cannot prevent unseen failures, and even known failures, due to optimistic safety estimates of uncertain out-of-distribution scenarios.

UNISafe : UNcertainty-aware Imagination for Safety filtering

Static diagram

(Left): We quantify the world model’s epistemic uncertainty for detecting unseen failures in latent space and calibrate an uncertainty threshold via conformal prediction, resulting in an OOD failure set. (Center): Uncertainty-aware latent reachability analysis synthesizes a safety monitor and fallback policy that steers the system away from both known and OOD failures. (Right): Our safety filter reliably safeguards arbitrary task policies during hard-to-model vision-based tasks, like a teleoperator playing the game of Jenga.


Benchmark Safe Control Task with a 3D Dubins Car

We first conduct experiments with a low-dimensional, benchmark safe navigation task where privileged information about the state, dynamics, safe set, and safety controller is available.

OOD failure visualization

UNISafe reliably identifies the OOD failure: To evaluate OOD detection, we first consider a setting where failure states are never observed by $\mathcal{D}_{\mathrm{train}}$. The ground-truth failure set is defined as $|p_y|>0.6$, while the offline dataset contains only 1,000 safe trajectories that never enter this region—making the failure set entirely OOD. As shown in Fig. X, our method reliably infers the OOD failure set from the quantified uncertainty, and the resulting safety value function accurately identifies the unsafe region.


Quantitative evaluation

UNISafe robustly learns safety filters despite high uncertainties in the world models: We evaluate whether our method can synthesize a robust safety filter under uncertainty due to limited data coverage. Here, the vehicle must avoid a circular obstacle of radius $0.5m$ at the center, with the failure set defined as $p_x^2 + p_y^2 < 0.5^2$, and $\mathcal{D}_{\mathrm{train}}$ consists of both safe and unsafe trajectories. We construct a dataset of 1,000 expert trajectories that never enter the ground-truth unsafe sets and 50 random trajectories that may include failure states. Expert trajectories are generated using the ground-truth safety value—applying fallback actions near the unsafe boundary and random actions elsewhere—inducing high uncertainty around the boundary. Fig. Y shows that UNISafe robustly learns the safety monitor with higher balanced accuracy, whereas the baseline overconfidently misclassifies unsafe states as safe. In rollouts from 181 challenging safe initial states (all oriented toward failure), UNISafe also achieves higher safety rates.

Simulation: Block Plucking

We scale our method to a visual manipulation task using IsaacLab, where a Franka manipulator must pluck the middle block from a stack of three while ensuring the top one remains on the bottom one. Observations consist of images from a wrist-mount and a tabletop camera, with 7-D proprioceptive inputs. Actions are a 6-DoF end-effector delta pose with a discrete gripper command.


UNISafe minimizes failure by preventing safety overestimation UNISafe that incorporates both known and OOD failures achieves the lowest failure rates and model errors. In contrast, LatentSafe that does not incorporate OOD failures overestimates the safety of OOD actions, leading to unsafe action proposals.


Hardware Experiments: Vision-based Jenga with a Robotic Manipulator

We evaluate our method on a real-world robotic manipulation task using a fixed-base Franka Research 3 arm, equipped with a third-person camera and a wrist-mounted camera. The robot must extract a target block from a tower without collapsing, then place it on top.

Jenga experiment result

Teleoperator Playing Jenga with Safety Filters. UNISafe enables non-conservative yet effective filtering of the teleoperator’s actions, ensuring the system remains within the in-distribution regions. In contrast, the uncertainty-unaware safety filter ( LatentSafe ) optimistically treats uncertain actions as safe, leading to failure.

Video Results: Robot Jenga with Latent Safety Filters

Our latent safety filter ( UNISafe ) allows stable block removal that is safe and predictable.

UNISafe reliably corrects the teleoperator by proposing in-distribution safe actions.

Uncertainty-unaware latent safety filter ( LatentSafe ) fails due to optimistic imagination of futures, leading to high uncertainty.


System-level Failure Detection (Generalization with Visual OOD Inputs)

OOD Visual inputs. Although the block colors differ from those seen during training, such visual variations do not necessarily imply out-of-distribution inputs. Instead, the decision to halt is based on the reliability of the filtering system. If the color change falls within the model’s generalization capacity, the latent dynamics model remains accurate, and its predictive uncertainty stays below the safety threshold. In contrast, when the visual input significantly departs from the training distribution, the model’s predictions become unreliable. The resulting increase in uncertainty causes the safety filter to trigger a halt, preventing potentially unsafe actions.

BibTeX

@article{seo2025uncertainty,
        title={Uncertainty-aware Latent Safety Filters for Avoiding Out-of-Distribution Failures},
        author={Seo, Junwon and Nakamura, Kensuke and Bajcsy, Andrea},
        journal={arXiv preprint arXiv:2505.00779},
        year={2025}
      }