World models are neural networks that understand the dynamics of the real world, including physics and spatial properties. They can use input data, including text, image, video, and movement, to generate videos that simulate realistic physical environments. Physical AI developers use world models to generate custom synthetic data or downstream AI models for training robots and autonomous vehicles.
Building world models for physical AI systems like self-driving cars requires extensive real-world data, particularly video and images from diverse terrains and conditions. Gathering this data demands petabytes of information and millions of hours of simulation footage, followed by thousands of hours of human effort for filtering and data preparation. Neural networks with billions of parameters then analyze this massive dataset to create and update internal representations of 3D environments, enabling robots to understand dynamic behaviors, predict changes such as motion and depth, and prepare reactions to potential events. Continuous improvement through deep learning allows world models to adapt to new scenarios and understand complex physical interactions. Training these large models costs millions of dollars in GPU compute resources.
There can be different types of world models:
World foundation models (WFMs), like NVIDIA Cosmos™ models, are a specialized class of world models that meet the scale and generalizability requirements of foundation models. These neural networks trained on massive unlabeled datasets can be adapted for a broad range of physical AI tasks. Due to their generalizability, they can significantly accelerate the development of various physical AI applications by serving as pretrained base models that developers can post-train on smaller, task-specific datasets.
These WFMs allow developers to extend generative AI beyond the confines of 2D software and bring its capabilities into the real world while reducing the need for real-world trials. While AI’s power has traditionally been harnessed in digital domains, world models will unlock AI for tangible, real-world experiences.
Here are some of the key components for building world models:
Data curation is a crucial step for pretraining and continuous training of world models, especially when working with large-scale multimodal data. It involves processing steps like filtering, annotation, classification, and deduplication of image or video data to ensure high quality when training or post-training highly accurate models.
In video processing, data curation starts with splitting and transcoding the video into smaller segments, followed by quality filtering to retain the high-quality data. State-of-the-art vision language models are used to annotate key objects or actions, while video embeddings help semantic deduplication to remove redundant data.
The data is then organized and cleaned for training. Throughout this process, efficient data orchestration ensures a smooth data flow among the GPUs, enabling them to handle large-scale data and achieve high throughput.
Once data is curated, developers must be able to search through it to find scenarios for specific test cases. Given the size of these datasets, this process can be like finding a needle in a haystack. However, with powerful embedding models trained from world models, developers can perform semantic search quickly and easily, retrieving targeted scenarios to accelerate post-training cycles from years to days.
Tokenization converts high-dimensional visual data into smaller units called tokens, facilitating machine learning processing. Tokenizers transform pixel redundancies in images and video into compact, semantic tokens, enabling efficient training of large-scale generative models and inference on limited resources. There are two main methods:
This approach enhances model learning speed and performance.
Developers can train a world model architecture from scratch or post-train a pretrained foundation model for downstream tasks using additional data.
WFMs serve as generalist models, trained on extensive visual datasets to simulate physical environments. Using post-training frameworks, these models can be specialized for precise applications in robotics, autonomous systems, and other physical AI domains. There can be multiple approaches to post-train a model:
To get started easily and streamline the end-to-end development process, developers can leverage training frameworks, which include libraries, SDKs, and tools for data preparation, model training, optimization, and performance evaluation and deployment.
Reasoning models are trained by post-training pretrained large language models or large vision language models. They also use reinforcement learning to analyze and reason for themselves before they reach a decision.
Reinforcement learning (RL) is a machine learning approach where an AI agent learns by interacting with an environment and receiving rewards or penalties based on its actions. Over time, it optimizes decision-making to achieve the best possible outcome.
RL enables world models to adapt, plan, and make informed decisions, making it essential for robotics, autonomous systems, and AI assistants that need to reason through complex tasks.
World models extend AI capabilities with deep understanding of spatial relationships and physical behavior in three-dimensional environments. This enables them to simulate realistic cause-and-effect scenarios, such as predicting how objects will move and interact in complex scenes.
Developers can leverage the power of world models to generate high-quality data for training AI models in industrial and robotics applications, such as factory robots, warehouse automation, and autonomous vehicles operating on highways or in challenging terrains. Physical AI systems require large-scale, visually, spatially, and physically accurate data for learning through realistic simulations. World models can generate this data efficiently at scale for numerous applications.
World models can create more realistic and physically accurate visual content by understanding the underlying principles of how objects move and interact. In certain cases, outputs from highly accurate world models can take the form of synthetic data, which can be leveraged for training perception AI.
Current AI video generation can struggle with complex scenes and has a limited understanding of cause-and-effect relationships. However, world models paired with 3D simulation platforms and software are showing potential to demonstrate a deeper understanding of cause and effect in visual scenarios, such as simulating an industrial robot picking up a heavy object covered with debris.
World models help physical AI systems learn, adapt, and make better decisions by simulating real-world actions and predicting outcomes. They enable systems to “imagine” different scenarios, test actions, and learn from virtual feedback—much like a self-driving car practicing in a simulator to handle sudden obstacles or adverse weather conditions. By predicting possible outcomes, an autonomous machine can plan smarter actions without needing real-world trials, saving time and reducing risk.
When combined with large language models (LLMs), world models help AI understand instructions in natural language and interact more effectively. For example, a delivery robot could interpret a spoken request to “find the fastest route” and simulate different paths to determine the best one.
This predictive intelligence makes physical AI models more efficient, adaptable, and safer—helping robots, autonomous vehicles, intelligent traffic systems, and industrial machines operate smarter in complex, real-world environments.
Policy learning involves exploring strategies to determine the most effective actions. A policy model helps a system, such as a robot, determine the best action to take based on its current state and the broader state of the world. It links the system’s state (e.g., position) to an action (e.g., movement) to achieve a goal or improve performance. A policy model can be derived from post-training a model. Policy models are commonly used in RL, where they learn through interaction and feedback.
Use a reasoning world model to filter and critique synthetic data, improving quality and relevance at speed.
World models enable strategy exploration, rewarding the most effective outcomes. Add a reward module to run simulations and build cost models that track resource use—boosting both performance and efficiency for real-world tasks.
World models, when used with 3D simulators, serve as virtual environments to safely streamline and scale training for autonomous machines. With the ability to generate, curate, and encode video data, developers can better train autonomous machines to sense, perceive, and interact with dynamic surroundings.
World models bring significant benefits to every stage of the autonomous vehicle (AV) pipeline. With pre-labeled, encoded video data, developers can curate and train the AV stack to recognize the behavior of vehicles, pedestrians, and objects more accurately. These models can create predictive video simulations based on text and visual inputs and generate new scenarios, such as different traffic patterns, road conditions, weather, and lighting, to post-train the reasoning vision-language-action model powering the vehicle and accelerate testing and validation.
World models generate photorealistic synthetic data and predictive world states to help robots develop spatial intelligence. Using virtual simulations powered by physical simulators, these models let robots practice tasks safely and efficiently, accelerating learning through rapid testing and training. They help robots adapt to new situations by learning from diverse data and experiences.
Modified world models enhance planning by simulating object interactions, predicting human behavior, and guiding robots to reach goals accurately. They also enhance decision-making by conducting multiple simulations and learning from the feedback. With virtual simulations, developers can reduce real-world testing risks, cutting time, costs, and resources.
Trained with rich, multimodal data and advanced reasoning capabilities, world models can perform complex video analytics on massive amounts of recorded and live videos. These models enable natural language Q&A, automated summarization, object detection, event localization, and richer contextual understanding of visual content in videos—capabilities that surpass traditional computer vision methods. World models also generate photorealistic synthetic data on corner cases, helping to better train AI models to detect critical incidents.
Common applications of world models for video analytics are found in both industrial and smart city settings to improve safety and operational efficiency. Examples include identifying injury risks and unsafe behaviors for industrial safety, providing a detailed cause-and-effect understanding for rapid incident investigation, monitoring traffic, crowd flows, public safety incidents, and environmental hazards in smart cities, and identifying defects and irregularities on manufacturing lines through visual inspection for quality control.