tairos技术报告(腾讯)
tairos技术报告(腾讯)
Abstract
Recent advancements in embodied intelligence and robotics have witnessed groundbreaking
innovations across hardware and AI model architectures. While significant progress has been
made in specialized foundation models for reasoning, multi-modal perception, manipula-
tion and locomotion, there remains a critical gap in unified platforms capable of seamless
cross-embodiment deployment for real-world robot applications. We present TAIROS, a
comprehensive embodied AI platform that integrates multi-modal perception, long-horizon
planning, and dexterous action capabilities into a unified modular architecture. Building
upon state-of-the-art LLM, VLM, and VLA models, TAIROS features three interoperable
modules: Embodied Perception, Embodied Planning, and Perception-Action, designed for
both integrated agent deployment and standalone functionality. Our platform demonstrates
exceptional generalization across diverse robotic embodiments (humanoids, quadrupeds,
bimanual manipulators) and real-world tasks including complex manipulation, dynamic
locomotion, and multi-modal interaction. Extensive validation on industrial and domestic
scenarios confirms TAIROS’s capabilities in bridging the gap between AI advancements and
physical-world applications.
1. Introduction
The advent of foundation models has ushered in a new paradigm for artificial intelligence
systems, with transformative impacts across vision, language, and decision-making domains.
These models, trained in Internet-scale datasets that encompass trillions of tokens and millions
of images, have demonstrated unprecedented generalization and adaptation capabilities.
Seminal works like GPT-4 [1] and Gemini [2] have shown how large-scale pretraining can yield
models that transfer effectively to downstream tasks with minimal fine-tuning. Particularly
in embodied intelligence and robotics, foundation models offer the promise of overcoming
longstanding challenges in generalization, sample efficiency, and multi-modal understanding
that have constrained traditional approaches.
Embodied intelligence represents an interdisciplinary field that integrates mechanical
engineering, embodiment design, control theory, and AI. The rapid advancement of foundation
models in AI has recently led to the emergence of numerous specialized models addressing
Email address: [email protected] (Tencent Robotics X Team & Futian Laboratory, Shenzhen)
1
different aspects of embodied intelligence, which can be broadly categorized into four types:
multi-modal foundation models for embodied understanding, large language models for
embodied reasoning, vision-language-action (VLA) models for manipulation and navigation,
and simulation-based reinforcement learning for locomotion and whole-body control (WBC).
2
world-level common sense with visual states via mutual information maximization, demon-
strating robust performance in both closed-set and open-vocabulary scenarios. Parallel efforts
have developed frameworks that concurrently process visual and linguistic planning signals to
overcome spatial imagination limitations in pure LLM-based approaches [14]. The TaPA [15]
framework further advances this direction by grounding LLM-generated plans in physical
scene constraints through visual perception integration, while Robo2VLM [16] contributes a
data generation pipeline that derives VQA queries from real robot trajectories to improve
spatial reasoning in vision-language models.
Task decomposition and adaptive planning constitute another focus of research. Recent
innovations include multi-modal grounded planning systems that achieve data-efficient learning
in complex environments [17], and Egocentric Planning which combines symbolic planning
with Object-oriented POMDPs for scalable task achievement [18]. The InterPreT [19]
framework enables robots to learn symbolic predicates from non-expert language feedback,
facilitating generalization to novel tasks. SMART-LLM [20] demonstrates how LLMs can
coordinate multi-robot systems through programmatic task decomposition and coalition
formation, while MPO [21] introduces meta-plans reusable high-level templates optimized
via execution feedback. The Embodied-Reasoner [22] extends visual search and reasoning to
interactive tasks through a three-stage training pipeline, and PRED [23] enhances robustness
by preemptively revising actions based on environmental discrepancy detection.
Benchmark development remains critical for evaluating progress in embodied planning.
The Embodied Agent Interface [24] establishes standardized evaluation using Linear Temporal
Logic to systematically assess 18 LLMs across key tasks such as goal interpretation and action
sequencing. However, there remains a notable scarcity of large-scale benchmarks in this
domain. To address this critical gap, we propose a novel benchmark specifically designed for
evaluating complex long-horizon planning tasks, which will serve as a comprehensive testbed
for assessing various foundational planning models of embodied intelligence.
The most analogous to the Embodied Planning Module in the present work is Cooperative
Embodied Language Agent (CoELA) [25], a modular framework that integrates perception,
memory, and communication modules for decentralized multi-agent collaboration. These
advancements collectively push the boundaries of embodied AI by addressing fundamental
challenges in reasoning, perception, and adaptive execution across diverse real-world scenarios.
3
using PaLM [29] for high-level goal interpretation. ACT [30] introduced temporal ensem-
bling and action chunking to achieve sub-millimeter precision in bimanual manipulation
through its CVAE-Transformer architecture. The emergence of diffusion-based methods
began with Diffusion Policy, which modeled multimodal action distributions and later incor-
porated UMI [31] framework improvements. Octo [32] set new benchmarks as a generalist
diffusion policy trained on over 4 million trajectories across 22 platforms using the Open
X-Embodiment dataset. OpenVLA [33] demonstrated efficient transfer through LLaMA-2
adaptation with DINOv2/SigLIP visual encoders. RDT-1B [34] advanced diffusion models
through its 1.2B-parameter architecture featuring a unified action space representation.
More recently, π0 [35] implemented flow matching for high-frequency control using
PaliGemma components and demonstrated exceptional cross-platform deployment capabilities
for robotic manipulation tasks. FAST[36] introduced frequency-space action tokenization
for 15x inference acceleration. Gemini Robotics leveraged Gemini 2.0 foundation model
capabilities for dexterous manipulation. Helix [37] achieved 200Hz humanoid control through
optimized transformer policies, while GR00T [38] developed a unified diffusion framework for
humanoid systems using Eagle-2 VLM components based on real-world robot data and exten-
sive IsaacSim data. These advances collectively demonstrate rapid progress along multiple
dimensions: scaling through foundation model approaches, specialization for particular control
regimes, and novel architectural innovations in action representation and policy learning.
Our Perception-Action Module adopts π0 as its foundational architecture. The module’s
implementation involves a two-phase training approach: initializing with the pre-trained π0
model’s parameters followed by domain-specific post-training using our proprietary dataset
collected through extensive teleoperation and simulation experiments. This dataset com-
prises multi-modal observations paired with corresponding action trajectories across diverse
manipulation tasks, enabling the model to maintain π0 ’s robust generalization capabilities
while adapting to our target operational environments and task requirements.
4
Figure 1: Paradigm shift from sensing-planning-acting to SLAP in the field of robotics.
1.5. Summary
While the field of embodied intelligence has witnessed significant progress across multiple
research directions, there remains a notable absence of comprehensive systems capable of
addressing all these aspects in an integrated manner. The TAIROS platform proposed in this
work represents a holistic system encompassing perception, planning, and execution capabili-
ties through three core functional modules: the Embodied Perception Module, Embodied
Planning Module, and Perception-Action Module. These modules are seamlessly integrated
through standardized interfaces to form a unified embodied intelligence agent capable of
executing end-to-end robotic tasks with cross-platform adaptability. Robot platforms meeting
hardware specifications can directly access TAIROS services through API calls or SDK-based
edge deployment. Importantly, the platform maintains flexible modularity, allowing each
component to be independently invoked for specific functions such as visual question an-
swering (VQA) in perception, query-based planning, or edge deployment of VLA and WBC
execution. The TAIROS platform has already been successfully deployed across multiple
robotic platforms from Unitree, PaXini, Leju, Dobot, and Engine AI, etc., demonstrating its
practical applicability and versatility in real-world scenarios.
5
Figure 2: A framework overview of the TAIROS platform.
After years of continuous research and development, our colleagues at Tencent Robotics
X Lab have refined this framework through persistent iteration. Now it has evolved into a
more comprehensive and robust core technological framework, which we call the SLAP3
system (Sensing-Learning-Action: Perception, Planning, PAction, where PAction stands for
Perception-Action). The TAIROS platform is built upon the SLAP3 framework. See Figure 2
for an overview. TAIROS consists of three main modules that focus on perception, planning,
and execution, respectively. The Embodied Perception Module receives multi-modal signals
from various sensors, including robot proprioception, images from cameras, depths or point
cloud from depth camera or LiDAR, and signals from tactile sensors and force sensors. It then
reconstructs a dense 3D point cloud with object-level geometric fusion and semantic labeling
to compose a hierarchical scene graph. The scene graph is organized into multiple layers to
summarize information at the object level, view level, room level, and floor level for efficient
query and retrieval. The entire scene graph is organized into structured data format and stored
in memories for RAG [48]. The Embodied Planning Module is an LLM-based reasoning agent
that receives user prompt and environment context from the Embodied Perception Module,
and then performs long-horizon reasoning through MCTS [49], CoT [50], and tool calling [51],
etc., to decompose a difficult task into sub-tasks, each of which can be completed by calling
the Perception-Action (PAction) Module. The PAction Module receives commands from the
Embodied Planning Module and vision-tactile-force-language embeddings from the Embodied
Perception Module to output robot actions. The Perception-Action module currently contains
two specific models for legged robot locomotion and gripper/dexterous-hand manipulation
separately. The locomotion model is trained in simulation using RL and deployed on real
robots through a general sim2real pipeline. The manipulation policy is a VLA model based
on an architecture similar to π0 [35]. In the future, the locomotion and manipulation models
6
Figure 3: The pipeline of the Embodied Perception Module.
will be unified. The three modules compose the complete embodied agent for end-to-end
deployment over any robot hardware platform that meets a certain requirement. Meanwhile,
each of the three modules can be called independently via self-contained APIs (service from
the cloud) or SDKs (for edge deployment). For example, the Embodied Perception Module
enables text prompt interaction with users, acting like a VLM for question answering and
scene understanding; the Embodied Planning Module can chat with users and help solving
long-horizon decision problem via text responses; the Perception-Action Module can be
deployed in robot hardware for direct locomotion and manipulation tasks. Please refer to the
official site for direct usage: https://tairos.tencent.com/docs. In the following sections,
we will elaborate each module in technical detail.
3.1. 3D Reconstruction
It begins with sensor data acquisition, where robots are required to equip with RGBD
cameras and 360-degree-of-view LiDAR with built-in IMU. For visual input, 2D object
7
detection is performed using YOLO-World [5] for open-vocabulary recognition, followed
by instance association via SAM2 [6] for robust multi-view tracking. Robot self-occlusions
are filtered out to isolate external scene observations. A dense 3D point cloud is then
reconstructed using SLAM techniques, with object-level geometric fusion to establish accurate
spatial positions. We utilize depth information and the camera’s intrinsic/extrinsic parameters
to map detected objects to the world coordinate system, generating corresponding object
point clouds and Z-axis-aligned oriented 3D bounding boxes. Subsequently, we perform point
cloud fusion for the same object based on spatial similarity in the world coordinate system
and semantic similarity.
8
which, along with the 3D BBOX, form an object node data structure. The VLM also infers
spatial relationships between objects, adding edges (e.g., “on” or “in”) between relevant object
nodes. At the View Level, key frame information, including camera poses, is integrated
into a view node data structure. Object nodes observed in a given key frame are linked to
their corresponding view node via hierarchical edges (“child of” or “parent of”), establishing
a connection between the object and view levels. The room Level represents a room or a
defined spatial region, structured as a room node. The VLM clusters temporally and spatially
related views (e.g., five consecutive key frames all depicting a kitchen), linking their view
nodes to a single room node (e.g., “Kitchen”) via hierarchical edges. Finally, at the floor level,
the current implementation supports only single-floor scenes, meaning all room nodes are
connected to a single floor node through hierarchical edges. This multi-layered graph enables
a structured and hierarchical representation of the scene, integrating object detections, spatial
relationships, and semantic context across different levels of granularity.
(:objects
box.n.01_1 - box.n.01
chair.n.01_1 chair.n.01_2 chair.n.01_3 chair.n.01_4 - chair.n.01
coffeetable.n.01_1 - coffeetable.n.01
creditcard.n.01_1 - creditcard.n.01
diningtable.n.01_1 - diningtable.n.01
drawer.n.01_1 - drawer.n.01
floor.n.01_1 - floor.n.01
......
television.n.01_1 - television.n.01
vase.n.01_1 - vase.n.01
watch.n.01_1 - watch.n.01
window.n.01_1 window.n.01_2 window.n.01_3 - window.n.01
)
(:init
(open box.n.01_1)
(toggled_on floorlamp.n.01_1)
(open laptop.n.01_1)
9
(toggled_on lightswitch.n.01_1)
(ontop sofa.n.01_2 chair.n.01_4)
(ontop remotecontrol.n.01_1 coffeetable.n.01_1)
(ontop box.n.01_1 coffeetable.n.01_1)
(ontop keychain.n.01_1 coffeetable.n.01_1)
(ontop watch.n.01_1 coffeetable.n.01_1)
.....
(inroom window.n.01_1)
(inroom window.n.01_2)
(inroom window.n.01_3)
)
This workflow bridges low-level perception with high-level scene abstraction, enabling
robots to reason about environments in both geometric and semantic dimensions.
The system processes user queries through a structured pipeline where the LLM first
interprets the input instruction to determine whether it falls within the scope of multi-modal
perception capabilities. If the query lies outside this operational domain, the system directly
generates an appropriate rejection response. For valid queries, the LLM dynamically selects
the optimal query modality, currently including options such as current field of view search,
directional/distance-based search, room-specific search (either current or designated), or global
environment search while simultaneously determining the corresponding query parameters.
The retrieved results, combined with the original user instruction, are then formatted into a
comprehensive prompt for the LLM, which subsequently generates both a natural language
response and visualizable outputs (such as target object IDs for 3D visualization or navigation
point computation). This integrated approach enables context-aware information retrieval
while maintaining robust rejection handling for out-of-distribution requests.
4.1. Router
Upon receiving a voice instruction, the module first converts it into natural language
text via automatic speech recognition (ASR), then employs a router LLM model for binary
classification with linguistic output (e.g., generating “simple” or “hard” labels) to categorize
task complexity. Simple tasks, such as direct verbal responses or basic action commands
for the Perception-Action Module, are handled by the Fast Embodied LLM, while complex
tasks requiring long-horizon planning are delegated to the Planning Embodied LLM. This
10
Figure 4: The pipeline of the Embodied Planning Module.
11
upon receiving initial task instructions, the agent can independently determine multiple rounds
of exploratory actions (e.g., <get_memory> to retrieve historical observations, <action> to
perform physical interactions). The agent dynamically adjusts the direction of each round
of actions based on the current scene graph and visual observations, continuing until the
termination condition is triggered. This mechanism transforms exploration from a one-
time blind search into an adaptive process with contextual memory. Second, reward-driven
exploration optimization: we have designed a fine-grained reward function that incorporates
multi-dimensional metrics such as object-matching F1-score, exploration path efficiency, and
format compliance. In particular, an “exploration reward” is introduced to quantitatively
evaluate the environmental feedback from each round of actions, encouraging the agent to
prioritize interactions with regions that maximize information gain.
Interruption. There are two types of interruptions: instruction interruption and action
interruption. Instruction interruption occurs when the current instruction is in the process
of tool invocation or queuing, but action execution has not yet started. At this point, the
instruction processing is interrupted, and the result is stored in the historical instructions.
Action interruption occurs when the task initiated by the instruction sends an action sequence
and is awaiting the result. An additional interruption action is sent to stop the robot’s actions
promptly.
Instruction Tracking. The framework systematically manages the user input instruction
process, recording reflection content, tool invocation status, task decomposition, and execution
results. For any output action sequence, text response, or system message, the corresponding
instruction ID is bound. Once instruction processing is complete or interrupted, the data is
stored in the historical memory.
Agentic LLM. We propose two agents: a reactive agent based on a 32B base model,
which processes historical information following instructions, selects the appropriate tools,
and generates the corresponding tool parameters. The results of tool calls, as well as any
exceptions encountered during tool invocation, are handled through reflection by this agent.
Any user’s instruction is processed until the Terminate tool is invoked or the instruction
is interrupted. A standard process-oriented (SOP) agent invokes tools following a defined
procedure. When the current task needs to be broken down into actions from mid-level
task, the Action tool is invoked. If errors occur during the execution of the actions, the
Error-Handle tool is called. Upon completion of the task, the Critic tool is invoked to assess
whether the task execution aligns with expectations.
Search. In embodied tasks, particularly execution in real-world or simulated environments,
obtaining ground truth trajectories is extremely difficult, and manual labeling is costly. As a
result, acquiring a large-scale supervised fine-tuning (SFT) dataset becomes a major challenge.
Once a model has acquired basic capabilities with a small amount of data, self-improvement
through generating its own training data becomes a reasonable approach. However, planning
problems can still be complex, especially when dealing with long sequences or rare actions,
as simple random sampling may fail to yield successful trajectories. MCTS addresses this by
simulating future states and evaluating action paths through a tree structure, allowing for
more informed decision-making. In the evaluation stage, the value is calculated via a trained
12
Figure 5: Ablation study of incorporating MCTS.
value model and using MC rollouts. The action model, together with MCTS, can generate
trajectories with higher quality. Furthermore, by establishing a self-improving loop where
MCTS produces better trajectories, which are then filtered and used to train the model,
leading to even better trajectories. We have observed that after 5 to 10 iterations, the model’s
performance continues to improve. We conduct experiments with a 7B Base model and on
an embodied symbolic simulation environment. With the same tasks, this final performance
in terms of pass rate outperforms all the baselines, including closed-source LLMs and as well
as the latest reasoning models. Figure 5 shows the comparison results of various models.
RL. The emergence of pure RL training paradigms on LLMs, such as DeepSeek-R1 [53],
provides another path for training embodied agents. It learns to understand the environment
(i.e., transitions) and solve tasks through exploration, interaction with the environment, and
reward feedback. We started directly from a general model without going through embodied
task-specific SFT before proceeding with PPO/GRPO. For reward feedback, we provided
sparse outcome rewards generated by a CoT-Reward Model. The trained reward model scores
based on the effectiveness, rationality, and efficiency of the final execution, producing an
overall score. Additionally, if an error occurs during execution, there is an extra penalty.
We used the Verl framework [54] and integrated a large number of cloud-based simulations,
providing execution status and results via headless execution for RL rollouts. We used a
7B model as the base model and AI2Thor (ALFRED) as the environment. Experimental
results in Figure 6 show that RL training can effectively improve the model’s capabilities,
13
Figure 6: Ablation study of incorporating RL training.
and outperforms other closed-source and open-source models, as well as some prompting
based method that specifically designed for ALFRED.
Reward Model. The Reward Model plays a crucial role in the training process of RL. We
leverage the CoT Reward Model to score the execution trajectories of agents. A trained 7B
model, through deep thinking, primarily evaluates a trajectory based on three key aspects:
1) Effectiveness (task completeness - for example, if the task is to fry an egg, the egg needs
to be cooked in the pan); 2) Rationality, such as deeming the trajectory irrational if it
contains illegal actions during execution; 3) Efficiency, which involves avoiding redundant
and ineffective actions or explorations. Ultimately, we generate an overall score based on
these three perspectives, serving as an evaluation of the trajectory’s execution effect.
14
model. Initially, we manually curate a small set of planning data (a few dozen examples) and
train the planning model via SFT to develop basic planning capabilities. Following this, the
planning model, operating at a higher temperature, performs extensive sampling for each
task. These generated plans are then executed in simulation by the action model. A critic
evaluates the execution’s success, and based on the results, we refine the planning model
through DPO. The criteria for positive and negative examples are as follows:
Through this training process, our planning model integrated with the action model in
benchmark object-interaction tasks improves success rate significantly. Compared to the
baseline SFP version initialized with the same 32B base model, the success rate improved
from 37.3% to 45.7%. The trained planning model’s thinking process is as follows:
<think>
The task is to move the book to the bed and then move the phone to where
the book was originally located.
KEY OBJECTS:
- book.n.01_1 (ontop desk.n.01_1)
- bed.n.01_1 (ontop floor.n.01_1)
- cellphone.n.01_1 (ontop desk.n.01_2)
The robot must ensure that the book is moved to the bed before moving the
phone to the desk.
</think>
<mid_tasks>
1. Pick up the book.n.01_1 from desk.n.01_1 and place it on bed.n.01_1.
2. Pick up the cellphone.n.01_1 from desk.n.01_2 and place it on desk.n.01_1
where the book was originally located.
</mid_tasks>
15
information), along with the timestamp of each event. Robot info includes robot types
(e.g., humanoid robots with two arms, quadrupedal robots, etc.), robot functions (such as
navigation, grasping, etc.), and robot descriptions (e.g., name and owner). The interaction
memory and robot info are organized as prompt context.
4.5. Benchmark
Furthermore, we have developed a comprehensive benchmark specifically designed for
evaluating long-horizon and challenging embodied decision-making tasks. This benchmark
comprises 1,011 task samples distributed across seven primary categories: Object-Interaction
(363 samples, 35.9%), QA-Attribute (144 samples, 14.2%), QA-Alignment (131 samples,
13.0%), QA-Self-awareness (122 samples, 12.1%), QA-Spatial (104 samples, 10.3%), Navigation
(80 samples, 7.9%), and QA-Temporal (67 samples, 6.6%). It supports three distinct robot
configurations: Single-armed Robot, Dual-armed Robot, and Mobile Base (2.8%), and is
primarily tested in four typical indoor environments: Kitchen (39.9%), Bedroom (29.0%),
Living Room (22.3%), and Bathroom (8.7%). The evaluation framework is designed for end-
to-end assessment, providing a multi-dimensional analysis that includes task comprehension,
action sequence generation, reflective capability, task success rate, and user satisfaction.
Additionally, it incorporates a hierarchical scoring system tailored to different task categories,
employing metrics such as exact match, answer similarity, LLM-based scoring, and success
rate to ensure a thorough and nuanced evaluation of embodied AI systems.
16
Task Rejection Accuracy. Compare the boolean field acceptTask with the ground truth.
The score Srej is:
Srej = EM(acceptTask, gt_acceptTask) (2)
Tool Usage. Match the correct tool call (e.g., get_time(), vqa(text), weather()) and
verify parameter correctness and simulation success:
1
Stool = (Match(tool) + Success(execution)) (3)
2
Action Sequence and Task Completion. Let A be the predicted action sequence and G
the ground truth sequence. Define:
• Success Rate (Succ.): Fraction of tasks completed successfully.
• Goal Condition Success (GcS): Fraction of predicates in final state matched to the goal.
• Success weighted by Path Length (SPL):
|D|
1 X Si Li
SPL = (4)
|D| i=1 max(Pi , Li )
where Si is success (1 or 0), Li is the length of optimal path, and Pi is path length taken.
These scores are averaged to form Saction :
1
Saction = (Succ. + GcS + SPL) (5)
3
Certain subcategories require LLM-based grading for Saction , using structured prompts
incorporating ground truth and prediction.
Answer Similarity (QA). Depending on category and subcategory:
(
sim(answer, gt_answer) if similarity-based
Sqa = (6)
LLMScore(answer, gt_answer) if model-based
Reflective Ability. Defined as the proportion of failed actions that were corrected with a
successful follow-up, forming the score Sref :
Num(Effective Reflections)
Sref = (7)
Num(Failure Events)
17
Alternatively, LLM-based grading can assess whether recovery was adequate.
Total Score. We define a weighted total score for each sample:
X
Stotal = wi Si (8)
i∈Denabled
where Denabled is the set of active evaluation dimensions, and wi the weight (equal weight by
default or set via config). All scores are reported per subcategory and aggregated by category.
For dimensions not applicable (e.g., wi = 0), they are omitted from Stotal . This structured
framework enables in-depth evaluation and fair comparison across baselines such as ReAct
and plan-and-execute, offering insight into performance across all critical embodied reasoning
dimensions.
We compared the ReAct and plan-and-execute planning frameworks using different models.
ReAct is a framework that combines reasoning and action in a recursive manner, where
the model is able to adaptively react to the environment based on intermediate feedback.
Plan-and-execute refers to a framework that focuses on first generating a high-level plan,
followed by execution of the individual steps in the plan, often leveraging symbolic reasoning
and task decomposition. This framework is more structured and hierarchical, in contrast to
the flexibility of ReAct’s recursive approach. We tested models such as GPT-4o, DeepSeek-
R1, Qwen-Max-Latest, Robobrain-7B/32B, among others, within both frameworks. For
various tasks, such as Object Interaction, Alignment, etc., we conducted evaluations using a
comprehensive pipeline specifically designed for embodied tasks, covering the entire process
from task understanding to user satisfaction. This end-to-end evaluation encompasses key
phases such as goal state prediction, task rejection, action sequence generation, reflective
abilities, tool usage, and task completion. The framework also incorporates a rich and
detailed scoring system that evaluates the models’ performance across multiple dimensions.
For instance, LLM-based scoring assesses how well the model’s responses align with the
expected outputs. Additionally, similarity metrics are used to compare generated responses
to reference answers, including exact match and partial match evaluations. Task completion
rates, including sub-task completion rates and execution path efficiency, are also measured.
These metrics provide a quantitative assessment of how effectively the model generates
coherent, accurate, and efficient responses. Furthermore, the task rejection component
ensures that models can accurately identify tasks outside their capabilities, a crucial skill for
real-world applications. The overall performance comparison is given in Table 1.
5. Perception-Action Module
The Perception-Action Module adopts a dual-model architecture consisting of a VLA
model for manipulation tasks and a simulation-based RL training pipeline for locomotion
tasks, reflecting the prevailing technical approaches for these distinct task categories in the
field. In our current system implementation, the manipulation model and locomotion model
remain decoupled. This design ensures operational clarity while requiring careful coordination
at the command level. The Embodied Planning Module addresses this by sequentially
18
Table 1: Overall performance of all compared models. The scores represent the weighted
total scores of all active evaluation dimensions of each model (row) under each task (column).
Models Performance Metrics ↑
Object Interaction Alignment Self-awareness Attribute Spatial Temporal Navigation
Tairos-Planing 60.82 66.74 74.38 66.07 52.62 53.44 62.09
GPT-4o+ReAct 44.06 70.00 71.69 60.70 39.10 48.13 58.00
DeepSeek-R1+ReAct 45.93 67.54 71.66 60.99 38.24 46.51 60.82
Claude-4.0-Sonnet+ReAct 50.56 69.19 68.14 65.87 48.90 51.87 57.98
Gemini-2.5-Pro+ReAct 48.01 37.33 51.95 64.38 49.07 48.09 53.16
Qwen-max-Latest+ReAct 43.90 58.60 63.91 56.90 38.90 47.60 60.50
Robobrain-7B+ReAct 36.85 20.76 25.33 49.90 50.70 41.02 52.69
Robobrain-32B+ReAct 37.98 21.43 26.12 51.30 50.92 42.35 54.40
GPT-4o+plan-and-execute 40.30 70.70 73.90 57.90 35.90 46.10 58.80
DeepSeek-R1+plan-and-execute 45.80 67.40 71.10 60.80 37.90 46.30 57.80
Claude-4.0-Sonnet+plan-and-execute 49.54 69.88 68.74 64.93 48.89 50.25 56.36
Gemini-2.5-Pro+plan-and-execute 47.09 37.61 50.89 62.17 49.24 47.72 52.87
Qwen-max-Latest+plan-and-execute 42.50 58.30 64.14 55.60 36.80 47.20 59.10
Robobrain-7B+plan-and-execute 35.47 20.52 25.68 49.43 49.02 39.63 51.64
Robobrain-32B+plan-and-execute 36.93 22.21 27.31 49.63 50.40 43.23 53.38
outputting corresponding commands for each model in its task execution pipeline, thereby
preventing conflicting calls and maintaining system stability during concurrent manipulation
and mobility operations. This architecture provides modular flexibility while ensuring reliable
task execution through explicit command sequencing.
19
Figure 7: An illustration of the VLA model for manipulation.
to calculate the joint angles of the target robotic arm by solving inverse kinematics. Then,
we leverage the known overhead camera extrinsics and calculated joint angles to render this
virtual robotic arm composited onto the inpainted background, aligning its end-effector pose
precisely with the recorded UMI gripper pose. This process generates synthetic overhead-view
sequences that visually simulate the robots’ view, significantly enhancing visual consistency
for cross-embodiment policy transfer.
3D Alignment. Multi-view images are widely used in recent VLA approaches due to their
implicit encoding of 3D information, which is crucial for spatial manipulation. However, learn-
ing robust multi-view representations typically requires large-scale real-world teleoperation
data, which is often limited in robotics. To inject stronger cross-view spatial understanding
into VLA models, we leverage external 3D visual representations, rather than relying solely on
the VLA models to learn them independently. Specifically, we adopt the 3D foundation model
VGGT [57], which has shown strong 3D perception capabilities from 2D images, as a teacher
model to guide VLA in learning powerful 3D visual correspondence. Nonetheless, VGGT is
originally trained on scene-level datasets with moderate pose variation and overlapping views,
while robotic settings, particularly those using head-mounted and wrist-mounted cameras,
involve much greater variation in pose and appearance. This domain gap hinders the direct
applicability of VGGT to embodied tasks.
To bridge this gap, we generate a multi-view dataset of 58K photorealistic synthetic
images, where a simulated Franka robot manipulates various objects in diverse indoor scenes.
The dataset provides precise labels for multi-view camera poses and point cloud alignment.
We use this high-quality dataset to fine-tune VGGT, enabling it to adapt to the head–wrist
camera configuration and demonstrate zero-shot generalization in our real-world dual-arm
robot scenarios. The fine-tuned VGGT is then used to generate cross-view-consistent features,
which supervise the output hidden states of the prefix VLM model via an alignment loss. This
guidance enables the VLM model to efficiently learn more powerful 3D visual representations
from the limited robotics dataset. However, pre-trained VLM models are typically trained
20
on large-scale internet data and encode strong semantic alignment between images and text.
Directly aligning VLM features with VGGT features may lead to a loss of this large-scale
pre-trained knowledge. To mitigate semantic forgetting during training, we therefore continue
to train the VLM on VQA and object localization tasks using a next-token-prediction loss in
parallel. Optionally, our method can leverage depth images as a known prior when available,
providing additional guidance for the network to produce more accurate predictions with
auxiliary information. Depth modality is processed by a block-specific MLP and is added
token-wise at the middle of the transformer block.
The entire framework is trained end-to-end. VGGT is used only during training and
removed at inference. This design enables the robot to better reason over diverse image
streams (e.g., stereo, head, and wrist views), enhancing its understanding of 3D spatial
relationships in complex manipulation tasks.
Applications. We utilize the Dobot X-Trainer robot, a dual-arm system equipped with
wrist-mounted cameras on each gripper and an externally mounted overhead camera. The
task is to enable the robot to accurately grasp essence bottles and insert them vertically
into a container that features a hole at its bottom, with a diameter closely matching that of
the bottle. This setup poses a significant challenge due to the tight clearance of the hole,
requiring high-precision manipulation. Moreover, the initial positions of both bottles and
containers are randomized, demanding strong spatial generalization from the model. To
support post-training, we collect 1,000 demonstration trajectories. The deployed VLA model
can achieve over 80% success rate. We also employ the PaXini Tora One humanoid robot as
our experimental platform to address a representative industrial task involving the packing
of multiple bottles with varying sizes and appearances (including laundry detergent and
water bottles) on a moving conveyor belt assembly line. For this specific task scenario, we
collected a dataset comprising 300 complete execution trajectories. Subsequent post-training
on our base model using this dataset demonstrates significant performance improvement,
achieving an average task (packing three objects as one task) success rate over 80% in the
target industrial packing application. This result validates both the robot’s capability in
handling dynamic industrial manipulation tasks and the effectiveness of our data-driven
training approach for complex robotic operations. In addition to teleoperation data, we
also leverage data collected using hand-held grippers for model finetuning.This approach
enables the acquisition of high-quality, dexterous manipulation trajectories that are otherwise
challenging to obtain through teleoperation alone. We deploy and evaluate the fine-tuned
model on the JAKA-K1 robot, which is equipped with the same type of gripper used during
data collection (TEK CTAG2F90-C). Owing to the increased dexterity afforded by hand-held
data collection, we are able to extend the previous packing task by introducing a bimanual
handover step, in which bottles must be precisely transferred from one gripper to the other
before insertion. We collect 500 demonstrations, enabling the fine-tuned model to achieve over
80% success rate. These results show that our hand-held gripper data enables fine-grained
skill learning and successful transfer to real robots.
21
Figure 8: An illustration of the locomotion model.
6. Conclusion
We present TAIROS, an integrated platform comprising three core modules: a multi-modal
perception module, a long-horizon planning module, and a unified perception-action module.
These components are designed to operate independently through standardized APIs/SDKs
and collectively as a complete agent, providing robots with comprehensive end-to-end task exe-
cution capabilities. TAIROS is specifically engineered to address practical industrial demands,
supporting diverse robotic applications through its flexible architecture. The platform enables
robot manufacturers to offer embodied intelligence services via standardized interfaces, signif-
icantly lowering the development barrier for third-party integration. Additionally, TAIROS
incorporates cloud-based simulation capabilities that allow instant deployment of virtual
environments for planning and perception model validation, complete with pre-configured
robotic agents, scenarios, and tasks to accelerate capability demonstration. Looking forward,
the platform will leverage Cloud inference clusters to deliver a fully integrated development
22
ecosystem encompassing data collection/annotation, algorithm training, model validation,
and OTA deployment to physical robots - creating a closed-loop workflow that enhances
research efficiency and industrial adoption in embodied intelligence. Currently, TAIROS has
demonstrated compatibility with diverse robotic morphologies, including bipedal/wheeled
humanoids, quadrupeds, and robotic arms, supporting various end-effectors from grippers to
dexterous hands. The platform has been successfully deployed in collaboration with multiple
robotics companies in industries including manufacturing, automotive, home appliances, and
exhibition services, validating its practical applicability across sectors.
References
[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren-
cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat,
et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[2] Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu
Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a
family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
[3] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini
Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
transferable visual models from natural language supervision. In International conference
on machine learning, pages 8748–8763. PmLR, 2021.
[4] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang,
Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with
grounded pre-training for open-set object detection. In European conference on computer
vision, pages 38–55. Springer, 2024.
[5] Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo-
world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 16901–16911, 2024.
[6] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu
Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2:
Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024.
[7] Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen,
Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al.
Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. In 2024
IEEE International Conference on Robotics and Automation (ICRA), pages 5021–5028.
IEEE, 2024.
[8] Bangguo Yu, Yuzhen Liu, Lei Han, Hamidreza Kasaei, Tingguang Li, and Ming Cao.
Vln-game: Vision-language equilibrium search for zero-shot semantic navigation. arXiv
preprint arXiv:2411.11609, 2024.
23
[9] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang,
Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint
arXiv:2502.13923, 2025.
[10] Wonje Choi, Woo Kyung Kim, Minjong Yoo, and Honguk Woo. Embodied cot distillation
from llm to off-the-shelf agents. arXiv preprint arXiv:2412.11499, 2024.
[11] Qi Zhao, Haotian Fu, Chen Sun, and George Konidaris. Epo: Hierarchical llm agents
with environment preference optimization. arXiv preprint arXiv:2408.16090, 2024.
[12] Hanwen Wan, Yifei Chen, Zeyu Wei, Dongrui Li, Zexin Lin, Donghao Wu, Jiu Cheng,
Yuxiang Zhang, and Xiaoqiang Ji. Embodiedagent: A scalable hierarchical approach to
overcome practical challenge in multi-robot control. arXiv preprint arXiv:2504.10030,
2025.
[13] Dejie Yang, Zijing Zhao, and Yang Liu. Planllm: Video procedure planning with refinable
large language models. In Proceedings of the AAAI Conference on Artificial Intelligence,
volume 39, pages 9166–9174, 2025.
[14] Jun Cen, Chenfei Wu, Xiao Liu, Shengming Yin, Yixuan Pei, Jinglong Yang, Qifeng
Chen, Nan Duan, and Jianguo Zhang. Using left and right brains together: Towards
vision and language planning. arXiv preprint arXiv:2402.10534, 2024.
[15] Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, and Haibin Yan. Embodied task planning
with large language models. arXiv preprint arXiv:2307.01848, 2023.
[16] Kaiyuan Chen, Shuangyu Xie, Zehan Ma, Pannag R Sanketi, and Ken Goldberg.
Robo2vlm: Visual question answering from large-scale in-the-wild robot manipulation
datasets. arXiv preprint arXiv:2505.15517, 2025.
[17] Taewoong Kim, Byeonghwi Kim, and Jonghyun Choi. Multi-modal grounded planning
and efficient replanning for learning embodied agents with a few examples. In Proceedings
of the AAAI Conference on Artificial Intelligence, volume 39, pages 4329–4337, 2025.
[18] Xiatoian Liu, Hector Palacios, and Christian Muise. Egocentric planning for scal-
able embodied task achievement. Advances in Neural Information Processing Systems,
36:54586–54613, 2023.
[19] Muzhi Han, Yifeng Zhu, Song-Chun Zhu, Ying Nian Wu, and Yuke Zhu. Interpret:
Interactive predicate learning from language feedback for generalizable task planning.
arXiv preprint arXiv:2405.19758, 2024.
[20] Shyam Sundar Kannan, Vishnunandan LN Venkatesh, and Byung-Cheol Min. Smart-llm:
Smart multi-agent robot task planning using large language models. In 2024 IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS), pages 12140–12147.
IEEE, 2024.
24
[21] Weimin Xiong, Yifan Song, Qingxiu Dong, Bingchan Zhao, Feifan Song, Xun Wang,
and Sujian Li. Mpo: Boosting llm agents with meta plan optimization. arXiv preprint
arXiv:2503.02682, 2025.
[22] Wenqi Zhang, Mengna Wang, Gangao Liu, Xu Huixin, Yiwei Jiang, Yongliang Shen,
Guiyang Hou, Zhe Zheng, Hang Zhang, Xin Li, et al. Embodied-reasoner: Synergizing
visual search, reasoning, and action for embodied interactive tasks. arXiv preprint
arXiv:2503.21696, 2025.
[23] Jinyeon Kim, Cheolhong Min, Byeonghwi Kim, and Jonghyun Choi. Pre-emptive action
revision by environmental feedback for embodied instruction following agents. In 8th
Annual Conference on Robot Learning, 2024.
[24] Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava,
Cem Gokmen, Tony Lee, Erran Li Li, Ruohan Zhang, et al. Embodied agent interface:
Benchmarking llms for embodied decision making. Advances in Neural Information
Processing Systems, 37:100428–100534, 2024.
[25] Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B Tenen-
baum, Tianmin Shu, and Chuang Gan. Building cooperative embodied agents modularly
with large language models. arXiv preprint arXiv:2307.02485, 2023.
[26] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis,
Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu,
et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint
arXiv:2212.06817, 2022.
[27] Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas,
Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, et al.
Rt-trajectory: Robotic task generalization via hindsight trajectory sketches. arXiv
preprint arXiv:2311.01977, 2023.
[28] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron
David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al.
Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint
arXiv:2204.01691, 2022.
[29] Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan
Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, et al. Palm-e:
An embodied multimodal language model. 2023.
[30] Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained
bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023.
[31] Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng,
Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot
teaching without in-the-wild robots. arXiv preprint arXiv:2402.10329, 2024.
25
[32] Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees,
Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source
generalist robot policy. arXiv preprint arXiv:2405.12213, 2024.
[33] Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna,
Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla:
An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024.
[34] Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang,
Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual
manipulation. arXiv preprint arXiv:2410.07864, 2024.
[35] Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn,
Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-
language-action flow model for general robot control. arXiv preprint arXiv:2410.24164,
2024.
[36] Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong,
Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for
vision-language-action models. arXiv preprint arXiv:2501.09747, 2025.
[38] Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi
Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open
foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734,
2025.
[39] Lei Han, Qingxu Zhu, Jiapeng Sheng, Chong Zhang, Tingguang Li, Yizheng Zhang,
He Zhang, Yuzhen Liu, Cheng Zhou, Rui Zhao, et al. Lifelike agility and play in
quadrupedal robots using reinforcement learning and generative pre-trained models.
Nature Machine Intelligence, 6(7):787–798, 2024.
[40] Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris
Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-
humanoid whole-body teleoperation and learning. arXiv preprint arXiv:2406.08858,
2024.
[41] Huayi Wang, Zirui Wang, Junli Ren, Qingwei Ben, Tao Huang, Weinan Zhang, and
Jiangmiao Pang. Beamdojo: Learning agile humanoid locomotion on sparse footholds.
arXiv preprint arXiv:2502.10363, 2025.
[42] Mazeyu Ji, Xuanbin Peng, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng, and
Xiaolong Wang. Exbody2: Advanced expressive humanoid whole-body control. arXiv
preprint arXiv:2412.13196, 2024.
26
[43] Tao Huang, Junli Ren, Huayi Wang, Zirui Wang, Qingwei Ben, Muning Wen, Xiao Chen,
Jianan Li, and Jiangmiao Pang. Learning humanoid standing-up control across diverse
postures. arXiv preprint arXiv:2502.08378, 2025.
[44] Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong
Wang. Gmt: General motion tracking for humanoid whole-body control. arXiv preprint
arXiv:2506.14770, 2025.
[45] Tairan He, Wenli Xiao, Toru Lin, Zhengyi Luo, Zhenjia Xu, Zhenyu Jiang, Jan Kautz,
Changliu Liu, Guanya Shi, Xiaolong Wang, et al. Hover: Versatile neural whole-body
controller for humanoid robots. arXiv preprint arXiv:2410.21229, 2024.
[46] Tairan He, Jiawei Gao, Wenli Xiao, Yuanhang Zhang, Zi Wang, Jiashun Wang, Zhengyi
Luo, Guanqi He, Nikhil Sobanbab, Chaoyi Pan, et al. Asap: Aligning simulation
and real-world physics for learning agile humanoid whole-body skills. arXiv preprint
arXiv:2502.01143, 2025.
[48] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin,
Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al.
Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural
information processing systems, 33:9459–9474, 2020.
[49] Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Lei Han, Haitao Mi, and
Dong Yu. Toward self-improvement of llms via imagination, searching, and criticizing.
Advances in Neural Information Processing Systems, 37:52723–52748, 2024.
[50] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V
Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language
models. Advances in neural information processing systems, 35:24824–24837, 2022.
[52] Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto
Martín-Martín, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al.
Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic
simulation. In Conference on Robot Learning, pages 80–93. PMLR, 2023.
[53] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao
Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning
capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
[54] Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang,
Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf
framework. arXiv preprint arXiv: 2409.19256, 2024.
27
[55] Marion Lepert, Jiaying Fang, and Jeannette Bohg. Phantom: Training robots without
robots using only human videos. arXiv preprint arXiv:2503.00779, 2025.
[56] Shangchen Zhou, Chongyi Li, Kelvin CK Chan, and Chen Change Loy. Propainter:
Improving propagation and transformer for video inpainting. In Proceedings of the
IEEE/CVF international conference on computer vision, pages 10477–10486, 2023.
[57] Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht,
and David Novotny. Vggt: Visual geometry grounded transformer. In Proceedings of the
Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025.
[58] Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis,
Vladlen Koltun, and Marco Hutter. Learning agile and dynamic motor skills for legged
robots. Science Robotics, 4(26):eaau5872, 2019.
28