On-Device Diffusion Transformer Policy for Efficient Robot Manipulation

Yiming Wu1    Huan Wang2    Zhenghao Chen3    Jianxin Pang4    Dong Xu1 
1 School of Computing and Data Science, The University of Hong Kong
2 School of Engineering, Westlake University
3 School of Information and Physical Sciences, University of Newcastle
4 UBTech Robotics Corp.
{yimingwu, dongxu}@hku.hk    [email protected]    [email protected]
Corresponding authors: Huan Wang and Dong Xu.
Abstract

Diffusion Policies have significantly advanced robotic manipulation tasks via imitation learning, but their application on resource-constrained mobile platforms remains challenging due to computational inefficiency and extensive memory footprint. In this paper, we propose LightDP, a novel framework specifically designed to accelerate Diffusion Policies for real-time deployment on mobile devices. LightDP addresses the computational bottleneck through two core strategies: network compression of the denoising modules and reduction of the required sampling steps. We first conduct an extensive computational analysis on existing Diffusion Policy architectures, identifying the denoising network as the primary contributor to latency. To overcome performance degradation typically associated with conventional pruning methods, we introduce a unified pruning and retraining pipeline, optimizing the model’s post-pruning recoverability explicitly. Furthermore, we combine pruning techniques with consistency distillation to effectively reduce sampling steps while maintaining action prediction accuracy. Experimental evaluations on the standard datasets, i.e., PushT, Robomimic, CALVIN, and LIBERO, demonstrate that LightDP achieves real-time action prediction on mobile devices with competitive performance, marking an important step toward practical deployment of diffusion-based policies in resource-limited environments. Extensive real-world experiments also show the proposed LightDP can achieve performance comparable to state-of-the-art Diffusion Policies.

1 Introduction

Diffusion Policies have demonstrated significant success in robotic manipulation tasks through imitation learning, as evidenced by various studies [51, 15, 4, 53, 8, 36, 37, 26, 46, 24, 1, 44]. This success fuels the ambition to deploy general-purpose embodied agents in robots, particularly those with limited computation resources. However, this endeavor presents multifaceted challenges: 1) Diffusion Policies require multiple denoising steps, which slows down the generation process; 2) the standard architectures [8, 36, 37] involve billions of parameters, leading to high memory usage. These factors impede real-time applications on resource-constrained platforms like mobile robots and drones. To address these challenges, recent work by DeeR-VLA [49] introduces a multi-exit architecture built on the Roboflamingo framework [26], enabling dynamic termination of the computation process to accelerate action prediction. While this design achieves considerable computation reduction on GPU devices, its early exit strategy remains suboptimally tuned for mobile platforms.

In this work, we introduce a novel framework named LightDP for Diffusion Policies that enables models to achieve real-time generation on mobile devices. To achieve this, we mainly focus on two primary strategies: compressing the denoising network to improve the inference speed and reducing the sampling steps. First, we provide an analysis of two Diffusion Policies named DiffusionPolicy Transformer (DP-T) [8] and MDT-V [36]. Through the comprehensive component evaluation, we observe that the denoiser is the major bottleneck for Diffusion Policies (as shown in Table 1). In this work, we follow the conventional model pruning pipeline, in which the model is pruned and re-trained to resist the performance drop. In previous pruning approaches based on importance metrics [31], oracle design [23], or lottery hypothesis [13], the pruning and retraining process is separated, which can lead to suboptimal performance. In contrast, we integrate the pruning and retraining process in a unified framework, which can enhance the recoverability of the Diffusion Policies and explicitly model and optimize the post-finetuning performance of pruned models. Second, reducing the sampling steps is another straightforward way to speed up diffusion policies, but it would result in inevitable performance degradation without distillation. To preserve the prediction of initial action with fewer inference steps, we integrate the pruning strategies introduced with consistency distillation [43, 41]. With the proposed LightDP, we show efficient diffusion policies on mobile devices, which can achieve real-time generation with competitive performance in three data sets. Our contributions are summarized as follows:

  • We present a novel framework for Diffusion Policies to obtain the efficient diffusion transformer that achieves real-time action prediction on the mobile device significantly faster than the original models.

  • To our knowledge, this is the first work to address deploying Diffusion Policies on mobile devices. We provide a comprehensive analysis of these policies’ computational cost and memory footprint.

  • We integrate the pruning and step distillation process in a unified framework that enhances the recoverability of the models under the extensive benchmarking on the widely used datasets, e.g., Push-T, Robomimic, CALVIN, and LIBERO. The extensive real-world evaluations present the effectiveness of our approach in practical scenarios.

2 Related Work

2.1 Diffusion Policies

Several studies have investigated the application of diffusion models  [40, 22, 19] on policy learning, such as BESO [35] Diffusion Policy [8], MDT [36], and MoDE [37]. Some approaches integrate pretrained visual-language models [36] directly into end-to-end visuomotor manipulation policies but these often involve significant architectural constraints or require calibrated cameras, limiting their generalizability. Further extension on 3D representations [50] enable the model to tackle complex 3D robotic manipulation tasks, demonstrating superior performance compared to traditional methods. Despite the success of these methods, they often require extensive fine-tuning and are computationally expensive, limiting their deployment on resource-constrained devices. Reuss et al[37] propose an MoE-based policy network that can be trained end-to-end, and only a few parameters are activated during inference, reducing the computational cost significantly. And some concurrent work [15, 49, 2, 39] explored accelerating the inference of VLA models.

In this work, we focus on compressing the policy models and deploying the model on resource-constrained devices, such as smartphones and NVIDIA Jetson devices.

2.2 Network Pruning for Diffusion Models

Due to the significant computational demands of diffusion models, many works aim to enhance efficiency by either pruning network components [17, 25, 45, 9, 7, 6] or employing knowledge distillation [18, 38, 32]. The former targets reducing the model’s size while the latter cuts down on the number of required denoising steps. For instance, Li et al. introduced SnapFusion [27], an early method that accelerates diffusion models by modifying the architecture through channel and block pruning alongside distillation techniques. SnapFusion determines the importance of each block by evaluating both the degradation in CLIP score and the gain in inference speed, and the blocks are removed using a “trial-and-error” procedure [34, 33]: those causing the smallest drop in CLIP score and the largest boost in speed are considered less critical. Additionally, SnapFusion incorporates a CFG-aware distillation loss to better align the outputs of a pruned (student) model with those of its original (teacher) one after classifier-free guidance is applied.

In a similar vein, BK-SDM [23] accelerates Stable Diffusion by eliminating entire weight blocks, although it relies solely on the CLIP score to assess importance. A subsequent finetuning step based on feature distillation helps recover performance, achieving a reduction in model size of around 30% to 50% with marginal performance loss. The resultant model is then further refined into EdgeFusion [5] based on a robust distillation method named LCM [29].

Furthermore, Google’s MobileDiffusion [52] applies pruning to shrink model size but goes a step further by introducing additional architectural modifications. These include adding more transformer layers in the U-Net’s intermediate stages, reducing the number of channels, and decoupling self-attention from cross-attention to enhance performance. Complemented by a specific distillation loss inspired by SnapFusion and UFOGen [48], it achieves remarkably fast inference speeds reportedly around 0.2 seconds on iPhone 15 Pro.

In parallel, SANA-1.5 [47] presents a linear diffusion transformer that introduces a block-level importance analysis for model depth pruning, enabling compression to arbitrary sizes with minimal quality drop. The pruned SANA models can even be scaled back up at inference via a repeated sampling strategy to match larger-model performance. In the realm of on-device applications, Edge-SD-SR [16] adapts Stable Diffusion for super-resolution by trimming the model to only  169M parameters through a specialized bidirectional conditioning design and joint training, enabling 4×\times× upscaling in  1.1s on mobile hardware while matching or surpassing dedicated super-resolution methods in quality.

3 Preliminaries

Diffusion Models. Diffusion models [42, 22] are a class of generative models that iteratively produce data by gradually adding and removing noise. They involve two main processes: 1. Forward Diffusion Process: Noise is progressively added to the input data, transforming it into a noise-like distribution. 2. Reverse Denoising Process: The original input is reconstructed from the noisy data by progressively removing the added noise. Within a continuous-time framework, adding independent and identically distributed (i.i.d.) Gaussian noise with standard deviation σ\sigmaitalic_σ to the data distribution pdata(𝒙0)p_{\text{data}}(\boldsymbol{x}_{0})italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) results in a noisy distribution p(𝒙;σ)p(\boldsymbol{x};\sigma)italic_p ( bold_italic_x ; italic_σ ). As σ\sigmaitalic_σ increases from a small value σmin\sigma_{\min}italic_σ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT to a large value σmax\sigma_{\max}italic_σ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT, p(𝒙;σmax)p(\boldsymbol{x};\sigma_{\max})italic_p ( bold_italic_x ; italic_σ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) approximates pure noise. The probability flow ordinary differential equation (PF-ODE) describes the evolution of the data under this noise addition:

d𝒙=σ˙tσt𝒙logp(𝒙,σt)dt,\mathrm{d}\boldsymbol{x}=-\dot{\sigma}_{t}\,\sigma_{t}\,\nabla_{\boldsymbol{x}}\log p(\boldsymbol{x},\sigma_{t})\,\mathrm{d}t,roman_d bold_italic_x = - over˙ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT roman_log italic_p ( bold_italic_x , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_d italic_t , (1)

where 𝒙logp(𝒙,σt)\nabla_{\boldsymbol{x}}\log p(\boldsymbol{x},\sigma_{t})∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT roman_log italic_p ( bold_italic_x , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the score function, often approximated by Dθ(𝒙;σt)𝒙σt2\frac{D_{\theta}(\boldsymbol{x};\sigma_{t})-\boldsymbol{x}}{\sigma_{t}^{2}}divide start_ARG italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x ; italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_italic_x end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. Within the EDM [22] framework, the denoising function Dθ(𝒙t,σt)D_{\theta}(\boldsymbol{x}_{t},\sigma_{t})italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is parameterized as:

Dθ=cskip(t)𝒙t+cout(t)fθ(cin(t)𝒙t,cnoise(t)),D_{\theta}=c_{skip}(t)\boldsymbol{x}_{t}+c_{out}(t)f_{\theta}(c_{in}(t)\boldsymbol{x}_{t},c_{noise}(t)),italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_s italic_k italic_i italic_p end_POSTSUBSCRIPT ( italic_t ) bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ( italic_t ) italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_t ) bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_n italic_o italic_i italic_s italic_e end_POSTSUBSCRIPT ( italic_t ) ) , (2)

where fθf_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is a neural network trained to minimize the L2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT denoising error, and cskipc_{\text{skip}}italic_c start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT, cinc_{\text{in}}italic_c start_POSTSUBSCRIPT in end_POSTSUBSCRIPT, coutc_{\text{out}}italic_c start_POSTSUBSCRIPT out end_POSTSUBSCRIPT, and cnoisec_{\text{noise}}italic_c start_POSTSUBSCRIPT noise end_POSTSUBSCRIPT are time-dependent coefficients.

Consistency Models. Consistency models, a family of generative models, are designed to generate data efficiently by directly mapping noisy inputs to their clean counterparts in a single step. They enforce a self-consistency property that ensures the model’s outputs remain invariant across different noise levels, i.e., fθ(𝒙t,t)=fθ(𝒙t,t)f_{\theta}(\boldsymbol{x}_{t},t)=f_{\theta}(\boldsymbol{x}_{t^{\prime}},t^{\prime})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), where 𝒙t\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒙t\boldsymbol{x}_{t^{\prime}}bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are samples taken at different time steps ttitalic_t and tt^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT along the ODE trajectory. In the EDM framework, consistency models adopt the boundary conditions cskip(0)=1c_{\text{skip}}(0)=1italic_c start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ( 0 ) = 1 and cout(0)=0c_{\text{out}}(0)=0italic_c start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ( 0 ) = 0. One approach to training these models, known as consistency distillation, involves refining a pre-trained diffusion model by minimizing the consistency loss:

CD(\displaystyle\mathcal{L}_{CD}(caligraphic_L start_POSTSUBSCRIPT italic_C italic_D end_POSTSUBSCRIPT ( θ,θ;Ψ)=\displaystyle\theta,\theta^{-};\Psi)=italic_θ , italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ; roman_Ψ ) = (3)
𝔼[d(fθ(𝒙tn+k,tn+k,),fθ(𝒙^tnΨ,ω,tn,))],\displaystyle\mathbb{E}\left[d\left(f_{\theta}\left(\boldsymbol{x}_{t_{n+k}},t_{n+k},\right),f_{\theta^{-}}\left(\hat{\boldsymbol{x}}_{t_{n}}^{\Psi,\omega},t_{n},\right)\right)\right],blackboard_E [ italic_d ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n + italic_k end_POSTSUBSCRIPT , ) , italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ψ , italic_ω end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , ) ) ] ,

where dditalic_d is a distance function, 𝒙^tnΨ,ω\hat{\boldsymbol{x}}_{t_{n}}^{\Psi,\omega}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ψ , italic_ω end_POSTSUPERSCRIPT is the data reversed by an ODE solver Ψ\Psiroman_Ψ with classifier-free guidance weight ω\omegaitalic_ω, nnitalic_n is the time step of the pre-trained diffusion model, and kkitalic_k is the step interval.

4 Method

4.1 Problem Formulation

Recent advances in imitation learning have enabled robots to learn complex manipulation tasks from demonstrations collected by human experts. Given the demonstration 𝒯\mathcal{T}caligraphic_T, a trajectory τ𝒯\tau\in\mathcal{T}italic_τ ∈ caligraphic_T is a sequence of observation 𝒐\boldsymbol{o}bold_italic_o and robot action 𝒂\boldsymbol{a}bold_italic_a, denoted as τ={(𝒐1,𝒂1),,(𝒐Nτ,𝒂Nτ)}\tau=\{(\boldsymbol{o}_{1},\boldsymbol{a}_{1}),...,(\boldsymbol{o}_{N_{\tau}},\boldsymbol{a}_{N_{\tau}})\}italic_τ = { ( bold_italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( bold_italic_o start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) }. A diffusion policy πϕ(𝒂|𝒐,𝐠)\pi_{\phi}({\boldsymbol{a}|\boldsymbol{o},\mathbf{g}})italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_a | bold_italic_o , bold_g ) is trained to imitate the expert’s behavior by maximizing the log-likelihood of the action 𝒂\boldsymbol{a}bold_italic_a given the observation 𝒐\boldsymbol{o}bold_italic_o and goal 𝐠\mathbf{g}bold_g. Under the multi-modal setting, the goal 𝐠\mathbf{g}bold_g is a high-level instruction that specifies the desired outcome of the task, could be a language instruction or a target observation. Generally, the diffusion policy parameterized by ϕ\phiitalic_ϕ is composed of an observation encoder 𝑬\boldsymbol{E}bold_italic_E, a diffusion transformer 𝑫\boldsymbol{D}bold_italic_D, and a goal encoder 𝑮\boldsymbol{G}bold_italic_G. The observation encoder 𝑬\boldsymbol{E}bold_italic_E extracts features from the observation 𝒐\boldsymbol{o}bold_italic_o, while the diffusion transformer 𝑫\boldsymbol{D}bold_italic_D generates the action 𝒂\boldsymbol{a}bold_italic_a conditioned on the observation 𝒐\boldsymbol{o}bold_italic_o and goal 𝐠\mathbf{g}bold_g. By substituting the notations into Equation 1, diffusion policy estimates the score function 𝒂logp(𝒂|𝒐,𝐠)\nabla_{\boldsymbol{a}}\log p(\boldsymbol{a}|\boldsymbol{o},\mathbf{g})∇ start_POSTSUBSCRIPT bold_italic_a end_POSTSUBSCRIPT roman_log italic_p ( bold_italic_a | bold_italic_o , bold_g ) at timestep ttitalic_t via score matching as follows:

DM=𝔼σ,𝒂,ϵ[α(σt)πϕ(𝒂t,𝒐,𝐠,σt)𝒂22],\mathcal{L}_{DM}=\mathbb{E}_{\mathbf{\sigma},\boldsymbol{a},\boldsymbol{\epsilon}}\big{[}\alpha(\sigma_{t})\newline \|\pi_{\phi}(\boldsymbol{a}_{t},\boldsymbol{o},\mathbf{g},\sigma_{t})-\boldsymbol{a}\|_{2}^{2}\big{]},caligraphic_L start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_σ , bold_italic_a , bold_italic_ϵ end_POSTSUBSCRIPT [ italic_α ( italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_o , bold_g , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_italic_a ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (4)

where πϕ=𝒂+σt2𝒂logp(𝒂|𝒐,𝐠)\pi_{\phi}=\boldsymbol{a}+\sigma_{t}^{2}\nabla_{\boldsymbol{a}}\log p(\boldsymbol{a}|\boldsymbol{o},\mathbf{g})italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = bold_italic_a + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_a end_POSTSUBSCRIPT roman_log italic_p ( bold_italic_a | bold_italic_o , bold_g ) is the neural network, 𝒂t\boldsymbol{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the noised action at timestep ttitalic_t, and α(σt)\alpha(\sigma_{t})italic_α ( italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the loss weight. The diffusion model is trained by minimizing the score matching loss DM\mathcal{L}_{DM}caligraphic_L start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT, which encourages the model to generate actions that are consistent with the expert’s demonstrations. In this work, we focus on accelerating the pretrained policy models by pruning and distillation algorithms, and then deploy the models on the mobile devices for real-time robot manipulation.

4.2 Latency Analysis of Diffusion Policies

Refer to caption
Figure 1: The network architecture of MDT-V model. The model consists of three main components: the observation encoder 𝑬\boldsymbol{E}bold_italic_E, the goal encoder 𝑮\boldsymbol{G}bold_italic_G, and the diffusion transformer 𝑫\boldsymbol{D}bold_italic_D.
Components IE DT
Latency (ms) 1.28 0.906
Parameter (M) 11.2 8.97
NFE 1 𝟏𝟎𝟎¯\underline{\mathbf{100}}under¯ start_ARG bold_100 end_ARG
Total Latency (ms) 1.28 90.6¯\underline{\mathbf{90.6}}under¯ start_ARG bold_90.6 end_ARG
Latency (ms) 1.28 0.68
Parameter (M) 11.2 4.76
NFE 1 𝟒¯\underline{\mathbf{4}}under¯ start_ARG bold_4 end_ARG
Total Latency (ms) 1.28 2.72¯\underline{\mathbf{2.72}}under¯ start_ARG bold_2.72 end_ARG
(a) DP-T Model
Components GLE IE DT
Latency (ms) 3.74 3.78 2.25
Parameter (M) 151.28 111.05 22.52
NFE 1 2 𝟏𝟎¯\underline{\mathbf{10}}under¯ start_ARG bold_10 end_ARG
Total Latency (ms) 3.74 7.56 22.25¯\underline{\mathbf{22.25}}under¯ start_ARG bold_22.25 end_ARG
Latency (ms) 3.74 3.78 1.025
Parameter (M) 151.28 111.05 12.47
NFE 1 2 𝟒¯\underline{\mathbf{4}}under¯ start_ARG bold_4 end_ARG
Total Latency (ms) 3.74 7.56 4.1¯\underline{\mathbf{4.1}}under¯ start_ARG bold_4.1 end_ARG
(b) MDT-V Model
Table 1: Time analysis for the (a) DiffusionPolicy Transformer (DP-T) and (b) MDT-V models on iPhone 13 (the top four rows show the original models, and the bottom four rows show the pruned models). The device features a 16-core Apple Neural Engine capable of 16 trillion operations per second. With the aid of LightDP, the diffusion transformers in DP-T and MDT-V achieve latency reductions from 90.6 ms and 22.25 ms to 2.72 ms and 4.1 ms, respectively. IE: Image Encoder, DT: Diffusion Transformer, GLE: Goal Language Encoder, NFE is short for the number of score function evaluations, i.e., inference steps., M: Million, ms: milliseconds.

Since the diffusion policy is designed for real-time robot manipulation, it is crucial to assess the on-device latency of the policy models. Given the structural similarities among these models, we use the MDT-V model as an example. As shown in Figure 1, the MDT-V model supports multiple modalities of input, including an observation encoder for extracting the image features (i.e., the Voltron Network [21] for MDT-V model), a goal encoder for processing the high-level instruction (i.e., the CLIP Text Encoder), and a diffusion transformer for generating the robot action.

As shown in Table 1, we evaluate the latency of the DP-T and MDT-V models on iPhone13. For DP-T, the network consists of two major components, the image encoder employs a ResNet18 model for converting the input image into embedding as the condition for the diffusion transformer, which costs a tiny portion of the total latency (1.28ms). The diffusion transformer is an 8-layer transformer, which is the main bottleneck of the model (90.6 ms), demands 100 iterative denoising steps to get the final action prediction. The similar observation can be found in the MDT-V model, where the Voltron network costs relatively less time (7.56ms) compared to the diffusion transformer (22.25ms), which slows down the on-device generation process. By breaking down the architecture of the policy models, we identify the bottleneck of the model, which is the diffusion transformer in both models. The architecture of the diffusion transformer can be formulated as a stack of NNitalic_N transformer blocks, where each block contains a multi-head attention layer (MHA) and a feed-forward network (FFN) layer, formulated as ϕi=FFN(MHA())\boldsymbol{\phi}_{i}=\text{FFN}(\text{MHA}(\cdot))bold_italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = FFN ( MHA ( ⋅ ) ). Since the diffusion transformer requires multiple denoising steps to generate the action prediction, which leads to a high latency of the model. To address this issue, we propose to accelerate the model by pruning and distillation, as described in the following sections.

4.3 Prune the Model by Learning

Refer to caption
Figure 2: The training pipeline of our proposed LightDP. In the left figure, we present the consistency distillation pipeline adopted in our method. The Student Model fϕf_{\phi}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT is initialized with the Teacher Model fψf_{\psi}italic_f start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT and then pruned by the learnable pruning technique introduced in Section 4.3. Given the sampled demonstration data (𝒐,𝒂,𝐠)(\boldsymbol{o},\boldsymbol{a},\mathbf{g})( bold_italic_o , bold_italic_a , bold_g ), we first add noise to obtain the noised action 𝒂t\boldsymbol{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at the timestep ttitalic_t, the Teacher Model fψf_{\psi}italic_f start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT is used to predict the noised action 𝒂t+k\boldsymbol{a}_{t+k}bold_italic_a start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT at the timestep t+kt+kitalic_t + italic_k. Then, two noised actions 𝒂t\boldsymbol{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒂t+k\boldsymbol{a}_{t+k}bold_italic_a start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT are fed into the Student Model fϕf_{\phi}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT and the Target Model fϕf_{\phi^{\star}}italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT to calculate the consistency loss. The Target Model is updated by the Student Model with a momentum update. In the right figure, we present the prune by learning technique used in our method, where a set of Bernoulli variables (gate score) is learned to perform the differentiable sampling of the pruned model, which is jointly optimized with the model parameters during the pruning process.

To obtain a smaller model, we adopt the layer pruning technique to remove the redundant layers in the diffusion transformer. Given the NNitalic_N-layer diffusion transformer, we aim to find a binary mask (N)={m1,m2,,mN}\mathcal{M}(N)=\{m_{1},m_{2},...,m_{N}\}caligraphic_M ( italic_N ) = { italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } identifying the layers to be pruned, where mi{0,1}m_{i}\in\{0,1\}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } indicates whether this layer is retained or pruned. Conventionally, the pruning process is formulated as an optimization problem to minimize the loss \mathcal{L}caligraphic_L after pruning, which can be formulated as min,πϕ^𝔼x[(x,πϕ,)]\min_{\mathcal{M},{\pi_{\hat{\phi}}}}\mathbb{E}_{x}\left[\mathcal{L}(x,\pi_{\phi},\mathcal{M})\right]roman_min start_POSTSUBSCRIPT caligraphic_M , italic_π start_POSTSUBSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ caligraphic_L ( italic_x , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , caligraphic_M ) ], where πϕ=Πi=1Nϕi\pi_{\phi}=\Pi_{i=1}^{N}{\boldsymbol{\phi}_{i}}italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = roman_Π start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the vanilla model, and πϕ^{\pi}_{\hat{\phi}}italic_π start_POSTSUBSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUBSCRIPT is the model after pruning.

However, this pruning problem is NP-hard [3, 14] since both the mask \mathcal{M}caligraphic_M and weight ϕ^\hat{\phi}over^ start_ARG italic_ϕ end_ARG are jointly optimized. To address this, a common approach is a two-stage pruning process: first determine the mask MMitalic_M (by minimizing the loss LLitalic_L with a given criterion), then fine-tune the pruned model to recover performance. However, this two-step approach can be suboptimal, since the model may not fully recover performance after pruning. To address this issue, we propose to use a single-stage pruning method [10], where the mask \mathcal{M}caligraphic_M and weight ϕ^\hat{\phi}over^ start_ARG italic_ϕ end_ARG are jointly optimized to minimize the loss \mathcal{L}caligraphic_L after pruning.

Specifically, the \mathcal{M}caligraphic_M is modeled as a probability distribution iBernoulli((pi))\mathcal{M}_{i}\sim\text{Bernoulli}((p_{i}))caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ Bernoulli ( ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ), where pip_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the gate score optimized during the training process. We leverage Singular Value Decomposition (SVD) to estimate layer importance, since SVD is a common technique in model compression [17,25]. Compared to alternatives like Canonical Polyadic or Kronecker product decompositions, SVD provides singular values that capture the most significant components of a weight matrix. We initialize the gate score with the SVD decomposition, which is formulated as:

(𝑾)=𝑾SVD(𝑾,k)=𝑾𝑼k𝑺k𝑽kTF,\mathcal{I}(\boldsymbol{W})=||\boldsymbol{W}-SVD(\boldsymbol{W},k)||=||\boldsymbol{W}-\boldsymbol{U}_{k}\boldsymbol{S}_{k}\boldsymbol{V}_{k}^{T}||_{F},caligraphic_I ( bold_italic_W ) = | | bold_italic_W - italic_S italic_V italic_D ( bold_italic_W , italic_k ) | | = | | bold_italic_W - bold_italic_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , (5)

where 𝑾\boldsymbol{W}bold_italic_W is the weight matrix of the transformer block, and SVD(𝑾,k)SVD(\boldsymbol{W},k)italic_S italic_V italic_D ( bold_italic_W , italic_k ) is the reconstructed weight matrix using the top-kkitalic_k singular values. Specifically, the SVD decomposition is applied to the weight matrix of each transformer block, including the query, key, and value weight matrix of the attention layer and MLP layers in the FFN module. Then, the gate score is initialized with the importance score by pi=(ϕi)i=1L(ϕi)p_{i}=\frac{\mathcal{I}{(\phi_{i})}}{\sum_{i=1}^{L}\mathcal{I}{(\phi_{i})}}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG caligraphic_I ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT caligraphic_I ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG, where ϕi\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the weight matrix in the iiitalic_i-th block of diffusion transformer.

As shown in Figure 2, the model is trained with a learnable gate selection mechanism via Gumbel-Softmax trick [20, 11], which could be used to select the block to be pruned. If the iiitalic_i-th block is dropped during training, we make its output identical to its input (an identity mapping)., which could be formulated as:

xi+1=miϕi(xi)+(1mi)xi,x_{i+1}=m_{i}\boldsymbol{\phi}_{i}(x_{i})+(1-m_{i})x_{i},italic_x start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ( 1 - italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (6)

where xix_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ϕi(xi)\boldsymbol{\phi}_{i}(x_{i})bold_italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) represent the input and output of layer ϕi\boldsymbol{\phi}_{i}bold_italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively. The gate score is updated during the training process, which could be used to select the block to be pruned. At the end of training, to obtain an NNitalic_N layer diffusion transformer, we select the NNitalic_N layers with the highest gate score. To further recover the performance after pruning, we continue to fine-tune the model without adopting the mask selection process.

4.4 Step Distillation

With the pruned model, the one-step inference speed could be significantly improved. However, the model still requires multiple denoising steps to obtain a high-quality action prediction, which raises a non-negligible computation cost. To address this issue, we employ the consistency distillation to train the model as a consistency model, which could achieve comparable performance with the original model but with fewer denoising steps.

As introduced in Section 3, consistency distillation aims to train the model πϕ\pi_{\phi}italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT to satisfy the consistency property across the different noise levels, denoted as πϕ(𝒂t,𝒐,𝐠,σt)=πϕ(𝒂t,𝒐,𝐠,σt)\pi_{\phi}(\boldsymbol{a}_{t},\boldsymbol{o},\mathbf{g},\sigma_{t})=\pi_{\phi}(\boldsymbol{a}_{t^{\prime}},\boldsymbol{o},\mathbf{g},\sigma_{t^{\prime}})italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_o , bold_g , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , bold_italic_o , bold_g , italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ). The distilled model is reparameterized as EDM, which is formulated as:

πϕ(𝒂t,𝒐,𝐠,σt)=cskip(t)𝒂t+cout(t)fϕ(cin(t)𝒂t,cnoise(t)),\pi_{\phi}(\boldsymbol{a}_{t},\boldsymbol{o},\mathbf{g},\sigma_{t})=c_{skip}(t)\boldsymbol{a}_{t}+c_{out}(t)f_{\phi}(c_{in}(t)\boldsymbol{a}_{t},c_{noise}(t)),italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_o , bold_g , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_c start_POSTSUBSCRIPT italic_s italic_k italic_i italic_p end_POSTSUBSCRIPT ( italic_t ) bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ( italic_t ) italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_t ) bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_n italic_o italic_i italic_s italic_e end_POSTSUBSCRIPT ( italic_t ) ) , (7)

where cskipc_{\text{skip}}italic_c start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT, cinc_{\text{in}}italic_c start_POSTSUBSCRIPT in end_POSTSUBSCRIPT, coutc_{\text{out}}italic_c start_POSTSUBSCRIPT out end_POSTSUBSCRIPT, and cnoisec_{\text{noise}}italic_c start_POSTSUBSCRIPT noise end_POSTSUBSCRIPT satisfy the boundary condition, and fϕf_{\phi}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT is the distilled model.

As shown in Figure 2, the Student Model fϕf_{\phi}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT is initialized with the Teacher Model fψf_{\psi}italic_f start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT and then pruned by the learnable pruning technique introduced in Section 4.3. Given the sampled demonstration data (𝒐,𝒂,𝐠)(\boldsymbol{o},\boldsymbol{a},\mathbf{g})( bold_italic_o , bold_italic_a , bold_g ), we first add noise to obtain the noised action 𝒂t+k\boldsymbol{a}_{t+k}bold_italic_a start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT at the timestep t+kt+kitalic_t + italic_k, the Teacher Model fψf_{\psi}italic_f start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT is used to predict the noised action 𝒂t\boldsymbol{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at the timestep ttitalic_t. Then, two noised actions 𝒂t+k\boldsymbol{a}_{t+k}bold_italic_a start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT and 𝒂t\boldsymbol{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are fed into the Student Model fϕf_{\phi}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT and the Target Model fϕf_{\phi^{\star}}italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT to calculate the consistency loss CD\mathcal{L}_{\text{CD}}caligraphic_L start_POSTSUBSCRIPT CD end_POSTSUBSCRIPT as follows:

CD=𝔼[fϕ(𝒂t+k,𝒐,𝐠)fϕ(𝒂t,𝒐,𝐠)22],\mathcal{L}_{\text{CD}}=\mathbb{E}\left[\left\|f_{\phi}(\boldsymbol{a}_{t+k},\boldsymbol{o},\mathbf{g})-f_{\phi^{\star}}(\boldsymbol{a}_{t},\boldsymbol{o},\mathbf{g})\right\|_{2}^{2}\right],caligraphic_L start_POSTSUBSCRIPT CD end_POSTSUBSCRIPT = blackboard_E [ ∥ italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT , bold_italic_o , bold_g ) - italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_o , bold_g ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (8)

where 2\|\cdot\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the 2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm. The Target Model fϕf_{\phi^{\star}}italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is updated with the exponential moving average (EMA) of the parameter fϕf_{\phi}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT defined as fϕsg(μfϕ+(1μ)fϕ)f_{\phi^{\star}}\leftarrow\texttt{sg}(\mu f_{\phi^{\star}}+(1-\mu)f_{\phi})italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ← sg ( italic_μ italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + ( 1 - italic_μ ) italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ), where sg()\texttt{sg}(\cdot)sg ( ⋅ ) denotes the stopgrad operation and μ\muitalic_μ satisfies 0μ<10\leq\mu<10 ≤ italic_μ < 1. Both Student Model and Target Model are initialized with the Teacher Model.

5 Experiments

In this section, we introduce the experimental settings, including the baselines, benchmarks, and evaluation metrics in Section 5.1. And introduce the details about baselines used in our experiments, as well as the implementation details in Section 5.2. Subsequently, we present the main results and the analysis of our experiments in Section 5.3.

5.1 Benchmarks and Evaluation Metrics

We evaluate our method on the following benchmarks:

  • Push-T was first introduced in IBC [12] used to evaluate the performance of Diffusion Policies. This task is designed to test the embodied agent’s ability to manipulate objects with a fixed end-effector. In the task, the agent is required to push a T-shaped block into a target goal zone, which is marked by green lines in a table. The task is varied by changing the initial position of the block and the end-effector. And the task provides two types of observations: RGB images and keypoint-based states. In the experiments, we use both types of observations to evaluate the performance of our method. And we follow the evaluation protocol adopted in Diffusion Policy [8] to evaluate the success rate of the manipulation task.

  • CALVIN [30] is a simulation benchmark for measuring the performance of long-horizon language-conditioned tasks. The benchmark dataset is split into four manipulation environments, A, B, C, and D. The environments share a similar structure, like a table with objects on it, but the objects and the goal are not always the same. The agent is requested to follow the instructions to manipulate the objects on the table to achieve the goal. There are 6-hour human-teleoperated recording data in each environment, and only 1% of the data is annotated with language instructions. We use the Average Rollout Length as the main evaluation metric in the experiments.

  • LIBERO [28] was developed for long-life robotic decision making to build the generalist agent that can perform a wide range of tasks. The benchmark comprises 130 tasks across 4 suites: LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, LIBERO-100. The first three suites are designed to test the agent’s ability to disentangle the transfer of declarative and procedural knowledge, while LIBERO-100 is a suite of 100 tasks with entangled knowledge transfer.

5.2 Implementation Details

Base Models. Through this work, we have mentioned both DiffusionPolicy Transformer and MDT-V in terms of their wide use in imitation learning, especially in the object manipulation tasks. As our purpose is to compress the model to make it more efficient and faster on mobile devices. We choose these two models as our base models. DiffusionPolicy Transformer is a transformer-based policy network that only supports image input. The model consists of a diffusion transformer and a visual encoder.

MDT is a multi-modal policy network that integrates the pre-trained multi-modal feature extractor named Voltron. We also implement MoDE, which is an MoE-based policy network that achieves the state-of-the-art performance on the CALVIN and LIBERO benchmarks. In the experiments, we consider compressing the widely used Diffusion Policies, including Diffusion-Policy-T [8], and MDT [36]. Diffusion-Policy-T [8] is a transformer-based policy network for imitation learning that supports only image input. MDT [36], by integrating the pre-trained multi-modal feature extractor named Voltron [21], MDT has achieved good results on the CALVIN dataset.

Implementation Details. Our implementation is based on PyTorch. We conducted training on NVIDIA RTX 3090 and H800 GPUs. Then, we converted the model trained on GPU to Core ML model format (mlpackage, based on Apple’s ml-stable-diffusion) and measured latency in Xcode Instruments on an iPhone 13 (A15 Bionic, iOS 18.3.1). For network pruning, we adopt the local block pruning scheme from TinyFusion [10] to build up a local block with scheme NNitalic_N:MMitalic_M. In this NNitalic_N:MMitalic_M scheme, each group of MMitalic_M consecutive layers (a ‘block’) is pruned down to NNitalic_N layers.. For instance, when we keep N=3N=3italic_N = 3 layers from a local block with M=4M=4italic_M = 4 layers in total, we have (43)=4\binom{4}{3}=4( FRACOP start_ARG 4 end_ARG start_ARG 3 end_ARG ) = 4 choices, corresponding to =[[1,1,1,0],[1,1,0,1],[1,0,1,1],[0,1,1,1]]\mathcal{M}=[[1,1,1,0],[1,1,0,1],[1,0,1,1],[0,1,1,1]]caligraphic_M = [ [ 1 , 1 , 1 , 0 ] , [ 1 , 1 , 0 , 1 ] , [ 1 , 0 , 1 , 1 ] , [ 0 , 1 , 1 , 1 ] ]. Our consistency distillation is applied to the model’s x0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT prediction (predicting the denoised action), following common practice, and we start the EMA decay rate at 0.95 and gradually increase it to 0.999 over the course of training to stabilize the Target model updates. We use the DDIM Solver [40] for distillation, with a skip interval of 10 steps (i.e., distill every 10th diffusion step). We keep the most hyper-parameters consistent with the original implementation of the base models. For DP-T, the input is a hybrid of RGB image and low-dimension state, the size of image is 84×8484\times 8484 × 84, and the observation sequence length is set as 2, the transformer block of the diffusion transformer is with the hidden size of 256, the number of heads is 4, and the number of layers in DP-T is 8. For MDT, the input is multi-modal, which includes two RGB images at different views as observation and a language instruction as the goal. We adopt AdamW as the optimizer with a learning rate of 1e41e-41 italic_e - 4, and the batch size is set as 64. We train the model for 30 epochs on the CALVIN datasets, within the last epochs, the Student Model fϕf_{\phi}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT is pruned based on the gate score at 202020-th epoch.

5.3 Evaluation on DiffusionPolicy Transformer

In this section, we conduct the experiments based on DP-T as reported in Table 2, we can find that the pruned model can achieve a comparable success rate with the original model, but with a smaller model size and faster inference speed.

Method Depth Param (M) NFE GFLOPs Inference Speed (ms) Success Rate
DP-T 8 8.97 100 4.39 90.6 0.772±0.039
DP-T\text{DP-T}^{\star}DP-T start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT 0.754±0.023
DP-T-D6/6-8 6 6.87 4 0.134 4.79 0.752±0.019
DP-T-D6/4-4 0.732±0.034
DP-T-D4/4-8 4 4.76 4 0.091 2.72 0.747±0.010
DP-T-D4/2-4 0.732±0.013
DP-T-D4/1-2 0.757±0.018
DP-T-D2/2-8 2 2.65 4 0.049 0.97 0.730±0.022
DP-T-D2/1-4 0.724±0.030
Table 2: Performance comparison of LightDP compressed models with varying depth and inference steps. All models are trained on the same Push-T dataset for 3K epochs. DP-T\text{DP-T}^{\star}DP-T start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT refers to the baseline model evaluated by us. DP-T-DL{\color[rgb]{1,0,0}L}italic_L/N{\color[rgb]{1,0,0}N}italic_N-M{\color[rgb]{1,0,0}M}italic_M indicates that L{\color[rgb]{1,0,0}L}italic_L blocks are retained during the pruning process, with a local block scheme of NNitalic_N:MMitalic_M. NFE is short for the number of score function evaluations, i.e., inference steps. Detailed experiments on the Robomimic dataset are provided in Section H.

Quantitative Results. The vanilla DP-T model contains 8 transformer blocks with alternative Multi-head Cross-Attention layers and Feed-Forward layers. The model is first trained to obtain an optimal pruning mask with the network weight updated jointly, then the model is pruned and trained via a consistency distillation loss. In our setting, we compress the model into 2, 4, and 6 layers. The results show that through our method, the pruned model can achieve a comparable success rate with the vanilla model. As we discuss in When NNitalic_N:MMitalic_M=1:2, each two successive blocks are grouped with one block pruned. With the same depth, we observe that when the capacity MMitalic_M of the block is reduced, the performance will be slightly reduced, since large MMitalic_M can provide more diverse pruning choices. Besides, from the perspective of the depth of the pruned model, we find that the performance of the larger depth model remains better than the smaller one, which is consistent with the intuition, but the performance gap is not significant. Especially, we find a 2-layer diffusion transformer can achieve a success rate with 0.724, which is quite close to the original model with 0.754. In contrast, the latency of the pruned model is greatly diminished when compared to the DP-T model. With the number of inference steps cut down to 4 and the depth limited to 2, we attain approximately 93 times speed improvement, and the FLOPs are decreased by 89.6%. These results indicate that our proposed LightDP successfully compresses the model while preserving the original model’s performance.

Training \rightarrow Test Method Param (M) GFLOPs Latency (ms) Instructions in a Row (1000 chains)
1 2 3 4 5 Average Length
ABCD\rightarrowD MDT-V 22.52 1.21 22.25 98.6% 95.8% 91.6% 86.2% 80.1% 4.52±(0.02)
MDT-V/E3-D3 17.50 0.36 8.7 98.3% 94.6% 91.5% 85.8% 79.6% 4.50±(0.06)
MDT-V/E2-D2 12.47 0.25 4.1 95.1% 87.9% 80.5% 71.9% 64.1% 3.94±(0.08)
MDT-V/E1-D1 7.45 0.13 3.39 92.3% 85.4% 77.2% 65.9% 61.4% 3.44±(0.05)
D\rightarrowD MDT-V 22.52 1.21 22.25 93.7% 84.5% 74.1% 64.4% 55.6% 3.72±(0.06)
MDT-V/E3-D3 17.50 0.36 8.7 92.4% 82.1% 71.2% 60.5% 52.2% 3.65±(0.05)
MDT-V/E2-D2 12.47 0.25 4.1 87.1% 71.2% 58.7% 48.3% 37.9% 3.00±(0.03)
MDT-V/E1-D1 7.45 0.13 3.39 79.9% 63.2% 47.8% 35.0% 23.1% 2.48±(0.07)
Table 3: Performance comparison of LightDP compressed MDT-V models with different depth and inference steps. All models are trained on the CALVIN D or CALVIN ABCD for 30 epochs, and then tested on the CALVIN D dataset.

5.4 Evaluation on MDT-V

Task Spatial Object Goal Long 90 Average
MDT-V 78.5±\pm±1.5 87.5±\pm±0.9 73.5±\pm±2.0 64.8±\pm±1.5 67.2±\pm±1.1 74.3±\pm±9.1
MDT-V/E3-D3 77.9±\pm±1.9 86.5±\pm±2.1 71.5±\pm±3.1 63.2±\pm±2.3 66.8±\pm±0.8 73.2±\pm±9.9
Table 4: Performance comparison of LightDP compressed MDT-V/E3-D3 model on the benchmark LIBERO. For each task, the achieved score is presented along with its variability (mean±\pm±standard deviation)

In this section, we conduct the experiments based on MDT-V as reported in Table 3. Since MDT-V consists of 4-layer TransformerEncoder and 4-layer TransformerDecoder, we keep the number of encoder layers the same as the decoder layers, therefore, we compress the model into 2, 4, and 6 layers as well as DP-T. Compared with the original model, the 6-layer model achieves comparable performance, while the 4-layer model has a significant performance drop and the 2-layer model has the worst performance. The results show the MDT-V model is more compact than the DP-T model. In addition, as detailed in Table 3, the ABCD\rightarrowD results reveal that the full MDT model attains very high success percentages across the chain (e.g., 98.6% on the first instruction, gradually decreasing to 80.1%), with an average chain length of 4.52. In contrast, the pruned variants show a noticeable decline in performance, where MDT-V/E1-D1, for instance, achieves only 92.3% initially and drops to 61.4%, with a reduced average chain length of 3.44. Similarly, in the D→D scenario, all models register lower performance, with the most compressed model suffering from a steep decline in both success rate and average chain length. These observations underscore the trade-off between model compactness and performance, highlighting that even a slight reduction in network depth can substantially impact the ability to sustain performance over extended inference sequences. Besides, we also conduct the experiments on the LIBERO datasets shown in Table 4, by comparing MDT-V and MDT-V/E3-D3 across LIBERO task suites, we find that the pruned model achieves comparable performance with the original model. On average, while MDT-V shows a marginally better overall result with less variability, the inference speed and model size are reduced significantly, which could be beneficial for deployment on mobile devices.

5.5 Ablation Study

Method Param (M) GFLOPs Latency (ms) Average Length
MDT-V 22.52 1.21 22.25 3.72±\pm±(0.06)
MDT-V w/ prune 17.50 0.91 18.87 3.70±\pm±(0.08)
MDT-V w/ CD 22.52 0.48 11.34 3.69±\pm±(0.02)
MDT-V/E3-D3 17.50 0.36 8.70 3.65±\pm±(0.05)
Table 5: Ablation study on the effect of the proposed learnable pruning and step distillation based on MDT-V, the performance is evaluated on the CALVIN D\rightarrowD task suite. w/ prune means learnable pruning technique, and w/ CD means step distillation. MDT-V/E3-D3 combines learnable pruning and step distillation.

In this section, we ablate the effectiveness of the proposed method by removing the consistency distillation and learnable pruning. As shown in Table 5, when learnable pruning is applied (MDT-V w/prune), we observe a reduction in the number of parameters and GFLOPs, along with slightly reduced latency (from 22.25ms to 18.87ms), while preserving similar behavior in the generated actions. Likewise, employing consistency distillation (MDT-V w/CD) considerably reduces the GFLOPs and latency with only minimal reduction in the average rollout length. Notably, the combined approach (MDT-V/E3-D3) delivers the best trade-off by minimizing latency and computational cost, thereby demonstrating the efficiency of our design modifications without significant degradation in performance.

5.6 Qualitative Results

Refer to caption
Figure 3: Qualitative comparison of the pruned models and original models. We observe that the pruned models can mimic the behaviors of the original models, which demonstrates the step distillation process is capable of transferring the knowledge from the original model to the pruned model.

Figure 3 displays rollout of the pruned DP-T and MDT-V models on the Push-T and LIBERO tasks. In the Push-T task, the pruned model successfully pushed the T-shaped block into the goal zone, without any failure in the manipulation process. And in the LIBERO task suite that requires the agent to follow the instructions to manipulate the objects on the table to achieve the goal, the pruned model can also successfully complete the task. By adopting LightDP on the original DP-T and MDT-V models, we obtain the lightweight policy models. Here we present the visual comparison between the pruned model and the original model. With the rollouts in the Push-T task and the CALVIN tasks. In Figure 3, the upper two rows present the pruned model DP-T-D2/2-8 and DP-T on the Push-T task, and the bottom two rows show the pruned model MDT-V/E3-D3 and the original model MDT-V. We observe that the pruned models can mimic the behaviors of the original models, which demonstrates the step distillation process is capable of transferring the knowledge from the original model to the pruned model. Except for the experiments on simulation environments, we also conduct the real-world experiments on robotic arms as presented in Section I. The results show that the pruned model can achieve a comparable success rate with the original model, which demonstrates the effectiveness of our method in real-world scenarios.

6 Conclusion and Limitation

In this paper, we introduced the LightDP framework, aiming at accelerating Diffusion Policies on the mobile devices. Specifically, we analyze the architecture of the widely-used DP-T and MDT-V baselines, observe the iterative denoising process, and the high cost of the network inference hurdles the real-time application of these models on the mobile robots. To address this issue, we employed two strategies: 1) adopting a lightweight network architecture via a learnable pruning method, and 2) reducing the number of inference steps to speed up the denoising process. We have benchmarked the proposed LightDP framework on Push-T, Robomimic, CALVIN, and LIBERO datasets, demonstrating a significant improvement in terms of inference speed and memory consumption.

Limitations. In this paper, we mainly focus on the Diffusion Policies, while the new proposed VLA models are not well explored in this work. We leave this as the future work.

References

  • Black et al. [2024] Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0\pi_{0}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: A vision-language-action flow model for general robot control, 2024.
  • Black et al. [2025] Kevin Black, Manuel Y Galliker, and Sergey Levine. Real-time execution of action chunking flow policies. arXiv preprint arXiv:2506.07339, 2025.
  • Blumensath and Davies [2008] Thomas Blumensath and Mike E Davies. Iterative thresholding for sparse approximations. Journal of Fourier analysis and Applications, 14:629–654, 2008.
  • Brohan et al. [2023] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J. Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, Deeksha Manjunath, Igor Mordatch, Ofir Nachum, Carolina Parada, Jodilyn Peralta, Emily Perez, Karl Pertsch, Jornell Quiambao, Kanishka Rao, Michael S. Ryoo, Grecia Salazar, Pannag R. Sanketi, Kevin Sayed, Jaspiar Singh, Sumedh Sontakke, Austin Stone, Clayton Tan, Huong T. Tran, Vincent Vanhoucke, Steve Vega, Quan Vuong, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, and Brianna Zitkovich. RT-1: robotics transformer for real-world control at scale. In Robotics: Science and Systems, 2023.
  • Castells et al. [2024] Thibault Castells, Hyoung-Kyu Song, Tairen Piao, Shinkook Choi, Bo-Kyeong Kim, Hanyoung Yim, Changgwun Lee, Jae Gon Kim, and Tae-Ho Kim. Edgefusion: On-device text-to-image generation. arXiv preprint arXiv:2404.11925, 2024.
  • Chen et al. [2023] Zhenghao Chen, Lucas Relic, Roberto Azevedo, Yang Zhang, Markus Gross, Dong Xu, Luping Zhou, and Christopher Schroers. Neural video compression with spatio-temporal cross-covariance transformers. In Proceedings of the 31st ACM International Conference on Multimedia, pages 8543–8551, 2023.
  • Chen et al. [2024] Zhenghao Chen, Luping Zhou, Zhihao Hu, and Dong Xu. Group-aware parameter-efficient updating for content-adaptive neural video compression. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 11022–11031, 2024.
  • Chi et al. [2023] Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 2023.
  • Fang et al. [2023] Gongfan Fang, Xinyin Ma, Mingli Song, Michael Bi Mi, and Xinchao Wang. Depgraph: Towards any structural pruning. In CVPR, 2023.
  • Fang et al. [2025a] Gongfan Fang, Kunjun Li, Xinyin Ma, and Xinchao Wang. Tinyfusion: Diffusion transformers learned shallow. In CVPR, pages 18144–18154, 2025a.
  • Fang et al. [2025b] Gongfan Fang, Hongxu Yin, Saurav Muralidharan, Greg Heinrich, Jeff Pool, Jan Kautz, Pavlo Molchanov, and Xinchao Wang. Maskllm: Learnable semi-structured sparsity for large language models. Advances in Neural Information Processing Systems, 37:7736–7758, 2025b.
  • Florence et al. [2022] Pete Florence, Corey Lynch, Andy Zeng, Oscar A Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit behavioral cloning. In Conference on robot learning, pages 158–168. PMLR, 2022.
  • Frankle and Carbin [2019] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2019.
  • Frantar and Alistarh [2023] Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR, 2023.
  • Fu et al. [2024] Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. In Conference on Robot Learning, 2024.
  • Hadji et al. [2025] Isma Hadji, Mehdi Noroozi, Victor Escorcia, Anestis Zaganidis, Brais Martinez, and Georgios Tzimiropoulos. Edge-sd-sr: Low latency and parameter efficient on-device super-resolution with stable diffusion via bidirectional conditioning. In CVPR, pages 12789–12798, 2025.
  • Han et al. [2015] Song Han, Jeff Pool, John Tran, and William J Dally. Learning both weights and connections for efficient neural network. In NeurIPS, 2015.
  • Hinton et al. [2014] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NIPS Workshop, 2014.
  • Ho and Salimans [2021] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  • Jang et al. [2016] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  • Karamcheti et al. [2023] Siddharth Karamcheti, Suraj Nair, Annie S. Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang. Language-driven representation learning for robotics. In Robotics: Science and Systems, 2023.
  • Karras et al. [2022] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. pages 26565–26577, 2022.
  • Kim et al. [2024a] Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, and Shinkook Choi. Bk-sdm: A lightweight, fast, and cheap version of stable diffusion. In ECCV, 2024a.
  • Kim et al. [2024b] Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model. In Conference on Robot Learning, 2024b.
  • Li et al. [2017] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. In ICLR, 2017.
  • Li et al. [2024] Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators. In ICLR, 2024.
  • Li et al. [2023] Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. In NeurIPS, 2023.
  • Liu et al. [2023] Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems, 36:44776–44791, 2023.
  • Luo et al. [2023] Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023.
  • Mees et al. [2022] Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022.
  • Men et al. [2024] Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853, 2024.
  • Meng et al. [2023] Chenlin Meng, Ruiqi Gao, Diederik P Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. In CVPR, 2023.
  • Molchanov et al. [2017] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. In ICLR, 2017.
  • Mozer and Smolensky [1988] Michael C Mozer and Paul Smolensky. Skeletonization: A technique for trimming the fat from a network via relevance assessment. In NeurIPS, 1988.
  • Reuss et al. [2023] Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf Lioutikov. Goal-conditioned imitation learning using score-based diffusion policies, 2023.
  • Reuss et al. [2024] Moritz Reuss, Ömer Erdinç Yagmurlu, Fabian Wenzel, and Rudolf Lioutikov. Multimodal diffusion transformer: Learning versatile behavior from multimodal goals. In Robotics: Science and Systems, 2024.
  • Reuss et al. [2025] Moritz Reuss, Jyothish Pari, Pulkit Agrawal, and Rudolf Lioutikov. Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning. In ICLR, 2025.
  • Salimans and Ho [2022] Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In ICLR, 2022.
  • Shukor et al. [2025] Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics. arXiv preprint arXiv:2506.01844, 2025.
  • Song et al. [2020] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  • Song and Dhariwal [2024] Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models. In ICLR, 2024.
  • Song et al. [2021] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
  • Song et al. [2023] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Int. Conf. Mach. Learn., pages 32211–32252. PMLR, 2023.
  • Team et al. [2024] Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy, 2024.
  • Wang et al. [2021] Huan Wang, Can Qin, Yulun Zhang, and Yun Fu. Neural pruning via growing regularization. In ICLR, 2021.
  • Wu et al. [2024] Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. In ICLR, 2024.
  • Xie et al. [2025] Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng YU, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, Bingchen Liu, Daquan Zhou, and Song Han. SANA 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer. In Int. Conf. Mach. Learn., 2025.
  • Xu et al. [2024] Yanwu Xu, Yang Zhao, Zhisheng Xiao, and Tingbo Hou. Ufogen: You forward once large scale text-to-image generation via diffusion gans. In CVPR, 2024.
  • Yue et al. [2024] Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, and Gao Huang. Deer-VLA: Dynamic inference of multimodal large language models for efficient robot execution. In NeurIPS, 2024.
  • Ze et al. [2024] Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. In Proceedings of Robotics: Science and Systems (RSS), 2024.
  • Zhao et al. [2023a] Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. In Robotics: Science and Systems, 2023a.
  • Zhao et al. [2023b] Yang Zhao, Yanwu Xu, Zhisheng Xiao, and Tingbo Hou. Mobilediffusion: Subsecond text-to-image generation on mobile devices. arXiv preprint arXiv:2311.16567, 2023b.
  • Zitkovich et al. [2023] Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalewski, Yao Lu, Sergey Levine, Lisa Lee, Tsang-Wei Edward Lee, Isabel Leal, Yuheng Kuang, Dmitry Kalashnikov, Ryan Julian, Nikhil J. Joshi, Alex Irpan, Brian Ichter, Jasmine Hsu, Alexander Herzog, Karol Hausman, Keerthana Gopalakrishnan, Chuyuan Fu, Pete Florence, Chelsea Finn, Kumar Avinava Dubey, Danny Driess, Tianli Ding, Krzysztof Marcin Choromanski, Xi Chen, Yevgen Chebotar, Justice Carbajal, Noah Brown, Anthony Brohan, Montserrat Gonzalez Arenas, and Kehang Han. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023.
\thetitle

Supplementary Material

G Supplementary Material

The supplement consists of the following sections:

  • Section H presents the extensive experimental results on Robomimic dataset based on the DiffusionPolicy Transformer (DP-T) model.

  • Section I describes the real-world experiments based on DP-T and MoDE models, including the experimental setup and results.

We provide a webpage to visualize the results of the pruned models and original models, which can be found at https://weleen.github.io/LightDP/.

H Extensive Experiments based on DP-T

Models Lift-ph Can-ph Square-ph Transport-ph Push-T ToolHang-ph
DP-T 1.000 1.000 1.000 0.955 0.772 0.713
DP-T-D6/6-8 1.000 1.000 1.000 0.950 0.752 0.707
Models Lift-mh Can-mh Square-mh Transport-mh Kitchen Block Push
DP-T 1.000 1.000 0.940 0.727 0.574 1.000
DP-T-D6/6-8 1.000 1.000 0.955 0.773 0.571 1.000
Table A1: The extensive evaluation on DP-T tasks (Push-T and Robomimic), showing the success rates of the original model (DP-T) and the pruned model (DP-T-D6/6-8). The pruned model maintains performance across most tasks, with only minor drops in success rates.

In Table A1, we have provided success rate on all tasks (i.e., Push-T and Robomimic) in the Diffusion Policy [8] work, which indicate that the pruned model DP-T-D6/6-8 preserves the baseline’s performance on most tasks, and the performance only drops by less than 0.02 on the tasks.

I Real-world Experiments

\animategraphics

[autoplay,loop,nomouse,poster=first,width=]8figures/frames/0183

Figure A1: Real-world experiments for DP-T (first column) and MoDE (other columns). Task descriptions are shown below each image. This figure contains an animated video. For optimal viewing, please zoom in and use a professional PDF reader.
Models Task 1
DP-T 0.80
DP-T-D6/6-8 0.75
Models Task 2 Task 3 Task 4
MoDE 0.80 0.55 0.30
MoDE-10/10-12 0.75 0.50 0.30
Table A2: Real-world evaluation results based on DP-T (on a Inovo Robot) and MoDE (on a Lebai Robot). The success rates are shown for each task, with the pruned model (DP-T-D6/6-8 and MoDE-10/10-12) maintaining performance across most tasks, with only minor drops in success rates.

Based on two models DP-T and MoDE, we deploy our LightDP on two robotic arms (an Inovo robot for DP-T and a Lebai robot for MoDE), where each task is executed by 20 times. As shown in Figure A1 and Table A2, the pruned model achieves a comparable success rate on these real-world tasks. Considering that most household users are often redundant to purchase advanced device, we selected the most accessible and portable device (i.e., iPhone) as the computing platform for our robotic development setup. Moreover, we also evaluate our approach based on a Jetson Orin NX (16 GB, Jetpack 5.1.1), the latency is 244.68ms (resp., 37.69ms) based on DP-T (resp., DP-T-D6/6-8).