OfficialPyTorchimplementationofRethinkingMobileBlockforEfficientAttention-basedModels.zip资源-CSDN下载

共3个文件

zip：1个

txt：1个

pdf：1个

版权申诉

128 浏览量 2025-02-02 18:56:03 上传评论收藏 93.39MB ZIP 举报

Official PyTorch implementation of Rethinking Mobile Block for Efficient Attention-based Models，含有完整的代码和论文在深度学习和计算机视觉领域，注意力机制模型已经成为了一个研究热点。这些模型在处理图像和序列数据时表现出了卓越的性能，尤其是在需要捕捉长距离依赖关系的任务中。然而，注意力模型往往计算复杂，模型庞大，限制了它们在计算资源有限的设备上的应用，例如移动设备和嵌入式系统。因此，如何设计一个既高效又轻量级的注意力模块成为了研究者们关注的焦点之一。本文档“Official PyTorch implementation of Rethinking Mobile Block for Efficient Attention-based Models.zip”为官方PyTorch实现包，其主要研究内容是重新审视移动模块，以实现更高效的基于注意力的模型。这份实现包包含了完整的研究代码以及对应的论文。通过这个项目，研究者们旨在探索和优化那些能够减轻现有注意力模型计算负担的新架构，以促进它们在移动和边缘计算场景中的应用。文档中包含了几个关键文件。首先是一份PDF格式的论文，标题为“2301.01146v4.pdf”，这篇论文详细描述了研究的动机、方法、实验结果以及结论。这将是了解整个项目背景、理论基础和实验验证的关键材料。论文通常会详细介绍研究者是如何分析现有移动模块的不足，并提出新的设计思路来构建更加高效的注意力机制。其次是“欢迎使用.txt”，虽然具体内容未知，但可以推测这是一个简单的说明文档，用于指导用户如何安装和使用这个PyTorch实现包。这个文件可能包含安装步骤、运行示例以及可能遇到的常见问题解答等。最后是“EMO-main.zip”，这很可能是一个压缩包，包含了项目的核心代码库。通过解压缩这个文件，用户可以获取到该项目的源代码，进而深入研究或进一步扩展该研究。EMO（Efficient Mobile Optimization）可能是本研究中提出的优化技术或模型的名称，main则表示这是主代码库。通过这个项目，研究人员和开发者可以探索如何有效地利用现有的深度学习框架来实现和优化高效注意力模块，进而推广到更多资源受限的应用场景中。项目不仅关注模型的性能优化，同时可能也兼顾了模型的训练效率和部署的简易性，这对于实际应用来说至关重要。这个官方PyTorch实现包是一个宝贵的资源，它不仅提供了一个前沿研究的实际代码实现，还通过论文的形式提供了理论支持，对于那些希望在移动设备上部署高效注意力模型的开发者和研究人员来说，这是一份不可多得的参考资料。

资源推荐

资源详情

资源评论

收起资源包目录

Official PyTorch implementation of Rethinking Mobile Block for Efficient Attention-based Models.zip （3个子文件）

2301.01146v4.pdf 6.8MB

EMO-main.zip 87.86MB

欢迎使用.txt 26B

Rethinking Mobile Block for Efﬁcient Attention-based Models

Jiangning Zhang

1,2

Xiangtai Li

Jian Li

Liang Liu

Zhucun Xue

Boshen Zhang

Zhengkai Jiang

Tianxin Huang

Yabiao Wang

Chengjie Wang

Youtu Lab, Tencent

Zhejiang University

Peking University

Wuhan University

Code: https://github.com/zhangzjn/EMO

Abstract

This paper focuses on developing modern, efﬁcient,

lightweight models for dense predictions while trading off

parameters, FLOPs, and performance. Inverted Residual

Block (IRB) serves as the infrastructure for lightweight

CNNs, but no counterpart has been recognized by attention-

based studies. This work rethinks lightweight infrastructure

from efﬁcient IRB and effective components of Transformer

from a uniﬁed perspective, extending CNN-based IRB to

attention-based models and abstracting a one-residual Meta

Mobile Block (MMB) for lightweight model design. Follow-

ing simple but effective design criterion, we deduce a modern

Inverted Residual Mobile Block (iRMB) and build a ResNet-

like Efﬁcient MOdel (EMO) with only iRMB for down-stream

tasks. Extensive experiments on ImageNet-1K, COCO2017,

and ADE20K benchmarks demonstrate the superiority of our

EMO over state-of-the-art methods, e.g., EMO-1M/2M/5M

achieve 71.5, 75.1, and 78.4 Top-1 that surpass equal-order

CNN-/Attention-based models, while trading-off the parame-

ter, efﬁciency, and accuracy well: running 2.8-4.0

× ↑

faster

than EdgeNeXt on iPhone14.

1. Introduction

With a recent increasing demand for storage/computing

restricted applications, mobile models with fewer parame-

ters and low FLOPs have attracted signiﬁcant attention from

developers and researchers. The earliest attempt to design an

efﬁcient model dates back to the Inceptionv3 [

] era, which

uses asymmetric convolutions to replace standard convolu-

tion. Then, MobileNet [

] proposes depth-wise separable

convolution to signiﬁcantly decrease the amount of computa-

tion and parameters, which is viewed as a fundamental CNN-

based component for subsequent works [

]. Re-

markably, MobileNetv2 [

] proposes an efﬁcient Inverted

Residual Block (IRB) based on Depth-Wise Convolution

(DW-Conv) that is recognized as the infrastructure of ef-

Corresponding authors.

1x1 Conv

Identity

Multi-Head

Self-Attention

1x1 Conv

Conv

1x1 Conv

1x1 Conv1x1 Conv

Feed-Forward

Network

Inverted

Residual Block

Meta Mobile Block

Attn Map

𝝀

𝓕

𝓕 𝓕

𝓕

𝝀=1 𝝀=4 𝝀>1

FLOPs (M)

Top-1 Accuracy in ImageNet-1K

EMO-1M

EMO-2M

EMO-5M

EMO-6M

MobileViTv3

arXiv’22

MobileViTv2

arXiv’22

MobileViTv1

ICLR’22

DeiT

ICML’21

ViTAE

NeurIPS’21

EdgeViT

ECCV’22

MPViT

CVPR’22

CoaT

arXiv’22

PVTv2

CVM’22

XCiT

NeurIPS’21

1M 5M 7M

Efficient

Operator

1x1 Conv

Identity

Multi-Head

Self-Attention

1x1 Conv

Conv

1x1 Conv

1x1 Conv1x1 Conv

Feed-Forward

Network

Inverted

Residual Block

Meta Mobile Block

(Sec. 3.2)

Attn Map

𝝀

𝓕

𝓕 𝓕

𝓕

𝝀=1 𝝀=4 𝝀>1

Efficient

Operator

DW-Convolution

Window-Transformer

...

Micro Operation

Combinations

iRMB x𝑵

𝟏

iRMB x𝑵

𝟐

iRMB x𝑵

𝟑

iRMB x𝑵

𝟒

CLS

Det

Seg

4×↓

8×↓

16×↓

32×↓

EMO

(Sec. 3.4)

iRMB

(Sec. 3.3)

FLOPs (M)

Top-1 Accuracy in ImageNet-1K

EMO-1M

EMO-2M

EMO-5M

EMO-6M

MobileViTv3

arXiv’22

MobileViTv2

arXiv’22

MobileViTv1

ICLR’22

DeiT

ICML’21

ViTAE

NeurIPS’21

EdgeViT

ECCV’22

MPViT

CVPR’22

CoaT

arXiv’22

PVTv2

CVM’22

XCiT

NeurIPS’21

1M 5M 7M

+1.5↑

2x FLOPs↓

1x1 Conv

3x3 DW-Conv

Attn Mat

Q K

1x1 Conv

Meta Mobile Block

𝝀

𝓕

1x1 Conv

𝓕

MetaFormer

FLOPs (M)

Top-1 Accuracy in ImageNet-1K

EMO-1M

EMO-2M

EMO-5M

EMO-6M

MobileViTv3

arXiv’22

MobileViTv2

arXiv’22

MobileViTv1

ICLR’22

DeiT

ICML’21

ViTAE

NeurIPS’21

EdgeViT

ECCV’22

MPViT

CVPR’22

CoaT

arXiv’22

PVTv2

CVM’22

XCiT

NeurIPS’21

1M 5M 7M

+1.5 ↑

2x FLOPs↓

MOAT

ICLR’23

Figure 1: Performance vs. FLOPs with concurrent methods.

ﬁcient models [

] until now. Inevitably, limited by the

natural induction bias of static CNN, the accuracy of CNN-

pure models still maintains a low level of accuracy that needs

further improvements. In summary, one extreme core is to

advance a stronger fundamental block going beyond IRB.

On the other hand, stared from vision transformer

(ViTs) [

], many follow-ups [

]

have achieved signiﬁcant improvements over CNN. This

is due to its ability to model dynamically and learn from

the extensive dataset, and how to migrate this capability to

lightweight CNN is worth our explorations. However, lim-

ited by the quadratic amount of computations for Multi-Head

Self-Attention (MHSA), the attention-based model requires

massive resource consumption, especially when the channel

and resolution of the feature map are large. Some works

attempt to tackle the above problems by designing variants

with linear complexity [

], decreasing the spatial resolu-

tion of features [

], rearranging channel [

], using

local window attention [

] etc. However, these methods

still cannot be deployed on devices.

Recently, researchers have aimed to design efﬁcient hy-

brid models with lightweight CNNs, and they obtain better

performances than CNN-based models with trading off ac-

curacy, parameters, and FLOPs. However, current methods

arXiv:2301.01146v4 [cs.CV] 14 Aug 2023

introduce complex structures [

] or multiple

hybrid modules [

], which is very detrimental to opti-

mize for applications. So far, little work has been done to

explore attention-based counterparts as IRB, and this inspires

us to think: Can we build a lightweight IRB-like infrastruc-

ture for attention-based models with only basic operators?

Based on the above motivation, we rethink efﬁcient In-

verted Residual Block in MobileNetv2 [

] and effective

MHSA/FFN modules in Transformer [

] from a uniﬁed

perspective, expecting to integrate both advantages at the

infrastructure design level. As shown in Fig. 2-Left, while

working to bring one-residual IRB with inductive bias into

the attention model, we observe two underlying submodules

(i.e., FFN and MHSA) in two-residual Transformer share

the similar structure to IRB. Thus, we inductively abstract

a one-residual Meta Mobile Block (MMB, c.f ., Sec. 2.2)

that takes parametric arguments expansion ratio

and efﬁ-

cient operator

to instantiate different modules, i.e., IRB,

MHSA, and FFN. We argue that MMB can reveal the con-

sistent essence expression of the above three modules, and

it can be regarded as an improved lightweight concentrated

aggregate of Transformer. Furthermore, a simple yet effec-

tive Inverted Residual Mobile Block (iRMB) is deduced that

only contains fundamental Depth-Wise Convolution and our

improved EW-MHSA (c.f ., Sec. 2.3) and we build a ResNet-

like 4-phase Efﬁcient MOdel (EMO) with only iRMBs (c.f .,

Sec. 2.4). Surprisingly, our method performs better over the

SoTA lightweight attention-based models even without com-

plex structures, as shown in Fig. 1. In summary, this work

follows simple design criteria while gradually producing an

efﬁcient attention-based lightweight model.

Our contributions are four folds: 1) We extend CNN-

based IRB to the two-residual transformer and abstract a

one-residual Meta Mobile Block (MMB) for lightweight

model design. This meta paradigm could describe the cur-

rent efﬁcient modules and is expected to have the guiding

signiﬁcance in concreting novel efﬁcient modules. 2) Based

on inductive MMB, we deduce a simple yet effective modern

Inverted Residual Mobile Block (iRMB) and build a ResNet-

like Efﬁcient MOdel (EMO) with only iRMB for down-

stream applications. In detail, iRMB only consists of naive

DW-Conv and the improved EW-MHSA to model short-

/long-distance dependency, respectively. 3) We provide de-

tailed studies of our method and give some experimental ﬁnd-

ings on building attention-based lightweight models, hoping

our study will inspire the research community to design pow-

erful and efﬁcient models. 4) Even without introducing com-

plex structures, our method still achieves very competitive

results than concurrent attention-based methods on several

benchmarks, e.g., our EMO-1M/2M/5M reach 71.5, 75.1,

and 78.4 Top-1 over current SoTA CNN-/Transformer-based

models. Besides, EMO-1M/2M/5M armed SSDLite obtain

22.0/25.2/27.9 mAP with only 2.3M/3.3M/6.0M parameters

and 0.6G/0.9G/1.8G FLOPs, which exceeds recent Mobile-

ViTv2 [

] by +0.8

↑

/+0.6

↑

/+0.1

↑

with decreased FLOPs by

-33%

↓

/-50%

↓

/-62%

↓

; EMO-1M/2M/5M armed DeepLabv3

obtain 33.5/35.3/37.98 mIoU with only 5.6M/6.9M/10.3M

parameters and 2.4G/3.5G/5.8G FLOPs, surpassing Mobile-

ViTv2 by +1.6↑/+0.6↑/+0.8↑ with much lower FLOPs.

2. Methodology: Induction and Deduction

2.1. Criteria for General Efﬁcient Model

When designing efﬁcient visual models for mobile appli-

cations, we advocate the following criteria subjectively and

empirically that an efﬁcient model should satisfy as much as

possible:

➀

Usability. Simple implementation that does not

use complex operators and is easy to optimize for applica-

tions.

➁

Uniformity. As few core modules as possible to

reduce model complexity and accelerated deployment.

➂

Ef-

fectiveness. Good performance for classiﬁcation and dense

prediction.

➃

Efﬁciency. Fewer parameters and calculations

with accuracy trade-off. We make a summary of current

efﬁcient models in Tab. 1: 1) Performance of MobileNet

series [

] is now seen to be slightly lower, and its

parameters are slightly higher than counterparts. 2) Recent

MobileViT series [

] achieve notable performances,

but they suffer from higher FLOPs and slightly complex

modules. 3) EdgeNeXt [

] and EdgeViT [

] obtain pretty

results, but their basic blocks also consist of elaborate mod-

ules. Comparably, the design principle of our EMO follows

the above criteria without introducing complicated opera-

tions (c.f ., Sec. 2.4), but it still obtains impressive results on

multiple vision tasks (c.f ., Sec. 3).

Table 1: Criterion comparison for current efﬁcient mod-

els.

➀

: Usability;

➁

: Uniformity;

➂

: Effectiveness;

➃

: Efﬁ-

ciency. ✔: Satisﬁed. ✚: Partially satisﬁed. ✘: Unsatisﬁed.

Method vs. Criterion ➀ ➁ ➂ ➃

MobileNet Series [20, 54, 67] ✔ ✔ ✚ ✚

MobileViT Series [49, 50, 67] ✚ ✚ ✔ ✚

EdgeNeXt [47] ✚ ✘ ✔ ✔

EdgeViT [52] ✔ ✚ ✔ ✚

EMO (Ours) ✔ ✔ ✔ ✔

2.2. Meta Mobile Block

Motivation. 1) Recent Transformer-based works [

] are dedicated to improving spatial token

mixing under the MetaFormer [

] for high-performance

network. CNN-based Inverted Residual Block [

] (IRB) is

recognized as the infrastructure of efﬁcient models [

but little work has been done to explore attention-based

counterpart. This inspires us to build a lightweight IRB-like

infrastructure for attention-based models. 2) While working

to bring one-residual IRB with inductive bias into the atten-

1x1 Conv

Identity

Multi-Head

Self-Attention

1x1 Conv

Conv

1x1 Conv

1x1 Conv1x1 Conv

Feed-Forward

Network

Inverted

Residual Block

Meta Mobile Block

Attn Map

𝝀

𝓕

𝓕 𝓕

𝓕

𝝀=1 𝝀=4 𝝀>1

FLOPs (M)

Top-1 Accuracy in ImageNet-1K

EMO-1M

EMO-2M

EMO-5M

EMO-6M

MobileViTv3

arXiv’22

MobileViTv2

arXiv’22

MobileViTv1

ICLR’22

DeiT

ICML’21

ViTAE

NeurIPS’21

EdgeViT

ECCV’22

MPViT

CVPR’22

CoaT

arXiv’22

PVTv2

CVM’22

XCiT

NeurIPS’21

1M 5M 7M

Efficient

Operator

1x1 Conv

Identity

Multi-Head

Self-Attention

1x1 Conv

Conv

1x1 Conv

1x1 Conv1x1 Conv

Feed-Forward

Network

Inverted

Residual Block

Meta Mobile Block

(Sec. 3.2)

Attn Map

𝝀

𝓕

𝓕 𝓕

𝓕

𝝀=1 𝝀=4 𝝀>1

Efficient

Operator

DW-Convolution

Window-Transformer

...

Micro Operation

Combinations

iRMB x𝑵

𝟏

iRMB x𝑵

𝟐

iRMB x𝑵

𝟑

iRMB x𝑵

𝟒

CLS

Det

Seg

4×↓

8×↓

16×↓

32×↓

EMO

(Sec. 3.4)

iRMB

(Sec. 3.3)

FLOPs (M)

Top-1 Accuracy in ImageNet-1K

EMO-1M

EMO-2M

EMO-5M

EMO-6M

MobileViTv3

arXiv’22

MobileViTv2

arXiv’22

MobileViTv1

ICLR’22

DeiT

ICML’21

ViTAE

NeurIPS’21

EdgeViT

ECCV’22

MPViT

CVPR’22

CoaT

arXiv’22

PVTv2

CVM’22

XCiT

NeurIPS’21

1M 5M 7M

+1.5↑

2x FLOPs↓

1x1 Conv

3x3 DW-Conv

Attn Mat

Q K

1x1 Conv

Meta Mobile Block

𝝀

𝓕

1x1 Conv

𝓕

MetaFormer

FLOPs (M)

Top-1 Accuracy in ImageNet-1K

EMO-1M

EMO-2M

EMO-5M

EMO-6M

MobileViTv3

arXiv’22

MobileViTv2

arXiv’22

MobileViTv1

ICLR’22

DeiT

ICML’21

ViTAE

NeurIPS’21

EdgeViT

ECCV’22

MPViT

CVPR’22

CoaT

arXiv’22

PVTv2

CVM’22

XCiT

NeurIPS’21

1M 5M 7M

+1.5 ↑

2x FLOPs↓

MOAT

ICLR’23

1x1 Conv

Identity

Multi-Head

Self-Attention

1x1 Conv

Conv

1x1 Conv

1x1 Conv1x1 Conv

Feed-Forward

Network

Inverted

Residual Block

Meta Mobile Block

(Sec. 3.2)

Attn Map

𝝀

𝓕

𝓕 𝓕

𝓕

𝝀=1 𝝀=4 𝝀>1

Efficient

Operator

DW-Convolution

Window-Transformer

...

Micro Operation

Combinations

iRMB x𝑵

𝟏

iRMB x𝑵

𝟐

iRMB x𝑵

𝟑

iRMB x𝑵

𝟒

CLS

Det

Seg

4×↓

8×↓

16×↓

32×↓

EMO

(Sec. 3.4)

iRMB

(Sec. 3.3)

Figure 2: Left: Abstracted uniﬁed Meta-Mobile Block from Multi-Head Self-Attention / Feed-Forward Network [

] and

Inverted Residual Block [

] (c.f . Sec 2.2). The inductive block can be deduced into speciﬁc modules using different expansion

ratio λ and efﬁcient operator F . Right: ResNet-like EMO composed of only deduced iRMB (c.f . Sec 2.3).

tion model, we stumble upon two underlying sub-modules

(i.e., FFN and MHSA) in two-residual Transformer that hap-

pen to share a similar structure to IRB.

Induction. We rethink Inverted Residual Block in Mo-

bileNetv2 [

] with core MHSA and FFN modules in Trans-

former [

], and inductively abstract a general Meta Mobile

Block (MMB) in Fig. 2, which takes parametric arguments

expansion ratio

and efﬁcient operator

to instantiate

different modules. We argue that the MMB can reveal the

consistent essence expression of the above three modules,

and MMB can be regarded as an improved lightweight con-

centrated aggregate of Transformer. Also, this is the basic

motivation for our elegant and easy-to-use EMO, which

only contains one deduced iRMB absorbing advantages

of lightweight CNN and Transformer. Take image input

X(∈ R

C×H×W

)

as an example, MMB ﬁrstly use a expan-

sion

MLP

with output/input ratio equaling

to expand

channel dimension:

= MLP

(X)(∈ R

λC×H×W

(1)

Then, intermediate operator

enhance image features fur-

ther, e.g., identity operator, static convolution, dynamic

MHSA, etc. Considering that MMB is suitable for efﬁcient

network design, we present

as the concept of efﬁcient

operator, formulated as:

= F (X

)(∈ R

λC×H×W

(2)

Finally, a shrinkage

MLP

with inverted input/output ratio

equaling λ to shrink channel dimension:

= MLP

)(∈ R

C×H×W

(3)

where a residual connection is used to get the ﬁnal output

Y = X + X

(∈ R

C×H×W

)

. Notice that normalization

and activation functions are omitted for clarity.

1x1 Conv

Identity

Multi-Head

Self-Attention

1x1 Conv

Conv

1x1 Conv

1x1 Conv1x1 Conv

Feed-Forward

Network

Inverted

Residual Block

Meta Mobile Block

Attn Map

𝝀

𝓕

𝓕 𝓕

𝓕

𝝀=1 𝝀=4 𝝀>1

FLOPs (M)

Top-1 Accuracy in ImageNet-1K

EMO-1M

EMO-2M

EMO-5M

EMO-6M

MobileViTv3

arXiv’22

MobileViTv2

arXiv’22

MobileViTv1

ICLR’22

DeiT

ICML’21

ViTAE

NeurIPS’21

EdgeViT

ECCV’22

MPViT

CVPR’22

CoaT

arXiv’22

PVTv2

CVM’22

XCiT

NeurIPS’21

1M 5M 7M

Efficient

Operator

1x1 Conv

Identity

Multi-Head

Self-Attention

1x1 Conv

Conv

1x1 Conv

1x1 Conv1x1 Conv

Feed-Forward

Network

Inverted

Residual Block

Meta Mobile Block

(Sec. 3.1)

Attn Map

𝝀

𝓕

𝓕 𝓕

𝓕

𝝀=1 𝝀=4 𝝀>1

Efficient

Operator

DW-Convolution

Window-Transformer

...

Micro Operation

Combinations

iRMB x𝑵

𝟏

iRMB x𝑵

𝟐

iRMB x𝑵

𝟑

iRMB x𝑵

𝟒

CLS

Det

Seg

4×↓

8×↓

16×↓

32×↓

EMO

(Sec. 3.3)

iRMB

(Sec. 3.2)

FLOPs (M)

Top-1 Accuracy in ImageNet-1K

EMO-1M

EMO-2M

EMO-5M

EMO-6M

MobileViTv3

arXiv’22

MobileViTv2

arXiv’22

MobileViTv1

ICLR’22

DeiT

ICML’21

ViTAE

NeurIPS’21

EdgeViT

ECCV’22

MPViT

CVPR’22

CoaT

arXiv’22

PVTv2

CVM’22

XCiT

NeurIPS’21

1M 5M 7M

+1.5 ↑

2x FLOPs↓

1x1 Conv

3x3 DW-Conv

Attn Mat

Q K

1x1 Conv

Meta Mobile Block

𝝀

𝓕

1x1 Conv

𝓕

MetaFormer

Figure 3: Paradigm illustra-

tion with MetaFormer.

Relation to MetaFormer.

We discuss the dif-

ferences between our

Meta Mobile Block and

MetaFormer [

] in Fig. 3.

1) From the structure,

two-residual MetaFormer

contains two sub-modules

with two skip connections,

while our Meta Mobile

Block contains only one

sub-module that covers one-residual IRB in the ﬁeld of

lightweight CNN. Also, shallower depths require less

memory access and save costs [

] that is more general and

hardware friendly. 2) From the motivation, MetaFormer is

the induction of high-performance Transformer/MLP-like

models, while our Meta Mobile Block is the induction of

efﬁcient IRB in MobileNetv2 [

] and effective MHSA/FFN

in Transformer [

] for designing efﬁcient infrastructure.

3) To a certain extent, the inductive one-residual Meta

Mobile Block can be regarded as a conceptual extension

of two-residual MetaFormer in the lightweight ﬁeld. We

hope our work inspires more future research dedicated to

lightweight model design domain based on attention.

Table 2: Complexity and Maximum Path Length analysis

of modules. Input/output feature maps are in

C×W ×W

L = W

l = w

and

are feature map size and window

size, while k and G are kernel size and group number.

Module #Params FLOPs MPL

MHSA 4(C + 1)C 8C

L+4CL

+3L

O(1)

W-MHSA 4(C + 1)C 8C

L+ 4CLl + 3Ll O(Inf)

Conv (Ck

/G+1)C (2Ck

/G)LC O(2W/(k−1))

DW-Conv (k

+ 1)C (2k

)LC O(2W/(k−1))

2.3. Micro Design: Inverted Residual Mobile Block

1x1 Conv

Identity

Multi-Head

Self-Attention

1x1 Conv

Conv

1x1 Conv

1x1 Conv1x1 Conv

Feed-Forward

Network

Inverted

Residual Block

Meta Mobile Block

Attn Map

𝝀

𝓕

𝓕 𝓕

𝓕

𝝀=1 𝝀=4 𝝀>1

FLOPs (M)

Top-1 Accuracy in ImageNet-1K

EMO-1M

EMO-2M

EMO-5M

EMO-6M

MobileViTv3

arXiv’22

MobileViTv2

arXiv’22

MobileViTv1

ICLR’22

DeiT

ICML’21

ViTAE

NeurIPS’21

EdgeViT

ECCV’22

MPViT

CVPR’22

CoaT

arXiv’22

PVTv2

CVM’22

XCiT

NeurIPS’21

1M 5M 7M

Efficient

Operator

1x1 Conv

Identity

Multi-Head

Self-Attention

1x1 Conv

Conv

1x1 Conv

1x1 Conv1x1 Conv

Feed-Forward

Network

Inverted

Residual Block

Meta Mobile Block

(Sec. 3.1)

Attn Map

𝝀

𝓕

𝓕 𝓕

𝓕

𝝀=1 𝝀=4 𝝀>1

Efficient

Operator

Convolution

Transformer

…

Micro Operation

Combinations

iRMB x𝑵

𝟏

iRMB x𝑵

𝟐

iRMB x𝑵

𝟑

iRMB x𝑵

𝟒

CLS

Det

Seg

4×↓

8×↓

16×↓

32×↓

EMO

(Sec. 3.3)

iRMB

(Sec. 3.2)

FLOPs (M)

Top-1 Accuracy in ImageNet-1K

EMO-1M

EMO-2M

EMO-5M

EMO-6M

MobileViTv3

arXiv’22

MobileViTv2

arXiv’22

MobileViTv1

ICLR’22

DeiT

ICML’21

ViTAE

NeurIPS’21

EdgeViT

ECCV’22

MPViT

CVPR’22

CoaT

arXiv’22

PVTv2

CVM’22

XCiT

NeurIPS’21

1M 5M 7M

+1.5 ↑

2x FLOPs↓

1x1 Conv

3x3 DW-Conv

Attn Mat

Q K

Figure 4: Paradigm

of iRMB.

Based on the inductive Meta

Mobile Block, we instantiate an

effective yet efﬁcient modern In-

verted Residual Mobile Block

(iRMB) from a microscopic view

in Fig. 4.

Design Principle. Following cri-

teria in Sec. 2.1,

in iRMB is

modeled as cascaded MHSA and

Convolution operations, formu-

lated as

F (·) = Conv(MHSA(·))

This design absorbs CNN-like efﬁ-

ciency to model local features and

Transformer-like dynamic mod-

elling capability to learn long-distance interactions. How-

ever, naive implementation can lead to unaffordable expenses

for two main reasons:

is generally greater than one that the intermediate dimen-

sion would be multiple to input dimension, causing quadratic

increasing of parameters and computations. Therefore,

components of

should be independent or linearly depen-

dent on the number of channels.

2) FLOPs of MHSA is proportional to the quadratic of total

image pixels, so the cost of a naive Transformer is unafford-

able. The speciﬁc inﬂuences can be seen in Tab. 2.

Deduction. We employ efﬁcient Window-MHSA (W-

MHSA) and Depth-Wise Convolution (DW-Conv) with a

skip connection to trade-off model cost and accuracy.

Improved EW-MHSA. Parameters and FLOPs for obtain-

ing

in W-MHSA is quadratic of the channel, so we

employ unexpanded

to calculate the attention matrix

more efﬁciently, i.e.,

X (∈ R

C×H×W

)

, while the

expanded value

V (∈ R

λC×H×W

)

. This improve-

ment is termed as Expanded Window MHSA (EW-MHSA)

that is more applicative, formulated as:

F (·) = (DW-Conv, Skip)(EW-MHSA(·)).

(4)

Also, this cascading manner can increase the expansion

speed of the receptive ﬁeld and reduce the maximum path

length of the model to

O(2W/(k − 1 + 2w))

, which has

been experimentally veriﬁed with consistency in Sec. 3.3.

Flexibility. Empirically, current transformer-based meth-

ods [

] reach a consensus that inductive CNN in

shallow layers while global Transformer in deep layers com-

position could beneﬁt the performance. Unlike recent Ed-

geNeXt that employs different blocks for different depths,

our iRMB satisﬁes the above design principle using only

two switches to control whether two modules are used (Code

level is also concise in #Supp).

Efﬁcient Equivalent Implementation. MHSA is usually

used in channel-consistent projection (

=1), meaning that

Table 3: A toy experiment for assessing iRMB.

Model #Params ↓ FLOPs ↓ Top-1 ↑

DeiT-Tiny [64] 5.7M 1258 72.2

DeiT-Tiny w/iRMB

4.9M

-14% ↓ 1102 -156M ↓

74.3

+2.1% ↑

PVT-Tiny [68] 13.2M 1943 75.1

PVT-Tiny w/iRMB

11.7M

-11% ↓ 1845 -98M ↓

75.4

+0.3% ↑

the FLOPs of multiplying attention matrix times expended

(

>1) will increase by

- 1. Fortunately, the informa-

tion ﬂow from

to expended

(

) involves only linear

operations, i.e.,

MLP

(·)

, so we can derive an equivalent

proposition:"When the groups of

MLP

equals to the head

number of

W-MHSA

, the multiplication result of exchang-

ing order remains unchanged." To reduce FLOPs, matrix

multiplication before MLP

is used by default.

Choice of Efﬁcient Operators. We also replace the compo-

nent of

with group convolution, asymmetric [

] convo-

lution, and performer [

], but they make no further improve-

ments with much higher parameters and FLOPs at the same

magnitude for our approach.

Boosting Naive Transformer. To assess iRMB performance,

we set

to 4 and replace standard Transformer structure in

columnar DeiT [

] and pyramid-like PVT [

]. As shown

in Tab. 3, we surprisingly found that iRMB can improve

performance with fewer parameters and computations in the

same training setting, especially for the columnar ViT. This

proves that the one-residual iRMB has obvious advantages

over the two-residual Transformer in the lightweight model.

Parallel Design of

. We also implement the parallel struc-

ture of DW-Conv and EW-MHSA with half the number of

channels in each component, and some conﬁguration de-

tails are adaptively modiﬁed to ensure the same magnitude.

Comparably, this parallel model gets 78.1 (-0.3

↓

) Top-1

in ImageNet-1k dataset with 5.1M parameters and 964M

FLOPs (+63M

↑

than EMO-5M), but its throughput will slow

down by about -7%

↓

. This phenomenon is also discussed

in the work [

] that: "Network fragmentation reduces the

degree of parallelism".

2.4. Macro Design of EMO for Dense Prediction

Based on the above criteria, we design a ResNet-like 4-

phase Efﬁcient MOdel (EMO) based on a series of iRMBs

for dense applications, as shown in Fig. 2-Right.

1) For the overall framework, EMO consists of only iRMBs

without diversiﬁed modules

➁

, which is a departure from re-

cent efﬁcient methods [49, 47] in terms of designing idea.

2) For the speciﬁc module, iRMB consists of only standard

convolution and multi-head self-attention without other com-

plex operators

➀

. Also, beneﬁtted by DW-Conv, iRMB can

adapt to down-sampling operation through the stride and

does not require any position embeddings for introducing

inductive bias to MHSA

➁

3) For variant settings, we employ gradually increasing ex-

评论收藏

内容反馈

版权申诉

程序猿000001号

粉丝: 1971

Official PyTorch implementation of Rethinking Mobile Block for E...

最新资源

Official PyTorch implementation of Rethinking Mobile Block for E...

PyTorch-Image-Models-Multi-Label-Classification-main.zip

The PyTorch implementation of STGCN.STGCN-main.zip

EfficientNet-PyTorch-master.zip

Attention_ocr.pytorch-master.zip

Bert-Chinese-Text-Classification-Pytorch-master.zip.zip

data-for-1.7.5.zip

Python库 | pytorch_lightning-1.1.2-py3-none-any.whl

pytorch-cifar10.zip

pytorch预训练模型vgg16-397923af.pth

Python库 | wgangp_pytorch-0.1.2-py2.py3-none-any.whl

优秀毕设-Pytorch+ShelfNet快速人体姿态估计算法-附源码.zip

Python库 | alexnet_pytorch-0.1.5-py2.py3-none-any.whl

attention-is-all-you-need-pytorch_pytorch_transformer_attention_

pytorch-v2.0.1.tar.gz

Yet-Another-EfficientDet-Pytorch.zip

PyTorch-YOLOv3-forTest-部分场景.zip

Python库 | segmentation_models_pytorch-0.1.2-py3-none-any.whl

vgg16-397923af.pth

unet + pytorch 多分类自定义-python源码.zip

faster-rcnn.pytorch-pytorch-1.0.zip

期末大作业-StarGAN人脸图像生成系统-Pytorch版-最新开发.zip

Pytorch-GAN-main.zip

期末大作业-基于扩散模型的音频生成系统-Pytorch版-最新开发.zip

Python库 | pytorch-lightning-0.8.1.tar.gz

PyTorch-1.4.0+torchvision-0.5.0.zip

pytorch_mnist-python源码.zip

使用多 GPU 训练 pytorch-yolov3.zip

pytorch-grad-cam-master.zip

PyPI 官网下载 | facenet-pytorch-2.2.8.tar.gz

【学习笔记】数据库系统第三章笔记自用

godot中用hslider控制声音大小 滑块初始为中间 向右滑动则调大 反之向左调小

最新资源

godot中用hslider控制声音大小滑块初始为中间向右滑动则调大反之向左调小