
introduce complex structures [
49
,
50
,
67
,
6
,
47
] or multiple
hybrid modules [
47
,
52
], which is very detrimental to opti-
mize for applications. So far, little work has been done to
explore attention-based counterparts as IRB, and this inspires
us to think: Can we build a lightweight IRB-like infrastruc-
ture for attention-based models with only basic operators?
Based on the above motivation, we rethink efficient In-
verted Residual Block in MobileNetv2 [
54
] and effective
MHSA/FFN modules in Transformer [
66
] from a unified
perspective, expecting to integrate both advantages at the
infrastructure design level. As shown in Fig. 2-Left, while
working to bring one-residual IRB with inductive bias into
the attention model, we observe two underlying submodules
(i.e., FFN and MHSA) in two-residual Transformer share
the similar structure to IRB. Thus, we inductively abstract
a one-residual Meta Mobile Block (MMB, c.f ., Sec. 2.2)
that takes parametric arguments expansion ratio
λ
and effi-
cient operator
F
to instantiate different modules, i.e., IRB,
MHSA, and FFN. We argue that MMB can reveal the con-
sistent essence expression of the above three modules, and
it can be regarded as an improved lightweight concentrated
aggregate of Transformer. Furthermore, a simple yet effec-
tive Inverted Residual Mobile Block (iRMB) is deduced that
only contains fundamental Depth-Wise Convolution and our
improved EW-MHSA (c.f ., Sec. 2.3) and we build a ResNet-
like 4-phase Efficient MOdel (EMO) with only iRMBs (c.f .,
Sec. 2.4). Surprisingly, our method performs better over the
SoTA lightweight attention-based models even without com-
plex structures, as shown in Fig. 1. In summary, this work
follows simple design criteria while gradually producing an
efficient attention-based lightweight model.
Our contributions are four folds: 1) We extend CNN-
based IRB to the two-residual transformer and abstract a
one-residual Meta Mobile Block (MMB) for lightweight
model design. This meta paradigm could describe the cur-
rent efficient modules and is expected to have the guiding
significance in concreting novel efficient modules. 2) Based
on inductive MMB, we deduce a simple yet effective modern
Inverted Residual Mobile Block (iRMB) and build a ResNet-
like Efficient MOdel (EMO) with only iRMB for down-
stream applications. In detail, iRMB only consists of naive
DW-Conv and the improved EW-MHSA to model short-
/long-distance dependency, respectively. 3) We provide de-
tailed studies of our method and give some experimental find-
ings on building attention-based lightweight models, hoping
our study will inspire the research community to design pow-
erful and efficient models. 4) Even without introducing com-
plex structures, our method still achieves very competitive
results than concurrent attention-based methods on several
benchmarks, e.g., our EMO-1M/2M/5M reach 71.5, 75.1,
and 78.4 Top-1 over current SoTA CNN-/Transformer-based
models. Besides, EMO-1M/2M/5M armed SSDLite obtain
22.0/25.2/27.9 mAP with only 2.3M/3.3M/6.0M parameters
and 0.6G/0.9G/1.8G FLOPs, which exceeds recent Mobile-
ViTv2 [
50
] by +0.8
↑
/+0.6
↑
/+0.1
↑
with decreased FLOPs by
-33%
↓
/-50%
↓
/-62%
↓
; EMO-1M/2M/5M armed DeepLabv3
obtain 33.5/35.3/37.98 mIoU with only 5.6M/6.9M/10.3M
parameters and 2.4G/3.5G/5.8G FLOPs, surpassing Mobile-
ViTv2 by +1.6↑/+0.6↑/+0.8↑ with much lower FLOPs.
2. Methodology: Induction and Deduction
2.1. Criteria for General Efficient Model
When designing efficient visual models for mobile appli-
cations, we advocate the following criteria subjectively and
empirically that an efficient model should satisfy as much as
possible:
➀
Usability. Simple implementation that does not
use complex operators and is easy to optimize for applica-
tions.
➁
Uniformity. As few core modules as possible to
reduce model complexity and accelerated deployment.
➂
Ef-
fectiveness. Good performance for classification and dense
prediction.
➃
Efficiency. Fewer parameters and calculations
with accuracy trade-off. We make a summary of current
efficient models in Tab. 1: 1) Performance of MobileNet
series [
20
,
54
,
67
] is now seen to be slightly lower, and its
parameters are slightly higher than counterparts. 2) Recent
MobileViT series [
49
,
50
,
67
] achieve notable performances,
but they suffer from higher FLOPs and slightly complex
modules. 3) EdgeNeXt [
47
] and EdgeViT [
52
] obtain pretty
results, but their basic blocks also consist of elaborate mod-
ules. Comparably, the design principle of our EMO follows
the above criteria without introducing complicated opera-
tions (c.f ., Sec. 2.4), but it still obtains impressive results on
multiple vision tasks (c.f ., Sec. 3).
Table 1: Criterion comparison for current efficient mod-
els.
➀
: Usability;
➁
: Uniformity;
➂
: Effectiveness;
➃
: Effi-
ciency. ✔: Satisfied. ✚: Partially satisfied. ✘: Unsatisfied.
Method vs. Criterion ➀ ➁ ➂ ➃
MobileNet Series [20, 54, 67] ✔ ✔ ✚ ✚
MobileViT Series [49, 50, 67] ✚ ✚ ✔ ✚
EdgeNeXt [47] ✚ ✘ ✔ ✔
EdgeViT [52] ✔ ✚ ✔ ✚
EMO (Ours) ✔ ✔ ✔ ✔
2.2. Meta Mobile Block
Motivation. 1) Recent Transformer-based works [
74
,
42
,
12
,
56
,
39
,
62
,
63
] are dedicated to improving spatial token
mixing under the MetaFormer [
75
] for high-performance
network. CNN-based Inverted Residual Block [
54
] (IRB) is
recognized as the infrastructure of efficient models [
54
,
61
],
but little work has been done to explore attention-based
counterpart. This inspires us to build a lightweight IRB-like
infrastructure for attention-based models. 2) While working
to bring one-residual IRB with inductive bias into the atten-