0% found this document useful (0 votes)
15 views10 pages

Bello Attention Augmented Convolutional Networks ICCV 2019 Paper

Uploaded by

hrishit.bp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views10 pages

Bello Attention Augmented Convolutional Networks ICCV 2019 Paper

Uploaded by

hrishit.bp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Attention Augmented Convolutional Networks

Irwan Bello Barret Zoph Ashish Vaswani Jonathon Shlens


Quoc V. Le
Google Brain
{ibello,barretzoph,avaswani,shlens,qvl}@google.com

Abstract

Convolutional networks have been the paradigm of


choice in many computer vision applications. The convolu-
tion operation however has a significant weakness in that it
only operates on a local neighborhood, thus missing global
information. Self-attention, on the other hand, has emerged
as a recent advance to capture long range interactions, but
has mostly been applied to sequence modeling and gener-
ative modeling tasks. In this paper, we consider the use of
self-attention for discriminative visual tasks as an alterna-
tive to convolutions. We introduce a novel two-dimensional
relative self-attention mechanism that proves competitive
in replacing convolutions as a stand-alone computational
primitive for image classification. We find in control exper-
iments that the best results are obtained when combining
both convolutions and self-attention. We therefore propose Figure 1. Attention Augmentation systematically improves im-
to augment convolutional operators with this self-attention age classification across a large variety of networks of different
mechanism by concatenating convolutional feature maps scales. ImageNet classification accuracy [9] versus the number of
with a set of feature maps produced via self-attention. Ex- parameters for baseline models (ResNet) [14], models augmented
with channel-wise attention (SE-ResNet) [17] and our proposed
tensive experiments show that Attention Augmentation leads
architecture (AA-ResNet).
to consistent improvements in image classification on Im-
ageNet and object detection on COCO across many dif-
ferent models and scales, including ResNets and a state-
Both these properties prove to be crucial inductive biases
of-the art mobile constrained network, while keeping the
when designing models that operate over images. However,
number of parameters similar. In particular, our method
the local nature of the convolutional kernel prevents it from
achieves a 1.3% top-1 accuracy improvement on ImageNet
capturing global contexts in an image, often necessary for
classification over a ResNet50 baseline and outperforms
better recognition of objects in images [33].
other attention mechanisms for images such as Squeeze-
and-Excitation [17]. It also achieves an improvement of Self-attention [43], on the other hand, has emerged as a
1.4 mAP in COCO Object Detection on top of a RetinaNet recent advance to capture long range interactions, but has
baseline. mostly been applied to sequence modeling and generative
modeling tasks. The key idea behind self-attention is to
produce a weighted average of values computed from hid-
1. Introduction den units. Unlike the pooling or the convolutional operator,
the weights used in the weighted average operation are pro-
Convolutional Neural Networks have enjoyed tremen- duced dynamically via a similarity function between hid-
dous success in many computer vision applications, espe- den units. As a result, the interaction between input signals
cially in image classification [24, 23]. The design of the depends on the signals themselves rather than being prede-
convolutional layer imposes 1) locality via a limited recep- termined by their relative location like in convolutions. In
tive field and 2) translation equivariance via weight sharing. particular, this allows self-attention to capture long range

3286
Input Attention maps Weighted average of the values Output

H Head Head ²
¹

Nh = 2

Values

Standard convolution

Figure 2. Attention-augmented convolution: For each spatial location (h, w), Nh attention maps over the image are computed from
queries and keys. These attention maps are used to compute Nh weighted averages of the values V. The results are then concatenated,
reshaped to match the original volume’s spatial dimensions and mixed with a pointwise convolution. Multi-head attention is applied in
parallel to a standard convolution operation and the outputs are concatenated.

interactions without increasing the number of parameters. 2. Related Work


In this paper, we consider the use of self-attention for 2.1. Convolutional networks
discriminative visual tasks as an alternative to convolu-
tions. We develop a novel two-dimensional relative self- Modern computer vision has been built on powerful im-
attention mechanism [37] that maintains translation equiv- age featurizers learned on image classification tasks such
ariance while being infused with relative position informa- as CIFAR-10 [22] and ImageNet [9]. These datasets have
tion, making it well suited for images. Our self-attention been used as benchmarks for delineating better image fea-
formulation proves competitive for replacing convolutions turizations and network architectures across a broad range
entirely, however we find in control experiments that the of tasks [21]. For example, improving the “backbone” net-
best results are obtained when combining both. We there- work typically leads to improvements in object detection
fore do not completely abandon the idea of convolutions, [19] and image segmentation [6]. These observations have
but instead propose to augment convolutions with this self- inspired the research and design of new architectures, which
attention mechanism. This is achieved by concatenating are typically derived from the composition of convolution
convolutional feature maps, which enforce locality, to self- operations across an array of spatial scales and skip con-
attentional feature maps capable of modeling longer range nections [23, 41, 39, 40, 14, 47, 13]. Indeed, automated
dependencies (see Figure 2). search strategies for designing architectures based on con-
volutional primitives result in state-of-the-art accuracy on
We test our method on the CIFAR-100 and ImageNet large-scale image classification tasks that translate across a
classification [22, 9] and the COCO object detection [27] range of tasks [55, 21].
tasks, across a wide range of architectures at different com-
putational budgets, including a state-of-the art resource 2.2. Attention mechanisms in networks
constrained architecture [42]. Attention Augmentation
yields systematic improvements with minimal additional Attention has enjoyed widespread adoption as a com-
computational burden and notably outperforms the popu- putational module for modeling sequences because of its
lar Squeeze-and-Excitation [17] channelwise attention ap- ability to capture long distance interactions [2, 44, 4, 3].
proach in all experiments. In particular, Attention Augmen- Most notably, Bahdanau et al. [2] first proposed to com-
tation achieves a 1.3% top-1 accuracy ImageNet on top of bine attention with a Recurrent Neural Network [15] for
a ResNet50 baseline and 1.4 mAP increase in COCO ob- alignment in Machine Translation. Attention was further
ject detection on top of a RetinaNet baseline. Suprisingly, extended by Vaswani et al. [43], where the self-attentional
experiments also reveal that fully self-attentional models, Transformer architecture achieved state-of-the-art results in
a special case of Attention Augmentation, only perform Machine Translation. Using self-attention in cooperation
slightly worse than their fully convolutional counterparts on with convolutions is a theme shared by recent work in Nat-
ImageNet, indicating that self-attention is a powerful stand- ural Language Processing [49] and Reinforcement Learn-
alone computational primitive for image classification. ing [52]. For example, the QANet [50] and Evolved Trans-

3287
former [38] architectures alternate between self-attention can be formulated as:
layers and convolution layers for Question Answering ap- 0 1
plications and Machine Translation respectively. Addi- T
(XW q )(XW k )
tionally, multiple attention mechanisms have been pro- Oh = Softmax @ q A (XWv ) (1)
h
dk
posed for visual tasks to address the weaknesses of con-
volutions [17, 16, 7, 46, 45, 53]. For instance, Squeeze- h h
and-Excitation [17] and Gather-Excite [16] reweigh feature where Wq , Wk ∈ RFin ×dk and Wv ∈ RFin ×dv are learned
channels using signals aggregated from entire feature maps, linear transformations that map the input X to queries Q =
while BAM [31] and CBAM [46] refine convolutional fea- XWq , keys K = XWk and values V = XWv . The outputs
tures independently in the channel and spatial dimensions. of all heads are then concatenated and projected again as
In non-local neural networks [45], improvements are shown follows:
in video classification and object detection via the addi- h i
tive use of a few non-local residual blocks that employ MHA(X) = Concat O1 , . . . , ON h W O (2)
self-attention in convolutional architectures. However, non-
local blocks are only added to the architecture after Ima- where W O ∈ Rdv ×dv is a learned linear transformation.
geNet pretraining and are initialized in such a way that they MHA(X) is then reshaped into a tensor of shape (H, W, dv )
do not break pretraining. to match the original spatial dimensions. We note that
In contrast, our attention augmented networks do not rely multi-head attention incurs a complexity of O((HW )2 dk )
on pretraining of their fully convolutional counterparts and and a memory cost of O((HW )2 Nh ) as it requires to store
employ self-attention along the entire architecture. The use attention maps for each head.
of multi-head attention allows the model to attend jointly
to both spatial and feature subspaces. Additionally, we en- 3.1.1 Two-dimensional Positional Encodings
hance the representational power of self-attention over im-
Without explicit information about positions, self-attention
ages by extending relative self-attention [37, 18] to two di-
is permutation equivariant:
mensional inputs allowing us to model translation equivari-
ance in a principled way. Finally our method produces addi- MHA(π(X)) = π(MHA(X))
tional feature maps, rather than recalibrating convolutional
features via addition [45, 53] or gating [17, 16, 31, 46]. This for any permutation π of the pixel locations, making it in-
property allows us to flexibly adjust the fraction of atten- effective for modeling highly structured data such as im-
tional channels and consider a spectrum of architectures, ages. Multiple positional encodings that augment activation
ranging from fully convolutional to fully attentional mod- maps with explicit spatial information have been proposed
els. to alleviate related issues. In particular, the Image Trans-
former [32] extends the sinusoidal waves first introduced in
3. Methods the original Transformer [43] to 2 dimensional inputs and
CoordConv [29] concatenates positional channels to an ac-
We now formally describe our proposed Attention Aug- tivation map.
mentation method. We use the following naming conven- However these encodings did not help in our experi-
tions: H, W and Fin refer to the height, width and number ments on image classification and object detection (see Sec-
of input filters of an activation map. Nh , dv and dk respec- tion 4.5). We hypothesize that this is because such posi-
tively refer the number of heads, the depth of values and the tional encodings, while not permutation equivariant, do not
depth of queries and keys in multihead-attention (MHA). satisfy translation equivariance, which is a desirable prop-
We further assume that Nh divides dv and dk evenly and erty when dealing with images. As a solution, we propose
denote dhv and dhk the depth of values and queries/keys per to extend the use of relative position encodings [37] to two
attention head. dimensions and present a memory efficient implementation
based on the Music Transformer [18].
3.1. Self-attention over images
Relative positional encodings: Introduced in [37] for the
Given an input tensor of shape (H, W, Fin ),1 we flatten purpose of language modeling, relative self-attention aug-
it to a matrix X ∈ RHW ×Fin and perform multihead atten- ments self-attention with relative position encodings and
tion as proposed in the Transformer architecture [43]. The enables translation equivariance while preventing permuta-
output of the self-attention mechanism for a single head h tion equivariance. We implement two-dimensional relative
self-attention by independently adding relative height infor-
1 We omit the batch dimension for simplicity. mation and relative width information. The attention logit

3288
for how much pixel i = (ix , iy ) attends to pixel j = (jx , jy ) feature maps rather than refining them. Figure 2 summa-
is computed as: rizes our proposed augmented convolution.

qT
li,j = qi (kj + rjWx −ix + rjHy −iy ) (3) Concatenating convolutional and attentional feature
dhk maps: Formally, consider an original convolution oper-
ator with kernel size k, Fin input filters and Fout output
where qi is the query vector for pixel i (the i-th row of Q), filters. The corresponding attention augmented convolution
kj is the key vector for pixel j (the j-th row of K) and rjWx −ix can be written as
and rjHy −iy are learned embeddings for relative width jx −ix h i
and relative height jy − iy , respectively. The output of head AAConv(X) = Concat Conv(X), MHA(X) .
h now becomes:
We denote υ = Fdout the ratio of attentional channels to
0 1 v
T rel rel
QK + SH + SW A
Oh = Softmax @ q V (4) number of original output filters and κ = Fdout k
the ratio of
dhk key depth to number of original output filters. Similarly to
the convolution, the proposed attention augmented convo-
rel rel
where SH , SW ∈ RHW ×HW are matrices of relative po- lution 1) is equivariant to translation and 2) can readily op-
sition logits along height and width dimensions that sat- erate on inputs of different spatial dimensions. We include
rel
isfy SH [i, j] = qiT rjHy −iy and SWrel
[i, j] = qiT rjWx −ix . As Tensorflow code for the proposed attention augmented con-
we consider relative height and width information sepa- volution in the Appendix A.3.
rel rel rel
rately, SH and SW also satisfy the properties SW [i, j] =
rel rel rel
SW [i, j + W ] and SH [i, j] = SH [i + H, j], which pre- Effect on number of parameters: Multihead attention
vents from having to compute the logits for all (i, j) pairs. introduces a 1x1 convolution with Fin input filters and
The relative attention algorithm in [37] explicitly (2dk +dv ) = Fout (2κ+υ) output filters to compute queries,
stores all relative embeddings rij in a tensor of shape keys and values and an additional 1x1 convolution with
(HW, HW, dhk ), thus incurring an additional memory cost dv = Fout υ input and output filters to mix the contribu-
of O((HW )2 dhk ). This compares to O((HW )2 Nh ) for the tion of different heads. Considering the decrease in filters
position-unaware version self-attention that does not use in the convolutional part, this leads to the following change
position encodings. As we typically have Nh < dhk , such an in parameters:
implementation can prove extremely prohibitive and restrict
the number of images that can fit in a minibatch. Instead, we Fout 2
extend the memory efficient relative masked attention algo- ∆params ∼ Fin Fout (2κ + (1 − k 2 )υ + υ ), (5)
Fin
rithm presented in [18] to unmasked relative self-attention
over 2 dimensional inputs. Our implementation has a mem- where we ignore the parameters introduced by relative po-
ory cost of O(HW dhk ). We leave the Tensorflow code of sition embeddings for simplicity as these are negligible. In
the algorithm in the Appendix. practice, this causes a slight decrease in parameters when
The relative positional embeeddings rH and rW are replacing 3x3 convolutions and a slight increase in parame-
learned and shared across heads but not layers. For each ters when replacing 1x1 convolutions. Interestingly, we find
layer, we add (2(H + W ) − 2)dhk parameters to model rel- in experiments that attention augmented networks still sig-
ative distances along height and width. nificantly outperform their fully convolutional counterparts
while using less parameters.
3.2. Attention Augmented Convolution
Multiple previously proposed attention mechanisms over Attention Augmented Convolutional Architectures: In
images [17, 16, 31, 46] suggest that the convolution op- all our experiments, the augmented convolution is followed
erator is limited by its locality and lack of understanding by a batch normalization [20] layer which can learn to scale
of global contexts. These methods capture long-range de- the contribution of the convolution feature maps and the at-
pendencies by recalibrating convolutional feature maps. In tention feature maps. We apply our augmented convolution
particular, Squeeze-and-Excitation (SE) [17] and Gather- once per residual block similarly to other visual attention
Excite (GE) [16] perform channelwise reweighing while mechanisms [17, 16, 31, 46] and along the entire architec-
BAM [31] and CBAM [46] reweigh both channels and ture as memory permits (see Section 4 for more details).
spatial positions independently. In contrast to these ap- Since the memory cost O((Nh (HW )2 ) can be pro-
proaches, we 1) use an attention mechanism that can attend hibitive for large spatial dimensions, we augment convolu-
jointly to spatial and feature subspaces (each head corre- tions with attention starting from the last layer (with small-
sponding to a feature subspace) and 2) introduce additional est spatial dimension) until we hit memory constraints. To

3289
reduce the memory footprint of augmented networks, we 4.2. ImageNet image classification with ResNet
typically resort to a smaller batch size and sometimes addi-
We next examine how Attention Augmentation performs
tionally downsample the inputs to self-attention in the lay-
on ImageNet [9, 21], a standard large-scale dataset for high
ers with the largest spatial dimensions where it is applied.
resolution imagery, across an array of architectures. We
Downsampling is performed by applying 3x3 average pool-
start with the ResNet architecture [14, 47, 13] because of its
ing with stride 2 while the following upsampling (required
widespread use and its ability to easily scale across several
for the concatenation) is obtained via bilinear interpolation.
computational budgets. The building block in ResNet-34
comprises two 3x3 convolutions with the same number of
4. Experiments output filters. ResNet-50 and its larger counterparts use a
bottleneck block comprising of 1x1, 3x3, 1x1 convolutions
In the subsequent experiments, we test Attention Aug- where the last pointwise convolution expands the number
mentation on standard computer vision architectures such of filters and the first one contracts the number of filters.
as ResNets [14, 47, 13], and MnasNet [42] on the CIFAR- We modify all ResNets by augmenting the 3x3 convolu-
100 [22], ImageNet [9] and COCO [25] datasets. Our ex- tions as this decreases number of parameters.2 We apply
periments show that Attention Augmentation leads to sys- Attention Augmentation in each residual block of the last 3
tematic improvements on both image classification and ob- stages of the architecture – when the spatial dimensions of
ject detection tasks across a broad array of architectures and the activation maps are 28x28, 14x14 and 7x7 – and down-
computational demands. We validate the utility of the pro- sample only during the first stage. All attention augmented
posed two-dimensional relative attention mechanism in ab- networks use κ=2υ=0.2, except for ResNet-34 which uses
lation experiments. In all experiments, we substitute con- κ=υ=0.25. The number of attention heads is fixed to Nh =8.
volutional feature maps with self-attention feature maps as
it makes for an easier comparison against the baseline mod- Architecture Params (M) ∆Inf er ∆T rain top-1
els. Unless specified otherwise, all results correspond to our ResNet-50 25.6 - - 76.4
SE [17] 28.1 +12% +92% 77.5 (77.0)
two-dimensional relative self-attention mechanism. Exper- BAM [31] 25.9 +19% +43% 77.3
imental details can be found in the Appendix. CBAM [46] 28.1 +56% +132% 77.4 (77.4)
GALA [28] 29.4 +86% +133% 77.5 (77.3)
AA (υ = 0.25) 24.3 +29% +25% 77.7
4.1. CIFAR-100 image classification
Table 2. Image classification performance of different attention
We first investigate how Attention Augmentation per- mechanisms on the ImageNet dataset. ∆ refers to the increase
forms on CIFAR-100 [22], a standard benchmark for low- in latency times compared to the ResNet50 on a single Tesla V100
resolution imagery, using a Wide ResNet architecture [51]. GPU with Tensorflow using a batch size of 128. For fair compar-
The Wide-ResNet-28-10 architecture is comprised of 3 ison, we also include top-1 results (in parentheses) when scaling
stages of 4 residual blocks each using two 3 × 3 convolu- networks in width to match ∼ 25.6M parameters as the ResNet50
tions. We augment the Wide-ResNet-28-10 by augmenting baseline.
the first convolution of all residual blocks with relative at-
tention using Nh =8 heads and κ=2υ=0.2 and a minimum of Table 2 benchmarks Attention Augmentation against
20 dimensions per head for the keys. We compare Attention channel and spatial attention mechanisms BAM [31],
Augmentation (AA) against other forms of attention includ- CBAM [46] and GALA [28] with channel reduction ra-
ing Squeeze-and-Excitation (SE) [17] and the parameter- tio σ = 16 on the ResNet50 architecture. Despite the
free formulation of Gather-Excite (GE) [16]. Table 1 shows lack of specialized kernels (See Appendix A.3), Attention
that Attention Augmentation improves performance both Augmentation offers a competitive accuracy/computational
over the baseline network and Squeeze-and-Excitation at a trade-off compared to previously proposed attention mech-
similar parameter and complexity cost. anisms. Table 3 compares the non-augmented networks and
Squeeze-and-Excitation (SE) [17] across different network
scales. In all experiments, Attention Augmentation sig-
Architecture Params GFlops top-1 top-5 nificantly increases performance over the non-augmented
Wide-ResNet [51] 36.3M 10.4 80.3 95.0
baseline and notably outperforms Squeeze-and-Excitation
GE-Wide-ResNet [16] 36.3M 10.4 79.8 95.0
(SE) [17] while being more parameter efficient (Figure 1).
SE-Wide-ResNet [17] 36.5M 10.4 81.0 95.3
AA-Wide-ResNet (ours) 36.2M 10.9 81.6 95.2 Remarkably, our AA-ResNet-50 performs comparably to
the baseline ResNet-101 and our AA-ResNet-101 outper-
Table 1. Image classification on the CIFAR-100 dataset [22] using forms the baseline ResNet-152. These results suggest that
the Wide-ResNet 28-10 architecture [51]. 2 We found that augmenting the pointwise expansions works just as well

but does not save parameters or computations.

3290
Architecture GFlops Params top-1 top-5
ResNet-34 [14] 7.4 21.8M 73.6 91.5
SE-ResNet-34 [17] 7.4 22.0M 74.3 91.8
AA-ResNet-34 (ours) 7.1 20.7M 74.7 92.0
ResNet-50 [14] 8.2 25.6M 76.4 93.1
SE-ResNet-50 [17] 8.2 28.1M 77.5 93.7
AA-ResNet-50 (ours) 8.3 25.8M 77.7 93.8
ResNet-101 [14] 15.6 44.5M 77.9 94.0
SE-ResNet-101 [17] 15.6 49.3M 78.4 94.2
AA-ResNet-101 (ours) 16.1 45.4M 78.7 94.4
ResNet-152 [14] 23.0 60.2M 78.4 94.2
SE-ResNet-152 [17] 23.1 66.8M 78.9 94.5
AA-ResNet-152 (ours) 23.8 61.6M 79.1 94.6

Table 3. Image classification on the ImageNet dataset [9] across


a range of ResNet architectures: ResNet-34, ResNet-50, Resnet-
101, and ResNet-152 [14, 47, 13].
Figure 3. ImageNet top-1 accuracy as a function of number of pa-
Architecture GFlops Params top-1 top-5
rameters for MnasNet (black) and Attention-Augmented-MnasNet
MnasNet-0.75 0.45 2.91M 73.3 91.3 (red) with depth multipliers 0.75, 1.0, 1.25 and 1.4.
AA-MnasNet-0.75 0.51 3.02M 73.9 91.6
MnasNet-1.0 0.63 3.89M 75.2 92.4
AA-MnasNet-1.0 0.70 4.06M 75.7 92.6 ferent width multipliers. Our experiments show that At-
MnasNet-1.25 1.01 5.26M 76.7 93.2 tention Augmentation yields accuracy improvements across
AA-MnasNet-1.25 1.11 5.53M 77.2 93.6 all width multipliers. Augmenting MnasNets with relative
MnasNet-1.4 1.17 6.10M 77.2 93.5 self-attention incurs a slight parameter increase, however
AA-MnasNet-1.4 1.29 6.44M 77.7 93.8
we verify in Figure 3 that the accuracy improvements are
Table 4. Baseline and attention augmented MnasNet [42] accura- not just explained by the parameter increase. Additionally,
cies with width multipliers 0.75, 1.0, 1.25 and 1.4. we note that the MnasNet architecture employs Squeeze-
and-Excitation at multiple locations that were optimally se-
lected via architecture search, further suggesting the bene-
attention augmentation is preferable to simply making net- fits of our method.
works deeper. We include and discuss attention maps visu-
alizations from different pixel positions in the appendix. 4.4. Object Detection with COCO dataset
We next investigate the use of Attention Augmentation
4.3. ImageNet classification with MnasNet on the task of object detection on the COCO dataset [27].
In this section, we inspect the use of Attention Aug- We employ the RetinaNet architecture with a ResNet-50
mentation in a resource constrained setting by conducting and ResNet-101 backbone as done in [26], using the open-
ImageNet experiments with the MnasNet architecture [42], sourced RetinaNet codebase.3 We apply Attention Aug-
which is an extremely parameter-efficient architecture. In mentation uniquely on the ResNet backbone, modifying
particular, the MnasNet was found by neural architec- them similarly as in our ImageNet classification experi-
ture search [54], using only the highly optimized mo- ments.
bile inverted bottleneck block [36] and the Squeeze-and- Our relative self-attention mechanism improves the per-
Excitation operation [17] as the primitives in its search formance of the RetinaNet on both ResNet-50 and ResNet-
space. We apply Attention Augmentation to the mobile 101 as shown in Table 5. Most notably, Attention Aug-
inverted bottleneck by replacing convolutional channels in mentation yields a 1.4% mAP improvement over a strong
the expansion pointwise convolution using κ=2υ=0.1 and RetinaNet baseline from [26]. In contrast to the success
Nh =4 heads. Our augmented MnasNets use augmented in- of Squeeze-and-Excitation in image classification with Im-
verted bottlenecks in the the last 13 blocks out of 18 in the ageNet, our experiments show that adding Squeeze-and-
MnasNet architecture, starting when the spatial dimension Excitation operators in the backbone network of the Reti-
is 28x28. We downsample only in the first stage where At- naNet significantly hurts performance, in spite of grid
tention Augmentation is applied. We leave the final point- searching over the squeeze ratio σ ∈ {4, 8, 16}. We hy-
wise convolution, also referred to as the “head”, unchanged. pothesize that localization requires precise spatial informa-
In Table 4, we report ImageNet accuracies for the base- 3 https://github.com/tensorflow/tpu/tree/master/

line MnasNet and its attention augmented variants at dif- models/official/retinanet

3291
Backbone architecture GFlops Params mAPCOCO mAP50 mAP75
ResNet-50 [26] 182 33.4M 36.8 54.5 39.5
SE-ResNet-50 [17] 183 35.9M 36.5 54.0 39.1
AA-ResNet-50 (ours) 182 33.1M 38.2 56.5 40.7
ResNet-101 [26] 243 52.4M 38.5 56.4 41.2
SE-ResNet-101 [17] 243 57.2M 37.4 55.0 39.9
AA-ResNet-101 (ours) 245 51.7M 39.2 57.8 41.9

Table 5. Object detection on the COCO dataset [27] using the RetinaNet architecture [26] with different backbone architectures. We report
mean Average Precision at three different IoU values.

tion which SE discards during the spatial pooling operation,


thereby negatively affecting performance. Self-attention on
the other hand maintains spatial information and is likely to
be able to identify object boundaries successfully. Visual-
izations of attention maps (See Figures 9 and 10 in the Ap-
pendix) reveal that some heads are indeed delineating ob-
jects from their background which might be important for
localization.

4.5. Ablation Study


Fully-attentional vision models: In this section, we in-
vestigate the performance of Attention Augmentation as a
function of the fraction of attentional channels. As we in-
crease this fraction to 100%, we begin to replace a Con-
vNet with a fully attentional model, only leaving pointwise
convolutions and the stem unchanged. Table 6 presents the
Figure 4. Effect of relative position embeddings as the ratio
performance of Attention Augmentation on the ResNet-50
of attentional channels increases on our Attention-Augmented
architecture for varying ratios κ=υ ∈ {0.25, 0.5, 0.75, 1.0}.
ResNet50.
Performance slightly degrades as the ratio of attentional
channels increases, which we hypothesize is partly ex-
plained by the average pooling operation for downsampling flops efficient (see Table 6).
at the first stage where Attention Augmentation is applied.
Attention Augmentation proves however quite robust to the Architecture GFlops Params top-1 top-5
fraction of attentional channels. For instance, AA-ResNet- ResNet-34 [14] 7.4 21.8M 73.6 91.5
50 with κ=υ=0.75 outperforms its ResNet-50 counterpart, ResNet-50 [14] 8.2 25.6M 76.4 93.1
while being more parameter and flops efficient, indicating κ = υ = 0.25 7.9 24.3M 77.7 93.8
that mostly employing attentional channels is readily com- κ = υ = 0.5 7.3 22.3M 77.3 93.6
petitive. κ = υ = 0.75 6.8 20.7M 76.7 93.2
Perhaps surprisingly, these experiments also reveal that κ = υ = 1.0 6.3 19.4M 73.9 91.5
our proposed self-attention mechanism is a powerful stand-
alone computational primitive for image classification and Table 6. Attention Augmented ResNet-50 with varying ratios of
that fully attentional models are viable for discriminative vi- attentional channels.
sual tasks. In particular, AA-ResNet-50 with κ=υ=1, which
uses exclusively attentional channels, is only 2.5% worse
in accuracy than its fully convolutional counterpart, in spite Importance of position encodings: In Figure 4, we show
of downsampling with average pooling and having 25% less the effect of our proposed two-dimensional relative posi-
parameters. Notably, this fully attentional architecture4 also tion encodings as a function of the fraction of attentional
outperforms ResNet-34 while being more parameter and channels. As expected, experiments demonstrate that our
4 We
relative position encodings become increasingly more im-
consider pointwise convolutions as dense layers. This architecture
portant as the architecture employs more attentional chan-
employs 4 non-pointwise convolutions in the stem and the first stage of the
architecture, but we believe such operations can be replaced by attention nels. In particular, the fully self-attentional ResNet-50 gains
too. 2.8% top-1 ImageNet accuracy when using relative position

3292
Architecture Position Encodings top-1 top-5 attentional vision models on image classification for the first
AA-ResNet-34 None 74.4 91.9 time. We propose to augment convolutional operators with
AA-ResNet-34 2d Sine 74.4 92.0 this self-attention mechanism and validate the superiority of
AA-ResNet-34 CoordConv 74.4 92.0 this approach over other attention schemes. Extensive ex-
AA-ResNet-34 Relative (ours) 74.7 92.0 periments show that Attention Augmentation leads to sys-
AA-ResNet-50 None 77.5 93.7 tematic improvements on both image classification and ob-
AA-ResNet-50 2d Sine 77.5 93.7 ject detection tasks across a wide range of architectures and
AA-ResNet-50 CoordConv 77.5 93.8 computational settings.
AA-ResNet-50 Relative (ours) 77.7 93.8 Several open questions from this work remain. In fu-
ture work, we will focus on the fully attentional regime
Table 7. Effects of different position encodings in Attention Aug- and explore how different attention mechanisms trade off
mentation on ImageNet classification.
computational efficiency versus representational power. For
instance, identifying a local attention mechanism may re-
Position Encodings mAPCOCO mAP50 mAP75
sult in an efficient and scalable computational mechanism
None 37.7 56.0 40.2 that could prevent the need for downsampling with average
CoordConv [29] 37.4 55.5 40.1 pooling [34]. Additionally, it is plausible that architectural
Relative (ours) 38.2 56.5 40.7 design choices that are well suited when exclusively relying
Table 8. Effects of different position encodings in Attention Aug- on convolutions are suboptimal when using self-attention
mentation on the COCO object detection task using a RetinaNet mechanisms. As such, it would be interesting to see if us-
AA-ResNet-50 backbone. ing Attention Augmentation as a primitive in automated ar-
chitecture search procedures proves useful to find even bet-
ter models than those previously found in image classifica-
encodings, which indicates the necessity of maintaining po- tion [55], object detection [12], image segmentation [6] and
sition information for fully self-attentional vision models. other domains[5, 1, 35, 8]. Finally, one can ask to which
We additionally compare our proposed two-dimensional degree fully attentional models can replace convolutional
relative position encodings to other position encoding networks for visual tasks.
schemes. We apply Attention Augmentation using the same
hyperparameters as 4.2 with the following different posi- Acknowledgements
tion encoding schemes: 1) The position-unaware version of
self-attention (referred to as None), 2) a two-dimensional The authors would like to thank Tsung-Yi Lin, Pra-
implementation of the sinusoidal positional waves (referred jit Ramachandran, Mingxing Tan, Yanping Huang and the
to as 2d Sine) as used in [32], 3) CoordConv [29] for which Google Brain team for insightful comments and discus-
we concatenate (x,y,r) coordinate channels to the inputs of sions.
the attention function, and 4) our proposed two-dimensional
relative position encodings (referred to as Relative). References
In Table 7 and 8, we present the results on ImageNet
classification and the COCO object detection task respec- [1] Maximilian Alber, Irwan Bello, Barret Zoph, Pieter-Jan Kin-
dermans, Prajit Ramachandran, and Quoc V. Le. Backprop
tively. On both tasks, Attention Augmentation without po-
evolution. CoRR, abs/1808.02822, 2018. 8
sition encodings already yields improvements over the fully
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.
convolutional non-augmented variants. Our experiments
Neural machine translation by jointly learning to align and
also reveal that the sinusoidal encodings and the coordinate translate. In International Conference on Learning Repre-
convolution do not provide improvements over the position- sentations, 2015. 2
unaware version of Attention Augmentation. We obtain ad- [3] Irwan Bello, Sayali Kulkarni, Sagar Jain, Craig Boutilier,
ditional improvements when using our two-dimensional rel- Ed Huai-hsin Chi, Elad Eban, Xiyang Luo, Alan Mackey,
ative attention, demonstrating the utility of preserving trans- and Ofer Meshi. Seq2slate: Re-ranking and slate optimiza-
lation equivariance while preventing permutation equivari- tion with rnns. CoRR, abs/1810.02019, 2018. 2
ance. [4] Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad Norouzi,
and Samy Bengio. Neural combinatorial optimization with
5. Discussion and future work reinforcement learning. 2016. 2
[5] Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V. Le.
In this work, we consider the use of self-attention for vi- Neural optimizer search with reinforcement learning. In Pro-
sion models as an alternative to convolutions. We introduce ceedings of the 34th International Conference on Machine
a novel two-dimensional relative self-attention mechanism Learning - Volume 70, ICML’17, pages 459–468. JMLR.org,
for images that enables training of competitive fully self- 2017. 8

3293
[6] Liang-Chieh Chen, Maxwell Collins, Yukun Zhu, George [21] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do bet-
Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, ter imagenet models transfer better? In Proceedings of the
and Jon Shlens. Searching for efficient multi-scale archi- IEEE Conference on Computer Vision and Pattern Recogni-
tectures for dense image prediction. In Advances in Neural tion, 2019. 2, 5
Information Processing Systems, pages 8713–8724, 2018. 2, [22] Alex Krizhevsky. Learning multiple layers of features from
8 tiny images. Technical report, University of Toronto, 2009.
[7] Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng 2, 5
Yan, and Jiashi Feng. A2 -nets: Double attention networks. [23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton.
CoRR, abs/1810.11579, 2018. 3 Imagenet classification with deep convolutional neural net-
[8] Ekin Dogus Cubuk, Barret Zoph, Dandelion Mané, Vijay Va- works. In Advances in Neural Information Processing Sys-
sudevan, and Quoc V. Le. Autoaugment: Learning augmen- tem, 2012. 1, 2
tation policies from data. CoRR, abs/1805.09501, 2018. 8 [24] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick
[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Haffner. Gradient-based learning applied to document recog-
and Li Fei-Fei. Imagenet: A large-scale hierarchical image nition. Proceedings of the IEEE, 1998. 1
database. In IEEE Conference on Computer Vision and Pat- [25] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
tern Recognition. IEEE, 2009. 1, 2, 5, 6 Bharath Hariharan, and Serge Belongie. Feature pyramid
[10] Xavier Gastaldi. Shake-shake regularization. arXiv preprint networks for object detection. In Proceedings of the IEEE
arXiv:1705.07485, 2017. 11 Conference on Computer Vision and Pattern Recognition,
[11] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock: 2017. 5
A regularization method for convolutional networks. In [26] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
Advances in Neural Information Processing Systems, pages Piotr Dollár. Focal loss for dense object detection. In Pro-
10750–10760, 2018. 11 ceedings of the IEEE international conference on computer
[12] Golnaz Ghiasi, Tsung-Yi Lin, Ruoming Pang, and Quoc V vision, pages 2980–2988, 2017. 6, 7, 11
Le. NAS-FPN: Learning scalable feature pyramid architec- [27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
ture for object detection. In The IEEE Conference on Com- Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
puter Vision and Pattern Recognition (CVPR), June 2019. 8 Zitnick. Microsoft coco: Common objects in context. In
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. European Conference on Computer Vision, pages 740–755.
Deep residual learning for image recognition. In IEEE Con- Springer, 2014. 2, 6, 7
ference on Computer Vision and Pattern Recognition, 2016. [28] Drew Linsley, Dan Scheibler, Sven Eberhardt, and Thomas
2, 5, 6 Serre. Global-and-local attention networks for visual recog-
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. nition. CoRR, abs/1805.08819, 2018. 5
Identity mappings in deep residual networks. In European [29] Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski
Conference on Computer Vision, 2016. 1, 2, 5, 6, 7, 11 Such, Eric Frank, Alex Sergeev, and Jason Yosinski. An
[15] Sepp Hochreiter and Juergen Schmidhuber. Long short-term intriguing failing of convolutional neural networks and the
memory. Neural Computation, 1997. 2 coordconv solution. In Advances in Neural Information Pro-
[16] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Andrea cessing Systems, pages 9628–9639, 2018. 3, 8
Vedaldi. Gather-excite: Exploiting feature context in convo- [30] Ilya Loshchilov and Frank Hutter. SGDR: Stochas-
lutional neural networks. In Advances in Neural Information tic gradient descent with warm restarts. arXiv preprint
Processing Systems, pages 9423–9433, 2018. 3, 4, 5 arXiv:1608.03983, 2016. 11
[17] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net- [31] Jongchan Park, Sanghyun Woo, Joon-Young Lee, and In So
works. In Proceedings of the IEEE Conference on Computer Kweon. Bam: bottleneck attention module. In British Ma-
Vision and Pattern Recognition, 2018. 1, 2, 3, 4, 5, 6, 7 chine Vision Conference, 2018. 3, 4, 5
[18] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszko- [32] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz
reit, Noam Shazeer, Curtis Hawthorne, Andrew M Dai, Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im-
Matthew D Hoffman, and Douglas Eck. Music transformer. age transformer. In International Conference on Machine
In Advances in Neural Processing Systems, 2018. 3, 4 Learning, 2018. 3, 8
[19] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, [33] Andrew Rabinovich, Andrea Vedaldi, Carolina Galleguillos,
Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wo- Eric Wiewiora, and Serge Belongie. Objects in context.
jna, Yang Song, Sergio Guadarrama, et al. Speed/accu- 2007. 1
racy trade-offs for modern convolutional object detectors. In [34] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan
IEEE Conference on Computer Vision and Pattern Recogni- Bello, Anselm Levskaya, and Jonathon Shlens. Stand-alone
tion, 2017. 2 self-attention in vision models. CoRR, abs/1906.05909,
[20] Sergey Ioffe and Christian Szegedy. Batch normalization: 2019. 8
Accelerating deep network training by reducing internal co- [35] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Search-
variate shift. In International Conference on Learning Rep- ing for activation functions. CoRR, abs/1710.05941, 2017.
resentations, 2015. 4 8

3294
[36] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- QAnet: Combining local convolution with global self-
moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted attention for reading comprehension. In International Con-
residuals and linear bottlenecks. In Proceedings of the IEEE ference on Learning Representations, 2018. 2
Conference on Computer Vision and Pattern Recognition, [51] Sergey Zagoruyko and Nikos Komodakis. Wide residual net-
pages 4510–4520, 2018. 6 works. In British Machine Vision Conference, 2016. 5
[37] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self- [52] Vinicius Zambaldi, David Raposo, Adam Santoro, Vic-
attention with relative position representations. arXiv tor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David
preprint arXiv:1803.02155, 2018. 2, 3, 4 Reichert, Timothy Lillicrap, Edward Lockhart, Murray
[38] David R. So, Chen Liang, and Quoc V. Le. The evolved Shanahan, Victoria Langston, Razvan Pascanu, Matthew
transformer. CoRR, abs/1901.11117, 2019. 3 Botvinick, Oriol Vinyals, and Peter Battaglia. Deep rein-
[39] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and forcement learning with relational inductive biases. In ICLR,
Alex Alemi. Inception-v4, Inception-Resnet and the impact 2019. 2
of residual connections on learning. In International Con- [53] Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and
ference on Learning Representations Workshop Track, 2016. Augustus Odena. Self-attention generative adversarial net-
2 works. arXiv:1805.08318, 2018. 3
[40] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, [54] Barret Zoph and Quoc V. Le. Neural architecture search
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent with reinforcement learning. In International Conference on
Vanhoucke, and Andrew Rabinovich. Going deeper with Learning Representations, 2017. 6
convolutions. In IEEE Conference on Computer Vision and [55] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V
Pattern Recognition, 2015. 2 Le. Learning transferable architectures for scalable image
[41] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon recognition. In Proceedings of the IEEE conference on
Shlens, and Zbigniew Wojna. Rethinking the Inception ar- computer vision and pattern recognition, pages 8697–8710,
chitecture for computer vision. In IEEE Conference on Com- 2018. 2, 8, 11
puter Vision and Pattern Recognition, 2016. 2, 11
[42] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan,
and Quoc V Le. Mnasnet: Platform-aware neural architec-
ture search for mobile. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2018. 2,
5, 6, 11
[43] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Advances in Neural
Information Processing Systems, pages 5998–6008, 2017. 1,
2, 3
[44] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer
networks. In NIPS, pages 2692–2700, 2015. 2
[45] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
ing He. Non-local neural networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 7794–7803, 2018. 3
[46] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In
So Kweon. Cbam: Convolutional block attention module.
In Proceedings of the European Conference on Computer Vi-
sion (ECCV), pages 3–19, 2018. 3, 4, 5
[47] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and
Kaiming He. Aggregated residual transformations for deep
neural networks. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2017. 2, 5, 6
[48] Yoshihiro Yamada, Masakazu Iwamura, Takuya Akiba, and
Koichi Kise. Shakedrop regularization for deep residual
learning. arXiv preprint arXiv:1802.02375, 2018. 11
[49] Baosong Yang, Longyue Wang, Derek F. Wong, Lidia S.
Chao, and Zhaopeng Tu. Convolutional self-attention net-
work. In CoRR, volume abs/1810.13320, 2018. 2
[50] Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui
Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le.

3295

You might also like