Bello Attention Augmented Convolutional Networks ICCV 2019 Paper
Bello Attention Augmented Convolutional Networks ICCV 2019 Paper
Abstract
3286
Input Attention maps Weighted average of the values Output
H Head Head ²
¹
Nh = 2
Values
Standard convolution
Figure 2. Attention-augmented convolution: For each spatial location (h, w), Nh attention maps over the image are computed from
queries and keys. These attention maps are used to compute Nh weighted averages of the values V. The results are then concatenated,
reshaped to match the original volume’s spatial dimensions and mixed with a pointwise convolution. Multi-head attention is applied in
parallel to a standard convolution operation and the outputs are concatenated.
3287
former [38] architectures alternate between self-attention can be formulated as:
layers and convolution layers for Question Answering ap- 0 1
plications and Machine Translation respectively. Addi- T
(XW q )(XW k )
tionally, multiple attention mechanisms have been pro- Oh = Softmax @ q A (XWv ) (1)
h
dk
posed for visual tasks to address the weaknesses of con-
volutions [17, 16, 7, 46, 45, 53]. For instance, Squeeze- h h
and-Excitation [17] and Gather-Excite [16] reweigh feature where Wq , Wk ∈ RFin ×dk and Wv ∈ RFin ×dv are learned
channels using signals aggregated from entire feature maps, linear transformations that map the input X to queries Q =
while BAM [31] and CBAM [46] refine convolutional fea- XWq , keys K = XWk and values V = XWv . The outputs
tures independently in the channel and spatial dimensions. of all heads are then concatenated and projected again as
In non-local neural networks [45], improvements are shown follows:
in video classification and object detection via the addi- h i
tive use of a few non-local residual blocks that employ MHA(X) = Concat O1 , . . . , ON h W O (2)
self-attention in convolutional architectures. However, non-
local blocks are only added to the architecture after Ima- where W O ∈ Rdv ×dv is a learned linear transformation.
geNet pretraining and are initialized in such a way that they MHA(X) is then reshaped into a tensor of shape (H, W, dv )
do not break pretraining. to match the original spatial dimensions. We note that
In contrast, our attention augmented networks do not rely multi-head attention incurs a complexity of O((HW )2 dk )
on pretraining of their fully convolutional counterparts and and a memory cost of O((HW )2 Nh ) as it requires to store
employ self-attention along the entire architecture. The use attention maps for each head.
of multi-head attention allows the model to attend jointly
to both spatial and feature subspaces. Additionally, we en- 3.1.1 Two-dimensional Positional Encodings
hance the representational power of self-attention over im-
Without explicit information about positions, self-attention
ages by extending relative self-attention [37, 18] to two di-
is permutation equivariant:
mensional inputs allowing us to model translation equivari-
ance in a principled way. Finally our method produces addi- MHA(π(X)) = π(MHA(X))
tional feature maps, rather than recalibrating convolutional
features via addition [45, 53] or gating [17, 16, 31, 46]. This for any permutation π of the pixel locations, making it in-
property allows us to flexibly adjust the fraction of atten- effective for modeling highly structured data such as im-
tional channels and consider a spectrum of architectures, ages. Multiple positional encodings that augment activation
ranging from fully convolutional to fully attentional mod- maps with explicit spatial information have been proposed
els. to alleviate related issues. In particular, the Image Trans-
former [32] extends the sinusoidal waves first introduced in
3. Methods the original Transformer [43] to 2 dimensional inputs and
CoordConv [29] concatenates positional channels to an ac-
We now formally describe our proposed Attention Aug- tivation map.
mentation method. We use the following naming conven- However these encodings did not help in our experi-
tions: H, W and Fin refer to the height, width and number ments on image classification and object detection (see Sec-
of input filters of an activation map. Nh , dv and dk respec- tion 4.5). We hypothesize that this is because such posi-
tively refer the number of heads, the depth of values and the tional encodings, while not permutation equivariant, do not
depth of queries and keys in multihead-attention (MHA). satisfy translation equivariance, which is a desirable prop-
We further assume that Nh divides dv and dk evenly and erty when dealing with images. As a solution, we propose
denote dhv and dhk the depth of values and queries/keys per to extend the use of relative position encodings [37] to two
attention head. dimensions and present a memory efficient implementation
based on the Music Transformer [18].
3.1. Self-attention over images
Relative positional encodings: Introduced in [37] for the
Given an input tensor of shape (H, W, Fin ),1 we flatten purpose of language modeling, relative self-attention aug-
it to a matrix X ∈ RHW ×Fin and perform multihead atten- ments self-attention with relative position encodings and
tion as proposed in the Transformer architecture [43]. The enables translation equivariance while preventing permuta-
output of the self-attention mechanism for a single head h tion equivariance. We implement two-dimensional relative
self-attention by independently adding relative height infor-
1 We omit the batch dimension for simplicity. mation and relative width information. The attention logit
3288
for how much pixel i = (ix , iy ) attends to pixel j = (jx , jy ) feature maps rather than refining them. Figure 2 summa-
is computed as: rizes our proposed augmented convolution.
qT
li,j = qi (kj + rjWx −ix + rjHy −iy ) (3) Concatenating convolutional and attentional feature
dhk maps: Formally, consider an original convolution oper-
ator with kernel size k, Fin input filters and Fout output
where qi is the query vector for pixel i (the i-th row of Q), filters. The corresponding attention augmented convolution
kj is the key vector for pixel j (the j-th row of K) and rjWx −ix can be written as
and rjHy −iy are learned embeddings for relative width jx −ix h i
and relative height jy − iy , respectively. The output of head AAConv(X) = Concat Conv(X), MHA(X) .
h now becomes:
We denote υ = Fdout the ratio of attentional channels to
0 1 v
T rel rel
QK + SH + SW A
Oh = Softmax @ q V (4) number of original output filters and κ = Fdout k
the ratio of
dhk key depth to number of original output filters. Similarly to
the convolution, the proposed attention augmented convo-
rel rel
where SH , SW ∈ RHW ×HW are matrices of relative po- lution 1) is equivariant to translation and 2) can readily op-
sition logits along height and width dimensions that sat- erate on inputs of different spatial dimensions. We include
rel
isfy SH [i, j] = qiT rjHy −iy and SWrel
[i, j] = qiT rjWx −ix . As Tensorflow code for the proposed attention augmented con-
we consider relative height and width information sepa- volution in the Appendix A.3.
rel rel rel
rately, SH and SW also satisfy the properties SW [i, j] =
rel rel rel
SW [i, j + W ] and SH [i, j] = SH [i + H, j], which pre- Effect on number of parameters: Multihead attention
vents from having to compute the logits for all (i, j) pairs. introduces a 1x1 convolution with Fin input filters and
The relative attention algorithm in [37] explicitly (2dk +dv ) = Fout (2κ+υ) output filters to compute queries,
stores all relative embeddings rij in a tensor of shape keys and values and an additional 1x1 convolution with
(HW, HW, dhk ), thus incurring an additional memory cost dv = Fout υ input and output filters to mix the contribu-
of O((HW )2 dhk ). This compares to O((HW )2 Nh ) for the tion of different heads. Considering the decrease in filters
position-unaware version self-attention that does not use in the convolutional part, this leads to the following change
position encodings. As we typically have Nh < dhk , such an in parameters:
implementation can prove extremely prohibitive and restrict
the number of images that can fit in a minibatch. Instead, we Fout 2
extend the memory efficient relative masked attention algo- ∆params ∼ Fin Fout (2κ + (1 − k 2 )υ + υ ), (5)
Fin
rithm presented in [18] to unmasked relative self-attention
over 2 dimensional inputs. Our implementation has a mem- where we ignore the parameters introduced by relative po-
ory cost of O(HW dhk ). We leave the Tensorflow code of sition embeddings for simplicity as these are negligible. In
the algorithm in the Appendix. practice, this causes a slight decrease in parameters when
The relative positional embeeddings rH and rW are replacing 3x3 convolutions and a slight increase in parame-
learned and shared across heads but not layers. For each ters when replacing 1x1 convolutions. Interestingly, we find
layer, we add (2(H + W ) − 2)dhk parameters to model rel- in experiments that attention augmented networks still sig-
ative distances along height and width. nificantly outperform their fully convolutional counterparts
while using less parameters.
3.2. Attention Augmented Convolution
Multiple previously proposed attention mechanisms over Attention Augmented Convolutional Architectures: In
images [17, 16, 31, 46] suggest that the convolution op- all our experiments, the augmented convolution is followed
erator is limited by its locality and lack of understanding by a batch normalization [20] layer which can learn to scale
of global contexts. These methods capture long-range de- the contribution of the convolution feature maps and the at-
pendencies by recalibrating convolutional feature maps. In tention feature maps. We apply our augmented convolution
particular, Squeeze-and-Excitation (SE) [17] and Gather- once per residual block similarly to other visual attention
Excite (GE) [16] perform channelwise reweighing while mechanisms [17, 16, 31, 46] and along the entire architec-
BAM [31] and CBAM [46] reweigh both channels and ture as memory permits (see Section 4 for more details).
spatial positions independently. In contrast to these ap- Since the memory cost O((Nh (HW )2 ) can be pro-
proaches, we 1) use an attention mechanism that can attend hibitive for large spatial dimensions, we augment convolu-
jointly to spatial and feature subspaces (each head corre- tions with attention starting from the last layer (with small-
sponding to a feature subspace) and 2) introduce additional est spatial dimension) until we hit memory constraints. To
3289
reduce the memory footprint of augmented networks, we 4.2. ImageNet image classification with ResNet
typically resort to a smaller batch size and sometimes addi-
We next examine how Attention Augmentation performs
tionally downsample the inputs to self-attention in the lay-
on ImageNet [9, 21], a standard large-scale dataset for high
ers with the largest spatial dimensions where it is applied.
resolution imagery, across an array of architectures. We
Downsampling is performed by applying 3x3 average pool-
start with the ResNet architecture [14, 47, 13] because of its
ing with stride 2 while the following upsampling (required
widespread use and its ability to easily scale across several
for the concatenation) is obtained via bilinear interpolation.
computational budgets. The building block in ResNet-34
comprises two 3x3 convolutions with the same number of
4. Experiments output filters. ResNet-50 and its larger counterparts use a
bottleneck block comprising of 1x1, 3x3, 1x1 convolutions
In the subsequent experiments, we test Attention Aug- where the last pointwise convolution expands the number
mentation on standard computer vision architectures such of filters and the first one contracts the number of filters.
as ResNets [14, 47, 13], and MnasNet [42] on the CIFAR- We modify all ResNets by augmenting the 3x3 convolu-
100 [22], ImageNet [9] and COCO [25] datasets. Our ex- tions as this decreases number of parameters.2 We apply
periments show that Attention Augmentation leads to sys- Attention Augmentation in each residual block of the last 3
tematic improvements on both image classification and ob- stages of the architecture – when the spatial dimensions of
ject detection tasks across a broad array of architectures and the activation maps are 28x28, 14x14 and 7x7 – and down-
computational demands. We validate the utility of the pro- sample only during the first stage. All attention augmented
posed two-dimensional relative attention mechanism in ab- networks use κ=2υ=0.2, except for ResNet-34 which uses
lation experiments. In all experiments, we substitute con- κ=υ=0.25. The number of attention heads is fixed to Nh =8.
volutional feature maps with self-attention feature maps as
it makes for an easier comparison against the baseline mod- Architecture Params (M) ∆Inf er ∆T rain top-1
els. Unless specified otherwise, all results correspond to our ResNet-50 25.6 - - 76.4
SE [17] 28.1 +12% +92% 77.5 (77.0)
two-dimensional relative self-attention mechanism. Exper- BAM [31] 25.9 +19% +43% 77.3
imental details can be found in the Appendix. CBAM [46] 28.1 +56% +132% 77.4 (77.4)
GALA [28] 29.4 +86% +133% 77.5 (77.3)
AA (υ = 0.25) 24.3 +29% +25% 77.7
4.1. CIFAR-100 image classification
Table 2. Image classification performance of different attention
We first investigate how Attention Augmentation per- mechanisms on the ImageNet dataset. ∆ refers to the increase
forms on CIFAR-100 [22], a standard benchmark for low- in latency times compared to the ResNet50 on a single Tesla V100
resolution imagery, using a Wide ResNet architecture [51]. GPU with Tensorflow using a batch size of 128. For fair compar-
The Wide-ResNet-28-10 architecture is comprised of 3 ison, we also include top-1 results (in parentheses) when scaling
stages of 4 residual blocks each using two 3 × 3 convolu- networks in width to match ∼ 25.6M parameters as the ResNet50
tions. We augment the Wide-ResNet-28-10 by augmenting baseline.
the first convolution of all residual blocks with relative at-
tention using Nh =8 heads and κ=2υ=0.2 and a minimum of Table 2 benchmarks Attention Augmentation against
20 dimensions per head for the keys. We compare Attention channel and spatial attention mechanisms BAM [31],
Augmentation (AA) against other forms of attention includ- CBAM [46] and GALA [28] with channel reduction ra-
ing Squeeze-and-Excitation (SE) [17] and the parameter- tio σ = 16 on the ResNet50 architecture. Despite the
free formulation of Gather-Excite (GE) [16]. Table 1 shows lack of specialized kernels (See Appendix A.3), Attention
that Attention Augmentation improves performance both Augmentation offers a competitive accuracy/computational
over the baseline network and Squeeze-and-Excitation at a trade-off compared to previously proposed attention mech-
similar parameter and complexity cost. anisms. Table 3 compares the non-augmented networks and
Squeeze-and-Excitation (SE) [17] across different network
scales. In all experiments, Attention Augmentation sig-
Architecture Params GFlops top-1 top-5 nificantly increases performance over the non-augmented
Wide-ResNet [51] 36.3M 10.4 80.3 95.0
baseline and notably outperforms Squeeze-and-Excitation
GE-Wide-ResNet [16] 36.3M 10.4 79.8 95.0
(SE) [17] while being more parameter efficient (Figure 1).
SE-Wide-ResNet [17] 36.5M 10.4 81.0 95.3
AA-Wide-ResNet (ours) 36.2M 10.9 81.6 95.2 Remarkably, our AA-ResNet-50 performs comparably to
the baseline ResNet-101 and our AA-ResNet-101 outper-
Table 1. Image classification on the CIFAR-100 dataset [22] using forms the baseline ResNet-152. These results suggest that
the Wide-ResNet 28-10 architecture [51]. 2 We found that augmenting the pointwise expansions works just as well
3290
Architecture GFlops Params top-1 top-5
ResNet-34 [14] 7.4 21.8M 73.6 91.5
SE-ResNet-34 [17] 7.4 22.0M 74.3 91.8
AA-ResNet-34 (ours) 7.1 20.7M 74.7 92.0
ResNet-50 [14] 8.2 25.6M 76.4 93.1
SE-ResNet-50 [17] 8.2 28.1M 77.5 93.7
AA-ResNet-50 (ours) 8.3 25.8M 77.7 93.8
ResNet-101 [14] 15.6 44.5M 77.9 94.0
SE-ResNet-101 [17] 15.6 49.3M 78.4 94.2
AA-ResNet-101 (ours) 16.1 45.4M 78.7 94.4
ResNet-152 [14] 23.0 60.2M 78.4 94.2
SE-ResNet-152 [17] 23.1 66.8M 78.9 94.5
AA-ResNet-152 (ours) 23.8 61.6M 79.1 94.6
3291
Backbone architecture GFlops Params mAPCOCO mAP50 mAP75
ResNet-50 [26] 182 33.4M 36.8 54.5 39.5
SE-ResNet-50 [17] 183 35.9M 36.5 54.0 39.1
AA-ResNet-50 (ours) 182 33.1M 38.2 56.5 40.7
ResNet-101 [26] 243 52.4M 38.5 56.4 41.2
SE-ResNet-101 [17] 243 57.2M 37.4 55.0 39.9
AA-ResNet-101 (ours) 245 51.7M 39.2 57.8 41.9
Table 5. Object detection on the COCO dataset [27] using the RetinaNet architecture [26] with different backbone architectures. We report
mean Average Precision at three different IoU values.
3292
Architecture Position Encodings top-1 top-5 attentional vision models on image classification for the first
AA-ResNet-34 None 74.4 91.9 time. We propose to augment convolutional operators with
AA-ResNet-34 2d Sine 74.4 92.0 this self-attention mechanism and validate the superiority of
AA-ResNet-34 CoordConv 74.4 92.0 this approach over other attention schemes. Extensive ex-
AA-ResNet-34 Relative (ours) 74.7 92.0 periments show that Attention Augmentation leads to sys-
AA-ResNet-50 None 77.5 93.7 tematic improvements on both image classification and ob-
AA-ResNet-50 2d Sine 77.5 93.7 ject detection tasks across a wide range of architectures and
AA-ResNet-50 CoordConv 77.5 93.8 computational settings.
AA-ResNet-50 Relative (ours) 77.7 93.8 Several open questions from this work remain. In fu-
ture work, we will focus on the fully attentional regime
Table 7. Effects of different position encodings in Attention Aug- and explore how different attention mechanisms trade off
mentation on ImageNet classification.
computational efficiency versus representational power. For
instance, identifying a local attention mechanism may re-
Position Encodings mAPCOCO mAP50 mAP75
sult in an efficient and scalable computational mechanism
None 37.7 56.0 40.2 that could prevent the need for downsampling with average
CoordConv [29] 37.4 55.5 40.1 pooling [34]. Additionally, it is plausible that architectural
Relative (ours) 38.2 56.5 40.7 design choices that are well suited when exclusively relying
Table 8. Effects of different position encodings in Attention Aug- on convolutions are suboptimal when using self-attention
mentation on the COCO object detection task using a RetinaNet mechanisms. As such, it would be interesting to see if us-
AA-ResNet-50 backbone. ing Attention Augmentation as a primitive in automated ar-
chitecture search procedures proves useful to find even bet-
ter models than those previously found in image classifica-
encodings, which indicates the necessity of maintaining po- tion [55], object detection [12], image segmentation [6] and
sition information for fully self-attentional vision models. other domains[5, 1, 35, 8]. Finally, one can ask to which
We additionally compare our proposed two-dimensional degree fully attentional models can replace convolutional
relative position encodings to other position encoding networks for visual tasks.
schemes. We apply Attention Augmentation using the same
hyperparameters as 4.2 with the following different posi- Acknowledgements
tion encoding schemes: 1) The position-unaware version of
self-attention (referred to as None), 2) a two-dimensional The authors would like to thank Tsung-Yi Lin, Pra-
implementation of the sinusoidal positional waves (referred jit Ramachandran, Mingxing Tan, Yanping Huang and the
to as 2d Sine) as used in [32], 3) CoordConv [29] for which Google Brain team for insightful comments and discus-
we concatenate (x,y,r) coordinate channels to the inputs of sions.
the attention function, and 4) our proposed two-dimensional
relative position encodings (referred to as Relative). References
In Table 7 and 8, we present the results on ImageNet
classification and the COCO object detection task respec- [1] Maximilian Alber, Irwan Bello, Barret Zoph, Pieter-Jan Kin-
dermans, Prajit Ramachandran, and Quoc V. Le. Backprop
tively. On both tasks, Attention Augmentation without po-
evolution. CoRR, abs/1808.02822, 2018. 8
sition encodings already yields improvements over the fully
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.
convolutional non-augmented variants. Our experiments
Neural machine translation by jointly learning to align and
also reveal that the sinusoidal encodings and the coordinate translate. In International Conference on Learning Repre-
convolution do not provide improvements over the position- sentations, 2015. 2
unaware version of Attention Augmentation. We obtain ad- [3] Irwan Bello, Sayali Kulkarni, Sagar Jain, Craig Boutilier,
ditional improvements when using our two-dimensional rel- Ed Huai-hsin Chi, Elad Eban, Xiyang Luo, Alan Mackey,
ative attention, demonstrating the utility of preserving trans- and Ofer Meshi. Seq2slate: Re-ranking and slate optimiza-
lation equivariance while preventing permutation equivari- tion with rnns. CoRR, abs/1810.02019, 2018. 2
ance. [4] Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad Norouzi,
and Samy Bengio. Neural combinatorial optimization with
5. Discussion and future work reinforcement learning. 2016. 2
[5] Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V. Le.
In this work, we consider the use of self-attention for vi- Neural optimizer search with reinforcement learning. In Pro-
sion models as an alternative to convolutions. We introduce ceedings of the 34th International Conference on Machine
a novel two-dimensional relative self-attention mechanism Learning - Volume 70, ICML’17, pages 459–468. JMLR.org,
for images that enables training of competitive fully self- 2017. 8
3293
[6] Liang-Chieh Chen, Maxwell Collins, Yukun Zhu, George [21] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do bet-
Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, ter imagenet models transfer better? In Proceedings of the
and Jon Shlens. Searching for efficient multi-scale archi- IEEE Conference on Computer Vision and Pattern Recogni-
tectures for dense image prediction. In Advances in Neural tion, 2019. 2, 5
Information Processing Systems, pages 8713–8724, 2018. 2, [22] Alex Krizhevsky. Learning multiple layers of features from
8 tiny images. Technical report, University of Toronto, 2009.
[7] Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng 2, 5
Yan, and Jiashi Feng. A2 -nets: Double attention networks. [23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton.
CoRR, abs/1810.11579, 2018. 3 Imagenet classification with deep convolutional neural net-
[8] Ekin Dogus Cubuk, Barret Zoph, Dandelion Mané, Vijay Va- works. In Advances in Neural Information Processing Sys-
sudevan, and Quoc V. Le. Autoaugment: Learning augmen- tem, 2012. 1, 2
tation policies from data. CoRR, abs/1805.09501, 2018. 8 [24] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick
[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Haffner. Gradient-based learning applied to document recog-
and Li Fei-Fei. Imagenet: A large-scale hierarchical image nition. Proceedings of the IEEE, 1998. 1
database. In IEEE Conference on Computer Vision and Pat- [25] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
tern Recognition. IEEE, 2009. 1, 2, 5, 6 Bharath Hariharan, and Serge Belongie. Feature pyramid
[10] Xavier Gastaldi. Shake-shake regularization. arXiv preprint networks for object detection. In Proceedings of the IEEE
arXiv:1705.07485, 2017. 11 Conference on Computer Vision and Pattern Recognition,
[11] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock: 2017. 5
A regularization method for convolutional networks. In [26] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
Advances in Neural Information Processing Systems, pages Piotr Dollár. Focal loss for dense object detection. In Pro-
10750–10760, 2018. 11 ceedings of the IEEE international conference on computer
[12] Golnaz Ghiasi, Tsung-Yi Lin, Ruoming Pang, and Quoc V vision, pages 2980–2988, 2017. 6, 7, 11
Le. NAS-FPN: Learning scalable feature pyramid architec- [27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
ture for object detection. In The IEEE Conference on Com- Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
puter Vision and Pattern Recognition (CVPR), June 2019. 8 Zitnick. Microsoft coco: Common objects in context. In
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. European Conference on Computer Vision, pages 740–755.
Deep residual learning for image recognition. In IEEE Con- Springer, 2014. 2, 6, 7
ference on Computer Vision and Pattern Recognition, 2016. [28] Drew Linsley, Dan Scheibler, Sven Eberhardt, and Thomas
2, 5, 6 Serre. Global-and-local attention networks for visual recog-
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. nition. CoRR, abs/1805.08819, 2018. 5
Identity mappings in deep residual networks. In European [29] Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski
Conference on Computer Vision, 2016. 1, 2, 5, 6, 7, 11 Such, Eric Frank, Alex Sergeev, and Jason Yosinski. An
[15] Sepp Hochreiter and Juergen Schmidhuber. Long short-term intriguing failing of convolutional neural networks and the
memory. Neural Computation, 1997. 2 coordconv solution. In Advances in Neural Information Pro-
[16] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Andrea cessing Systems, pages 9628–9639, 2018. 3, 8
Vedaldi. Gather-excite: Exploiting feature context in convo- [30] Ilya Loshchilov and Frank Hutter. SGDR: Stochas-
lutional neural networks. In Advances in Neural Information tic gradient descent with warm restarts. arXiv preprint
Processing Systems, pages 9423–9433, 2018. 3, 4, 5 arXiv:1608.03983, 2016. 11
[17] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net- [31] Jongchan Park, Sanghyun Woo, Joon-Young Lee, and In So
works. In Proceedings of the IEEE Conference on Computer Kweon. Bam: bottleneck attention module. In British Ma-
Vision and Pattern Recognition, 2018. 1, 2, 3, 4, 5, 6, 7 chine Vision Conference, 2018. 3, 4, 5
[18] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszko- [32] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz
reit, Noam Shazeer, Curtis Hawthorne, Andrew M Dai, Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im-
Matthew D Hoffman, and Douglas Eck. Music transformer. age transformer. In International Conference on Machine
In Advances in Neural Processing Systems, 2018. 3, 4 Learning, 2018. 3, 8
[19] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, [33] Andrew Rabinovich, Andrea Vedaldi, Carolina Galleguillos,
Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wo- Eric Wiewiora, and Serge Belongie. Objects in context.
jna, Yang Song, Sergio Guadarrama, et al. Speed/accu- 2007. 1
racy trade-offs for modern convolutional object detectors. In [34] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan
IEEE Conference on Computer Vision and Pattern Recogni- Bello, Anselm Levskaya, and Jonathon Shlens. Stand-alone
tion, 2017. 2 self-attention in vision models. CoRR, abs/1906.05909,
[20] Sergey Ioffe and Christian Szegedy. Batch normalization: 2019. 8
Accelerating deep network training by reducing internal co- [35] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Search-
variate shift. In International Conference on Learning Rep- ing for activation functions. CoRR, abs/1710.05941, 2017.
resentations, 2015. 4 8
3294
[36] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- QAnet: Combining local convolution with global self-
moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted attention for reading comprehension. In International Con-
residuals and linear bottlenecks. In Proceedings of the IEEE ference on Learning Representations, 2018. 2
Conference on Computer Vision and Pattern Recognition, [51] Sergey Zagoruyko and Nikos Komodakis. Wide residual net-
pages 4510–4520, 2018. 6 works. In British Machine Vision Conference, 2016. 5
[37] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self- [52] Vinicius Zambaldi, David Raposo, Adam Santoro, Vic-
attention with relative position representations. arXiv tor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David
preprint arXiv:1803.02155, 2018. 2, 3, 4 Reichert, Timothy Lillicrap, Edward Lockhart, Murray
[38] David R. So, Chen Liang, and Quoc V. Le. The evolved Shanahan, Victoria Langston, Razvan Pascanu, Matthew
transformer. CoRR, abs/1901.11117, 2019. 3 Botvinick, Oriol Vinyals, and Peter Battaglia. Deep rein-
[39] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and forcement learning with relational inductive biases. In ICLR,
Alex Alemi. Inception-v4, Inception-Resnet and the impact 2019. 2
of residual connections on learning. In International Con- [53] Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and
ference on Learning Representations Workshop Track, 2016. Augustus Odena. Self-attention generative adversarial net-
2 works. arXiv:1805.08318, 2018. 3
[40] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, [54] Barret Zoph and Quoc V. Le. Neural architecture search
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent with reinforcement learning. In International Conference on
Vanhoucke, and Andrew Rabinovich. Going deeper with Learning Representations, 2017. 6
convolutions. In IEEE Conference on Computer Vision and [55] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V
Pattern Recognition, 2015. 2 Le. Learning transferable architectures for scalable image
[41] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon recognition. In Proceedings of the IEEE conference on
Shlens, and Zbigniew Wojna. Rethinking the Inception ar- computer vision and pattern recognition, pages 8697–8710,
chitecture for computer vision. In IEEE Conference on Com- 2018. 2, 8, 11
puter Vision and Pattern Recognition, 2016. 2, 11
[42] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan,
and Quoc V Le. Mnasnet: Platform-aware neural architec-
ture search for mobile. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2018. 2,
5, 6, 11
[43] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Advances in Neural
Information Processing Systems, pages 5998–6008, 2017. 1,
2, 3
[44] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer
networks. In NIPS, pages 2692–2700, 2015. 2
[45] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
ing He. Non-local neural networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 7794–7803, 2018. 3
[46] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In
So Kweon. Cbam: Convolutional block attention module.
In Proceedings of the European Conference on Computer Vi-
sion (ECCV), pages 3–19, 2018. 3, 4, 5
[47] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and
Kaiming He. Aggregated residual transformations for deep
neural networks. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2017. 2, 5, 6
[48] Yoshihiro Yamada, Masakazu Iwamura, Takuya Akiba, and
Koichi Kise. Shakedrop regularization for deep residual
learning. arXiv preprint arXiv:1802.02375, 2018. 11
[49] Baosong Yang, Longyue Wang, Derek F. Wong, Lidia S.
Chao, and Zhaopeng Tu. Convolutional self-attention net-
work. In CoRR, volume abs/1810.13320, 2018. 2
[50] Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui
Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le.
3295