各式各样的 Attention
To Learn
More …
Long Range Arena: A
Benchmark for Efficient
Transformers
https://arxiv.org/abs/2011.04006
Efficient Transformers: A Survey
https://arxiv.org/abs/2009.06732
2
How to make self-attention efficient?
key
Sequence length
=𝑁 𝑁
𝑁
query
Attention Matrix
𝑁×𝑁
Notice
• Self-attention is only
a module in a larger
network.
• Self-attention
dominates
computation when 𝑁
is large.
• Usually developed for
image processing
𝑁=
256
256 ∗ 256
256
Skip Some Calculations with Human
Knowledge
Can we fill in some values
with human knowledge?
Local Attention / Truncated Attention
Set to 0
……
query
Similar with
CNN
Calculate
attention weight
key
Stride Attention
… …
Global Attention
Add special token into original sequence
• Attend to every token → collect global information
• Attended by every token → it knows global information
No attention …
between non-
special token
Many Different Choices …
选择哪个?
Different heads use different
patterns.
Many Different Choices …
• Longformer https://arxiv.org/abs/2004.0515
0
• Big Bird https://arxiv.org/abs/2007.1406
2
Can we only focus on Critical Parts?
key
small value
• Directly set to 0
query
• Smaller
influence on
results
large value
How to quickly estimate the portion with
small attention weights?
Reformer
Clustering https://openreview.net/forum?id=rkgNKkHtvB
Routing Transformer
https://arxiv.org/abs/2003.05997
Step 1
query key
Clustering
(approximate & fast)
based on similarity
1 4 1 2 3 3 3 2 2 1 3 3 1 4 1 4
key
Clustering
Step 2
query
Belong to the same cluster, Not the same cluster,
then calculate attention weight set to 0
https://arxiv.org/abs/2002.11296
Learnable Patterns Input sequence
Sinkhorn Sorting Network
key
NN Jointly learned
query
A grid should be skipped or not is
decided by another learned (simplified version)
module
Do we need full attention matrix?
Linformer
https://arxiv.org/abs/2006.0476
Many redundant columns 8
query
key Low Rank
𝑁 𝑁
value key
Representative
𝐾 𝐾
keys
output
Can we
reduce the
number of query
queries?
change output
sequence
length
Reduce Number of Keys
Compressed Attention Linformer
https://arxiv.org/abs/1801.10198 https://arxiv.org/abs/2006.04768
𝑁
𝑑
Conv Conv Conv Conv 𝑁×𝐾
Linear combination 𝐾
of N vectors
Attention Mechanism is three-matrix Multiplication
Review 𝑑×𝑁 Q = 𝑊𝑞 I
𝑑×𝑁 K = 𝑊𝑘 I
𝑑′ × 𝑁 V = 𝑊𝑣 I
A′ A = 𝐾𝑇 Q
softmax
ignore
O = V A′
Attention Mechanism is three-matrix Multiplication
Review 𝑑×𝑁 Q = 𝑊𝑞 I
𝑑×𝑁 K = 𝑊𝑘 I
𝑑′ × 𝑁 V = 𝑊𝑣 I
O ≈ V 𝐾𝑇 Q
𝑑′ × 𝑁 𝑑′ × 𝑁 𝑑×𝑁
𝑁×𝑑
O ≈ V 𝐾𝑇 Q
𝑑′ × 𝑁 𝑑′ × 𝑁 𝑑×𝑁
𝑁×𝑑
O ≈ V 𝐾𝑇 Q
𝑑′ × 𝑁 𝑑′ × 𝑁 𝑑×𝑁 What is the
difference?
𝑁×𝑑
O ≈ V 𝐾𝑇 Q 𝑑 + 𝑑′ 𝑁 2
𝑑′ × 𝑁 𝑑′ × 𝑁 𝑑×𝑁
𝑁×𝑑
𝑁×𝑑×𝑁
Attention
V A
Matrix
𝑑′ × 𝑁
𝑁×𝑁
𝑑′ × 𝑁 × 𝑁
O ≈ V 𝐾𝑇 Q 2𝑑 ′ 𝑑𝑁
𝑑′ × 𝑁 𝑑′ × 𝑁 𝑑×𝑁
𝑁×𝑑
𝑑′ × 𝑁 × 𝑑
𝑑×𝑁
𝑑′ × 𝑑 Q
𝑑′ × 𝑑 × 𝑁
O ≈ V 𝐾𝑇 Q 𝑑 + 𝑑′ 𝑁 2
𝑑‘ × 𝑁 𝑑‘ × 𝑁 𝑑×𝑁
𝑁×𝑑
>
O ≈ V 𝐾𝑇 Q 2𝑑 ′ 𝑑𝑁
𝑑‘ × 𝑁 𝑑‘ × 𝑁 𝑑×𝑁
𝑁×𝑑
put softmax back …
𝑁 𝑁
′ 𝑒𝑥𝑝 𝒒𝟏 ∙ 𝒌𝒊
𝒃𝟏 = 𝛼1,𝑖 𝒗𝒊 = 𝒗𝒊
σN
𝑗=1 𝑒𝑥𝑝 𝒒 𝟏 ∙ 𝒌𝒋
𝑖=1 𝑖=1
𝒃𝟏
′ ′ ′ ′
𝛼1,1 𝛼1,2 𝛼1,3 𝛼1,4
𝒒𝟏 𝒌𝟏 𝒗𝟏 𝒌𝟐 𝒗𝟐 𝒌𝟑 𝒗𝟑 𝒌𝟒 𝒗𝟒
𝒂𝟏 𝒂𝟐 𝒂𝟑 𝒂𝟒
𝑁 𝑁
′ 𝑒𝑥𝑝 𝒒𝟏 ∙ 𝒌𝒊
𝒃𝟏 = 𝛼1,𝑖 𝒗𝒊 = 𝒗𝒊
σN
𝑗=1 𝑒𝑥𝑝 𝒒 𝟏 ∙ 𝒌𝒋
𝑖=1 𝑖=1
𝑁
𝑒𝑥𝑝 𝒒 ∙ 𝒌 𝜙 𝒒𝟏 ∙ 𝜙 𝒌 𝒊
≈𝜙 𝒒 ∙𝜙 𝒌 = 𝒗𝒊
σN
𝑗=1 𝜙 𝒒𝟏 ∙ 𝜙 𝒌𝒋
𝑖=1
𝒒 𝜙 𝜙 𝒒 σ𝑁 𝜙 𝒒 𝟏
∙ 𝜙 𝒌 𝒊
𝒗 𝒊
𝑖=1
=
σ𝑁𝑗=1 𝜙 𝒒 𝟏 ∙ 𝜙 𝒌𝒋
𝜙 𝒒𝟏 ∙ 𝜙 𝒌 𝒋
𝑗=1
𝜙 𝒒𝟏
𝑁 𝑁
σ 𝜙 𝒒 𝟏 ∙ 𝜙 𝒌 𝒊 𝒗𝒊
′
𝒃𝟏 = 𝛼1,𝑖 𝒗𝒊 = 𝑖=1 𝟏
𝜙 𝒒 ∙ σ𝑁 𝑗=1 𝜙 𝒌 𝒋
𝑖=1
𝑁
𝑞11 𝑘11
𝜙 𝒒 𝟏 ∙ 𝜙 𝒌 𝒊 𝒗𝒊 𝜙 𝒒𝟏 = 𝑞21 𝜙 𝒌𝟏 = 𝑘21
𝑖=1 ⋮ ⋮
= 𝜙 𝒒 𝟏 ∙ 𝜙 𝒌 𝟏 𝒗 𝟏 + 𝜙 𝒒𝟏 ∙ 𝜙 𝒌 𝟐 𝒗 𝟐 + ⋯
= 𝑞11 𝑘11 + 𝑞21 𝑘21 + ⋯ 𝒗𝟏 + 𝑞11 𝑘12 + 𝑞21 𝑘22 + ⋯ 𝒗𝟐 + ⋯
= 𝑞11 𝑘11 𝒗𝟏 + 𝑞21 𝑘21 𝒗𝟏 + ⋯ + 𝑞11 𝑘12 𝒗𝟐 + 𝑞21 𝑘22 𝒗𝟐 + ⋯ + ⋯
= 𝑞11 𝑘11 𝒗𝟏 + 𝑘12 𝒗𝟐 + ⋯ + 𝑞21 𝑘21 𝒗𝟏 + 𝑘22 𝒗𝟐 + ⋯
𝑁 𝑁
σ 𝜙 𝒒 𝟏 ∙ 𝜙 𝒌 𝒊 𝒗𝒊
′
𝒃𝟏 = 𝛼1,𝑖 𝒗𝒊 = 𝑖=1 𝟏
𝜙 𝒒 ∙ σ𝑁 𝑗=1 𝜙 𝒌 𝒋
𝑖=1
𝑁
𝑞11 𝑘11
𝜙 𝒒 𝟏 ∙ 𝜙 𝒌 𝒊 𝒗𝒊 𝜙 𝒒𝟏 = 𝑞21 𝜙 𝒌𝟏 = 𝑘21
𝑖=1 M dim ⋮ ⋮
= 𝑞11 𝑘11 𝒗𝟏 + 𝑘12 𝒗𝟐 + ⋯ + 𝑞21 𝑘21 𝒗𝟏 + 𝑘22 𝒗𝟐 + ⋯
𝑁 𝑁
𝑗
𝑗
𝑘1 𝒗𝒋 𝑘2 𝒗𝒋
𝑗=1 𝑗=1
... 𝜙 𝒒𝟏
M vectors
...
......
𝒗𝟏 𝒗𝟐 𝒗𝟑 𝒗𝑵
𝜙 𝒌𝟏 𝜙 𝒌𝟐 𝜙 𝒌𝟑 𝜙 𝒌𝑵
𝑁 𝑁 M dim
𝑗 𝑗
𝑘1 𝒗𝒋 𝑘2 𝒗𝒋 M vectors
𝑗=1 𝑗=1
... 𝜙 𝒒𝟏
M dim
𝒃𝟏 =
𝑁
𝜙 𝒌𝒋 𝜙 𝒒𝟏
𝑗=1
... 𝜙 𝒒𝟏
𝒃𝟏 =
𝜙 𝒒𝟏
Don’t compute again
... 𝜙 𝒒𝟐
𝒃𝟐 =
……
𝜙 𝒒𝟐
𝒃𝑵 = …
weighted sum 𝒗𝟏 + 𝒗𝟐 + 𝒗𝟑 + 𝒗𝟒
=
𝑞11
𝜙 𝒒𝟏 = 𝑞21
= 𝒗𝟏 + 𝒗𝟐 + 𝒗𝟑 + 𝒗𝟒
⋮
…
M
M dimensions vectors
𝜙 𝒌𝟏 𝜙 𝒌𝟐 𝜙 𝒌𝟑 𝜙 𝒌𝟒
𝒒𝟏 𝒌𝟏 𝒗𝟏 𝒌𝟐 𝒗𝟐 𝒌𝟑 𝒗𝟑 𝒌𝟒 𝒗𝟒
𝒂𝟏 𝒂𝟐 𝒂𝟑 𝒂𝟒
weighted sum 𝒗𝟏 + 𝒗𝟐 + 𝒗𝟑 + 𝒗𝟒
=
𝑞11
𝜙 𝒒𝟏 = 𝑞21
= 𝒗𝟏 + 𝒗𝟐 + 𝒗𝟑 + 𝒗𝟒
⋮
…
M
vectors
... 𝜙 𝒒𝟏
𝒃𝟏 =
𝑁
𝜙 𝒌𝒋 𝜙 𝒒𝟏
𝑗=1
= 𝒗𝟏 + 𝒗𝟐 + 𝒗𝟑 + 𝒗𝟒
𝒃𝟐 =?
𝜙 𝒒𝟐
= 𝒗𝟏 + 𝒗𝟐 + 𝒗𝟑 + 𝒗𝟒
…
M
vectors
𝜙 𝒌𝟏 𝜙 𝒌𝟐 𝜙 𝒌𝟑 𝜙 𝒌𝟒
𝒒𝟏 𝒌𝟏 𝒗𝟏 𝒒𝟐 𝒌𝟐 𝒗𝟐 𝒌𝟑 𝒗𝟑 𝒌𝟒 𝒗𝟒
𝒂𝟏 𝒂𝟐 𝒂𝟑 𝒂𝟒
Realization
• Efficient attention
https://arxiv.org/pdf/1812.01243.pdf
𝑒𝑥𝑝 𝒒 ∙ 𝒌
≈𝜙 𝒒 ∙𝜙 𝒌 • Linear Transformer
https://linear-transformers.com/
𝒒 𝜙 𝜙 𝒒 • Random Feature Attention
https://arxiv.org/pdf/2103.02143.pdf
• Performer
https://arxiv.org/pdf/2009.14794.pdf
https://arxiv.org/abs/2005.00743
Do we need q and k to compute attention?
Synthesizer!
𝒃𝟏 𝒃𝟐 𝒃𝟑 𝒃𝟒
𝛼1,1 𝛼1,2 𝛼1,3 𝛼1,4
softmax 𝑁
𝛼1,2 𝛼2,2 𝛼2,3 𝛼2,4
′
𝒃𝟏 = 𝛼1,𝑖 𝒗𝒊
𝛼1,3 𝛼2,3 𝛼3,3 𝛼3,4
𝑖=1
𝛼1,4 𝛼2,4 𝛼3,4 𝛼4,4
From q and k? 𝒗𝟏 𝒗𝟐 𝒗𝟑 𝒗𝟒
They are network
parameters!
𝒂𝟏 𝒂𝟐 𝒂𝟑 𝒂𝟒
Attention-free?
• Fnet: Mixing tokens with fourier transforms
https://arxiv.org/abs/2105.03824
• Pay Attention to MLPs https://arxiv.org/abs/2105.08050
• MLP-Mixer: An all-MLP Architecture for Vision
https://arxiv.org/abs/2105.01601
Summary
• Human knowledge
• Local Attention, Big Bird
• Clustering
• Reformer
• Learnable Pattern
• Sinkforn
• Representative key
• Linformer
• k,q first → v,k first
• Linear Transformer, Performer
• New framework
• Synthesizer