0% found this document useful (0 votes)
12 views37 pages

Efficient Attention Mechanisms

Uploaded by

technistretron0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views37 pages

Efficient Attention Mechanisms

Uploaded by

technistretron0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

各式各样的 Attention

To Learn
More …

Long Range Arena: A


Benchmark for Efficient
Transformers
https://arxiv.org/abs/2011.04006

Efficient Transformers: A Survey


https://arxiv.org/abs/2009.06732
2
How to make self-attention efficient?
key
Sequence length
=𝑁 𝑁
𝑁

query

Attention Matrix
𝑁×𝑁
Notice
• Self-attention is only
a module in a larger
network.
• Self-attention
dominates
computation when 𝑁
is large.
• Usually developed for
image processing
𝑁=
256
256 ∗ 256
256
Skip Some Calculations with Human
Knowledge

Can we fill in some values


with human knowledge?
Local Attention / Truncated Attention

Set to 0

……
query

Similar with
CNN

Calculate
attention weight
key
Stride Attention
… …
Global Attention
Add special token into original sequence
• Attend to every token → collect global information
• Attended by every token → it knows global information

No attention …
between non-
special token
Many Different Choices …

选择哪个?
Different heads use different
patterns.
Many Different Choices …
• Longformer https://arxiv.org/abs/2004.0515
0

• Big Bird https://arxiv.org/abs/2007.1406


2
Can we only focus on Critical Parts?
key

small value
• Directly set to 0
query

• Smaller
influence on
results
large value

How to quickly estimate the portion with


small attention weights?
Reformer
Clustering https://openreview.net/forum?id=rkgNKkHtvB

Routing Transformer
https://arxiv.org/abs/2003.05997

Step 1
query key

Clustering
(approximate & fast)
based on similarity

1 4 1 2 3 3 3 2 2 1 3 3 1 4 1 4
key
Clustering

Step 2

query

Belong to the same cluster, Not the same cluster,


then calculate attention weight set to 0
https://arxiv.org/abs/2002.11296

Learnable Patterns Input sequence


Sinkhorn Sorting Network

key
NN Jointly learned
query

A grid should be skipped or not is


decided by another learned (simplified version)
module
Do we need full attention matrix?
Linformer
https://arxiv.org/abs/2006.0476
Many redundant columns 8
query

key Low Rank


𝑁 𝑁
value key

Representative
𝐾 𝐾
keys

output
Can we
reduce the
number of query
queries?
change output
sequence
length
Reduce Number of Keys
Compressed Attention Linformer
https://arxiv.org/abs/1801.10198 https://arxiv.org/abs/2006.04768

𝑁
𝑑

Conv Conv Conv Conv 𝑁×𝐾

Linear combination 𝐾
of N vectors
Attention Mechanism is three-matrix Multiplication
Review 𝑑×𝑁 Q = 𝑊𝑞 I
𝑑×𝑁 K = 𝑊𝑘 I
𝑑′ × 𝑁 V = 𝑊𝑣 I

A′ A = 𝐾𝑇 Q
softmax
ignore

O = V A′
Attention Mechanism is three-matrix Multiplication
Review 𝑑×𝑁 Q = 𝑊𝑞 I
𝑑×𝑁 K = 𝑊𝑘 I
𝑑′ × 𝑁 V = 𝑊𝑣 I

O ≈ V 𝐾𝑇 Q
𝑑′ × 𝑁 𝑑′ × 𝑁 𝑑×𝑁
𝑁×𝑑
O ≈ V 𝐾𝑇 Q
𝑑′ × 𝑁 𝑑′ × 𝑁 𝑑×𝑁
𝑁×𝑑

O ≈ V 𝐾𝑇 Q
𝑑′ × 𝑁 𝑑′ × 𝑁 𝑑×𝑁 What is the
difference?
𝑁×𝑑
O ≈ V 𝐾𝑇 Q 𝑑 + 𝑑′ 𝑁 2
𝑑′ × 𝑁 𝑑′ × 𝑁 𝑑×𝑁
𝑁×𝑑

𝑁×𝑑×𝑁

Attention
V A
Matrix
𝑑′ × 𝑁
𝑁×𝑁

𝑑′ × 𝑁 × 𝑁
O ≈ V 𝐾𝑇 Q 2𝑑 ′ 𝑑𝑁
𝑑′ × 𝑁 𝑑′ × 𝑁 𝑑×𝑁
𝑁×𝑑

𝑑′ × 𝑁 × 𝑑

𝑑×𝑁
𝑑′ × 𝑑 Q

𝑑′ × 𝑑 × 𝑁
O ≈ V 𝐾𝑇 Q 𝑑 + 𝑑′ 𝑁 2
𝑑‘ × 𝑁 𝑑‘ × 𝑁 𝑑×𝑁
𝑁×𝑑

>
O ≈ V 𝐾𝑇 Q 2𝑑 ′ 𝑑𝑁
𝑑‘ × 𝑁 𝑑‘ × 𝑁 𝑑×𝑁
𝑁×𝑑
put softmax back …
𝑁 𝑁
′ 𝑒𝑥𝑝 𝒒𝟏 ∙ 𝒌𝒊
𝒃𝟏 = ෍ 𝛼1,𝑖 𝒗𝒊 = ෍ 𝒗𝒊
σN
𝑗=1 𝑒𝑥𝑝 𝒒 𝟏 ∙ 𝒌𝒋
𝑖=1 𝑖=1

𝒃𝟏

′ ′ ′ ′
𝛼1,1 𝛼1,2 𝛼1,3 𝛼1,4

𝒒𝟏 𝒌𝟏 𝒗𝟏 𝒌𝟐 𝒗𝟐 𝒌𝟑 𝒗𝟑 𝒌𝟒 𝒗𝟒

𝒂𝟏 𝒂𝟐 𝒂𝟑 𝒂𝟒
𝑁 𝑁
′ 𝑒𝑥𝑝 𝒒𝟏 ∙ 𝒌𝒊
𝒃𝟏 = ෍ 𝛼1,𝑖 𝒗𝒊 = ෍ 𝒗𝒊
σN
𝑗=1 𝑒𝑥𝑝 𝒒 𝟏 ∙ 𝒌𝒋
𝑖=1 𝑖=1
𝑁
𝑒𝑥𝑝 𝒒 ∙ 𝒌 𝜙 𝒒𝟏 ∙ 𝜙 𝒌 𝒊
≈𝜙 𝒒 ∙𝜙 𝒌 =෍ 𝒗𝒊
σN
𝑗=1 𝜙 𝒒𝟏 ∙ 𝜙 𝒌𝒋
𝑖=1

𝒒 𝜙 𝜙 𝒒 σ𝑁 𝜙 𝒒 𝟏
∙ 𝜙 𝒌 𝒊
𝒗 𝒊
𝑖=1
=
σ𝑁𝑗=1 𝜙 𝒒 𝟏 ∙ 𝜙 𝒌𝒋

𝜙 𝒒𝟏 ∙ ෍ 𝜙 𝒌 𝒋
𝑗=1
𝜙 𝒒𝟏
𝑁 𝑁
σ 𝜙 𝒒 𝟏 ∙ 𝜙 𝒌 𝒊 𝒗𝒊

𝒃𝟏 = ෍ 𝛼1,𝑖 𝒗𝒊 = 𝑖=1 𝟏
𝜙 𝒒 ∙ σ𝑁 𝑗=1 𝜙 𝒌 𝒋
𝑖=1
𝑁
𝑞11 𝑘11
෍ 𝜙 𝒒 𝟏 ∙ 𝜙 𝒌 𝒊 𝒗𝒊 𝜙 𝒒𝟏 = 𝑞21 𝜙 𝒌𝟏 = 𝑘21
𝑖=1 ⋮ ⋮

= 𝜙 𝒒 𝟏 ∙ 𝜙 𝒌 𝟏 𝒗 𝟏 + 𝜙 𝒒𝟏 ∙ 𝜙 𝒌 𝟐 𝒗 𝟐 + ⋯

= 𝑞11 𝑘11 + 𝑞21 𝑘21 + ⋯ 𝒗𝟏 + 𝑞11 𝑘12 + 𝑞21 𝑘22 + ⋯ 𝒗𝟐 + ⋯


= 𝑞11 𝑘11 𝒗𝟏 + 𝑞21 𝑘21 𝒗𝟏 + ⋯ + 𝑞11 𝑘12 𝒗𝟐 + 𝑞21 𝑘22 𝒗𝟐 + ⋯ + ⋯

= 𝑞11 𝑘11 𝒗𝟏 + 𝑘12 𝒗𝟐 + ⋯ + 𝑞21 𝑘21 𝒗𝟏 + 𝑘22 𝒗𝟐 + ⋯


𝑁 𝑁
σ 𝜙 𝒒 𝟏 ∙ 𝜙 𝒌 𝒊 𝒗𝒊

𝒃𝟏 = ෍ 𝛼1,𝑖 𝒗𝒊 = 𝑖=1 𝟏
𝜙 𝒒 ∙ σ𝑁 𝑗=1 𝜙 𝒌 𝒋
𝑖=1
𝑁
𝑞11 𝑘11
෍ 𝜙 𝒒 𝟏 ∙ 𝜙 𝒌 𝒊 𝒗𝒊 𝜙 𝒒𝟏 = 𝑞21 𝜙 𝒌𝟏 = 𝑘21
𝑖=1 M dim ⋮ ⋮

= 𝑞11 𝑘11 𝒗𝟏 + 𝑘12 𝒗𝟐 + ⋯ + 𝑞21 𝑘21 𝒗𝟏 + 𝑘22 𝒗𝟐 + ⋯


𝑁 𝑁
𝑗
𝑗
෍ 𝑘1 𝒗𝒋 ෍ 𝑘2 𝒗𝒋
𝑗=1 𝑗=1

... 𝜙 𝒒𝟏

M vectors
...
......
𝒗𝟏 𝒗𝟐 𝒗𝟑 𝒗𝑵
𝜙 𝒌𝟏 𝜙 𝒌𝟐 𝜙 𝒌𝟑 𝜙 𝒌𝑵
𝑁 𝑁 M dim
𝑗 𝑗
෍ 𝑘1 𝒗𝒋 ෍ 𝑘2 𝒗𝒋 M vectors
𝑗=1 𝑗=1

... 𝜙 𝒒𝟏
M dim
𝒃𝟏 =
𝑁

෍ 𝜙 𝒌𝒋 𝜙 𝒒𝟏
𝑗=1
... 𝜙 𝒒𝟏

𝒃𝟏 =

𝜙 𝒒𝟏

Don’t compute again

... 𝜙 𝒒𝟐

𝒃𝟐 =
……

𝜙 𝒒𝟐
𝒃𝑵 = …
weighted sum 𝒗𝟏 + 𝒗𝟐 + 𝒗𝟑 + 𝒗𝟒
=
𝑞11
𝜙 𝒒𝟏 = 𝑞21
= 𝒗𝟏 + 𝒗𝟐 + 𝒗𝟑 + 𝒗𝟒


M
M dimensions vectors

𝜙 𝒌𝟏 𝜙 𝒌𝟐 𝜙 𝒌𝟑 𝜙 𝒌𝟒

𝒒𝟏 𝒌𝟏 𝒗𝟏 𝒌𝟐 𝒗𝟐 𝒌𝟑 𝒗𝟑 𝒌𝟒 𝒗𝟒

𝒂𝟏 𝒂𝟐 𝒂𝟑 𝒂𝟒
weighted sum 𝒗𝟏 + 𝒗𝟐 + 𝒗𝟑 + 𝒗𝟒
=
𝑞11
𝜙 𝒒𝟏 = 𝑞21
= 𝒗𝟏 + 𝒗𝟐 + 𝒗𝟑 + 𝒗𝟒


M
vectors

... 𝜙 𝒒𝟏

𝒃𝟏 =
𝑁

෍ 𝜙 𝒌𝒋 𝜙 𝒒𝟏
𝑗=1
= 𝒗𝟏 + 𝒗𝟐 + 𝒗𝟑 + 𝒗𝟒
𝒃𝟐 =?
𝜙 𝒒𝟐
= 𝒗𝟏 + 𝒗𝟐 + 𝒗𝟑 + 𝒗𝟒


M
vectors

𝜙 𝒌𝟏 𝜙 𝒌𝟐 𝜙 𝒌𝟑 𝜙 𝒌𝟒

𝒒𝟏 𝒌𝟏 𝒗𝟏 𝒒𝟐 𝒌𝟐 𝒗𝟐 𝒌𝟑 𝒗𝟑 𝒌𝟒 𝒗𝟒

𝒂𝟏 𝒂𝟐 𝒂𝟑 𝒂𝟒
Realization
• Efficient attention
https://arxiv.org/pdf/1812.01243.pdf
𝑒𝑥𝑝 𝒒 ∙ 𝒌
≈𝜙 𝒒 ∙𝜙 𝒌 • Linear Transformer
https://linear-transformers.com/

𝒒 𝜙 𝜙 𝒒 • Random Feature Attention


https://arxiv.org/pdf/2103.02143.pdf

• Performer
https://arxiv.org/pdf/2009.14794.pdf
https://arxiv.org/abs/2005.00743

Do we need q and k to compute attention?


Synthesizer!
𝒃𝟏 𝒃𝟐 𝒃𝟑 𝒃𝟒
𝛼1,1 𝛼1,2 𝛼1,3 𝛼1,4
softmax 𝑁
𝛼1,2 𝛼2,2 𝛼2,3 𝛼2,4

𝒃𝟏 = ෍ 𝛼1,𝑖 𝒗𝒊
𝛼1,3 𝛼2,3 𝛼3,3 𝛼3,4
𝑖=1
𝛼1,4 𝛼2,4 𝛼3,4 𝛼4,4
From q and k? 𝒗𝟏 𝒗𝟐 𝒗𝟑 𝒗𝟒
They are network
parameters!
𝒂𝟏 𝒂𝟐 𝒂𝟑 𝒂𝟒
Attention-free?

• Fnet: Mixing tokens with fourier transforms


https://arxiv.org/abs/2105.03824

• Pay Attention to MLPs https://arxiv.org/abs/2105.08050

• MLP-Mixer: An all-MLP Architecture for Vision


https://arxiv.org/abs/2105.01601
Summary
• Human knowledge
• Local Attention, Big Bird
• Clustering
• Reformer
• Learnable Pattern
• Sinkforn
• Representative key
• Linformer
• k,q first → v,k first
• Linear Transformer, Performer
• New framework
• Synthesizer

You might also like