0% found this document useful (0 votes)

12 views37 pages

Efficient Attention Mechanisms

Uploaded by

technistretron0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views37 pages

Efficient Attention Mechanisms

Uploaded by

technistretron0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

各式各样的 Attention

To Learn
More …

Long Range Arena: A

Benchmark for Efficient
Transformers
https://arxiv.org/abs/2011.04006

Efficient Transformers: A Survey

https://arxiv.org/abs/2009.06732
2
How to make self-attention efficient?
key
Sequence length
=𝑁 𝑁
𝑁

query

Attention Matrix
𝑁×𝑁
Notice
• Self-attention is only
a module in a larger
network.
• Self-attention
dominates
computation when 𝑁
is large.
• Usually developed for
image processing
𝑁=
256
256 ∗ 256
256
Skip Some Calculations with Human
Knowledge

Can we fill in some values

with human knowledge?
Local Attention / Truncated Attention

Set to 0

……
query

Similar with
CNN

Calculate
attention weight
key
Stride Attention
… …
Global Attention
Add special token into original sequence
• Attend to every token → collect global information
• Attended by every token → it knows global information

No attention …
between non-
special token
Many Different Choices …

选择哪个？
Different heads use different
patterns.
Many Different Choices …
• Longformer https://arxiv.org/abs/2004.0515
0

• Big Bird https://arxiv.org/abs/2007.1406

2
Can we only focus on Critical Parts?
key

small value
• Directly set to 0
query

• Smaller
influence on
results
large value

How to quickly estimate the portion with

small attention weights?
Reformer
Clustering https://openreview.net/forum?id=rkgNKkHtvB

Routing Transformer
https://arxiv.org/abs/2003.05997

Step 1
query key

Clustering
(approximate & fast)
based on similarity

1 4 1 2 3 3 3 2 2 1 3 3 1 4 1 4
key
Clustering

Step 2

query

Belong to the same cluster, Not the same cluster,

then calculate attention weight set to 0
https://arxiv.org/abs/2002.11296

Learnable Patterns Input sequence

Sinkhorn Sorting Network

key
NN Jointly learned
query

A grid should be skipped or not is

decided by another learned (simplified version)
module
Do we need full attention matrix?
Linformer
https://arxiv.org/abs/2006.0476
Many redundant columns 8
query

key Low Rank

𝑁 𝑁
value key

Representative
𝐾 𝐾
keys

output
Can we
reduce the
number of query
queries?
change output
sequence
length
Reduce Number of Keys
Compressed Attention Linformer
https://arxiv.org/abs/1801.10198 https://arxiv.org/abs/2006.04768

𝑁
𝑑

Conv Conv Conv Conv 𝑁×𝐾

Linear combination 𝐾
of N vectors
Attention Mechanism is three-matrix Multiplication
Review 𝑑×𝑁 Q = 𝑊𝑞 I
𝑑×𝑁 K = 𝑊𝑘 I
𝑑′ × 𝑁 V = 𝑊𝑣 I

A′ A = 𝐾𝑇 Q
softmax
ignore

O = V A′
Attention Mechanism is three-matrix Multiplication
Review 𝑑×𝑁 Q = 𝑊𝑞 I
𝑑×𝑁 K = 𝑊𝑘 I
𝑑′ × 𝑁 V = 𝑊𝑣 I

O ≈ V 𝐾𝑇 Q
𝑑′ × 𝑁 𝑑′ × 𝑁 𝑑×𝑁
𝑁×𝑑
O ≈ V 𝐾𝑇 Q
𝑑′ × 𝑁 𝑑′ × 𝑁 𝑑×𝑁
𝑁×𝑑

O ≈ V 𝐾𝑇 Q
𝑑′ × 𝑁 𝑑′ × 𝑁 𝑑×𝑁 What is the
difference?
𝑁×𝑑
O ≈ V 𝐾𝑇 Q 𝑑 + 𝑑′ 𝑁 2
𝑑′ × 𝑁 𝑑′ × 𝑁 𝑑×𝑁
𝑁×𝑑

𝑁×𝑑×𝑁

Attention
V A
Matrix
𝑑′ × 𝑁
𝑁×𝑁

𝑑′ × 𝑁 × 𝑁
O ≈ V 𝐾𝑇 Q 2𝑑 ′ 𝑑𝑁
𝑑′ × 𝑁 𝑑′ × 𝑁 𝑑×𝑁
𝑁×𝑑

𝑑′ × 𝑁 × 𝑑

𝑑×𝑁
𝑑′ × 𝑑 Q

𝑑′ × 𝑑 × 𝑁
O ≈ V 𝐾𝑇 Q 𝑑 + 𝑑′ 𝑁 2
𝑑‘ × 𝑁 𝑑‘ × 𝑁 𝑑×𝑁
𝑁×𝑑

>
O ≈ V 𝐾𝑇 Q 2𝑑 ′ 𝑑𝑁
𝑑‘ × 𝑁 𝑑‘ × 𝑁 𝑑×𝑁
𝑁×𝑑
put softmax back …
𝑁 𝑁
′ 𝑒𝑥𝑝 𝒒𝟏 ∙ 𝒌𝒊
𝒃𝟏 = ෍ 𝛼1,𝑖 𝒗𝒊 = ෍ 𝒗𝒊
σN
𝑗=1 𝑒𝑥𝑝 𝒒 𝟏 ∙ 𝒌𝒋
𝑖=1 𝑖=1

𝒃𝟏

′ ′ ′ ′
𝛼1,1 𝛼1,2 𝛼1,3 𝛼1,4

𝒒𝟏 𝒌𝟏 𝒗𝟏 𝒌𝟐 𝒗𝟐 𝒌𝟑 𝒗𝟑 𝒌𝟒 𝒗𝟒

𝒂𝟏 𝒂𝟐 𝒂𝟑 𝒂𝟒
𝑁 𝑁
′ 𝑒𝑥𝑝 𝒒𝟏 ∙ 𝒌𝒊
𝒃𝟏 = ෍ 𝛼1,𝑖 𝒗𝒊 = ෍ 𝒗𝒊
σN
𝑗=1 𝑒𝑥𝑝 𝒒 𝟏 ∙ 𝒌𝒋
𝑖=1 𝑖=1
𝑁
𝑒𝑥𝑝 𝒒 ∙ 𝒌 𝜙 𝒒𝟏 ∙ 𝜙 𝒌 𝒊
≈𝜙 𝒒 ∙𝜙 𝒌 =෍ 𝒗𝒊
σN
𝑗=1 𝜙 𝒒𝟏 ∙ 𝜙 𝒌𝒋
𝑖=1

𝒒 𝜙 𝜙 𝒒 σ𝑁 𝜙 𝒒 𝟏
∙ 𝜙 𝒌 𝒊
𝒗 𝒊
𝑖=1
=
σ𝑁𝑗=1 𝜙 𝒒 𝟏 ∙ 𝜙 𝒌𝒋

𝜙 𝒒𝟏 ∙ ෍ 𝜙 𝒌 𝒋
𝑗=1
𝜙 𝒒𝟏
𝑁 𝑁
σ 𝜙 𝒒 𝟏 ∙ 𝜙 𝒌 𝒊 𝒗𝒊
′
𝒃𝟏 = ෍ 𝛼1,𝑖 𝒗𝒊 = 𝑖=1 𝟏
𝜙 𝒒 ∙ σ𝑁 𝑗=1 𝜙 𝒌 𝒋
𝑖=1
𝑁
𝑞11 𝑘11
෍ 𝜙 𝒒 𝟏 ∙ 𝜙 𝒌 𝒊 𝒗𝒊 𝜙 𝒒𝟏 = 𝑞21 𝜙 𝒌𝟏 = 𝑘21
𝑖=1 ⋮ ⋮

= 𝜙 𝒒 𝟏 ∙ 𝜙 𝒌 𝟏 𝒗 𝟏 + 𝜙 𝒒𝟏 ∙ 𝜙 𝒌 𝟐 𝒗 𝟐 + ⋯

= 𝑞11 𝑘11 + 𝑞21 𝑘21 + ⋯ 𝒗𝟏 + 𝑞11 𝑘12 + 𝑞21 𝑘22 + ⋯ 𝒗𝟐 + ⋯

= 𝑞11 𝑘11 𝒗𝟏 + 𝑞21 𝑘21 𝒗𝟏 + ⋯ + 𝑞11 𝑘12 𝒗𝟐 + 𝑞21 𝑘22 𝒗𝟐 + ⋯ + ⋯

= 𝑞11 𝑘11 𝒗𝟏 + 𝑘12 𝒗𝟐 + ⋯ + 𝑞21 𝑘21 𝒗𝟏 + 𝑘22 𝒗𝟐 + ⋯

𝑁 𝑁
σ 𝜙 𝒒 𝟏 ∙ 𝜙 𝒌 𝒊 𝒗𝒊
′
𝒃𝟏 = ෍ 𝛼1,𝑖 𝒗𝒊 = 𝑖=1 𝟏
𝜙 𝒒 ∙ σ𝑁 𝑗=1 𝜙 𝒌 𝒋
𝑖=1
𝑁
𝑞11 𝑘11
෍ 𝜙 𝒒 𝟏 ∙ 𝜙 𝒌 𝒊 𝒗𝒊 𝜙 𝒒𝟏 = 𝑞21 𝜙 𝒌𝟏 = 𝑘21
𝑖=1 M dim ⋮ ⋮

= 𝑞11 𝑘11 𝒗𝟏 + 𝑘12 𝒗𝟐 + ⋯ + 𝑞21 𝑘21 𝒗𝟏 + 𝑘22 𝒗𝟐 + ⋯

𝑁 𝑁
𝑗
𝑗
෍ 𝑘1 𝒗𝒋 ෍ 𝑘2 𝒗𝒋
𝑗=1 𝑗=1

... 𝜙 𝒒𝟏

M vectors
...
......
𝒗𝟏 𝒗𝟐 𝒗𝟑 𝒗𝑵
𝜙 𝒌𝟏 𝜙 𝒌𝟐 𝜙 𝒌𝟑 𝜙 𝒌𝑵
𝑁 𝑁 M dim
𝑗 𝑗
෍ 𝑘1 𝒗𝒋 ෍ 𝑘2 𝒗𝒋 M vectors
𝑗=1 𝑗=1

... 𝜙 𝒒𝟏
M dim
𝒃𝟏 =
𝑁

෍ 𝜙 𝒌𝒋 𝜙 𝒒𝟏
𝑗=1
... 𝜙 𝒒𝟏

𝒃𝟏 =

𝜙 𝒒𝟏

Don’t compute again

... 𝜙 𝒒𝟐

𝒃𝟐 =
……

𝜙 𝒒𝟐
𝒃𝑵 = …
weighted sum 𝒗𝟏 + 𝒗𝟐 + 𝒗𝟑 + 𝒗𝟒
=
𝑞11
𝜙 𝒒𝟏 = 𝑞21
= 𝒗𝟏 + 𝒗𝟐 + 𝒗𝟑 + 𝒗𝟒
⋮

…
M
M dimensions vectors

𝜙 𝒌𝟏 𝜙 𝒌𝟐 𝜙 𝒌𝟑 𝜙 𝒌𝟒

𝒒𝟏 𝒌𝟏 𝒗𝟏 𝒌𝟐 𝒗𝟐 𝒌𝟑 𝒗𝟑 𝒌𝟒 𝒗𝟒

𝒂𝟏 𝒂𝟐 𝒂𝟑 𝒂𝟒
weighted sum 𝒗𝟏 + 𝒗𝟐 + 𝒗𝟑 + 𝒗𝟒
=
𝑞11
𝜙 𝒒𝟏 = 𝑞21
= 𝒗𝟏 + 𝒗𝟐 + 𝒗𝟑 + 𝒗𝟒
⋮

…
M
vectors

... 𝜙 𝒒𝟏

𝒃𝟏 =
𝑁

෍ 𝜙 𝒌𝒋 𝜙 𝒒𝟏
𝑗=1
= 𝒗𝟏 + 𝒗𝟐 + 𝒗𝟑 + 𝒗𝟒
𝒃𝟐 =?
𝜙 𝒒𝟐
= 𝒗𝟏 + 𝒗𝟐 + 𝒗𝟑 + 𝒗𝟒

…
M
vectors

𝜙 𝒌𝟏 𝜙 𝒌𝟐 𝜙 𝒌𝟑 𝜙 𝒌𝟒

𝒒𝟏 𝒌𝟏 𝒗𝟏 𝒒𝟐 𝒌𝟐 𝒗𝟐 𝒌𝟑 𝒗𝟑 𝒌𝟒 𝒗𝟒

𝒂𝟏 𝒂𝟐 𝒂𝟑 𝒂𝟒
Realization
• Efficient attention
https://arxiv.org/pdf/1812.01243.pdf
𝑒𝑥𝑝 𝒒 ∙ 𝒌
≈𝜙 𝒒 ∙𝜙 𝒌 • Linear Transformer
https://linear-transformers.com/

𝒒 𝜙 𝜙 𝒒 • Random Feature Attention

https://arxiv.org/pdf/2103.02143.pdf

• Performer
https://arxiv.org/pdf/2009.14794.pdf
https://arxiv.org/abs/2005.00743

Do we need q and k to compute attention?

Synthesizer!
𝒃𝟏 𝒃𝟐 𝒃𝟑 𝒃𝟒
𝛼1,1 𝛼1,2 𝛼1,3 𝛼1,4
softmax 𝑁
𝛼1,2 𝛼2,2 𝛼2,3 𝛼2,4
′
𝒃𝟏 = ෍ 𝛼1,𝑖 𝒗𝒊
𝛼1,3 𝛼2,3 𝛼3,3 𝛼3,4
𝑖=1
𝛼1,4 𝛼2,4 𝛼3,4 𝛼4,4
From q and k? 𝒗𝟏 𝒗𝟐 𝒗𝟑 𝒗𝟒
They are network
parameters!
𝒂𝟏 𝒂𝟐 𝒂𝟑 𝒂𝟒
Attention-free?

• Fnet: Mixing tokens with fourier transforms

https://arxiv.org/abs/2105.03824

• Pay Attention to MLPs https://arxiv.org/abs/2105.08050

• MLP-Mixer: An all-MLP Architecture for Vision

https://arxiv.org/abs/2105.01601
Summary
• Human knowledge
• Local Attention, Big Bird
• Clustering
• Reformer
• Learnable Pattern
• Sinkforn
• Representative key
• Linformer
• k,q first → v,k first
• Linear Transformer, Performer
• New framework
• Synthesizer

The Transformer Family
No ratings yet
The Transformer Family
25 pages
Attention & Transformers
No ratings yet
Attention & Transformers
66 pages
Transformers From Scratch PoliTO - Ipynb Colab
No ratings yet
Transformers From Scratch PoliTO - Ipynb Colab
17 pages
Lecture 6 Transformers
No ratings yet
Lecture 6 Transformers
92 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
Notes On Implementing Attention - Eli Bendersky
No ratings yet
Notes On Implementing Attention - Eli Bendersky
12 pages
Understanding Self-Attention in Transformers
No ratings yet
Understanding Self-Attention in Transformers
16 pages
Transformer 24 Aug
No ratings yet
Transformer 24 Aug
56 pages
Dis7 Sol
No ratings yet
Dis7 Sol
8 pages
Fastformer: Additive Attention Can Be All You Need
No ratings yet
Fastformer: Additive Attention Can Be All You Need
11 pages
495 Lecture 8
No ratings yet
495 Lecture 8
28 pages
Transformer Mixture Key
No ratings yet
Transformer Mixture Key
27 pages
Transformer
No ratings yet
Transformer
4 pages
Clustered Attention for Fast Transformers
No ratings yet
Clustered Attention for Fast Transformers
10 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Transformer
No ratings yet
Transformer
58 pages
Attention in Neural Networks
No ratings yet
Attention in Neural Networks
8 pages
Transformers
No ratings yet
Transformers
41 pages
Seg lm6 Attention
No ratings yet
Seg lm6 Attention
6 pages
All You Need To Know About The Self-Attention Layer
No ratings yet
All You Need To Know About The Self-Attention Layer
80 pages
Attention Is Not You Need: Pure Attention Loses Rank Doubly Exponentially With Depth
No ratings yet
Attention Is Not You Need: Pure Attention Loses Rank Doubly Exponentially With Depth
22 pages
CS414-Lesson 10.transformer and Applications
No ratings yet
CS414-Lesson 10.transformer and Applications
50 pages
Transformer
No ratings yet
Transformer
41 pages
Memory-Efficient Attention Mechanism
No ratings yet
Memory-Efficient Attention Mechanism
8 pages
12 Transformer
No ratings yet
12 Transformer
41 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Lec 12
No ratings yet
Lec 12
30 pages
NLP 8
No ratings yet
NLP 8
42 pages
Abc 2
No ratings yet
Abc 2
15 pages
SimpleTron NeurIPS 2022
No ratings yet
SimpleTron NeurIPS 2022
15 pages
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
No ratings yet
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
20 pages
Assignment 2 - ML-SelfAttn
No ratings yet
Assignment 2 - ML-SelfAttn
4 pages
Transformers - The Brain of ChatGPT
No ratings yet
Transformers - The Brain of ChatGPT
25 pages
Dong 21 A
No ratings yet
Dong 21 A
11 pages
2022 Acl-Long 170
No ratings yet
2022 Acl-Long 170
13 pages
NLPMCQ
No ratings yet
NLPMCQ
23 pages
NLP Lecture 01-15-Attnmechanism
No ratings yet
NLP Lecture 01-15-Attnmechanism
13 pages
Lecture 2
No ratings yet
Lecture 2
39 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
Visual Attention Methods in Deep Learning An In-De
No ratings yet
Visual Attention Methods in Deep Learning An In-De
20 pages
Transformer Arch Optimisations
No ratings yet
Transformer Arch Optimisations
3 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Transformer Attention 91cb05dd 182d 4c7d 8c8e f1698567b8d6
No ratings yet
Transformer Attention 91cb05dd 182d 4c7d 8c8e f1698567b8d6
39 pages
Chap6 Transformer (20240219) - DL4H Practioner Guide
No ratings yet
Chap6 Transformer (20240219) - DL4H Practioner Guide
36 pages
Chapter 4
No ratings yet
Chapter 4
24 pages
2296 Treeformer Dense Gradient Tree
No ratings yet
2296 Treeformer Dense Gradient Tree
15 pages
Transformers
No ratings yet
Transformers
15 pages
04 Transformer 4
No ratings yet
04 Transformer 4
97 pages
The Transformer Family Version 20 LilLog
No ratings yet
The Transformer Family Version 20 LilLog
32 pages
CNNs and Transformers
No ratings yet
CNNs and Transformers
90 pages
Log Linear Attention
No ratings yet
Log Linear Attention
21 pages
LLM Attention
No ratings yet
LLM Attention
13 pages
AA2 3.2 Attention 2024
No ratings yet
AA2 3.2 Attention 2024
58 pages
Empirical Study of Spatial Attention in Deep Networks
No ratings yet
Empirical Study of Spatial Attention in Deep Networks
10 pages
Eigen Attention - Attention in Low-Rank Space For KV Cache Compression
No ratings yet
Eigen Attention - Attention in Low-Rank Space For KV Cache Compression
13 pages
Notes of Transformer
No ratings yet
Notes of Transformer
8 pages
Attention Mechanism by Hand Exercise
No ratings yet
Attention Mechanism by Hand Exercise
1 page
A1
No ratings yet
A1
11 pages
Community Copy - Epic Legacy Tome of Titans - Vol. 2
91% (11)
Community Copy - Epic Legacy Tome of Titans - Vol. 2
499 pages
Establishing A TSFP
100% (1)
Establishing A TSFP
2 pages
Tales from the Rabbi's Desk 2
No ratings yet
Tales from the Rabbi's Desk 2
11 pages
Indian Railways: Challenges & Services
No ratings yet
Indian Railways: Challenges & Services
18 pages
Respiratory System - History and Physical Examination
No ratings yet
Respiratory System - History and Physical Examination
22 pages
OPT A2 U08 Vocab Standard
No ratings yet
OPT A2 U08 Vocab Standard
1 page
The First Season Recap, Episode by Episode: Pilot
No ratings yet
The First Season Recap, Episode by Episode: Pilot
45 pages
ELECTIVE
No ratings yet
ELECTIVE
5 pages
Skytrain Avia Services: Emergency Procedures Manual
No ratings yet
Skytrain Avia Services: Emergency Procedures Manual
32 pages
U600 ULTRASONIC SCALER - Woodpecker Medical
No ratings yet
U600 ULTRASONIC SCALER - Woodpecker Medical
18 pages
Asli Bs7262
100% (1)
Asli Bs7262
26 pages
Marketing Info Systems Guide
100% (1)
Marketing Info Systems Guide
13 pages
First Mbbs Syllabus 21 03 2020 Final Full
No ratings yet
First Mbbs Syllabus 21 03 2020 Final Full
129 pages
P&Id Reverse Osmosis: Shuqaiq 3 Independent Water Project
No ratings yet
P&Id Reverse Osmosis: Shuqaiq 3 Independent Water Project
20 pages
Internship Briefing Notes - Students - DPS - 2022
No ratings yet
Internship Briefing Notes - Students - DPS - 2022
15 pages
Ipac Thinc2017 Presentationdraft Final Copy-Min
No ratings yet
Ipac Thinc2017 Presentationdraft Final Copy-Min
24 pages
Ind AS 12
No ratings yet
Ind AS 12
37 pages
Risk Assessment Foundation Work
No ratings yet
Risk Assessment Foundation Work
2 pages
CSC Books
No ratings yet
CSC Books
20 pages
Formative and Summative Assessment
100% (1)
Formative and Summative Assessment
2 pages
Fermentec 5L
No ratings yet
Fermentec 5L
3 pages
First Summative Assessment Entrepreneurship
No ratings yet
First Summative Assessment Entrepreneurship
2 pages
Indonesian Folktales in English
No ratings yet
Indonesian Folktales in English
11 pages
Significance and Problems of Fisheries Co-Operatives in India PDF
0% (1)
Significance and Problems of Fisheries Co-Operatives in India PDF
27 pages
21 Chump Street Questions
No ratings yet
21 Chump Street Questions
3 pages
Smart Grid Protocols
100% (3)
Smart Grid Protocols
69 pages
IS208 PROFESSIONAL ISSUES IN INFORMATION SYSTEMS Revised
67% (3)
IS208 PROFESSIONAL ISSUES IN INFORMATION SYSTEMS Revised
2 pages
PP Sap Table
100% (1)
PP Sap Table
4 pages
Netlab Cyberops Associate Pod
No ratings yet
Netlab Cyberops Associate Pod
25 pages
2017 Student Placement Summary
No ratings yet
2017 Student Placement Summary
3 pages