Ada2024 Gpu 3
Ada2024 Gpu 3
Caroline Collange
she/her
[email protected]
https://team.inria.fr/pacap/members/collange/
Master 2 SIF
ADA - 2024
Outline
Compiling from threads to SIMD
If-conversion
Internal representations: SSA and GSA
Generalized if-conversion
Adding branches back
Scalarization
Scalars in SPMD code
Code generation for scalar+vector architectures
Out-of-order SIMT
2
Implementations of SPMD vectorization: SIMT vs. explicit SIMD
SIMT Multi-core + explicit SIMD
All parallelism expressed Combination of threads,
using threads vectors
Warp size implementation- Vector length fixed at
defined compile-time
Dynamic vectorization Static vectorization
Threads Threads
Vector
Warp
3
Bridging the gap between SPMD and SIMD
RF RF RF RF
ALU ALU ALU ALU
Hardware
Hardware:
SIMD CPU, GPU, Xeon Phi...
4
SPMD to SIMD: hardware or software ?
RF RF RF RF
ALU ALU ALU ALU
Hardware
Hardware:
SIMD CPU, GPU, Xeon Phi...
5
The third school of vectorization
Straight-line program (SLP) add
vectorization vadd Vector
mul instructions
add
add
sub
Loop vectorization
add vadd
vmul
mul
Threads Warp
SPMD vectorization add vadd
add Vector
mul add
mul vmul instructions
mul
sub vsub
6
Managing SIMD control flow in software
Use cases
Compiling shaders and OpenCL for AMD GCN GPUs
Compiling OpenCL for Xeon Phi
ispc: Intel SPMD Program Compiler, targets various SIMD instruction sets
7
Target instruction sets
Supports at least X Y
Scalar instructions and vector registers
Vector instructions and vector registers Registers
blend(m, X, Y)
T1 T2 Tn
Vector blend / select instructions
T1 T2 Tn
Masked vector stores and loads Registers X
Memory
May also support store(m, a, X)
Masked versions of all vector instructions
T1 T2 Tn
Vector gather and scatter
Registers
Memory
gather(A)
8
Compiling SPMD to predicated SIMD
x = 0;
// Uniform condition
if(tid > 17) {
x = 1;
}
// Divergent conditions
if(tid < 2) {
if(tid == 0) {
x = 2;
}
else {
x = 3;
}
}
9
Compiling SPMD to predicated SIMD: scalar control-flow graph
x=0
tid > 17
x = 0;
// Uniform condition tid ≤ 17 x=1
if(tid > 17) {
x = 1;
}
// Divergent conditions tid < 2
if(tid < 2) { tid ≥ 2
if(tid == 0) {
x = 2; tid ≠ 0 tid = 0
}
x=3 x=2
else {
x = 3;
}
}
10
Compiling SPMD to predicated SIMD: if-conversion
if-conversion: flatten if statements, convert x=0 Need masked
into predicated (masked) instructions instructions
x = 0;
// Uniform condition x=1 where tid > 17
Modern usages
Dynamic: register renaming in superscalar microarchitectures
Static: Single Static Assignment representation in compilers
L. F. Menabrea and A. A. Lovelace. Sketch of the Analytical Engine invented by Charles Babbage. 1842 12
Single Static Assignment (SSA)
Add φ functions to select proper name x0 = 0
after reconvergence
tid > 17
x = 0;
// Uniform condition tid ≤ 17 x1 = 1
if(tid > 17) {
x = 1;
} x2 = φ(x0, x1)
// Divergent conditions tid < 2
if(tid < 2) { tid ≥ 2
if(tid == 0) {
x = 2; tid ≠ 0 tid = 0
}
x4 = 3 x3 = 2
else {
x = 3;
x5 = φ(x4, x3)
}
}
x6 = φ(x2, x5) 13
Gated Single Assignment (GSA)
Gated single-assignment: x0 = 0
include predicates in φ functions
tid > 17
x = 0;
// Uniform condition tid ≤ 17 x1 = 1
if(tid > 17) {
x = 1;
} x2 = ɣ(tid > 17 ? x1 : x0)
// Divergent conditions tid < 2
if(tid < 2) { tid ≥ 2
if(tid == 0) {
x = 2; tid ≠ 0 tid = 0
}
x4 = 3 x3 = 2
else {
x = 3;
x5 = ɣ(tid == 0 ? x3 : x4)
}
}
x6 = ɣ(tid < 2 ? x5 : x2) 14
Omitting branch conditions
x0 = 0
x = 0;
// Uniform condition x1 = 1
if(tid > 17) {
x = 1;
} x2 = ɣ(tid > 17 ? x1 : x0)
// Divergent conditions
if(tid < 2) {
if(tid == 0) {
x = 2;
}
x4 = 3 x3 = 2
else {
x = 3;
x5 = ɣ(tid == 0 ? x3 : x4)
}
}
x6 = ɣ(tid < 2 ? x5 : x2) 15
Compiling SPMD to predicated SIMD: linearized CFG
x0 = 0
x = 0;
// Uniform condition x1 = 1
if(tid > 17) {
x = 1;
} x2 = ɣ(tid > 17 ? x1 : x0)
// Divergent conditions
if(tid < 2) {
if(tid == 0) {
x = 2;
}
x4 = 3 x3 = 2
else {
x = 3;
x5 = ɣ(tid == 0 ? x3 : x4)
}
}
x6 = ɣ(tid < 2 ? x5 : x2) 16
Compiling SPMD to predicated SIMD: vector code
tid = (0, 1, 2, 3)
x0 = (0…0)
x0 = (0, 0, 0, 0)
x = 0;
// Uniform condition x1 =( 1…1)
x1 = (1, 1, 1, 1)
if(tid > 17) {
x = 1;
} x2 = (0, 0, 0, 0) x2 = blend(tid > 17 ? x1 : x0)
// Divergent conditions
if(tid < 2) {
if(tid == 0) {
x = 2;
x3 = (2, 2, 2, 2)
}
x4 = (3…3) x3 = (2…2)
else { x4 = (3, 3, 3, 3)
x = 3;
x5 = (2, 3, 3, 3) x5 = blend(tid == 0 ? x3 : x4)
}
} x6 = (2, 3, 0, 0) x6 = blend(tid < 2 ? x5 : x2) 17
Compiling SPMD to predicated SIMD: vector code + control-flow
tid = (0, 1, 2, 3)
x0 = (0…0)
x0 = (0, 0, 0, 0) any(tid > 17)
x = 0;
// Uniform condition all(tid ≤ 17) x1 =( 1…1)
if(tid > 17) {
x = 1;
} x2 = (0, 0, 0, 0) x2 = blend(tid > 17 ? x1 : x0)
// Divergent conditions any(tid > 2)
if(tid < 2) { all(tid ≤ 2)
if(tid == 0) {
x = 2; all(tid ≠ 0) any(tid = 0)
x3 = (2, 2, 2, 2) any(tid ≠ 0)
}
x4 = (3…3) x3 = (2…2)
else { x4 = (3, 3, 3, 3)
x = 3; all(tid = 0)
x5 = (2, 3, 3, 3) x5 = blend(tid == 0 ? x3 : x4)
}
} x6 = (2, 3, 0, 0) x6 = blend(tid < 2 ? x5 : x2) 18
Compiling SPMD to predicated SIMD
21
Outline
Compiling from threads to SIMD
If-conversion
Internal representations: SSA and GSA
Generalized if-conversion
Adding branches back
Scalarization
Scalars in SPMD code
Code generation for scalar+vector architectures
Out-of-order SIMT
22
Scalars in SPMD code
SPMD code
Some values and operations mov i ← tid
are inherently scalar loop:
load t ← X[i]
mul t ← a×t
store X[i] ← t
add i ← i+tnum
Loop counters, addresses of branch i<n? loop
consecutive accesses…
Thread
0 12 3…
load
Same value values for all mul
store
threads of a warp add
Uniform vector branch
Or sequence of evenly-spaced
values
Affine vector t
a 1717171717 17
i 0 1 2 3 4 15
n 5151515151 51
23
Uniform and affine vectors
warp
(granularity)
Uniform vector
thread
In a warp, v[i] = c
Value does not depend on lane ID
5 5 5 5 5 5 5 5 3 3 3 3 3 3 3 3
c=5 c=3
Affine vector
8 9 101112131415 0 2 4 6 8 101214
In a warp, v[i] = b + i s
s=1 s=2
Base b, stride s
b=8 b=0
Affine relation between value and lane
ID
Generic vector : anything else 2 8 0 -4 4 4 5 8 2 3 7 1 0 3 3 4
24
Most vectors are scalars in disguise!
26
Scalarization
Scalarization optimization
demotes uniform and affine vectors into scalars
Vector instructions → scalar instructions
Vector registers → scalar registers
SIMT branches → uniform (scalar) branches
Gather-scatter load-store → vector load-store or broadcast
27
After scalarization
SIMD+scalar code
Obvious benefits mov i ← 0
loop:
Scalar registers instead of vector vload T ← X[i]
Scalar instructions instead of vmul T ← a×T
vector vstore X[i] ← T
add i ← i+16
branch i<n? loop
30
Typing-based approach
No automatic solution!
31
Outline
Compiling from threads to SIMD
If-conversion
Internal representations: SSA and GSA
Generalized if-conversion
Adding branches back
Scalarization
Scalars in SPMD code
Code generation for scalar+vector architectures
Out-of-order SIMT
32
SIMT vs prédiction de branchements
Prédiction de branchements : prédit un seul chemin
Confirmation
SIMT: le flux d'exécution est un graphe, pas une séquence
Les groupes de threads peuvent converger, diverger
1 chemin à travers le graphe : 1 ensemble de threads Acquisition
33
SIMT vs renommage
34
Architecture CPU de base
High-throughput backend
vector
PRF vector
Multi-thread units
frontend vector
issue
Rename: RAT
Thread
select
PC
scalar
issue
Low-latency
backend
scalar scalar
PRF units
branch
issue branch
unit 35
Architecture CPU SIMT proposée
High-throughput backend
vector
a u!
e PRF vector
uv
No units
Path-switching frontend
vector
issue
u!
Rename: PIRAT
Fetch & decode
a
u ve
No
Warp Path
select select
scalar
PC issue
Low-latency
backend
scalar scalar
PRF units
!
eau
uv
branch No
issue
Pathtable branch 36
unit
SIMT vs prédiction de branchements : solution
Prédiction de branchements : prédit un seul chemin
Confirmation
SIMT: le flux d'exécution est un graphe, pas une séquence
Les groupes de fils d'exec peuvent converger, diverger
1 chemin à travers le graphe : 1 ensemble de threads
Acquisition
Problème : sans ordre total, comment revenir en arrière ?
Chemin A: Chemin B: t2
Solution
t0,t3
Assigner un ordre total à l'acquisition, Chemin C: t1
restaurer le même ordre à la confirmation
Convergence à l'acquisition, divergence à l'exécution
→ chaque thread suit 1 chemin spéculatif à la fois
Suivi des masque dans une Table de chemins
Confirmation masquée
→ operations avec un masque entièrement nul sont ignorées, ne produisent pas d'effet
Anita Tino, Caroline Collange, André Seznec. SIMT-X: Extending Single-Instruction Multi-Threading to Out-of-Order Cores . TACO 2020
37
Suivi des chemins
h4 h5 h1
p?
h6 h2 h3
h6 = h4 | h5 Clear bits p in all paths from h2;
h3 = h1 & p
Convergence Divergence
38
Gestion des chemins : exemple
Flux d'exécution
du programme Table de chemins
Confirmation
h1 Instructions Chemin Masque
Acquisition en vol
h1 1111
39
Gestion des chemins : exemple
Acquisition du branchement
Démarrage d'un nouveau chemin h2 = h1
pour préparer potentielle divergence
Flux d'exécution
du programme Table de chemins
Confirmation
h1 Instructions Chemin Masque
Acquisition en vol
h1 1111
h2 1111
h2
40
Gestion des chemins : exemple
Flux d'exécution
du programme Table de chemins
Confirmation
h1 Instructions Chemin Masque
en vol
h3 h1 1111
Acquisition h2 1001
h2
h3 0110
41
Gestion des chemins : exemple
Flux d'exécution
du programme Table de chemins
Confirmation
h1 Chemin Masque
Instructions
h3 en vol h1 1111
h2 h2 1001
Acquisition h4 h3 0110
h4 0110
42
Gestion des chemins : exemple
Divergence
h5 = h3 & 0100
h4 &= ~0100
Flux d'exécution
du programme Table de chemins
Confirmation
h1 Chemin Masque
Instructions
h3 en vol h1 1111
h2 h2 1001
h4 h5 h3 0110
Acquisition h4 0010
h5 0100
43
Gestion des chemins : exemple
Convergence: h6 = h4 | h5
Flux d'exécution
du programme Table de chemins
Confirmation
h1 Chemin Masque
h3 h1 1111
Instructions
h2 en vol h2 1001
h4 h5 h3 0110
h4 0010
h6 h5 0100
Acquisition h6 0110
44
Gestion des chemins : exemple
45
SIMT vs renommage : solution
46
Renommage : deux chemins suffisent
Se lit: r1 est dans p2 pour les fils d'exec contenus dans h1, sinon dans p1
47
Renommage : deux chemins suffisent
Après fusion : r1 ← 42
L'entrée active de r1 est libérée if(p)
r1 ← 17
L'autre entrée est utilisable comme source pour les instructions r2 ← r1 + 1
suivantes
p1 = 42
PIRAT Pathtable if(h1)
p2 = 17
Other Active Mask p4 = merge(h1?p2:p1)
r1 p4 h1 1100 p5 = p4 + 1
r2 p5 h2 1111
48
Conclusions
Contrôle séquentiel : Contrôle parallèle :
prédiction de branchements SIMT
Pourquoi ? → aller plus vite Pourquoi ? → être plus efficace
Comment ? → casser les dépendances de Comment ? → factoriser le travail redondant
contrôle
49