0% found this document useful (0 votes)
26 views47 pages

Ada2024 Gpu 3

The document discusses the process of compiling from threads to SIMD in GPU architecture, focusing on techniques such as if-conversion, scalarization, and internal representations like SSA and GSA. It highlights the differences between SIMT and explicit SIMD implementations, and the challenges of managing control flow in software. Additionally, it addresses the benefits and shortcomings of software SIMT and the importance of scalarization in optimizing GPU performance.

Uploaded by

Munish Jindal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views47 pages

Ada2024 Gpu 3

The document discusses the process of compiling from threads to SIMD in GPU architecture, focusing on techniques such as if-conversion, scalarization, and internal representations like SSA and GSA. It highlights the differences between SIMT and explicit SIMD implementations, and the challenges of managing control flow in software. Additionally, it addresses the benefits and shortcomings of software SIMT and the importance of scalarization in optimizing GPU performance.

Uploaded by

Munish Jindal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

GPU architecture part 3:

compiling from threads to SIMD

Caroline Collange
she/her
[email protected]
https://team.inria.fr/pacap/members/collange/
Master 2 SIF
ADA - 2024
Outline
Compiling from threads to SIMD
If-conversion
Internal representations: SSA and GSA
Generalized if-conversion
Adding branches back
Scalarization
Scalars in SPMD code
Code generation for scalar+vector architectures
Out-of-order SIMT

2
Implementations of SPMD vectorization: SIMT vs. explicit SIMD
SIMT Multi-core + explicit SIMD
All parallelism expressed Combination of threads,
using threads vectors
Warp size implementation- Vector length fixed at
defined compile-time
Dynamic vectorization Static vectorization
Threads Threads

Vector
Warp

Example: Nvidia GPUs Example: most CPUs, Intel Xeon Phi,


AMD GCN GPUs

3
Bridging the gap between SPMD and SIMD

Software: OpenMP, graphics shaders, OpenCL, CUDA... Software

1 kernel Many threads


kernel void scale(float a, float * X) {
X[tid] = a * X[tid];
}

RF RF RF RF
ALU ALU ALU ALU
Hardware
Hardware:
SIMD CPU, GPU, Xeon Phi...

4
SPMD to SIMD: hardware or software ?

Software: OpenMP, OpenCL, CUDA, Gfx shaders... Software

1 kernel Many threads


kernel void scale(float a, float * X) {
X[tid] = a * X[tid];
}

Scalar ISA Threads

SIMT microarchitecture SIMD Compiler


NVIDIA Intel, AMD Vector ISA Vectors

RF RF RF RF
ALU ALU ALU ALU
Hardware
Hardware:
SIMD CPU, GPU, Xeon Phi...

5
The third school of vectorization
Straight-line program (SLP) add
vectorization vadd Vector
mul instructions
add
add
sub

Loop vectorization
add vadd
vmul
mul

Threads Warp
SPMD vectorization add vadd
add Vector
mul add
mul vmul instructions
mul
sub vsub

6
Managing SIMD control flow in software
Use cases
Compiling shaders and OpenCL for AMD GCN GPUs
Compiling OpenCL for Xeon Phi
ispc: Intel SPMD Program Compiler, targets various SIMD instruction sets

Compiler generates code to compute execution masks and branch directions


Same ideas as hardware-based SIMT
Different set of possible optimization

7
Target instruction sets
Supports at least X Y
Scalar instructions and vector registers
Vector instructions and vector registers Registers
blend(m, X, Y)
T1 T2 Tn
Vector blend / select instructions

T1 T2 Tn
Masked vector stores and loads Registers X

Memory
May also support store(m, a, X)
Masked versions of all vector instructions
T1 T2 Tn
Vector gather and scatter
Registers

Memory
gather(A)

8
Compiling SPMD to predicated SIMD

x = 0;
// Uniform condition
if(tid > 17) {
x = 1;
}
// Divergent conditions
if(tid < 2) {
if(tid == 0) {
x = 2;
}
else {
x = 3;
}
}
9
Compiling SPMD to predicated SIMD: scalar control-flow graph
x=0
tid > 17
x = 0;
// Uniform condition tid ≤ 17 x=1
if(tid > 17) {
x = 1;
}
// Divergent conditions tid < 2
if(tid < 2) { tid ≥ 2
if(tid == 0) {
x = 2; tid ≠ 0 tid = 0

}
x=3 x=2
else {
x = 3;
}
}
10
Compiling SPMD to predicated SIMD: if-conversion
if-conversion: flatten if statements, convert x=0 Need masked
into predicated (masked) instructions instructions
x = 0;
// Uniform condition x=1 where tid > 17

if(tid > 17) {


x = 1; Redundant
execution
}
// Divergent conditions where tid < 2
if(tid < 2) {
if(tid == 0) { x=2 where tid > 2 & tid = 0
x = 2; Introduces false
dependencies
}
x=3 where tid > 2 & tid ≠ 0
else {
x = 3;
}
}
11
Single assignment: idea
Create a new identifier each time a variable is written
Ada Lovelace's notation : 2V1 is the 2nd assignment of variable 1
Allows expressing a program as equations
V3 ← V1 + V2 3
V3 = 1V1 + 2V2
V1 ← V2 × V2 2
V1 = 2V2 × 2V2
imperative: assignments functionnal: equations

Removes Write-After-Read dependencies

Modern usages
Dynamic: register renaming in superscalar microarchitectures
Static: Single Static Assignment representation in compilers

L. F. Menabrea and A. A. Lovelace. Sketch of the Analytical Engine invented by Charles Babbage. 1842 12
Single Static Assignment (SSA)
Add φ functions to select proper name x0 = 0
after reconvergence
tid > 17
x = 0;
// Uniform condition tid ≤ 17 x1 = 1
if(tid > 17) {
x = 1;
} x2 = φ(x0, x1)
// Divergent conditions tid < 2
if(tid < 2) { tid ≥ 2
if(tid == 0) {
x = 2; tid ≠ 0 tid = 0

}
x4 = 3 x3 = 2
else {
x = 3;
x5 = φ(x4, x3)
}
}
x6 = φ(x2, x5) 13
Gated Single Assignment (GSA)
Gated single-assignment: x0 = 0
include predicates in φ functions
tid > 17
x = 0;
// Uniform condition tid ≤ 17 x1 = 1
if(tid > 17) {
x = 1;
} x2 = ɣ(tid > 17 ? x1 : x0)
// Divergent conditions tid < 2
if(tid < 2) { tid ≥ 2
if(tid == 0) {
x = 2; tid ≠ 0 tid = 0

}
x4 = 3 x3 = 2
else {
x = 3;
x5 = ɣ(tid == 0 ? x3 : x4)
}
}
x6 = ɣ(tid < 2 ? x5 : x2) 14
Omitting branch conditions
x0 = 0

x = 0;
// Uniform condition x1 = 1
if(tid > 17) {
x = 1;
} x2 = ɣ(tid > 17 ? x1 : x0)
// Divergent conditions
if(tid < 2) {
if(tid == 0) {
x = 2;
}
x4 = 3 x3 = 2
else {
x = 3;
x5 = ɣ(tid == 0 ? x3 : x4)
}
}
x6 = ɣ(tid < 2 ? x5 : x2) 15
Compiling SPMD to predicated SIMD: linearized CFG
x0 = 0

x = 0;
// Uniform condition x1 = 1
if(tid > 17) {
x = 1;
} x2 = ɣ(tid > 17 ? x1 : x0)
// Divergent conditions
if(tid < 2) {
if(tid == 0) {
x = 2;
}
x4 = 3 x3 = 2
else {
x = 3;
x5 = ɣ(tid == 0 ? x3 : x4)
}
}
x6 = ɣ(tid < 2 ? x5 : x2) 16
Compiling SPMD to predicated SIMD: vector code
tid = (0, 1, 2, 3)
x0 = (0…0)
x0 = (0, 0, 0, 0)
x = 0;
// Uniform condition x1 =( 1…1)
x1 = (1, 1, 1, 1)
if(tid > 17) {
x = 1;
} x2 = (0, 0, 0, 0) x2 = blend(tid > 17 ? x1 : x0)
// Divergent conditions
if(tid < 2) {
if(tid == 0) {
x = 2;
x3 = (2, 2, 2, 2)
}
x4 = (3…3) x3 = (2…2)
else { x4 = (3, 3, 3, 3)
x = 3;
x5 = (2, 3, 3, 3) x5 = blend(tid == 0 ? x3 : x4)
}
} x6 = (2, 3, 0, 0) x6 = blend(tid < 2 ? x5 : x2) 17
Compiling SPMD to predicated SIMD: vector code + control-flow
tid = (0, 1, 2, 3)
x0 = (0…0)
x0 = (0, 0, 0, 0) any(tid > 17)
x = 0;
// Uniform condition all(tid ≤ 17) x1 =( 1…1)
if(tid > 17) {
x = 1;
} x2 = (0, 0, 0, 0) x2 = blend(tid > 17 ? x1 : x0)
// Divergent conditions any(tid > 2)
if(tid < 2) { all(tid ≤ 2)
if(tid == 0) {
x = 2; all(tid ≠ 0) any(tid = 0)
x3 = (2, 2, 2, 2) any(tid ≠ 0)
}
x4 = (3…3) x3 = (2…2)
else { x4 = (3, 3, 3, 3)
x = 3; all(tid = 0)
x5 = (2, 3, 3, 3) x5 = blend(tid == 0 ? x3 : x4)
}
} x6 = (2, 3, 0, 0) x6 = blend(tid < 2 ? x5 : x2) 18
Compiling SPMD to predicated SIMD

x = 0; (m0) mov x←0 // m0 is current mask


(m0) cmp c←tid>17 // vector comparison
// Uniform condition
and m1←m0&c // compute if mask
if(tid > 17) { jcc(m1=0) endif1 // skip if null
x = 1; (m1) mov x←1
endif1:
}
(m0) cmp c←tid<2
// Divergent conditions
and m2←m0&c
if(tid < 2) { jcc(m2=0) endif2
if(tid == 0) { (m2) cmp c←tid==0
and m3←m2&c
x = 2;
jcc(m3=0) else
} (m3) mov x←2
else { else:
and m4←m2&~c
x = 3;
jcc(m4=0) endif2
} (m4) mov x←3
} endif2:
20
Benefits and shortcomings of software SIMT
Benefits Shortcomings
No stack structure to maintain Every branch is divergent
Use mask registers directly
unless proven otherwise
Register allocation takes care of reuse and Need to allocate mask register
spills to memory either way

Compiler knowing precise execution order Restricts freedom of


enables more optimizations microarchitecture
for runtime optimization
Turn masking into “zeroing”:
critical for out-of-order architectures
Scalarization: demoting uniform vectors
into scalars

21
Outline
Compiling from threads to SIMD
If-conversion
Internal representations: SSA and GSA
Generalized if-conversion
Adding branches back
Scalarization
Scalars in SPMD code
Code generation for scalar+vector architectures
Out-of-order SIMT

22
Scalars in SPMD code
SPMD code
Some values and operations mov i ← tid
are inherently scalar loop:
load t ← X[i]
mul t ← a×t
store X[i] ← t
add i ← i+tnum
Loop counters, addresses of branch i<n? loop
consecutive accesses…
Thread
0 12 3…
load
Same value values for all mul
store
threads of a warp add
Uniform vector branch

Or sequence of evenly-spaced
values
Affine vector t
a 1717171717 17
i 0 1 2 3 4 15
n 5151515151 51
23
Uniform and affine vectors
warp
(granularity)
Uniform vector

thread
In a warp, v[i] = c
Value does not depend on lane ID
5 5 5 5 5 5 5 5 3 3 3 3 3 3 3 3

c=5 c=3

Affine vector
8 9 101112131415 0 2 4 6 8 101214
In a warp, v[i] = b + i s
s=1 s=2
Base b, stride s
b=8 b=0
Affine relation between value and lane
ID
Generic vector : anything else 2 8 0 -4 4 4 5 8 2 3 7 1 0 3 3 4

24
Most vectors are scalars in disguise!

In GPGPU kernels, most integer arithmetic is affine (or uniform)


i.e. not floating point, not graphics shaders
25
What is inside a GPU register file?

Non-affine registers alive in inner loop:

MatrixMul: 3 non-affine / 14 Needleman-Wunsch:


Convolution: 4 non-affine in
hotspot / 14 2 non-affine / 24

50% - 92% of GPU RF contains affine variables


More than register reads: non-affine variables are short-lived
Very high potential for register pressure reduction in GPGPU apps

26
Scalarization

Explicit SIMD architectures have scalar units


Intel Xeon Phi: has good old x86
AMD GCN GPUs: have scalar units and registers

Scalarization optimization
demotes uniform and affine vectors into scalars
Vector instructions → scalar instructions
Vector registers → scalar registers
SIMT branches → uniform (scalar) branches
Gather-scatter load-store → vector load-store or broadcast

Divergence analysis guides scalarization


Data-flow analysis on GSA representation

27
After scalarization
SIMD+scalar code
Obvious benefits mov i ← 0
loop:
Scalar registers instead of vector vload T ← X[i]
Scalar instructions instead of vmul T ← a×T
vector vstore X[i] ← T
add i ← i+16
branch i<n? loop

Less obvious benefits


Instructions
Contiguous vector load, store
vload
Scalar branches, no masking vmul
vstore
Affine vector → single scalar: add
stride has been constant- branch scalar SIMD
propagated!
No dependency between
scalar and vector code except
through loads and stores: T
a 17
enables decoupling i 0 scalar vector
n 51
28
Scalarization across function calls
Which parameters are uniform – affine?

kernel void scale(float a, float * X) float mul(float u, float v)


{ {
// Called for each thread tid return u * v;
X[tid] = mul(a, X[tid]); }
}

kernel void scale2(float a, float * X) void mul_ptr(float* u, float *v)


{ {
// Called for each thread tid *v = (*u) * (*v);
mul_ptr(&a, &X[tid]); }
}

Depends on call site


Not visible to compiler before link-time,
or requires interprocedural optimization (expensive)
Different call sites may have different set of uniform/affine
parameters

30
Typing-based approach

Used in Intel Cilk+


Programmer qualifies parameters explicitly

__declspec (vector uniform(u)) float mul(float u, float v)


{
return u * v;
}

__declspec (vector uniform(u) linear(v)) void mul_ptr(float* u, float *v)


{
*v = (*u) * (*v);
}

Different variations are C++ function overloads


By default, everything is a generic vector

No automatic solution!

31
Outline
Compiling from threads to SIMD
If-conversion
Internal representations: SSA and GSA
Generalized if-conversion
Adding branches back
Scalarization
Scalars in SPMD code
Code generation for scalar+vector architectures
Out-of-order SIMT

32
SIMT vs prédiction de branchements
Prédiction de branchements : prédit un seul chemin
Confirmation
SIMT: le flux d'exécution est un graphe, pas une séquence
Les groupes de threads peuvent converger, diverger
1 chemin à travers le graphe : 1 ensemble de threads Acquisition

Problème : sans ordre total, comment revenir en arrière ? Chemin B: t2


Chemin A:
t0,t3
Chemin C: t1

33
SIMT vs renommage

Renommage de registres pour éliminer add r3 ← r1+r2 add r3a ← r1a+r2a


mul r1 ← r2×r2 mul r1b ← r2a×r2a
les dépendances écriture-après-lecture
Cas SIMT: dépendances partielles
Instructions implicitement masquées if(p) if(p)
Écriture partielle de registres vectoriels r1 ← r1a ←
42 42
Un registre peut avoir plusieurs producteurs else else
r1 ← r1b ←
Problème : différents threads accèdent à 17 17
r2 ← r1 r2a ← r1??
différents registres : pas de nom unique

34
Architecture CPU de base

High-throughput backend

vector
PRF vector
Multi-thread units
frontend vector
issue

Fetch & decode

Rename: RAT
Thread
select
PC
scalar
issue
Low-latency
backend
scalar scalar
PRF units

branch
issue branch
unit 35
Architecture CPU SIMT proposée

High-throughput backend

vector
a u!
e PRF vector
uv
No units
Path-switching frontend
vector
issue
u!

Rename: PIRAT
Fetch & decode
a
u ve
No
Warp Path
select select
scalar
PC issue
Low-latency
backend
scalar scalar
PRF units

!
eau
uv
branch No
issue
Pathtable branch 36
unit
SIMT vs prédiction de branchements : solution
Prédiction de branchements : prédit un seul chemin
Confirmation
SIMT: le flux d'exécution est un graphe, pas une séquence
Les groupes de fils d'exec peuvent converger, diverger
1 chemin à travers le graphe : 1 ensemble de threads
Acquisition
Problème : sans ordre total, comment revenir en arrière ?

Chemin A: Chemin B: t2
Solution
t0,t3
Assigner un ordre total à l'acquisition, Chemin C: t1
restaurer le même ordre à la confirmation
Convergence à l'acquisition, divergence à l'exécution
→ chaque thread suit 1 chemin spéculatif à la fois
Suivi des masque dans une Table de chemins
Confirmation masquée
→ operations avec un masque entièrement nul sont ignorées, ne produisent pas d'effet

Anita Tino, Caroline Collange, André Seznec. SIMT-X: Extending Single-Instruction Multi-Threading to Out-of-Order Cores . TACO 2020
37
Suivi des chemins

Chaque chemin a un masque de bits


Représente l'ensemble des threads qui suivent ce chemin
Initialement spéculatif, par sur-approximation
Connu entièrement à la confirmation des instructions du chemin
Des registres de chemins maintiennent les masques des chemins en cours
Registres de chemin lus et écris par des micro-instructions de convergence et divergence
Transforme le flux de contrôle en flux de données

h4 h5 h1
p?
h6 h2 h3
h6 = h4 | h5 Clear bits p in all paths from h2;
h3 = h1 & p
Convergence Divergence
38
Gestion des chemins : exemple

Chemin initial emprunté par tous les fils d'exécution


h1 = 1111

Flux d'exécution
du programme Table de chemins
Confirmation
h1 Instructions Chemin Masque
Acquisition en vol
h1 1111

39
Gestion des chemins : exemple

Acquisition du branchement
Démarrage d'un nouveau chemin h2 = h1
pour préparer potentielle divergence

Flux d'exécution
du programme Table de chemins
Confirmation
h1 Instructions Chemin Masque
Acquisition en vol
h1 1111
h2 1111
h2

40
Gestion des chemins : exemple

Résolution du branchement, divergence des threads 1-2


Annulation des threads 1-2 dans tous les chemins après le branchement : h2 &= ~0110
Démarrage d'un nouveau chemin h3 = h1 & 0110

Flux d'exécution
du programme Table de chemins
Confirmation
h1 Instructions Chemin Masque
en vol
h3 h1 1111
Acquisition h2 1001
h2
h3 0110

41
Gestion des chemins : exemple

Prédiction de divergence probable


h4 = h3

Flux d'exécution
du programme Table de chemins
Confirmation
h1 Chemin Masque
Instructions
h3 en vol h1 1111
h2 h2 1001
Acquisition h4 h3 0110
h4 0110

42
Gestion des chemins : exemple

Divergence
h5 = h3 & 0100
h4 &= ~0100

Flux d'exécution
du programme Table de chemins
Confirmation
h1 Chemin Masque
Instructions
h3 en vol h1 1111
h2 h2 1001
h4 h5 h3 0110
Acquisition h4 0010
h5 0100

43
Gestion des chemins : exemple

Convergence: h6 = h4 | h5

Flux d'exécution
du programme Table de chemins
Confirmation
h1 Chemin Masque
h3 h1 1111
Instructions
h2 en vol h2 1001
h4 h5 h3 0110
h4 0010
h6 h5 0100
Acquisition h6 0110

44
Gestion des chemins : exemple

Libérer chemin h1 à la confirmation du branchement


Convergence: h7 = h2 | h6

Flux d'exécution Table de chemins


du programme
Chemin Masque
Confirmation
h3 h2 1001
h2 h3 0110
h4 h5 Instructions h4 0010
en vol
h5 0100
h6 h6 0110
h7 h7 1111
Acquisition

45
SIMT vs renommage : solution

Renommage de registres pour éliminer add r3 ← r1+r2 add r3a ← r1a+r2a


mul r1 ← r2×r2 mul r1b ← r2a×r2a
les dépendances écriture-après-lecture
Cas SIMT: dépendances partielles
Instructions implicitement masquées if(p) if(p)
Écriture partielle de registres vectoriels r1 ← 42 r1a ← 42
else else
Un registre peut avoir plusieurs producteurs r1 ← 17 r1b ← 17
r2 ← r1 r2a ← r1??
Solution
PIRAT (Path-Identifying Register Alias Table)
Différents chemins peuvent utiliser if(p)
différents registres physiques r1a ← 42
Injecter des micro-opérations de fusion à la else
r1b ← 17
demande après convergence
r2a ← merge(p?
r1a:r1b)

46
Renommage : deux chemins suffisent

Combien de registres physiques par registre architectural ? r1 ← 42


Autant que de threads/warp dans le pire cas if(p)
r1 ← 17
Bien trop pour réaliser en matériel r2 ← r1 + 1
Observation : 2 suffisent en pratique!
1 pour le chemin actif courant, 1 pour les autres chemins
En cas de dépassement, insérer des micro-opérations de fusion
p1 = 42
Coût: dépendances entre instructions if(h1)
p2 = 17
p4 = merge(h1?p2:p1)
PIRAT Pathtable p5 = p4 + 1
Other Active Mask
r1 p1 p2 h1 h1 1100
r2 p3

Se lit: r1 est dans p2 pour les fils d'exec contenus dans h1, sinon dans p1

47
Renommage : deux chemins suffisent

Après fusion : r1 ← 42
L'entrée active de r1 est libérée if(p)
r1 ← 17
L'autre entrée est utilisable comme source pour les instructions r2 ← r1 + 1
suivantes

p1 = 42
PIRAT Pathtable if(h1)
p2 = 17
Other Active Mask p4 = merge(h1?p2:p1)
r1 p4 h1 1100 p5 = p4 + 1
r2 p5 h2 1111

Se lit: r1 est dans p4 pour tous les fils d'exécution

48
Conclusions
Contrôle séquentiel : Contrôle parallèle :
prédiction de branchements SIMT
Pourquoi ? → aller plus vite Pourquoi ? → être plus efficace
Comment ? → casser les dépendances de Comment ? → factoriser le travail redondant
contrôle

Possible de combiner contrôle séquentiel et parallèle


Piste de recherche : étudier formellement propriétés de SIMT
Le concept d'assignation unique revient toujours : microarchitectures out-of-order,
architectures SIMT, synthèse haut-niveau

49

You might also like