0% found this document useful (0 votes)

26 views47 pages

Ada2024 Gpu 3

The document discusses the process of compiling from threads to SIMD in GPU architecture, focusing on techniques such as if-conversion, scalarization, and internal representations like SSA and GSA. It highlights the differences between SIMT and explicit SIMD implementations, and the challenges of managing control flow in software. Additionally, it addresses the benefits and shortcomings of software SIMT and the importance of scalarization in optimizing GPU performance.

Uploaded by

Munish Jindal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views47 pages

Ada2024 Gpu 3

Uploaded by

Munish Jindal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

GPU architecture part 3:

compiling from threads to SIMD

Caroline Collange
she/her
[email protected]
https://team.inria.fr/pacap/members/collange/
Master 2 SIF
ADA - 2024
Outline
Compiling from threads to SIMD
If-conversion
Internal representations: SSA and GSA
Generalized if-conversion
Adding branches back
Scalarization
Scalars in SPMD code
Code generation for scalar+vector architectures
Out-of-order SIMT

2
Implementations of SPMD vectorization: SIMT vs. explicit SIMD
SIMT Multi-core + explicit SIMD
All parallelism expressed Combination of threads,
using threads vectors
Warp size implementation- Vector length fixed at
defined compile-time
Dynamic vectorization Static vectorization
Threads Threads

Vector
Warp

Example: Nvidia GPUs Example: most CPUs, Intel Xeon Phi,

AMD GCN GPUs

3
Bridging the gap between SPMD and SIMD

Software: OpenMP, graphics shaders, OpenCL, CUDA... Software

1 kernel Many threads

kernel void scale(float a, float * X) {
X[tid] = a * X[tid];
}

RF RF RF RF
ALU ALU ALU ALU
Hardware
Hardware:
SIMD CPU, GPU, Xeon Phi...

4
SPMD to SIMD: hardware or software ?

Software: OpenMP, OpenCL, CUDA, Gfx shaders... Software

1 kernel Many threads

kernel void scale(float a, float * X) {
X[tid] = a * X[tid];
}

Scalar ISA Threads

SIMT microarchitecture SIMD Compiler

NVIDIA Intel, AMD Vector ISA Vectors

RF RF RF RF
ALU ALU ALU ALU
Hardware
Hardware:
SIMD CPU, GPU, Xeon Phi...

5
The third school of vectorization
Straight-line program (SLP) add
vectorization vadd Vector
mul instructions
add
add
sub

Loop vectorization
add vadd
vmul
mul

Threads Warp
SPMD vectorization add vadd
add Vector
mul add
mul vmul instructions
mul
sub vsub

6
Managing SIMD control flow in software
Use cases
Compiling shaders and OpenCL for AMD GCN GPUs
Compiling OpenCL for Xeon Phi
ispc: Intel SPMD Program Compiler, targets various SIMD instruction sets

Compiler generates code to compute execution masks and branch directions

Same ideas as hardware-based SIMT
Different set of possible optimization

7
Target instruction sets
Supports at least X Y
Scalar instructions and vector registers
Vector instructions and vector registers Registers
blend(m, X, Y)
T1 T2 Tn
Vector blend / select instructions

T1 T2 Tn
Masked vector stores and loads Registers X

Memory
May also support store(m, a, X)
Masked versions of all vector instructions
T1 T2 Tn
Vector gather and scatter
Registers

Memory
gather(A)

8
Compiling SPMD to predicated SIMD

x = 0;
// Uniform condition
if(tid > 17) {
x = 1;
}
// Divergent conditions
if(tid < 2) {
if(tid == 0) {
x = 2;
}
else {
x = 3;
}
}
9
Compiling SPMD to predicated SIMD: scalar control-flow graph
x=0
tid > 17
x = 0;
// Uniform condition tid ≤ 17 x=1
if(tid > 17) {
x = 1;
}
// Divergent conditions tid < 2
if(tid < 2) { tid ≥ 2
if(tid == 0) {
x = 2; tid ≠ 0 tid = 0

}
x=3 x=2
else {
x = 3;
}
}
10
Compiling SPMD to predicated SIMD: if-conversion
if-conversion: flatten if statements, convert x=0 Need masked
into predicated (masked) instructions instructions
x = 0;
// Uniform condition x=1 where tid > 17

if(tid > 17) {

x = 1; Redundant
execution
}
// Divergent conditions where tid < 2
if(tid < 2) {
if(tid == 0) { x=2 where tid > 2 & tid = 0
x = 2; Introduces false
dependencies
}
x=3 where tid > 2 & tid ≠ 0
else {
x = 3;
}
}
11
Single assignment: idea
Create a new identifier each time a variable is written
Ada Lovelace's notation : 2V1 is the 2nd assignment of variable 1
Allows expressing a program as equations
V3 ← V1 + V2 3
V3 = 1V1 + 2V2
V1 ← V2 × V2 2
V1 = 2V2 × 2V2
imperative: assignments functionnal: equations

Removes Write-After-Read dependencies

Modern usages
Dynamic: register renaming in superscalar microarchitectures
Static: Single Static Assignment representation in compilers

L. F. Menabrea and A. A. Lovelace. Sketch of the Analytical Engine invented by Charles Babbage. 1842 12
Single Static Assignment (SSA)
Add φ functions to select proper name x0 = 0
after reconvergence
tid > 17
x = 0;
// Uniform condition tid ≤ 17 x1 = 1
if(tid > 17) {
x = 1;
} x2 = φ(x0, x1)
// Divergent conditions tid < 2
if(tid < 2) { tid ≥ 2
if(tid == 0) {
x = 2; tid ≠ 0 tid = 0

}
x4 = 3 x3 = 2
else {
x = 3;
x5 = φ(x4, x3)
}
}
x6 = φ(x2, x5) 13
Gated Single Assignment (GSA)
Gated single-assignment: x0 = 0
include predicates in φ functions
tid > 17
x = 0;
// Uniform condition tid ≤ 17 x1 = 1
if(tid > 17) {
x = 1;
} x2 = ɣ(tid > 17 ? x1 : x0)
// Divergent conditions tid < 2
if(tid < 2) { tid ≥ 2
if(tid == 0) {
x = 2; tid ≠ 0 tid = 0

}
x4 = 3 x3 = 2
else {
x = 3;
x5 = ɣ(tid == 0 ? x3 : x4)
}
}
x6 = ɣ(tid < 2 ? x5 : x2) 14
Omitting branch conditions
x0 = 0

x = 0;
// Uniform condition x1 = 1
if(tid > 17) {
x = 1;
} x2 = ɣ(tid > 17 ? x1 : x0)
// Divergent conditions
if(tid < 2) {
if(tid == 0) {
x = 2;
}
x4 = 3 x3 = 2
else {
x = 3;
x5 = ɣ(tid == 0 ? x3 : x4)
}
}
x6 = ɣ(tid < 2 ? x5 : x2) 16
Compiling SPMD to predicated SIMD: vector code
tid = (0, 1, 2, 3)
x0 = (0…0)
x0 = (0, 0, 0, 0)
x = 0;
// Uniform condition x1 =( 1…1)
x1 = (1, 1, 1, 1)
if(tid > 17) {
x = 1;
} x2 = (0, 0, 0, 0) x2 = blend(tid > 17 ? x1 : x0)
// Divergent conditions
if(tid < 2) {
if(tid == 0) {
x = 2;
x3 = (2, 2, 2, 2)
}
x4 = (3…3) x3 = (2…2)
else { x4 = (3, 3, 3, 3)
x = 3;
x5 = (2, 3, 3, 3) x5 = blend(tid == 0 ? x3 : x4)
}
} x6 = (2, 3, 0, 0) x6 = blend(tid < 2 ? x5 : x2) 17
Compiling SPMD to predicated SIMD: vector code + control-flow
tid = (0, 1, 2, 3)
x0 = (0…0)
x0 = (0, 0, 0, 0) any(tid > 17)
x = 0;
// Uniform condition all(tid ≤ 17) x1 =( 1…1)
if(tid > 17) {
x = 1;
} x2 = (0, 0, 0, 0) x2 = blend(tid > 17 ? x1 : x0)
// Divergent conditions any(tid > 2)
if(tid < 2) { all(tid ≤ 2)
if(tid == 0) {
x = 2; all(tid ≠ 0) any(tid = 0)
x3 = (2, 2, 2, 2) any(tid ≠ 0)
}
x4 = (3…3) x3 = (2…2)
else { x4 = (3, 3, 3, 3)
x = 3; all(tid = 0)
x5 = (2, 3, 3, 3) x5 = blend(tid == 0 ? x3 : x4)
}
} x6 = (2, 3, 0, 0) x6 = blend(tid < 2 ? x5 : x2) 18
Compiling SPMD to predicated SIMD

x = 0; (m0) mov x←0 // m0 is current mask

(m0) cmp c←tid>17 // vector comparison
// Uniform condition
and m1←m0&c // compute if mask
if(tid > 17) { jcc(m1=0) endif1 // skip if null
x = 1; (m1) mov x←1
endif1:
}
(m0) cmp c←tid<2
// Divergent conditions
and m2←m0&c
if(tid < 2) { jcc(m2=0) endif2
if(tid == 0) { (m2) cmp c←tid==0
and m3←m2&c
x = 2;
jcc(m3=0) else
} (m3) mov x←2
else { else:
and m4←m2&~c
x = 3;
jcc(m4=0) endif2
} (m4) mov x←3
} endif2:
20
Benefits and shortcomings of software SIMT
Benefits Shortcomings
No stack structure to maintain Every branch is divergent
Use mask registers directly
unless proven otherwise
Register allocation takes care of reuse and Need to allocate mask register
spills to memory either way

Compiler knowing precise execution order Restricts freedom of

enables more optimizations microarchitecture
for runtime optimization
Turn masking into “zeroing”:
critical for out-of-order architectures
Scalarization: demoting uniform vectors
into scalars

21
Outline
Compiling from threads to SIMD
If-conversion
Internal representations: SSA and GSA
Generalized if-conversion
Adding branches back
Scalarization
Scalars in SPMD code
Code generation for scalar+vector architectures
Out-of-order SIMT

22
Scalars in SPMD code
SPMD code
Some values and operations mov i ← tid
are inherently scalar loop:
load t ← X[i]
mul t ← a×t
store X[i] ← t
add i ← i+tnum
Loop counters, addresses of branch i<n? loop
consecutive accesses…
Thread
0 12 3…
load
Same value values for all mul
store
threads of a warp add
Uniform vector branch

Or sequence of evenly-spaced
values
Affine vector t
a 1717171717 17
i 0 1 2 3 4 15
n 5151515151 51
23
Uniform and affine vectors
warp
(granularity)
Uniform vector

thread
In a warp, v[i] = c
Value does not depend on lane ID
5 5 5 5 5 5 5 5 3 3 3 3 3 3 3 3

c=5 c=3

Affine vector
8 9 101112131415 0 2 4 6 8 101214
In a warp, v[i] = b + i s
s=1 s=2
Base b, stride s
b=8 b=0
Affine relation between value and lane
ID
Generic vector : anything else 2 8 0 -4 4 4 5 8 2 3 7 1 0 3 3 4

24
Most vectors are scalars in disguise!

In GPGPU kernels, most integer arithmetic is affine (or uniform)

i.e. not floating point, not graphics shaders
25
What is inside a GPU register file?

Non-affine registers alive in inner loop:

MatrixMul: 3 non-affine / 14 Needleman-Wunsch:

Convolution: 4 non-affine in
hotspot / 14 2 non-affine / 24

50% - 92% of GPU RF contains affine variables

More than register reads: non-affine variables are short-lived
Very high potential for register pressure reduction in GPGPU apps

26
Scalarization

Explicit SIMD architectures have scalar units

Intel Xeon Phi: has good old x86
AMD GCN GPUs: have scalar units and registers

Scalarization optimization
demotes uniform and affine vectors into scalars
Vector instructions → scalar instructions
Vector registers → scalar registers
SIMT branches → uniform (scalar) branches
Gather-scatter load-store → vector load-store or broadcast

Divergence analysis guides scalarization

Data-flow analysis on GSA representation

27
After scalarization
SIMD+scalar code
Obvious benefits mov i ← 0
loop:
Scalar registers instead of vector vload T ← X[i]
Scalar instructions instead of vmul T ← a×T
vector vstore X[i] ← T
add i ← i+16
branch i<n? loop

Less obvious benefits

Instructions
Contiguous vector load, store
vload
Scalar branches, no masking vmul
vstore
Affine vector → single scalar: add
stride has been constant- branch scalar SIMD
propagated!
No dependency between
scalar and vector code except
through loads and stores: T
a 17
enables decoupling i 0 scalar vector
n 51
28
Scalarization across function calls
Which parameters are uniform – affine?

kernel void scale(float a, float * X) float mul(float u, float v)

{ {
// Called for each thread tid return u * v;
X[tid] = mul(a, X[tid]); }
}

kernel void scale2(float a, float * X) void mul_ptr(float* u, float *v)

{ {
// Called for each thread tid *v = (*u) * (*v);
mul_ptr(&a, &X[tid]); }
}

Depends on call site

Not visible to compiler before link-time,
or requires interprocedural optimization (expensive)
Different call sites may have different set of uniform/affine
parameters

30
Typing-based approach

Used in Intel Cilk+

Programmer qualifies parameters explicitly

__declspec (vector uniform(u)) float mul(float u, float v)

{
return u * v;
}

__declspec (vector uniform(u) linear(v)) void mul_ptr(float* u, float *v)

{
*v = (*u) * (*v);
}

Different variations are C++ function overloads

By default, everything is a generic vector

No automatic solution!

31
Outline
Compiling from threads to SIMD
If-conversion
Internal representations: SSA and GSA
Generalized if-conversion
Adding branches back
Scalarization
Scalars in SPMD code
Code generation for scalar+vector architectures
Out-of-order SIMT

32
SIMT vs prédiction de branchements
Prédiction de branchements : prédit un seul chemin
Confirmation
SIMT: le flux d'exécution est un graphe, pas une séquence
Les groupes de threads peuvent converger, diverger
1 chemin à travers le graphe : 1 ensemble de threads Acquisition

Problème : sans ordre total, comment revenir en arrière ? Chemin B: t2

Chemin A:
t0,t3
Chemin C: t1

33
SIMT vs renommage

Renommage de registres pour éliminer add r3 ← r1+r2 add r3a ← r1a+r2a

mul r1 ← r2×r2 mul r1b ← r2a×r2a
les dépendances écriture-après-lecture
Cas SIMT: dépendances partielles
Instructions implicitement masquées if(p) if(p)
Écriture partielle de registres vectoriels r1 ← r1a ←
42 42
Un registre peut avoir plusieurs producteurs else else
r1 ← r1b ←
Problème : différents threads accèdent à 17 17
r2 ← r1 r2a ← r1??
différents registres : pas de nom unique

34
Architecture CPU de base

High-throughput backend

vector
PRF vector
Multi-thread units
frontend vector
issue

Fetch & decode

Rename: RAT
Thread
select
PC
scalar
issue
Low-latency
backend
scalar scalar
PRF units

branch
issue branch
unit 35
Architecture CPU SIMT proposée

High-throughput backend

vector
a u!
e PRF vector
uv
No units
Path-switching frontend
vector
issue
u!

Rename: PIRAT
Fetch & decode
a
u ve
No
Warp Path
select select
scalar
PC issue
Low-latency
backend
scalar scalar
PRF units

!
eau
uv
branch No
issue
Pathtable branch 36
unit
SIMT vs prédiction de branchements : solution
Prédiction de branchements : prédit un seul chemin
Confirmation
SIMT: le flux d'exécution est un graphe, pas une séquence
Les groupes de fils d'exec peuvent converger, diverger
1 chemin à travers le graphe : 1 ensemble de threads
Acquisition
Problème : sans ordre total, comment revenir en arrière ?

Chemin A: Chemin B: t2
Solution
t0,t3
Assigner un ordre total à l'acquisition, Chemin C: t1
restaurer le même ordre à la confirmation
Convergence à l'acquisition, divergence à l'exécution
→ chaque thread suit 1 chemin spéculatif à la fois
Suivi des masque dans une Table de chemins
Confirmation masquée
→ operations avec un masque entièrement nul sont ignorées, ne produisent pas d'effet

Anita Tino, Caroline Collange, André Seznec. SIMT-X: Extending Single-Instruction Multi-Threading to Out-of-Order Cores . TACO 2020
37
Suivi des chemins

Chaque chemin a un masque de bits

Représente l'ensemble des threads qui suivent ce chemin
Initialement spéculatif, par sur-approximation
Connu entièrement à la confirmation des instructions du chemin
Des registres de chemins maintiennent les masques des chemins en cours
Registres de chemin lus et écris par des micro-instructions de convergence et divergence
Transforme le flux de contrôle en flux de données

h4 h5 h1
p?
h6 h2 h3
h6 = h4 | h5 Clear bits p in all paths from h2;
h3 = h1 & p
Convergence Divergence
38
Gestion des chemins : exemple

Chemin initial emprunté par tous les fils d'exécution

h1 = 1111

Flux d'exécution
du programme Table de chemins
Confirmation
h1 Instructions Chemin Masque
Acquisition en vol
h1 1111

39
Gestion des chemins : exemple

Acquisition du branchement
Démarrage d'un nouveau chemin h2 = h1
pour préparer potentielle divergence

Flux d'exécution
du programme Table de chemins
Confirmation
h1 Instructions Chemin Masque
Acquisition en vol
h1 1111
h2 1111
h2

40
Gestion des chemins : exemple

Résolution du branchement, divergence des threads 1-2

Annulation des threads 1-2 dans tous les chemins après le branchement : h2 &= ~0110
Démarrage d'un nouveau chemin h3 = h1 & 0110

Flux d'exécution
du programme Table de chemins
Confirmation
h1 Instructions Chemin Masque
en vol
h3 h1 1111
Acquisition h2 1001
h2
h3 0110

41
Gestion des chemins : exemple

Prédiction de divergence probable

h4 = h3

Flux d'exécution
du programme Table de chemins
Confirmation
h1 Chemin Masque
Instructions
h3 en vol h1 1111
h2 h2 1001
Acquisition h4 h3 0110
h4 0110

42
Gestion des chemins : exemple

Divergence
h5 = h3 & 0100
h4 &= ~0100

Flux d'exécution
du programme Table de chemins
Confirmation
h1 Chemin Masque
Instructions
h3 en vol h1 1111
h2 h2 1001
h4 h5 h3 0110
Acquisition h4 0010
h5 0100

43
Gestion des chemins : exemple

Convergence: h6 = h4 | h5

Flux d'exécution
du programme Table de chemins
Confirmation
h1 Chemin Masque
h3 h1 1111
Instructions
h2 en vol h2 1001
h4 h5 h3 0110
h4 0010
h6 h5 0100
Acquisition h6 0110

44
Gestion des chemins : exemple

Libérer chemin h1 à la confirmation du branchement

Convergence: h7 = h2 | h6

Flux d'exécution Table de chemins

du programme
Chemin Masque
Confirmation
h3 h2 1001
h2 h3 0110
h4 h5 Instructions h4 0010
en vol
h5 0100
h6 h6 0110
h7 h7 1111
Acquisition

45
SIMT vs renommage : solution

Renommage de registres pour éliminer add r3 ← r1+r2 add r3a ← r1a+r2a

mul r1 ← r2×r2 mul r1b ← r2a×r2a
les dépendances écriture-après-lecture
Cas SIMT: dépendances partielles
Instructions implicitement masquées if(p) if(p)
Écriture partielle de registres vectoriels r1 ← 42 r1a ← 42
else else
Un registre peut avoir plusieurs producteurs r1 ← 17 r1b ← 17
r2 ← r1 r2a ← r1??
Solution
PIRAT (Path-Identifying Register Alias Table)
Différents chemins peuvent utiliser if(p)
différents registres physiques r1a ← 42
Injecter des micro-opérations de fusion à la else
r1b ← 17
demande après convergence
r2a ← merge(p?
r1a:r1b)

46
Renommage : deux chemins suffisent

Combien de registres physiques par registre architectural ? r1 ← 42

Autant que de threads/warp dans le pire cas if(p)
r1 ← 17
Bien trop pour réaliser en matériel r2 ← r1 + 1
Observation : 2 suffisent en pratique!
1 pour le chemin actif courant, 1 pour les autres chemins
En cas de dépassement, insérer des micro-opérations de fusion
p1 = 42
Coût: dépendances entre instructions if(h1)
p2 = 17
p4 = merge(h1?p2:p1)
PIRAT Pathtable p5 = p4 + 1
Other Active Mask
r1 p1 p2 h1 h1 1100
r2 p3

Se lit: r1 est dans p2 pour les fils d'exec contenus dans h1, sinon dans p1

47
Renommage : deux chemins suffisent

Après fusion : r1 ← 42
L'entrée active de r1 est libérée if(p)
r1 ← 17
L'autre entrée est utilisable comme source pour les instructions r2 ← r1 + 1
suivantes

p1 = 42
PIRAT Pathtable if(h1)
p2 = 17
Other Active Mask p4 = merge(h1?p2:p1)
r1 p4 h1 1100 p5 = p4 + 1
r2 p5 h2 1111

Se lit: r1 est dans p4 pour tous les fils d'exécution

48
Conclusions
Contrôle séquentiel : Contrôle parallèle :
prédiction de branchements SIMT
Pourquoi ? → aller plus vite Pourquoi ? → être plus efficace
Comment ? → casser les dépendances de Comment ? → factoriser le travail redondant
contrôle

Possible de combiner contrôle séquentiel et parallèle

Piste de recherche : étudier formellement propriétés de SIMT
Le concept d'assignation unique revient toujours : microarchitectures out-of-order,
architectures SIMT, synthèse haut-niveau

CS-3006!3!1 SIMD Intrinsic Programming Reduced
No ratings yet
CS-3006!3!1 SIMD Intrinsic Programming Reduced
55 pages
Hpca2021 Gpu 3
No ratings yet
Hpca2021 Gpu 3
49 pages
SIMD and Associative Computational Models: Parallel & Distributed Algorithms
No ratings yet
SIMD and Associative Computational Models: Parallel & Distributed Algorithms
31 pages
Ada2024 Gpu 2
No ratings yet
Ada2024 Gpu 2
55 pages
RV64V: A Vector Architecture Overview
No ratings yet
RV64V: A Vector Architecture Overview
29 pages
Advanced Computer Architecture: Presented By, Krishna
No ratings yet
Advanced Computer Architecture: Presented By, Krishna
35 pages
SIMD Programming Overview
No ratings yet
SIMD Programming Overview
31 pages
Chapter 04
No ratings yet
Chapter 04
47 pages
Lecture #4
No ratings yet
Lecture #4
16 pages
Module1 PP BDS701 Notes
No ratings yet
Module1 PP BDS701 Notes
27 pages
Riscv Vector Workshop June2015
No ratings yet
Riscv Vector Workshop June2015
58 pages
Advanced Computer Architecture: Presented By, Farhan Mukhtiar
No ratings yet
Advanced Computer Architecture: Presented By, Farhan Mukhtiar
9 pages
CS7103 - MultiCore Architecture Ppts Unit-II
No ratings yet
CS7103 - MultiCore Architecture Ppts Unit-II
43 pages
GPU SIMD Architecture Overview
No ratings yet
GPU SIMD Architecture Overview
26 pages
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
73 pages
Data-Level Parallelism with Vectors & GPUs
No ratings yet
Data-Level Parallelism with Vectors & GPUs
6 pages
CA 4 Notes
No ratings yet
CA 4 Notes
34 pages
Programming With SIMD-instructions
No ratings yet
Programming With SIMD-instructions
10 pages
Data-Level Parallelism in SIMD Architectures
No ratings yet
Data-Level Parallelism in SIMD Architectures
92 pages
Advanced Parallel Computing Concepts
No ratings yet
Advanced Parallel Computing Concepts
38 pages
Chapter 04
No ratings yet
Chapter 04
17 pages
Parallel Computing for CS Students
No ratings yet
Parallel Computing for CS Students
9 pages
Vector and SIMD Computer Systems
No ratings yet
Vector and SIMD Computer Systems
59 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Unit 3-4
No ratings yet
Unit 3-4
76 pages
Practical SIMD Programming Guide
No ratings yet
Practical SIMD Programming Guide
17 pages
Lecture 10 - SIMD Architecture
No ratings yet
Lecture 10 - SIMD Architecture
27 pages
CS 61C: Great Ideas in Computer Architecture: Parallel Processing - SIMD
No ratings yet
CS 61C: Great Ideas in Computer Architecture: Parallel Processing - SIMD
66 pages
Guc 315 61 38694 2023-11-23T11 50 52
No ratings yet
Guc 315 61 38694 2023-11-23T11 50 52
33 pages
ACA1
No ratings yet
ACA1
29 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
26-27 SIMD Architecture
No ratings yet
26-27 SIMD Architecture
33 pages
Coa Mod 4 5
No ratings yet
Coa Mod 4 5
91 pages
Programação Paralela e Distribuída
No ratings yet
Programação Paralela e Distribuída
39 pages
Associative Computing Models: SIMD Background
No ratings yet
Associative Computing Models: SIMD Background
39 pages
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
58 pages
NX27V RISC V Vector Processor - English
No ratings yet
NX27V RISC V Vector Processor - English
29 pages
Module 1
No ratings yet
Module 1
63 pages
Module 4 Chapter 2
No ratings yet
Module 4 Chapter 2
42 pages
Presentation1 (1) HPC Mod 3
No ratings yet
Presentation1 (1) HPC Mod 3
51 pages
Intel SIMD Architecture Overview
No ratings yet
Intel SIMD Architecture Overview
80 pages
Stanford CS149 Parallel Computing Assignment
No ratings yet
Stanford CS149 Parallel Computing Assignment
31 pages
SIMD For C++ Developers © 2019 Konstantin, Http://const - Me Page 1 of 21
No ratings yet
SIMD For C++ Developers © 2019 Konstantin, Http://const - Me Page 1 of 21
21 pages
Chapter 6 Parallel and Concurrent Computing
No ratings yet
Chapter 6 Parallel and Concurrent Computing
27 pages
Array & Vector Processor
No ratings yet
Array & Vector Processor
17 pages
Onur 447 Spring15 Lecture14 Simd Afterlecture
No ratings yet
Onur 447 Spring15 Lecture14 Simd Afterlecture
60 pages
7TH - Unit 4-21ec74h6 - Ca
No ratings yet
7TH - Unit 4-21ec74h6 - Ca
67 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
35 pages
Pda 2
No ratings yet
Pda 2
105 pages
Parallel Computer Architectures 2015
No ratings yet
Parallel Computer Architectures 2015
59 pages
Assembly #4
No ratings yet
Assembly #4
3 pages
Understanding SIMD Architecture
No ratings yet
Understanding SIMD Architecture
28 pages
Communication Costs in HPC Systems
No ratings yet
Communication Costs in HPC Systems
29 pages
Week 4 PDC
No ratings yet
Week 4 PDC
11 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
16 pages
SIMD Computer Organizations
0% (1)
SIMD Computer Organizations
20 pages
Aca UNIT-5
No ratings yet
Aca UNIT-5
10 pages
Computerized System Life Cycle Management
100% (1)
Computerized System Life Cycle Management
107 pages
Lidstech Broucher
No ratings yet
Lidstech Broucher
2 pages
Reply Speech
No ratings yet
Reply Speech
1 page
CHAPTER 3: Introduction To C Programming: Cseb113 Principles of Programming
No ratings yet
CHAPTER 3: Introduction To C Programming: Cseb113 Principles of Programming
43 pages
Liebherr CEP12 Software Manual
No ratings yet
Liebherr CEP12 Software Manual
12 pages
An Approach To Carry Out Consistency Analysis On Requirements PDF
No ratings yet
An Approach To Carry Out Consistency Analysis On Requirements PDF
6 pages
Generic Repository VS Non-Generic Repository
No ratings yet
Generic Repository VS Non-Generic Repository
6 pages
Telecom Engineering Expertise
No ratings yet
Telecom Engineering Expertise
5 pages
Functions
No ratings yet
Functions
9 pages
22MCA39 Internship Report Format
100% (1)
22MCA39 Internship Report Format
32 pages
CypCut Laser Cutting Control System User Manual V6.2.5
No ratings yet
CypCut Laser Cutting Control System User Manual V6.2.5
55 pages
300+ Top Microprocessors Questions and Answers PDF: Prisma™ Cloud Security
No ratings yet
300+ Top Microprocessors Questions and Answers PDF: Prisma™ Cloud Security
14 pages
Class 5 Levchuk 442
No ratings yet
Class 5 Levchuk 442
10 pages
NHS Identity Management SOP
No ratings yet
NHS Identity Management SOP
8 pages
Hybrid Conference Zoom Meeting Rules
No ratings yet
Hybrid Conference Zoom Meeting Rules
2 pages
Understanding Computer Mouse Basics
No ratings yet
Understanding Computer Mouse Basics
1 page
Skill Badges Solution
No ratings yet
Skill Badges Solution
12 pages
Game Dev & AI Programmer Profile
No ratings yet
Game Dev & AI Programmer Profile
1 page
Java Mad
No ratings yet
Java Mad
2 pages
Understanding Service-Level Agreements
No ratings yet
Understanding Service-Level Agreements
30 pages
UGI RFM 2021 HighLevelScopelist S4H
No ratings yet
UGI RFM 2021 HighLevelScopelist S4H
34 pages
Bucky Diagnost Floor Sys. Refe.m.
No ratings yet
Bucky Diagnost Floor Sys. Refe.m.
42 pages
Key Stretching Explained
100% (1)
Key Stretching Explained
2 pages
Odia Asian Sexy Girls Us 401880
No ratings yet
Odia Asian Sexy Girls Us 401880
7 pages
Training and Placement
No ratings yet
Training and Placement
48 pages
Worksheet 2
No ratings yet
Worksheet 2
5 pages
Python New Material
No ratings yet
Python New Material
126 pages
Importance of Cyber Security in Digital India
No ratings yet
Importance of Cyber Security in Digital India
16 pages
Guia de Compatibilidad de Cartuchos HP
No ratings yet
Guia de Compatibilidad de Cartuchos HP
2 pages
Cse535 F24 1003 BFT
No ratings yet
Cse535 F24 1003 BFT
47 pages

Ada2024 Gpu 3

Uploaded by

Ada2024 Gpu 3

Uploaded by

GPU architecture part 3:

compiling from threads to SIMD

Example: Nvidia GPUs Example: most CPUs, Intel Xeon Phi,

Software: OpenMP, graphics shaders, OpenCL, CUDA... Software

1 kernel Many threads

Software: OpenMP, OpenCL, CUDA, Gfx shaders... Software

1 kernel Many threads

Scalar ISA Threads

SIMT microarchitecture SIMD Compiler

Compiler generates code to compute execution masks and branch directions

if(tid > 17) {

Removes Write-After-Read dependencies

x = 0; (m0) mov x←0 // m0 is current mask

Compiler knowing precise execution order Restricts freedom of

In GPGPU kernels, most integer arithmetic is affine (or uniform)

Non-affine registers alive in inner loop:

MatrixMul: 3 non-affine / 14 Needleman-Wunsch:

50% - 92% of GPU RF contains affine variables

Explicit SIMD architectures have scalar units

Divergence analysis guides scalarization

Less obvious benefits

kernel void scale(float a, float * X) float mul(float u, float v)

kernel void scale2(float a, float * X) void mul_ptr(float* u, float *v)

Depends on call site

Used in Intel Cilk+

__declspec (vector uniform(u)) float mul(float u, float v)

__declspec (vector uniform(u) linear(v)) void mul_ptr(float* u, float *v)

Different variations are C++ function overloads

Problème : sans ordre total, comment revenir en arrière ? Chemin B: t2

Renommage de registres pour éliminer add r3 ← r1+r2 add r3a ← r1a+r2a

Fetch & decode

Chaque chemin a un masque de bits

Chemin initial emprunté par tous les fils d'exécution

Résolution du branchement, divergence des threads 1-2

Prédiction de divergence probable

Libérer chemin h1 à la confirmation du branchement

Flux d'exécution Table de chemins

Renommage de registres pour éliminer add r3 ← r1+r2 add r3a ← r1a+r2a

Combien de registres physiques par registre architectural ? r1 ← 42

Se lit: r1 est dans p4 pour tous les fils d'exécution

Possible de combiner contrôle séquentiel et parallèle

You might also like