Presentation1 (1) hpc mod 3

Uploaded by

mohdshabeelvp14

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Presentation1 (1) hpc mod 3

Uploaded by

mohdshabeelvp14

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 51

MODULE 3

DATA LEVEL PARALLELISM

SIMD EXTENSIONS
 Media Applications operate on data types narrower
than 32 bits.
 Graphics systems use
 8 bits : 3 primary colours
 8 bits: transparency
 Audio samples represented with :
 8 bits
 16 bits
• Simultaneous operations on:
• 32 8 bit operands
• 16 16 bit operands
• 8 32 bit operands
• 4 64 bit operands
• Instruction category:
 Unsigned add/subtract
 Maximum/minimu
 Average
 Shift right/left
 Floating point
• SIMD Limitations/ Omissions:
 No Vector length register
Number of data operands encoded into opcode
Addition of hundreds of instructions in MMX,
SSE and AVX extensions
 No sophisticated addressing modes
 No mask registers
Programming Multimedia SIMD
Architectures

• Advanced compilers today can generate SIMD

floating-point instructions.

• Programmers must be sure to align all the data

in memory to the width of the SIMD unit.
The Roofline Visual Performance Model

• To compare potential floating-point

performance of variations of SIMD
architectures.
• Combines :
Floating-point, performance, memory
performance, and arithmetic intensity in a two-
dimensional graph.
• Arithmetic intensity is the ratio of floating-
point operations per byte of memory accessed.
• Peak floating-point performance can be found
using the hardware specification
Roofline model for the NEC SX-9 vector processor and
the Intel Core i7 920 Multicore Processor
Graphics Processing Units
• The GPU computing when combined with a
programming language ,made GPUs easier to
program.
• The primary ancestors of GPUs are graphics
accelerators.
Programming the GPUs
• The challenges for the GPU programmer are:
 Getting good performance on the GPU
 Coordinating the scheduling of computation
on the system processor and the GPU
 The transfer of data between system memory
and GPU memory.
• NVIDIA decided to develop a C-like language
and programming environment:
• CUDA : Compute Unified Device Architecture.
• CUDA produces C/C++ for the system
processor (host) and a C and C++ dialect for
the GPU.
• The unifying theme of all these forms of
parallelism is the CUDA Thread
• Threads together to utilize the various styles
of parallelism within a GPU: multithreading,
MIMD, SIMD, and instruction-level parallelism.
• NVIDIA classifies the CUDA programming
model as single instruction, multiple thread
(SIMT).
• Threads are blocked together and executed in
groups of threads, called a Thread Block.
• The hardware that executes a whole block of
threads a multithreaded SIMD Processor.
• To distinguish between functions for the GPU (device) and functions for the
system processor (host),
• CUDA uses __device__ or __global__ for the former and __host__ for the
latter.
• • CUDA variables declared with __device__ are allocated to the GPU Memory
(see below), which is accessible by all multithreaded SIMD Processors.
• • The extended function call syntax for the function name that runs on the
GPU is name < <> > (… parameter list…) where dimGrid and dimBlock specify
the dimensions of the code (in Thread Blocks) and the dimensions of a block
(in threads).
• • In addition to the identifier for blocks (blockIdx) and the identifier for each
thread in a block (threadIdx), CUDA provides a keyword for the number of
threads per block (blockDim), which comes from the dimBlock parameter in
the preceding bullet.
NVIDIA GPU Computational
Structures
• GPUs work well only with data-level parallel
problems.
• GPU processors have more registers than do
vector processors.
• GPUs implement certain features in hardware
that vector processors would implement in
software.
GPU Terms
The mapping of a Grid (vectorizable loop), Thread Blocks (SIMD basic blocks), and threads of
SIMD instructions to a vector-vector multiply, with each vector being 8192 elements long.
• Each thread of SIMD instructions calculates 32 elements per instruction,
and in this example, each Thread Block contains 16 threads of SIMD
instructions and the Grid contains 16 Thread Blocks.
• The hardware Thread Block Scheduler assigns Thread Blocks to
multithreaded SIMD Processors, and the hardware Thread Scheduler picks
which thread of SIMD instructions to run each clock cycle within a SIMD
Processor.
• Only SIMD Threads in the same Thread Block can communicate via local
memory. (The maximum number of SIMD Threads that can execute
simultaneously per Thread Block is 32 for Pascal GPUs.)
• GPU hardware has two levels of hardware schedulers:
 (1) the Thread Block Scheduler that assigns Thread Blocks
(bodies of vectorized loops) to multithreaded SIMD
Processors
 (2) the SIMD Thread Scheduler within a SIMD Processor,
which schedules when threads of SIMD instructions should
run.
• The SIMD instructions of these threads are 32 wide
• Each thread of SIMD instructions in this example
would compute 32 of the elements of the
computation.
• The SIMD Processor must have parallel functional
units to perform the operation called SIMD Lanes
• With the Pascal GPU, each 32-wide thread of SIMD
instructions is mapped to 16 physical SIMD Lanes
• Each SIMD instruction in a thread of SIMD
instructions takes 2 clock cycles to complete.
• The number of lanes in a GPU SIMD Processor can
be anything up to the number of threads in a
Thread Block, just as the number of lanes in a
vector processor can vary between 1 and the
maximum vector length.
• The SIMD Thread Scheduler can pick whatever
thread of SIMD instructions is ready, and need not
stick with the next SIMD instruction in the sequence
within a thread.
• Scoreboard : To keep track of up to 64 threads of
SIMD instructions to see which SIMD instruction is
ready to go.
• Each multithreaded SIMD Processor must load 32 elements of
two vectors from memory into registers
• Perform the multiply by reading and writing registers, and
store the product back from registers into memory.
• To hold these memory elements, a SIMD Processor has
between an impressive 32,768–65,536 32-bit registers
depending on the model of the Pascal GPU.
• Just like a vector processor, these registers are divided
logically across the Vector Lanes or, in this case, SIMD Lanes
NVIDA GPU Instruction Set Architecture
 An abstraction of the hardware instruction set.

 PTX (Parallel Thread Execution) provides a stable instruction set for

compilers as well as compatibility across generations of GPUs.
 The hardware instruction set is hidden from the programmer
 PTX instructions describe the operations on a single CUDA Thread and
usually map one-to-one with hardware instructions,
 PTX uses an unlimited number of write-once registers and the compiler
must run a register allocation procedure to map the PTX registers to a
fixed number of read-write hardware registers available on the actual
device.
 The optimizer runs subsequently and can reduce register use even
further. This optimizer also eliminates dead code,
• The format of a PTX instruction is
opcode.type d, a, b, c;
• where d is the destination operand; a, b, and c are source
operands;
Conditional Branching in GPUs
• There are strong similarities between how vector architectures(S/W) and
GPUs handle IF statements (H/W)
• At the PTX assembler level, control flow of one CUDA Thread is
described by the PTX instructions branch, call, return, and exit, plus
individual per-thread-lane predication of each instruction, specified by the
programmer with per-thread-lane 1-bit predicate registers.
• The PTX assembler analyzes the PTX branch graph and optimizes it to the
fastest GPU hardware instruction sequence.
• Each can make its own decision on a branch and does not need to be in
lock step.
• At the GPU hardware instruction level, control flow includes branch, jump, jump
indexed, call, call indexed, return, exit, and special instructions that manage the
branch synchronization stack.
• GPU hardware provides each SIMD Thread with its own stack; a stack entry
contains an identifier token, a target instruction address, and a target thread-active
mask.
• There are GPU special instructions that push stack entries for a SIMD Thread and
special instructions and instruction markers that pop a stack entry or unwind the
stack to a specified entry and branch to the target instruction address with the
target thread-active mask.
• GPU hardware instructions also have an individual per-lane predication
(enable/disable), specified with a 1-bit predicate register for each lane.
• The PTX assembler identifies loop branches and generates GPU branch instructions
that branch to the top of the loop
• GPU indexed jump and indexed call instructions push entries on the stack so that
when all lanes complete the switch statement or function call, the SIMD Thread
converges .
• A GPU set predicate instruction evaluates the conditional part of the IF
statement.
• The SIMD instructions in the threads inside the THEN part of the IF
statement broadcast operations to all the SIMD Lanes.
• Those lanes with the predicate set to 1 perform the operation and store
the result, and the other SIMD Lanes don’t perform an operation or store
a result.
NVIDIA GPU Memory Structures
• Each SIMD Lane in a multithreaded SIMD Processor is given a private section of off-
chip DRAM, which we call the private memory.
• Local memory is limited in size, typically to 48 KiB
• The multithreaded SIMD Processor dynamically allocates portions of the local
memory to a Thread Block when it creates the Thread Block.
• The system processor, called the host, can read or write GPU Memory.
• Local memory is unavailable to the host, as it is private to each multithreaded
SIMD Process.
• GPUs traditionally use smaller streaming caches.
• To improve memory bandwidth and reduce overhead, as mentioned, PTX data
transfer instructions in cooperation with the memory controller coalesce individual
parallel thread requests from the same SIMD Thread together into a single
memory block request when the addresses fall in the same block.
Innovations in the Pascal GPU Architecture
• Each new generation of GPU typically adds some new features that
increase performance or make it easier for programmers.
• Fast single-precision, double-precision, and half-precision floating-point
arithmetic:
• High-bandwidth memory
• High-speed chip-to-chip interconnect.
• Unified virtual memory and paging support.
Similarities and Differences Between Vector
Architectures and GPUs
• Similarity: Both include Data Level Parallelism
• Major Difference: multithreading, which is
fundamental to GPUs and missing from most
vector processors.
Similarities and Differences Between Multimedia
SIMD Computers and GPUs
Detecting and Enhancing Loop-Level
Parallelism
• Compiler technology used for discovering the amount of parallelism that
we can exploit in a program.
• Loop-level parallelism is normally investigated at the source level.
• Loop-level analysis involves determining what dependences exist among
the operands in a loop across the iterations of that loop.
• The analysis of loop-level parallelism focuses on determining whether data
accesses in later iterations are dependent on data values produced in
earlier iterations; such dependence is called a loop-carried dependence.
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;

In this loop, the two uses of x[i] are dependent,

but this dependence is within a single iteration
and is not loop-carried.
There is a loop-carried dependence between
successive uses of i in different iterations
• Because finding loop-level parallelism involves
• recognizing structures such as loops, array
references, and induction variable
computations
• a compiler can do this analysis more easily at
or near the source level, in contrast to the
machine-code level.
Finding Dependences
• Finding the dependences in a program is important both to determine
which loops might contain parallelism and to eliminate name
dependences.
• The complexity of dependence analysis arises also because of the
presence of arrays and pointers in languages such as C or C++, or pass-by-
reference parameter passing in Fortran.
• How does the compiler detect dependences in general?
• Assume that array indices are affine.
• In simplest terms, a one-dimensional array index is affine if it can be
written in the form ai+b, where a and b are constants and i is the loop
index variable
• The index of a multidimensional array is affine if the index in each
dimension is affine.
• A dependence exists if two conditions hold: 1. There are two
iteration indices, j and k, that are both within the limits of the
forloop.
That is, m < j < n, m < k < n.
• 2. The loop stores into an array element indexed by a*j+b and
later fetches from that same array element when it is indexed
by c*k+d, that is, a*j+b=c*k+d.
Eliminating Dependent Computations
• We cannot determine whether dependence exists at compile
time.
• One of the most important forms of dependent computations
is a recurrence.
• Although any loop is not parallel, it has a very specific
structure called a reduction.
• They are also a key part of the primary parallelism, primitive
MapReduce used in warehouse-scale computers.
• In general, any function can be used as a reduction operator,
and common cases include operators such as max and min
• Reductions are sometimes handled by special hardware in a
vector and SIMD architecture that allows the reduce step to
be done much faster than it could be done in scalar mode.

3G3DV Series - Drive.CA 3G3DV-SFDPT Control Software - Operating Instructions - Omron
No ratings yet
3G3DV Series - Drive.CA 3G3DV-SFDPT Control Software - Operating Instructions - Omron
96 pages
Tajima Tme-Dc, TMFD-DC User Manual
33% (6)
Tajima Tme-Dc, TMFD-DC User Manual
7 pages
Ccure-9000-Connected-Program Faq r12 LT en
No ratings yet
Ccure-9000-Connected-Program Faq r12 LT en
11 pages
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
Vector Processors
No ratings yet
Vector Processors
20 pages
chapter-8
No ratings yet
chapter-8
58 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
No ratings yet
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
21 pages
GPU Fundamentals
No ratings yet
GPU Fundamentals
20 pages
Lec 18-VectorSIMDGPUArchitectures
No ratings yet
Lec 18-VectorSIMDGPUArchitectures
29 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
0-gpu-computing-i-give-it
No ratings yet
0-gpu-computing-i-give-it
57 pages
GPU Architecture & Implications: David Luebke NVIDIA Research
No ratings yet
GPU Architecture & Implications: David Luebke NVIDIA Research
94 pages
onur-digitaldesign-2020-lecture20-gpu-beforelecture
No ratings yet
onur-digitaldesign-2020-lecture20-gpu-beforelecture
73 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Gpu Architecture
No ratings yet
Gpu Architecture
43 pages
GPU Architectures
No ratings yet
GPU Architectures
29 pages
CSE_lec4_cuda
No ratings yet
CSE_lec4_cuda
91 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Analyzing CUDA Workloads Using A Detailed GPU Simulator
No ratings yet
Analyzing CUDA Workloads Using A Detailed GPU Simulator
12 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
00_CourseIntroduction
No ratings yet
00_CourseIntroduction
33 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
CatanzaroIntroToGPUs
No ratings yet
CatanzaroIntroToGPUs
76 pages
Lec 1
No ratings yet
Lec 1
27 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
GPU Architecture
0% (2)
GPU Architecture
28 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Lecture-12-GPU-Programming
No ratings yet
Lecture-12-GPU-Programming
65 pages
Threads and Memory7
No ratings yet
Threads and Memory7
42 pages
Lec 6
No ratings yet
Lec 6
16 pages
Lecture13 - Full IS1500
No ratings yet
Lecture13 - Full IS1500
34 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Comparison of Multimedia SIMD, GPUs and Vector
No ratings yet
Comparison of Multimedia SIMD, GPUs and Vector
13 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
L 6-Gpu
No ratings yet
L 6-Gpu
79 pages
CUDA
No ratings yet
CUDA
33 pages
Paralelismo_2024
No ratings yet
Paralelismo_2024
30 pages
GPU Khoruzhenko
No ratings yet
GPU Khoruzhenko
5 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
TFM - Unfinished
No ratings yet
TFM - Unfinished
17 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Cpus: Latency Oriented Design
No ratings yet
Cpus: Latency Oriented Design
2 pages
Advanced Computer Architecture: Presented By, Krishna
No ratings yet
Advanced Computer Architecture: Presented By, Krishna
35 pages
2620 Final PDF
No ratings yet
2620 Final PDF
45 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
PART19
No ratings yet
PART19
20 pages
L 3 GPU
No ratings yet
L 3 GPU
33 pages
Lec 3
No ratings yet
Lec 3
48 pages
CH19 COA10e
No ratings yet
CH19 COA10e
20 pages
Design of Graphics Processing Framework On FPGA
No ratings yet
Design of Graphics Processing Framework On FPGA
5 pages
Design Patterns For Low-Level Real-Time Rendering - Nicolas Guillemot - CppCon 2017
No ratings yet
Design Patterns For Low-Level Real-Time Rendering - Nicolas Guillemot - CppCon 2017
56 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Master System Architecture: Architecture of Consoles: A Practical Analysis, #15
From Everand
Master System Architecture: Architecture of Consoles: A Practical Analysis, #15
Rodrigo Copetti
2/5 (1)
DHI-MCVR6208: 8 Channels Mobile HDCVI Video Recorder
No ratings yet
DHI-MCVR6208: 8 Channels Mobile HDCVI Video Recorder
2 pages
Polio Vaccination System
No ratings yet
Polio Vaccination System
8 pages
CAD Introduction
No ratings yet
CAD Introduction
25 pages
MAN (Fra) SCIA24.0 - Installation Manual Cloud
No ratings yet
MAN (Fra) SCIA24.0 - Installation Manual Cloud
30 pages
Home Interview Questions Java SQL Python Javascript Angular
No ratings yet
Home Interview Questions Java SQL Python Javascript Angular
21 pages
AWS Certified Machine Learning Specialty Exam Guide
No ratings yet
AWS Certified Machine Learning Specialty Exam Guide
7 pages
Fallout 4 Custom Settlement Tutorial
No ratings yet
Fallout 4 Custom Settlement Tutorial
11 pages
4.+IDTECK+Trouble+Shooting 20140516
No ratings yet
4.+IDTECK+Trouble+Shooting 20140516
94 pages
American English File 1 Student Book - Flip PDF - FlipBuilder
No ratings yet
American English File 1 Student Book - Flip PDF - FlipBuilder
176 pages
ECOTAP® - VPD® - CONTROL - PRO - Flyer - IN 5182527 - 01 - en
No ratings yet
ECOTAP® - VPD® - CONTROL - PRO - Flyer - IN 5182527 - 01 - en
2 pages
Flow Chart Symbols
No ratings yet
Flow Chart Symbols
7 pages
E-Learning and Gamification Application For Primary School
No ratings yet
E-Learning and Gamification Application For Primary School
10 pages
A Project Report ON Coaching Management System
100% (1)
A Project Report ON Coaching Management System
66 pages
CSEC Information Technology January 2011 P02
No ratings yet
CSEC Information Technology January 2011 P02
9 pages
Takao Naoya 2003 1
No ratings yet
Takao Naoya 2003 1
30 pages
F4HCJAQJ3YPYHHN
No ratings yet
F4HCJAQJ3YPYHHN
44 pages
TS DRA 2022 en Create Drawings
No ratings yet
TS DRA 2022 en Create Drawings
1,070 pages
Lastexception 63822451712
No ratings yet
Lastexception 63822451712
6 pages
Black Book 2020
No ratings yet
Black Book 2020
49 pages
Database 2
No ratings yet
Database 2
70 pages
2 AES Immersive
No ratings yet
2 AES Immersive
5 pages
MM Unit-III - 0
No ratings yet
MM Unit-III - 0
22 pages
Nihal Graphics - Logo & Brand Identity Design Brief
100% (1)
Nihal Graphics - Logo & Brand Identity Design Brief
20 pages
Computer Assembly and Repair Lab Manual 2023 202 - 231209 - 093922
No ratings yet
Computer Assembly and Repair Lab Manual 2023 202 - 231209 - 093922
63 pages
Gis Enabled Interactive Web Site & On-Line Property Tax Management System 2008-09
No ratings yet
Gis Enabled Interactive Web Site & On-Line Property Tax Management System 2008-09
5 pages
2024 Star Product Guide-Compressed - Kedar Mahajan
No ratings yet
2024 Star Product Guide-Compressed - Kedar Mahajan
32 pages
XVL Player / XVL Player Pro Start-Up Options Manual: For Product Support, Please Contact Your Sales Agent
No ratings yet
XVL Player / XVL Player Pro Start-Up Options Manual: For Product Support, Please Contact Your Sales Agent
53 pages

Presentation1 (1) hpc mod 3

Uploaded by

Presentation1 (1) hpc mod 3

Uploaded by

MODULE 3

DATA LEVEL PARALLELISM

• Advanced compilers today can generate SIMD

• Programmers must be sure to align all the data

• To compare potential floating-point

 PTX (Parallel Thread Execution) provides a stable instruction set for

In this loop, the two uses of x[i] are dependent,

You might also like