0% found this document useful (0 votes)
3 views

Presentation1 (1) hpc mod 3

Uploaded by

mohdshabeelvp14
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Presentation1 (1) hpc mod 3

Uploaded by

mohdshabeelvp14
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

MODULE 3

DATA LEVEL PARALLELISM


SIMD EXTENSIONS
 Media Applications operate on data types narrower
than 32 bits.
 Graphics systems use
 8 bits : 3 primary colours
 8 bits: transparency
 Audio samples represented with :
 8 bits
 16 bits
• Simultaneous operations on:
• 32 8 bit operands
• 16 16 bit operands
• 8 32 bit operands
• 4 64 bit operands
• Instruction category:
 Unsigned add/subtract
 Maximum/minimu
 Average
 Shift right/left
 Floating point
• SIMD Limitations/ Omissions:
 No Vector length register
Number of data operands encoded into opcode
Addition of hundreds of instructions in MMX,
SSE and AVX extensions
 No sophisticated addressing modes
 No mask registers
Programming Multimedia SIMD
Architectures

• Advanced compilers today can generate SIMD


floating-point instructions.

• Programmers must be sure to align all the data


in memory to the width of the SIMD unit.
The Roofline Visual Performance Model

• To compare potential floating-point


performance of variations of SIMD
architectures.
• Combines :
Floating-point, performance, memory
performance, and arithmetic intensity in a two-
dimensional graph.
• Arithmetic intensity is the ratio of floating-
point operations per byte of memory accessed.
• Peak floating-point performance can be found
using the hardware specification
Roofline model for the NEC SX-9 vector processor and
the Intel Core i7 920 Multicore Processor
Graphics Processing Units
• The GPU computing when combined with a
programming language ,made GPUs easier to
program.
• The primary ancestors of GPUs are graphics
accelerators.
Programming the GPUs
• The challenges for the GPU programmer are:
 Getting good performance on the GPU
 Coordinating the scheduling of computation
on the system processor and the GPU
 The transfer of data between system memory
and GPU memory.
• NVIDIA decided to develop a C-like language
and programming environment:
• CUDA : Compute Unified Device Architecture.
• CUDA produces C/C++ for the system
processor (host) and a C and C++ dialect for
the GPU.
• The unifying theme of all these forms of
parallelism is the CUDA Thread
• Threads together to utilize the various styles
of parallelism within a GPU: multithreading,
MIMD, SIMD, and instruction-level parallelism.
• NVIDIA classifies the CUDA programming
model as single instruction, multiple thread
(SIMT).
• Threads are blocked together and executed in
groups of threads, called a Thread Block.
• The hardware that executes a whole block of
threads a multithreaded SIMD Processor.
• To distinguish between functions for the GPU (device) and functions for the
system processor (host),
• CUDA uses __device__ or __global__ for the former and __host__ for the
latter.
• • CUDA variables declared with __device__ are allocated to the GPU Memory
(see below), which is accessible by all multithreaded SIMD Processors.
• • The extended function call syntax for the function name that runs on the
GPU is name < <> > (… parameter list…) where dimGrid and dimBlock specify
the dimensions of the code (in Thread Blocks) and the dimensions of a block
(in threads).
• • In addition to the identifier for blocks (blockIdx) and the identifier for each
thread in a block (threadIdx), CUDA provides a keyword for the number of
threads per block (blockDim), which comes from the dimBlock parameter in
the preceding bullet.
NVIDIA GPU Computational
Structures
• GPUs work well only with data-level parallel
problems.
• GPU processors have more registers than do
vector processors.
• GPUs implement certain features in hardware
that vector processors would implement in
software.
GPU Terms
The mapping of a Grid (vectorizable loop), Thread Blocks (SIMD basic blocks), and threads of
SIMD instructions to a vector-vector multiply, with each vector being 8192 elements long.
• Each thread of SIMD instructions calculates 32 elements per instruction,
and in this example, each Thread Block contains 16 threads of SIMD
instructions and the Grid contains 16 Thread Blocks.
• The hardware Thread Block Scheduler assigns Thread Blocks to
multithreaded SIMD Processors, and the hardware Thread Scheduler picks
which thread of SIMD instructions to run each clock cycle within a SIMD
Processor.
• Only SIMD Threads in the same Thread Block can communicate via local
memory. (The maximum number of SIMD Threads that can execute
simultaneously per Thread Block is 32 for Pascal GPUs.)
• GPU hardware has two levels of hardware schedulers:
 (1) the Thread Block Scheduler that assigns Thread Blocks
(bodies of vectorized loops) to multithreaded SIMD
Processors
 (2) the SIMD Thread Scheduler within a SIMD Processor,
which schedules when threads of SIMD instructions should
run.
• The SIMD instructions of these threads are 32 wide
• Each thread of SIMD instructions in this example
would compute 32 of the elements of the
computation.
• The SIMD Processor must have parallel functional
units to perform the operation called SIMD Lanes
• With the Pascal GPU, each 32-wide thread of SIMD
instructions is mapped to 16 physical SIMD Lanes
• Each SIMD instruction in a thread of SIMD
instructions takes 2 clock cycles to complete.
• The number of lanes in a GPU SIMD Processor can
be anything up to the number of threads in a
Thread Block, just as the number of lanes in a
vector processor can vary between 1 and the
maximum vector length.
• The SIMD Thread Scheduler can pick whatever
thread of SIMD instructions is ready, and need not
stick with the next SIMD instruction in the sequence
within a thread.
• Scoreboard : To keep track of up to 64 threads of
SIMD instructions to see which SIMD instruction is
ready to go.
• Each multithreaded SIMD Processor must load 32 elements of
two vectors from memory into registers
• Perform the multiply by reading and writing registers, and
store the product back from registers into memory.
• To hold these memory elements, a SIMD Processor has
between an impressive 32,768–65,536 32-bit registers
depending on the model of the Pascal GPU.
• Just like a vector processor, these registers are divided
logically across the Vector Lanes or, in this case, SIMD Lanes
NVIDA GPU Instruction Set Architecture
 An abstraction of the hardware instruction set.

 PTX (Parallel Thread Execution) provides a stable instruction set for


compilers as well as compatibility across generations of GPUs.
 The hardware instruction set is hidden from the programmer
 PTX instructions describe the operations on a single CUDA Thread and
usually map one-to-one with hardware instructions,
 PTX uses an unlimited number of write-once registers and the compiler
must run a register allocation procedure to map the PTX registers to a
fixed number of read-write hardware registers available on the actual
device.
 The optimizer runs subsequently and can reduce register use even
further. This optimizer also eliminates dead code,
• The format of a PTX instruction is
opcode.type d, a, b, c;
• where d is the destination operand; a, b, and c are source
operands;
Conditional Branching in GPUs
• There are strong similarities between how vector architectures(S/W) and
GPUs handle IF statements (H/W)
• At the PTX assembler level, control flow of one CUDA Thread is
described by the PTX instructions branch, call, return, and exit, plus
individual per-thread-lane predication of each instruction, specified by the
programmer with per-thread-lane 1-bit predicate registers.
• The PTX assembler analyzes the PTX branch graph and optimizes it to the
fastest GPU hardware instruction sequence.
• Each can make its own decision on a branch and does not need to be in
lock step.
• At the GPU hardware instruction level, control flow includes branch, jump, jump
indexed, call, call indexed, return, exit, and special instructions that manage the
branch synchronization stack.
• GPU hardware provides each SIMD Thread with its own stack; a stack entry
contains an identifier token, a target instruction address, and a target thread-active
mask.
• There are GPU special instructions that push stack entries for a SIMD Thread and
special instructions and instruction markers that pop a stack entry or unwind the
stack to a specified entry and branch to the target instruction address with the
target thread-active mask.
• GPU hardware instructions also have an individual per-lane predication
(enable/disable), specified with a 1-bit predicate register for each lane.
• The PTX assembler identifies loop branches and generates GPU branch instructions
that branch to the top of the loop
• GPU indexed jump and indexed call instructions push entries on the stack so that
when all lanes complete the switch statement or function call, the SIMD Thread
converges .
• A GPU set predicate instruction evaluates the conditional part of the IF
statement.
• The SIMD instructions in the threads inside the THEN part of the IF
statement broadcast operations to all the SIMD Lanes.
• Those lanes with the predicate set to 1 perform the operation and store
the result, and the other SIMD Lanes don’t perform an operation or store
a result.
NVIDIA GPU Memory Structures
• Each SIMD Lane in a multithreaded SIMD Processor is given a private section of off-
chip DRAM, which we call the private memory.
• Local memory is limited in size, typically to 48 KiB
• The multithreaded SIMD Processor dynamically allocates portions of the local
memory to a Thread Block when it creates the Thread Block.
• The system processor, called the host, can read or write GPU Memory.
• Local memory is unavailable to the host, as it is private to each multithreaded
SIMD Process.
• GPUs traditionally use smaller streaming caches.
• To improve memory bandwidth and reduce overhead, as mentioned, PTX data
transfer instructions in cooperation with the memory controller coalesce individual
parallel thread requests from the same SIMD Thread together into a single
memory block request when the addresses fall in the same block.
Innovations in the Pascal GPU Architecture
• Each new generation of GPU typically adds some new features that
increase performance or make it easier for programmers.
• Fast single-precision, double-precision, and half-precision floating-point
arithmetic:
• High-bandwidth memory
• High-speed chip-to-chip interconnect.
• Unified virtual memory and paging support.
Similarities and Differences Between Vector
Architectures and GPUs
• Similarity: Both include Data Level Parallelism
• Major Difference: multithreading, which is
fundamental to GPUs and missing from most
vector processors.
Similarities and Differences Between Multimedia
SIMD Computers and GPUs
Detecting and Enhancing Loop-Level
Parallelism
• Compiler technology used for discovering the amount of parallelism that
we can exploit in a program.
• Loop-level parallelism is normally investigated at the source level.
• Loop-level analysis involves determining what dependences exist among
the operands in a loop across the iterations of that loop.
• The analysis of loop-level parallelism focuses on determining whether data
accesses in later iterations are dependent on data values produced in
earlier iterations; such dependence is called a loop-carried dependence.
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;

In this loop, the two uses of x[i] are dependent,


but this dependence is within a single iteration
and is not loop-carried.
There is a loop-carried dependence between
successive uses of i in different iterations
• Because finding loop-level parallelism involves
• recognizing structures such as loops, array
references, and induction variable
computations
• a compiler can do this analysis more easily at
or near the source level, in contrast to the
machine-code level.
Finding Dependences
• Finding the dependences in a program is important both to determine
which loops might contain parallelism and to eliminate name
dependences.
• The complexity of dependence analysis arises also because of the
presence of arrays and pointers in languages such as C or C++, or pass-by-
reference parameter passing in Fortran.
• How does the compiler detect dependences in general?
• Assume that array indices are affine.
• In simplest terms, a one-dimensional array index is affine if it can be
written in the form ai+b, where a and b are constants and i is the loop
index variable
• The index of a multidimensional array is affine if the index in each
dimension is affine.
• A dependence exists if two conditions hold: 1. There are two
iteration indices, j and k, that are both within the limits of the
forloop.
That is, m < j < n, m < k < n.
• 2. The loop stores into an array element indexed by a*j+b and
later fetches from that same array element when it is indexed
by c*k+d, that is, a*j+b=c*k+d.
Eliminating Dependent Computations
• We cannot determine whether dependence exists at compile
time.
• One of the most important forms of dependent computations
is a recurrence.
• Although any loop is not parallel, it has a very specific
structure called a reduction.
• They are also a key part of the primary parallelism, primitive
MapReduce used in warehouse-scale computers.
• In general, any function can be used as a reduction operator,
and common cases include operators such as max and min
• Reductions are sometimes handled by special hardware in a
vector and SIMD architecture that allows the reduce step to
be done much faster than it could be done in scalar mode.

You might also like