Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51
MODULE 3
DATA LEVEL PARALLELISM
SIMD EXTENSIONS Media Applications operate on data types narrower than 32 bits. Graphics systems use 8 bits : 3 primary colours 8 bits: transparency Audio samples represented with : 8 bits 16 bits • Simultaneous operations on: • 32 8 bit operands • 16 16 bit operands • 8 32 bit operands • 4 64 bit operands • Instruction category: Unsigned add/subtract Maximum/minimu Average Shift right/left Floating point • SIMD Limitations/ Omissions: No Vector length register Number of data operands encoded into opcode Addition of hundreds of instructions in MMX, SSE and AVX extensions No sophisticated addressing modes No mask registers Programming Multimedia SIMD Architectures
• Advanced compilers today can generate SIMD
floating-point instructions.
• Programmers must be sure to align all the data
in memory to the width of the SIMD unit. The Roofline Visual Performance Model
• To compare potential floating-point
performance of variations of SIMD architectures. • Combines : Floating-point, performance, memory performance, and arithmetic intensity in a two- dimensional graph. • Arithmetic intensity is the ratio of floating- point operations per byte of memory accessed. • Peak floating-point performance can be found using the hardware specification Roofline model for the NEC SX-9 vector processor and the Intel Core i7 920 Multicore Processor Graphics Processing Units • The GPU computing when combined with a programming language ,made GPUs easier to program. • The primary ancestors of GPUs are graphics accelerators. Programming the GPUs • The challenges for the GPU programmer are: Getting good performance on the GPU Coordinating the scheduling of computation on the system processor and the GPU The transfer of data between system memory and GPU memory. • NVIDIA decided to develop a C-like language and programming environment: • CUDA : Compute Unified Device Architecture. • CUDA produces C/C++ for the system processor (host) and a C and C++ dialect for the GPU. • The unifying theme of all these forms of parallelism is the CUDA Thread • Threads together to utilize the various styles of parallelism within a GPU: multithreading, MIMD, SIMD, and instruction-level parallelism. • NVIDIA classifies the CUDA programming model as single instruction, multiple thread (SIMT). • Threads are blocked together and executed in groups of threads, called a Thread Block. • The hardware that executes a whole block of threads a multithreaded SIMD Processor. • To distinguish between functions for the GPU (device) and functions for the system processor (host), • CUDA uses __device__ or __global__ for the former and __host__ for the latter. • • CUDA variables declared with __device__ are allocated to the GPU Memory (see below), which is accessible by all multithreaded SIMD Processors. • • The extended function call syntax for the function name that runs on the GPU is name < <> > (… parameter list…) where dimGrid and dimBlock specify the dimensions of the code (in Thread Blocks) and the dimensions of a block (in threads). • • In addition to the identifier for blocks (blockIdx) and the identifier for each thread in a block (threadIdx), CUDA provides a keyword for the number of threads per block (blockDim), which comes from the dimBlock parameter in the preceding bullet. NVIDIA GPU Computational Structures • GPUs work well only with data-level parallel problems. • GPU processors have more registers than do vector processors. • GPUs implement certain features in hardware that vector processors would implement in software. GPU Terms The mapping of a Grid (vectorizable loop), Thread Blocks (SIMD basic blocks), and threads of SIMD instructions to a vector-vector multiply, with each vector being 8192 elements long. • Each thread of SIMD instructions calculates 32 elements per instruction, and in this example, each Thread Block contains 16 threads of SIMD instructions and the Grid contains 16 Thread Blocks. • The hardware Thread Block Scheduler assigns Thread Blocks to multithreaded SIMD Processors, and the hardware Thread Scheduler picks which thread of SIMD instructions to run each clock cycle within a SIMD Processor. • Only SIMD Threads in the same Thread Block can communicate via local memory. (The maximum number of SIMD Threads that can execute simultaneously per Thread Block is 32 for Pascal GPUs.) • GPU hardware has two levels of hardware schedulers: (1) the Thread Block Scheduler that assigns Thread Blocks (bodies of vectorized loops) to multithreaded SIMD Processors (2) the SIMD Thread Scheduler within a SIMD Processor, which schedules when threads of SIMD instructions should run. • The SIMD instructions of these threads are 32 wide • Each thread of SIMD instructions in this example would compute 32 of the elements of the computation. • The SIMD Processor must have parallel functional units to perform the operation called SIMD Lanes • With the Pascal GPU, each 32-wide thread of SIMD instructions is mapped to 16 physical SIMD Lanes • Each SIMD instruction in a thread of SIMD instructions takes 2 clock cycles to complete. • The number of lanes in a GPU SIMD Processor can be anything up to the number of threads in a Thread Block, just as the number of lanes in a vector processor can vary between 1 and the maximum vector length. • The SIMD Thread Scheduler can pick whatever thread of SIMD instructions is ready, and need not stick with the next SIMD instruction in the sequence within a thread. • Scoreboard : To keep track of up to 64 threads of SIMD instructions to see which SIMD instruction is ready to go. • Each multithreaded SIMD Processor must load 32 elements of two vectors from memory into registers • Perform the multiply by reading and writing registers, and store the product back from registers into memory. • To hold these memory elements, a SIMD Processor has between an impressive 32,768–65,536 32-bit registers depending on the model of the Pascal GPU. • Just like a vector processor, these registers are divided logically across the Vector Lanes or, in this case, SIMD Lanes NVIDA GPU Instruction Set Architecture An abstraction of the hardware instruction set.
PTX (Parallel Thread Execution) provides a stable instruction set for
compilers as well as compatibility across generations of GPUs. The hardware instruction set is hidden from the programmer PTX instructions describe the operations on a single CUDA Thread and usually map one-to-one with hardware instructions, PTX uses an unlimited number of write-once registers and the compiler must run a register allocation procedure to map the PTX registers to a fixed number of read-write hardware registers available on the actual device. The optimizer runs subsequently and can reduce register use even further. This optimizer also eliminates dead code, • The format of a PTX instruction is opcode.type d, a, b, c; • where d is the destination operand; a, b, and c are source operands; Conditional Branching in GPUs • There are strong similarities between how vector architectures(S/W) and GPUs handle IF statements (H/W) • At the PTX assembler level, control flow of one CUDA Thread is described by the PTX instructions branch, call, return, and exit, plus individual per-thread-lane predication of each instruction, specified by the programmer with per-thread-lane 1-bit predicate registers. • The PTX assembler analyzes the PTX branch graph and optimizes it to the fastest GPU hardware instruction sequence. • Each can make its own decision on a branch and does not need to be in lock step. • At the GPU hardware instruction level, control flow includes branch, jump, jump indexed, call, call indexed, return, exit, and special instructions that manage the branch synchronization stack. • GPU hardware provides each SIMD Thread with its own stack; a stack entry contains an identifier token, a target instruction address, and a target thread-active mask. • There are GPU special instructions that push stack entries for a SIMD Thread and special instructions and instruction markers that pop a stack entry or unwind the stack to a specified entry and branch to the target instruction address with the target thread-active mask. • GPU hardware instructions also have an individual per-lane predication (enable/disable), specified with a 1-bit predicate register for each lane. • The PTX assembler identifies loop branches and generates GPU branch instructions that branch to the top of the loop • GPU indexed jump and indexed call instructions push entries on the stack so that when all lanes complete the switch statement or function call, the SIMD Thread converges . • A GPU set predicate instruction evaluates the conditional part of the IF statement. • The SIMD instructions in the threads inside the THEN part of the IF statement broadcast operations to all the SIMD Lanes. • Those lanes with the predicate set to 1 perform the operation and store the result, and the other SIMD Lanes don’t perform an operation or store a result. NVIDIA GPU Memory Structures • Each SIMD Lane in a multithreaded SIMD Processor is given a private section of off- chip DRAM, which we call the private memory. • Local memory is limited in size, typically to 48 KiB • The multithreaded SIMD Processor dynamically allocates portions of the local memory to a Thread Block when it creates the Thread Block. • The system processor, called the host, can read or write GPU Memory. • Local memory is unavailable to the host, as it is private to each multithreaded SIMD Process. • GPUs traditionally use smaller streaming caches. • To improve memory bandwidth and reduce overhead, as mentioned, PTX data transfer instructions in cooperation with the memory controller coalesce individual parallel thread requests from the same SIMD Thread together into a single memory block request when the addresses fall in the same block. Innovations in the Pascal GPU Architecture • Each new generation of GPU typically adds some new features that increase performance or make it easier for programmers. • Fast single-precision, double-precision, and half-precision floating-point arithmetic: • High-bandwidth memory • High-speed chip-to-chip interconnect. • Unified virtual memory and paging support. Similarities and Differences Between Vector Architectures and GPUs • Similarity: Both include Data Level Parallelism • Major Difference: multithreading, which is fundamental to GPUs and missing from most vector processors. Similarities and Differences Between Multimedia SIMD Computers and GPUs Detecting and Enhancing Loop-Level Parallelism • Compiler technology used for discovering the amount of parallelism that we can exploit in a program. • Loop-level parallelism is normally investigated at the source level. • Loop-level analysis involves determining what dependences exist among the operands in a loop across the iterations of that loop. • The analysis of loop-level parallelism focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations; such dependence is called a loop-carried dependence. for (i=999; i>=0; i=i-1) x[i] = x[i] + s;
In this loop, the two uses of x[i] are dependent,
but this dependence is within a single iteration and is not loop-carried. There is a loop-carried dependence between successive uses of i in different iterations • Because finding loop-level parallelism involves • recognizing structures such as loops, array references, and induction variable computations • a compiler can do this analysis more easily at or near the source level, in contrast to the machine-code level. Finding Dependences • Finding the dependences in a program is important both to determine which loops might contain parallelism and to eliminate name dependences. • The complexity of dependence analysis arises also because of the presence of arrays and pointers in languages such as C or C++, or pass-by- reference parameter passing in Fortran. • How does the compiler detect dependences in general? • Assume that array indices are affine. • In simplest terms, a one-dimensional array index is affine if it can be written in the form ai+b, where a and b are constants and i is the loop index variable • The index of a multidimensional array is affine if the index in each dimension is affine. • A dependence exists if two conditions hold: 1. There are two iteration indices, j and k, that are both within the limits of the forloop. That is, m < j < n, m < k < n. • 2. The loop stores into an array element indexed by a*j+b and later fetches from that same array element when it is indexed by c*k+d, that is, a*j+b=c*k+d. Eliminating Dependent Computations • We cannot determine whether dependence exists at compile time. • One of the most important forms of dependent computations is a recurrence. • Although any loop is not parallel, it has a very specific structure called a reduction. • They are also a key part of the primary parallelism, primitive MapReduce used in warehouse-scale computers. • In general, any function can be used as a reduction operator, and common cases include operators such as max and min • Reductions are sometimes handled by special hardware in a vector and SIMD architecture that allows the reduce step to be done much faster than it could be done in scalar mode.