Multicores, Multiprocessors, and P, Clusters
Multicores, Multiprocessors, and P, Clusters
Introduction
Multiprocessors Scalability, y, availability, y, power p efficiency y High g throughput oug pu for o independent depe de jobs Single program run on multiple processors Chips with multiple processors (cores)
Chapter 7 Multicores, Multiprocessors, and Clusters 2
Multicore microprocessors
Hardware
Serial: e.g., e g Pentium 4 Parallel: e.g., quad-core Xeon e5345 Sequential: e.g., matrix multiplication Concurrent: e.g., operating system
Software
Synchronization Associativity
4.10: Parallelism and Advanced Instruction-Level Instruction Level Parallelism 5.8: Parallelism and Memory Hierarchies
Parallel Programming
Otherwise, Oth i j just t use a f faster t uniprocessor, i since its easier! Partitioning Coordination Communications overhead
Amdahls Law
Sequential part can limit speedup Example: 100 processors processors, 90 speedup?
Scaling Example
Time = 10 tadd + 100/10 tadd = 20 tadd Speedup = 110/20 = 5.5 (55% of potential) Time = 10 tadd + 100/100 tadd = 11 tadd Speedup = 110/11 0/ = 10 0( (10% 0% o of po potential) e a)
100 processors
What if matrix size is 100 100? Single processor: Time = (10 + 10000) tadd dd 10 processors
Time = 10 tadd dd + 10000/10 tadd dd = 1010 tadd dd Speedup = 10010/1010 = 9.9 (99% of potential) Time = 10 tadd + 10000/100 tadd = 110 tadd Speedup = 10010/110 = 91 (91% of potential)
100 processors
As in example
10 processors, 10 10 matrix
Shared Memory
Hardware provides single physical address space for all processors Synchronize shared variables using locks Memory access time
Each processor has ID: 0 Pn 99 Partition 1000 numbers per processor Initial summation on each processor sum[Pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i]; Reduction: divide and conquer Half the processors add pairs, then quarter, Need to synchronize between reduction steps
Chapter 7 Multicores, Multiprocessors, and Clusters 11
half = 100; ; repeat synch(); if ( (half%2 != 0 && Pn == 0) ) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ / half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; until (half == 1);
Chapter 7 Multicores, Multiprocessors, and Clusters 12
Message Passing
Each processor has private physical address add ess space Hardware sends/receives messages p between processors
The do partial sums sum = 0; 0 for (i = 0; i<1000; i = i + 1) sum = sum + AN[i]; Half H lf the th processors send, d other th half h lf receive i and add Th quarter The t send, d quarter t receive i and d add, dd
Chapter 7 Multicores, Multiprocessors, and Clusters 15
Reduction
Send/receive also provide synchronization Assumes send/receive take similar time to addition
Chapter 7 Multicores, Multiprocessors, and Clusters 16
Grid Computing
E.g., Internet connections Work units farmed out, out results sent back E.g., SETI@home, World Community Grid
Multithreading
Replicate registers, PC, etc. Fast switching between threads Switch threads after each cycle Interleave instruction execution If one thread stalls, others are executed Only switch on long stall (e.g., L2-cache miss) Simplifies hardware, but doesnt hide short stalls (eg data hazards) (eg,
Coarse-grain multithreading
Simultaneous Multithreading
Schedule instructions from multiple threads Instructions from independent p threads execute when function units are available Within threads, dependencies handled by scheduling h d li and d register i t renaming i Two threads: T h d d duplicated li d registers, i shared h d function units and caches
Multithreading Example
Future of Multithreading
An alternate classification
Data Streams Single Multiple SIMD: SSE instructions of x86 MIMD: Intel Xeon e5345
SIMD
Simplifies synchronization Reduced instruction control hardware Works best for highly data-parallel applications
Chapter 7 Multicores, Multiprocessors, and Clusters 23
Vector Processors
Highly pipelined function units Stream data from/to vector registers to units
Data collected from memory into registers Results stored from registers g to memory y 32 64-element registers g ( (64-bit elements) ) Vector instructions
lv, sv: load/store vector addv.d dd d: add dd vectors t of f double d bl addvs.d: add scalar to each element of vector of double
Example: DAXPY (Y = a X + Y)
Conventional MIPS code l.d $f0,a($sp) addiu r4,$s0,#512 , , loop: l.d $f2,0($s0) mul.d $f2,$f2,$f0 l.d $f4,0($s1) $f4,$f4,$f2 ,$ ,$ add.d $ s.d $f4,0($s1) addiu $s0,$s0,#8 addiu $s1,$s1,#8 $t0,r4,$s0 , ,$ subu $ bne $t0,$zero,loop Vector MIPS code l.d $f0,a($sp) l lv $v1,0($s0) $ 1 0($ 0) mulvs.d $v2,$v1,$f0 lv $v3,0($s1) addv.d $v4,$v2,$v3 sv $ $v4,0($s1) ($ )
;load scalar a ;upper ; pp bound of what to load ;load x(i) ;a x(i) ;load y(i) ;a ; x(i) ( ) + y(i) y( ) ;store into y(i) ;increment index to x ;increment index to y ;compute ; p bound ;check if done ;load scalar a ;load l d vector x ;vector-scalar multiply ;load vector y ;add y to product ;store the h result l
Simplify data-parallel data parallel programming Explicit statement of absence of loop-carried dependences
Regular access patterns benefit from i t l interleaved d and db burst t memory Avoid control hazards by avoiding loops
More general than ad ad-hoc hoc media extensions (such as MMX, SSE)
History of GPUs
Frame buffer memory with address generation for video output Originally high-end computers (e.g., SGI) Moores Law lower cost, higher density 3D graphics cards for PCs and game consoles Processors oriented P i d to 3D graphics hi tasks k Vertex/pixel processing, shading, texture mapping, rasterization
3D graphics processing
GPU Architectures
GPUs are highly multithreaded U thread Use h d switching i hi to hid hide memory l latency
Graphics memory is wide and high-bandwidth Heterogeneous CPU/GPU systems CPU for sequential code code, GPU for parallel code DirectX, OpenGL C for Graphics (Cg), High Level Shader Language (HLSL) Compute p Unified Device Architecture ( (CUDA) )
Chapter 7 Multicores, Multiprocessors, and Clusters 29
Programming languages/APIs
8 Streaming processors
Chapter 7 Multicores, Multiprocessors, and Clusters 30
Streaming Processors
Single precision FP and integer units Single-precision Each SP is fine-grained multithreaded Executed in parallel, SIMD style y
Registers, g , PCs, ,
Chapter 7 Multicores, Multiprocessors, and Clusters 31
Classifying GPUs
But with performance degredation Need to write general purpose code with care
Static: Discovered at Compile Time Dynamic: y Discovered at Runtime Superscalar Tesla Multiprocessor
Interconnection Networks
Network topologies
Bus
Ring
Multistage Networks
Network Characteristics
Performance
Parallel Benchmarks
Linpack: matrix linear algebra SPECrate: p parallel run of SPEC CPU p programs g
Job-level parallelism
Mix of kernels and applications, strong scaling computational fluid dynamics kernels
Code or Applications?
Traditional benchmarks
Fixed code and data sets Should algorithms, algorithms programming languages languages, and tools be part of the system? Compare p systems, y p provided they y implement p a given application E.g., Linpack, Berkeley Design Patterns
Modeling Performance
Measured using computational kernels from Berkeley Design Patterns FLOPs per byte of memory accessed Peak GFLOPS ( (from data sheet) ) Peak memory bytes/sec (using Stream benchmark)
Roofline Diagram
Comparing Systems
2 core vs. 4-core, 2-core 4 core, 2 2 FP performance/core, 2.2GHz vs. 2.3GHz Same memory system
Need high arithmetic intensity Or working set must fit in X4s 2MB L-3 cache
Optimizing Performance
Optimize FP performance
Balance adds & multiplies Improve superscalar ILP and use of SIMD instructions Software prefetch
Optimizing Performance
Kernels
SpMV (left) LBHMD (right)
Some optimizations change arithmetic intensity x86 systems have higher peak GFLOPs
Performance on SpMV
Irregular memory accesses, memory bound 0.166 before memory y optimization, p , 0.25 after
Arithmetic intensity
Similar peak FLOPS Xeon limited by shared FSBs and chipset 20 30 vs. 75 peak GFLOPs More cores and memory bandwidth
Performance on LBMHD
Each point: 75 FP read/write, 1300 FP ops 0.70 before optimization, p , 1.07 after
Arithmetic intensity
More powerful cores, not limited by memory bandwidth Still suffers ff f from memory bottlenecks
Achieving Performance
If nave code performs well well, its it s easier to write high performance code for the system
Kernel SpMV LBMHD SpMV LBMHD SpMV LBMHD SpMV LBMHD Nave GFLOPs/sec 1.0 46 4.6 1.4 7.1 3.5 3 5 9.7 Nave code not feasible Optimized GFLOPs/sec 1.5 56 5.6 3.6 14.1 4.1 4 1 10.5 6.4 16 7 16.7 Nave as % of optimized 64% 82% 38% 50% 86% 93% 0% 0%
System Intel Xeon AMD Opteron X4 Sun UltraSPARC T2 IBM Cell QS20
Fallacies
Since we can achieve linear speedup But only on applications with weak scaling
Marketers like this approach! But compare Xeon with others in example Need to be aware of bottlenecks
Chapter 7 Multicores, Multiprocessors, and Clusters 49
Pitfalls
Serializes accesses, even if they could be done in parallel Use finer-granularity locking
Concluding Remarks
Developing p gp parallel software Devising appropriate architectures Changing software and application environment Chip-level multiprocessors with lower latency, hi h b higher bandwidth d idth i interconnect t t