0% found this document useful (0 votes)
92 views30 pages

Summary Exam 2015

computer science Second Midterm practice exam for STA statistics. worked out Solutions are included. For professor Ying Yang University of florida 2015. Also Homework help. Chip parallel. Assuming that the submitted jobs are all compute-heavy workloads, possibly with different memory bandwidth requirements, what are the pros and cons of round-robin versus consolidated scheduling in terms of power and cooling costs, performance, and reliability?

Uploaded by

ayylmao kek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views30 pages

Summary Exam 2015

computer science Second Midterm practice exam for STA statistics. worked out Solutions are included. For professor Ying Yang University of florida 2015. Also Homework help. Chip parallel. Assuming that the submitted jobs are all compute-heavy workloads, possibly with different memory bandwidth requirements, what are the pros and cons of round-robin versus consolidated scheduling in terms of power and cooling costs, performance, and reliability?

Uploaded by

ayylmao kek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 30

CIS 6930: Chip

Multiprocessor: Parallel
Architecture and Programming
Fall 2009
Jih-Kwon Peir
Computer Information Science
Engineering
University of Florida
1

CIS 6930: Chip Multiprocessor:


Parallel Architecture and Programming

Acknowledgement: Slides borrowed from


o Accelerators for Science and Engineering Applications:
GPUs and Multicores, by David Kirk / NVIDIA and Wen-mei
Hwu / University of Illinois, 2006-2008,
(http://www.greatlakesconsortium.org/events/GPUMulticore
/agenda.html)
o Course material posted from CUDA zone
(http://www.nvidia.com/object/cuda_education.html)
o Intel Software Network (http://software.intel.com/enus/academic/)
o The Art of Multiprocessor Programming
(http://software.intel.com/en-us/academic/ )
o Presentation slides from various papers
2

Course Goals
Learn how to program massively parallel
processors and achieve
high performance
functionality and maintainability
scalability across future generations

Acquire technical knowledge required to achieve


the above goals
principles and patterns of parallel programming
processor architecture features and constraints
programming API, tools and techniques

Learn new many-core general-purpose and GPU


processor architecture
Organization and memory systems

Parallel programming basics: Locking, synchronization,


mutual exclusion, transactional memory, etc.

Course Outline
Week 1-2: Introduction, GPU architectures, CUDA programming
Week 3-6: CUDA threads, code blocks, grids, CUDA memory,
synchronization, performance
Week 7: Project selection and discussion
Week 8-9: Intel many-core architectures
Week 10-11: Parallel programming model, synchronization,
mutual exclusion, conditional synchronization, locks, barriers,
concurrency and correctness, sequential program and
consistency.
Add Fermi and Larrabee
Week 12-13 - Discussion of advanced issues in multi-core
architecture and programming
Week 14-16 In-depth discussion of project topics and project
presentation
4

CUDA GPU Proggming

Integrated host+device app C program

Serial or modestly parallel parts in host C code


Highly parallel parts in device SPMD kernel C code

Serial Code (host)


Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);

...

Serial Code (host)


Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);

...
5

CUDA Thread Blocks and Threads

Each thread uses IDs to


decide what data to work on

Block ID: 1D or 2D
Thread ID: 1D, 2D, or 3D

Simplifies memory
addressing when
processing
multidimensional data

Image processing
Solving PDEs on volumes

Matrix Multiplication
A Simple Example
// Matrix multiplication on the (CPU) host in double precision

void MatrixMulOnHost(float* M, float* N, float* P, int Width)


N
{
for (int i = 0; i < Width; ++i)
for (int j = 0; j < Width; ++j) {
j
double sum = 0;
for (int k = 0; k < Width; ++k) {
double a = M[i * width + k];
double b = N[k * width + j];
sum += a * b;
}
M
P
P[i * Width + j] = sum;
i
}
}

WIDTH

WIDTH

k
WIDTH

WIDTH

G80 Example: Thread Scheduling (cont.)


SM implements zero-overhead warp scheduling
At any time, only one of the warps is executed by SM
Warps whose next instruction has its operands ready for
consumption are eligible for execution
Eligible Warps are selected for execution on a prioritized
scheduling policy
All threads in a warp execute the same instruction when
selected

Thread Scheduling (cont.)

Each code block assigned to one SM, each SM can take up to 8 blocks
Each block up to 512 threads, divided into 32-therad wrap, each wrap scheduled on 8 SP, 4
threads on one SP, wrap executed SIMT mode
SP is pipelined ~30 stages, fetch, decode, gather and write-back act on whole warps, so they
have a throughput of 1 warp/slow clock
Execute acts on group of 8 threads or quarter-warps (there are only 8 SP/SM), so their
throughput is 1 warp/4 fast clocks or 1 warp/2 slow clocks
The Fetch/decode/... stages have a higher throughput to feed both the MAD and the SFU/MUL
units alternatively. Hence the peak rate of 8 MAD + 8 MUL per (fast) clock cycle
Need 6 warps (or 192 threads) per SM to hide the read-after-write latencies
9

G80 Implementation of CUDA


Memories
Each thread can:
Grid

Read/write per-thread registers


Read/write per-thread local
memory
Read/write per-block shared
memory
Read/write per-grid global memory
Read/only per-grid constant
memory

Block (0, 0)

Block (1, 0)

Shared Memory
Registers

Registers

Thread (0, 0) Thread (1, 0)

Host

Shared Memory
Registers

Registers

Thread (0, 0) Thread (1, 0)

Global Memory
Constant Memory

10

How about performance on G80?


All threads access global memory
for their input matrix elements

Two memory accesses (8


bytes) per floating point
multiply-add
4B/s of memory
bandwidth/FLOPS
4*346.5 = 1386 GB/s required
to achieve peak FLOP rating
86.4 GB/s limits the code at
21.6 GFLOPS

The actual code runs at about 15


Host
GFLOPS
Need to drastically cut down
memory accesses to get closer to
the peak 346.5 GFLOPS

Grid
Block (0, 0)

Block (1, 0)

Shared Memory
Registers

Registers

Thread (0, 0) Thread (1, 0)

Shared Memory
Registers

Registers

Thread (0, 0) Thread (1, 0)

Global Memory
Constant Memory

11

Tiled Matrix Multiplication Kernel


__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
1.
2.

__shared__float Mds[TILE_WIDTH][TILE_WIDTH];
__shared__float Nds[TILE_WIDTH][TILE_WIDTH];

3.
4.

int bx = blockIdx.x; int by = blockIdx.y;


int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element to work on


5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7.
float Pvalue = 0;
// Loop over the Md and Nd tiles required to compute the Pd element
8.
for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Coolaborative loading of Md and Nd tiles into shared memory
9.
Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10.
Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];
11.
__syncthreads();
11.
12.
13.
14.
13.
}

for (int k = 0; k < TILE_WIDTH; ++k)


Pvalue += Mds[ty][k] * Nds[k][tx];
Synchthreads();
}
Pd[Row*Width+Col] = Pvalue;

12

Todays Intel PC Architecture:


Single Core System
FSB connection
between processor and
Northbridge (82925X)
Memory Control Hub
Northbridge handles
primary PCIe to
video/GPU and DRAM.
PCIe x16 bandwidth
at 8 GB/s (4 GB each
direction)
Southbridge (ICH6RW)
handles other
peripherals

GeForce-8 Series HW Overview


Streaming Processor Array

TPC

TPC

TPC

Texture Processor Cluster

TPC

TPC

Streaming Multiprocessor
Instruction L1

SM

TPC

Data L1

Instruction Fetch/Dispatch
Shared Memory

TEX

SP
SM

SP

SP

SP
SFU

SFU

SP

SP

SP

SP

14

SM Warp Scheduling

SM hardware implements zerooverhead Warp scheduling

SM multithreaded
Warp scheduler
time

warp 8 instruction 11
warp 1 instruction 42
warp 3 instruction 95
..
.
warp 8 instruction 12
warp 3 instruction 96

Warps whose next instruction has its


operands ready for consumption are
eligible for execution
Eligible Warps are selected for
execution on a prioritized scheduling
policy
All threads in a Warp execute the same
instruction when selected

4 clock cycles needed to dispatch


the same instruction for all threads
in a Warp in G80

If one global memory access is needed


for every 4 instructions
A minimal of 13 Warps are needed to
fully tolerate 200-cycle memory latency
15

CUDA Device Memory Space: Review

Each thread can:

(Device) Grid

R/W per-thread registers


R/W per-thread local memory
R/W per-block shared memory
R/W per-grid global memory
Read only per-grid constant
memory
Read only per-grid texture memory

The host can R/W


global, constant, and
texture memories
using Copy function

Host

Block (0, 0)

Block (1, 0)

Shared Memory
Registers

Registers

Shared Memory
Registers

Registers

Thread (0, 0) Thread (1, 0)

Thread (0, 0) Thread (1, 0)

Local
Memory

Local
Memory

Local
Memory

Local
Memory

Global
Memory
Constant
Memory
Texture
Memory

16

Memory Layout of a Matrix in C


M0,0 M1,0 M2,0 M3,0

Access
direction
in Kernel
code

M0,1 M1,1 M2,1 M3,1


M0,2 M1,2 M2,2 M3,2
M0,3 M1,3 M2,3 M3,3

Time Period 2

T1

T2

T3

T4

Time Period 1

T1

T2

T3

T4

M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
17

Bank Addressing Examples


2-way Bank Conflicts

Linear addressing
stride == 2

Thread 0
Thread 1
Thread 2
Thread 3
Thread 4

Thread 8
Thread 9
Thread 10
Thread 11

8-way Bank Conflicts

Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7

Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7

Bank 15

Thread 15

Linear addressing
stride == 8
x8

Bank 0
Bank 1
Bank 2

Bank 7
Bank 8
x8 Bank 9
Bank 15
18

Control Flow Instructions

Main performance concern with branching is


divergence

Threads within a single warp take different paths


Different execution paths are serialized in G80

The control paths taken by the threads in a warp are traversed one at
a time until there is no more.

A common case: avoid divergence when branch


condition is a function of thread ID

Example with divergence:


If (threadIdx.x > 2) { }
This creates two different control paths for threads in a block
Branch granularity < warp size; threads 0 and 1 follow different path
than the rest of the threads in the first warp

Example without divergence:


If (threadIdx.x / WARP_SIZE > 2) { }
Also creates two different control paths for threads in a block
Branch granularity is a whole multiple of warp size; all threads in any
given warp follow the same path
19

Vector Reduction with Branch


Divergence
Thread 0

0+1

0...3

0..7

Thread 2

2+3

Thread 4

4+5

4..7

Thread 6

6+7

Thread 8

8+9

Thread 10

10

11

10+11

8..11

8..15

iterations
Array elements
20

No Divergence until < 16 sub-sums


Thread 0

0+16

13

14

15

16

17

18

19

15+31

21

Fundamentals of Parallel
Computing
Parallel computing requires that
The problem can be decomposed into sub-problems
that can be safely solved at the same time
The programmer structures the code and data to solve
these sub-problems concurrently

The goals of parallel computing are


To solve problems in less time, and/or
To solve bigger problems, and/or
To achieve better solutions
The problems must be large enough to justify parallel
computing and to exhibit exploitable concurrency.

22

Challenges of Parallel
Programming
Finding and exploiting concurrency often requires
looking at the problem from a non-obvious angle
Computational thinking (J. Wing)

Dependences need to be identified and managed


The order of task execution may change the answers
Obvious: One step feeds result to the next steps
Subtle: numeric accuracy may be affected by ordering steps that are
logically parallel with each other

Performance can be drastically reduced by many


factors

Overhead of parallel processing


Load imbalance among processor elements
Inefficient data sharing patterns
Saturation of critical resources such as memory bandwidth
23

Fermi Implements CUDA

Definition of memory scope,


grid, thread block, thread,
are same as in Tesla

Grid: Array of thread blocks

Thread Block: up to 1536


concurrent threads, comm.
through shared memory

GPU has an array of SMs,


each executes one or more
thread block, each block is
grouped into warps with 32
thread per warp

Other resource constraints


are implementation based
24

Fermi GT300 Key Feature


32 cores per SM, 512 cores
Fully pipelined integer and floating
point unit that implements new
IEEE 754-2008 standard include
fused multiply-add (FMA)
Two warps from different thread
blocks (even different kernels) can
be issued and executed
concurrently
ECC protection from the registers to
DRAM
Linear addressing model with
caching at all levels
Large shared memory / L1 cache
Double precision performance 8x
faster than GT200 and reach ~600
double-precision GFLOPs
25

Fermi GT300 Key Feature


(cont.)
Fermi supports simultaneous
execution of multiple kernels from
the same application, each kernel
distributed to one or more SMs
GigaThread hardware thread
scheduler, manages 1,536
simultaneously active threads for
each SM across 16 kernels
Switching from one application to
another is 20x faster on Fermi
Fermi supports OpenCL, Fortran, C+
+, Java, Matlab, and Python.
Each SM has 32cores and 16 LS/ST
units, 4 SFUs
Fermi supports FMA for both singe
and double precision

26

Instruction Schedule Example

A total of 32 instructions from one or


two warps can be dispatched in each
cycle to any two of the four execution
blocks within a Fermi SM: two blocks
of 16 cores each, one block of four
Special Function Units, and one block
of load/store units. This figure shows
how instructions are issued to the
four execution blocks.

It takes two cycles for the 32 instructions in each warp to execute on the cores or
load/store units. A warp of 32 special-function instructions is issued in a single cycle
but takes eight cycles to complete on the four SFUs
Another major improvement in Fermi and PTX 2.0 is a new unified addressing model.
All addresses in the GPU are allocated from a continuous 40-bit (one terabyte) address
space. Global, shared, and local addresses are defined as ranges within this address
space and can be accessed by common load/store instructions. (The load/store
instructions support 64-bit addresses to allow for future growth.)

Multi-Core Architecture:
Intel Quad Core Technology of Today
Cache Structure
Core
0

Core
1

Core
2

Core
3

4MB Shared
L2 Cache

4MB Shared
L2 Cache

Bus Interface
1066MHz/1333Mhz FSB

28

The L2 cache of todays


quad-core processors is
not one cache shared by
all 4 cores. Instead there
are two L2 cache shared
by two cores each

What Is OpenMP*?
C$OMP FLUSH
C$OMP THREADPRIVATE(/ABC/)
C$OMP parallel do shared(a, b, c)

#pragma omp critical


CALL OMP_SET_NUM_THREADS(10)
call omp_test_lock(jlok)
C$OMP MASTER

call OMP_INIT_LOCK (ilok)

http://www.openmp.org

C$OMP ATOMIC

C$OMP

SINGLE PRIVATE(X)

Current spec is OpenMP


setenv2.5
OMP_SCHEDULE dynamic
C$OMP PARALLEL DO ORDERED PRIVATE (A, B,
C)Pages
250
C$OMP PARALLEL

C$OMP ORDERED

(combined C/C++ and Fortran)

REDUCTION (+: A, B)

C$OMP SECTIONS

#pragma omp parallel for private(A, B)


C$OMP PARALLEL COPYIN(/blk/)
Nthrds = OMP_GET_NUM_PROCS()

Programming with
OpenMP*

!$OMP

BARRIER

C$OMP DO lastprivate(XX)
omp_set_lock(lck)

29

More material
Intel Larrabee Architecture
Herlihys Book
Chapter 1: Introduction
Chapter 2: Mutual Exclusion

30

You might also like