0% found this document useful (0 votes)

92 views30 pages

Summary Exam 2015

computer science Second Midterm practice exam for STA statistics. worked out Solutions are included. For professor Ying Yang University of florida 2015. Also Homework help. Chip parallel. Assuming that the submitted jobs are all compute-heavy workloads, possibly with different memory bandwidth requirements, what are the pros and cons of round-robin versus consolidated scheduling in terms of power and cooling costs, performance, and reliability?

Uploaded by

ayylmao kek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views30 pages

Summary Exam 2015

Uploaded by

ayylmao kek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 30

CIS 6930: Chip

Multiprocessor: Parallel
Architecture and Programming
Fall 2009
Jih-Kwon Peir
Computer Information Science
Engineering
University of Florida
1

CIS 6930: Chip Multiprocessor:

Parallel Architecture and Programming

Acknowledgement: Slides borrowed from

o Accelerators for Science and Engineering Applications:
GPUs and Multicores, by David Kirk / NVIDIA and Wen-mei
Hwu / University of Illinois, 2006-2008,
(http://www.greatlakesconsortium.org/events/GPUMulticore
/agenda.html)
o Course material posted from CUDA zone
(http://www.nvidia.com/object/cuda_education.html)
o Intel Software Network (http://software.intel.com/enus/academic/)
o The Art of Multiprocessor Programming
(http://software.intel.com/en-us/academic/ )
o Presentation slides from various papers
2

Course Goals
Learn how to program massively parallel
processors and achieve
high performance
functionality and maintainability
scalability across future generations

Acquire technical knowledge required to achieve

the above goals
principles and patterns of parallel programming
processor architecture features and constraints
programming API, tools and techniques

Learn new many-core general-purpose and GPU

processor architecture
Organization and memory systems

Parallel programming basics: Locking, synchronization,

mutual exclusion, transactional memory, etc.

Course Outline
Week 1-2: Introduction, GPU architectures, CUDA programming
Week 3-6: CUDA threads, code blocks, grids, CUDA memory,
synchronization, performance
Week 7: Project selection and discussion
Week 8-9: Intel many-core architectures
Week 10-11: Parallel programming model, synchronization,
mutual exclusion, conditional synchronization, locks, barriers,
concurrency and correctness, sequential program and
consistency.
Add Fermi and Larrabee
Week 12-13 - Discussion of advanced issues in multi-core
architecture and programming
Week 14-16 In-depth discussion of project topics and project
presentation
4

CUDA GPU Proggming

Integrated host+device app C program

Serial or modestly parallel parts in host C code

Highly parallel parts in device SPMD kernel C code

Serial Code (host)

Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);

...

Serial Code (host)

Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);

...
5

CUDA Thread Blocks and Threads

Each thread uses IDs to

decide what data to work on

Block ID: 1D or 2D
Thread ID: 1D, 2D, or 3D

Simplifies memory
addressing when
processing
multidimensional data

Image processing
Solving PDEs on volumes

Matrix Multiplication
A Simple Example
// Matrix multiplication on the (CPU) host in double precision

void MatrixMulOnHost(float* M, float* N, float* P, int Width)

N
{
for (int i = 0; i < Width; ++i)
for (int j = 0; j < Width; ++j) {
j
double sum = 0;
for (int k = 0; k < Width; ++k) {
double a = M[i * width + k];
double b = N[k * width + j];
sum += a * b;
}
M
P
P[i * Width + j] = sum;
i
}
}

WIDTH

k
WIDTH

WIDTH

G80 Example: Thread Scheduling (cont.)

SM implements zero-overhead warp scheduling
At any time, only one of the warps is executed by SM
Warps whose next instruction has its operands ready for
consumption are eligible for execution
Eligible Warps are selected for execution on a prioritized
scheduling policy
All threads in a warp execute the same instruction when
selected

Thread Scheduling (cont.)

Each code block assigned to one SM, each SM can take up to 8 blocks
Each block up to 512 threads, divided into 32-therad wrap, each wrap scheduled on 8 SP, 4
threads on one SP, wrap executed SIMT mode
SP is pipelined ~30 stages, fetch, decode, gather and write-back act on whole warps, so they
have a throughput of 1 warp/slow clock
Execute acts on group of 8 threads or quarter-warps (there are only 8 SP/SM), so their
throughput is 1 warp/4 fast clocks or 1 warp/2 slow clocks
The Fetch/decode/... stages have a higher throughput to feed both the MAD and the SFU/MUL
units alternatively. Hence the peak rate of 8 MAD + 8 MUL per (fast) clock cycle
Need 6 warps (or 192 threads) per SM to hide the read-after-write latencies
9

G80 Implementation of CUDA

Memories
Each thread can:
Grid

Read/write per-thread registers

Read/write per-thread local
memory
Read/write per-block shared
memory
Read/write per-grid global memory
Read/only per-grid constant
memory

Block (0, 0)

Block (1, 0)

Shared Memory
Registers

Registers

Thread (0, 0) Thread (1, 0)

Host

Shared Memory
Registers

Registers

Thread (0, 0) Thread (1, 0)

Global Memory
Constant Memory

How about performance on G80?

All threads access global memory
for their input matrix elements

Two memory accesses (8

bytes) per floating point
multiply-add
4B/s of memory
bandwidth/FLOPS
4*346.5 = 1386 GB/s required
to achieve peak FLOP rating
86.4 GB/s limits the code at
21.6 GFLOPS

The actual code runs at about 15

Host
GFLOPS
Need to drastically cut down
memory accesses to get closer to
the peak 346.5 GFLOPS

Grid
Block (0, 0)

Block (1, 0)

Shared Memory
Registers

Registers

Thread (0, 0) Thread (1, 0)

Shared Memory
Registers

Registers

Thread (0, 0) Thread (1, 0)

Global Memory
Constant Memory

Tiled Matrix Multiplication Kernel

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
1.
2.

__shared__float Mds[TILE_WIDTH][TILE_WIDTH];
__shared__float Nds[TILE_WIDTH][TILE_WIDTH];

3.
4.

int bx = blockIdx.x; int by = blockIdx.y;

int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element to work on

5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7.
float Pvalue = 0;
// Loop over the Md and Nd tiles required to compute the Pd element
8.
for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Coolaborative loading of Md and Nd tiles into shared memory
9.
Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10.
Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];
11.
__syncthreads();
11.
12.
13.
14.
13.
}

for (int k = 0; k < TILE_WIDTH; ++k)

Pvalue += Mds[ty][k] * Nds[k][tx];
Synchthreads();
}
Pd[Row*Width+Col] = Pvalue;

Todays Intel PC Architecture:

Single Core System
FSB connection
between processor and
Northbridge (82925X)
Memory Control Hub
Northbridge handles
primary PCIe to
video/GPU and DRAM.
PCIe x16 bandwidth
at 8 GB/s (4 GB each
direction)
Southbridge (ICH6RW)
handles other
peripherals

GeForce-8 Series HW Overview

Streaming Processor Array

TPC

Texture Processor Cluster

TPC

Streaming Multiprocessor
Instruction L1

TPC

Data L1

Instruction Fetch/Dispatch
Shared Memory

TEX

SP
SM

SP
SFU

SFU

SM Warp Scheduling

SM hardware implements zerooverhead Warp scheduling

SM multithreaded
Warp scheduler
time

warp 8 instruction 11
warp 1 instruction 42
warp 3 instruction 95
..
.
warp 8 instruction 12
warp 3 instruction 96

Warps whose next instruction has its

operands ready for consumption are
eligible for execution
Eligible Warps are selected for
execution on a prioritized scheduling
policy
All threads in a Warp execute the same
instruction when selected

4 clock cycles needed to dispatch

the same instruction for all threads
in a Warp in G80

If one global memory access is needed

for every 4 instructions
A minimal of 13 Warps are needed to
fully tolerate 200-cycle memory latency
15

CUDA Device Memory Space: Review

Each thread can:

(Device) Grid

R/W per-thread registers

R/W per-thread local memory
R/W per-block shared memory
R/W per-grid global memory
Read only per-grid constant
memory
Read only per-grid texture memory

The host can R/W

global, constant, and
texture memories
using Copy function

Host

Block (0, 0)

Block (1, 0)

Shared Memory
Registers

Registers

Shared Memory
Registers

Registers

Thread (0, 0) Thread (1, 0)

Local
Memory

Global
Memory
Constant
Memory
Texture
Memory

Memory Layout of a Matrix in C

M0,0 M1,0 M2,0 M3,0

Access
direction
in Kernel
code

M0,1 M1,1 M2,1 M3,1

M0,2 M1,2 M2,2 M3,2
M0,3 M1,3 M2,3 M3,3

Time Period 2

Time Period 1

M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
17

Bank Addressing Examples

2-way Bank Conflicts

Linear addressing
stride == 2

Thread 0
Thread 1
Thread 2
Thread 3
Thread 4

Thread 8
Thread 9
Thread 10
Thread 11

8-way Bank Conflicts

Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7

Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7

Bank 15

Thread 15

Linear addressing
stride == 8
x8

Bank 0
Bank 1
Bank 2

Bank 7
Bank 8
x8 Bank 9
Bank 15
18

Control Flow Instructions

Main performance concern with branching is

divergence

Threads within a single warp take different paths

Different execution paths are serialized in G80

The control paths taken by the threads in a warp are traversed one at
a time until there is no more.

A common case: avoid divergence when branch

condition is a function of thread ID

Example with divergence:

If (threadIdx.x > 2) { }
This creates two different control paths for threads in a block
Branch granularity < warp size; threads 0 and 1 follow different path
than the rest of the threads in the first warp

Example without divergence:

If (threadIdx.x / WARP_SIZE > 2) { }
Also creates two different control paths for threads in a block
Branch granularity is a whole multiple of warp size; all threads in any
given warp follow the same path
19

Vector Reduction with Branch

Divergence
Thread 0

0+1

0...3

0..7

Thread 2

2+3

Thread 4

4+5

4..7

Thread 6

6+7

Thread 8

8+9

Thread 10

10+11

8..11

8..15

iterations
Array elements
20

No Divergence until < 16 sub-sums

Thread 0

0+16

15+31

Fundamentals of Parallel
Computing
Parallel computing requires that
The problem can be decomposed into sub-problems
that can be safely solved at the same time
The programmer structures the code and data to solve
these sub-problems concurrently

The goals of parallel computing are

To solve problems in less time, and/or
To solve bigger problems, and/or
To achieve better solutions
The problems must be large enough to justify parallel
computing and to exhibit exploitable concurrency.

Challenges of Parallel
Programming
Finding and exploiting concurrency often requires
looking at the problem from a non-obvious angle
Computational thinking (J. Wing)

Dependences need to be identified and managed

The order of task execution may change the answers
Obvious: One step feeds result to the next steps
Subtle: numeric accuracy may be affected by ordering steps that are
logically parallel with each other

Performance can be drastically reduced by many

factors

Overhead of parallel processing

Load imbalance among processor elements
Inefficient data sharing patterns
Saturation of critical resources such as memory bandwidth
23

Fermi Implements CUDA

Definition of memory scope,

grid, thread block, thread,
are same as in Tesla

Grid: Array of thread blocks

Thread Block: up to 1536

concurrent threads, comm.
through shared memory

GPU has an array of SMs,

each executes one or more
thread block, each block is
grouped into warps with 32
thread per warp

Other resource constraints

are implementation based
24

Fermi GT300 Key Feature

32 cores per SM, 512 cores
Fully pipelined integer and floating
point unit that implements new
IEEE 754-2008 standard include
fused multiply-add (FMA)
Two warps from different thread
blocks (even different kernels) can
be issued and executed
concurrently
ECC protection from the registers to
DRAM
Linear addressing model with
caching at all levels
Large shared memory / L1 cache
Double precision performance 8x
faster than GT200 and reach ~600
double-precision GFLOPs
25

Fermi GT300 Key Feature

(cont.)
Fermi supports simultaneous
execution of multiple kernels from
the same application, each kernel
distributed to one or more SMs
GigaThread hardware thread
scheduler, manages 1,536
simultaneously active threads for
each SM across 16 kernels
Switching from one application to
another is 20x faster on Fermi
Fermi supports OpenCL, Fortran, C+
+, Java, Matlab, and Python.
Each SM has 32cores and 16 LS/ST
units, 4 SFUs
Fermi supports FMA for both singe
and double precision

Instruction Schedule Example

A total of 32 instructions from one or

two warps can be dispatched in each
cycle to any two of the four execution
blocks within a Fermi SM: two blocks
of 16 cores each, one block of four
Special Function Units, and one block
of load/store units. This figure shows
how instructions are issued to the
four execution blocks.

It takes two cycles for the 32 instructions in each warp to execute on the cores or
load/store units. A warp of 32 special-function instructions is issued in a single cycle
but takes eight cycles to complete on the four SFUs
Another major improvement in Fermi and PTX 2.0 is a new unified addressing model.
All addresses in the GPU are allocated from a continuous 40-bit (one terabyte) address
space. Global, shared, and local addresses are defined as ranges within this address
space and can be accessed by common load/store instructions. (The load/store
instructions support 64-bit addresses to allow for future growth.)

Multi-Core Architecture:
Intel Quad Core Technology of Today
Cache Structure
Core
0

Core
1

Core
2

Core
3

4MB Shared
L2 Cache

Bus Interface
1066MHz/1333Mhz FSB

The L2 cache of todays

quad-core processors is
not one cache shared by
all 4 cores. Instead there
are two L2 cache shared
by two cores each

What Is OpenMP*?
C$OMP FLUSH
C$OMP THREADPRIVATE(/ABC/)
C$OMP parallel do shared(a, b, c)

#pragma omp critical

CALL OMP_SET_NUM_THREADS(10)
call omp_test_lock(jlok)
C$OMP MASTER

call OMP_INIT_LOCK (ilok)

http://www.openmp.org

C$OMP ATOMIC

C$OMP

SINGLE PRIVATE(X)

Current spec is OpenMP

setenv2.5
OMP_SCHEDULE dynamic
C$OMP PARALLEL DO ORDERED PRIVATE (A, B,
C)Pages
250
C$OMP PARALLEL

C$OMP ORDERED

(combined C/C++ and Fortran)

REDUCTION (+: A, B)

C$OMP SECTIONS

#pragma omp parallel for private(A, B)

C$OMP PARALLEL COPYIN(/blk/)
Nthrds = OMP_GET_NUM_PROCS()

Programming with
OpenMP*

!$OMP

BARRIER

C$OMP DO lastprivate(XX)
omp_set_lock(lck)

More material
Intel Larrabee Architecture
Herlihys Book
Chapter 1: Introduction
Chapter 2: Mutual Exclusion

Lecture 4
No ratings yet
Lecture 4
48 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Lecture4 CUDA Threads Part2
No ratings yet
Lecture4 CUDA Threads Part2
15 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Hardware
No ratings yet
Hardware
54 pages
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
No ratings yet
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
21 pages
Case Study On GPU Architectures: Lecture 3H
No ratings yet
Case Study On GPU Architectures: Lecture 3H
34 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
GTC S62191
No ratings yet
GTC S62191
89 pages
GPU Architecture for Engineers
No ratings yet
GPU Architecture for Engineers
32 pages
Slides - Chapter 6
No ratings yet
Slides - Chapter 6
59 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Threads
No ratings yet
Threads
54 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Lec 1
No ratings yet
Lec 1
27 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
Introduction To Massively Parallel Computing
No ratings yet
Introduction To Massively Parallel Computing
44 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
VSCSE Lecture3 Cuda Memory Model 2012
No ratings yet
VSCSE Lecture3 Cuda Memory Model 2012
31 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Multithreaded Architectures: Lecture 5: Performance Considerations
No ratings yet
Multithreaded Architectures: Lecture 5: Performance Considerations
49 pages
Introduction to GPGPU Programming
No ratings yet
Introduction to GPGPU Programming
32 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
GPU Architecture
No ratings yet
GPU Architecture
17 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
Lec 3
No ratings yet
Lec 3
48 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
Threads and Memory7
No ratings yet
Threads and Memory7
42 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
CUDA Memory Architecture Explained
No ratings yet
CUDA Memory Architecture Explained
28 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
PDC Lecture 09
No ratings yet
PDC Lecture 09
36 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
GPU & CUDA Programming Guide
No ratings yet
GPU & CUDA Programming Guide
31 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
58 pages
The Processor: The Hardware/Software Interface 5
No ratings yet
The Processor: The Hardware/Software Interface 5
149 pages
SPRING 2015 CDA 3101 Homework 3: Date-Assigned: Mar 27th, 2015 Due Dates: 11:55pm, April 7th, 2015
No ratings yet
SPRING 2015 CDA 3101 Homework 3: Date-Assigned: Mar 27th, 2015 Due Dates: 11:55pm, April 7th, 2015
5 pages
hw6 Circuits
No ratings yet
hw6 Circuits
4 pages
hw1 Computer
No ratings yet
hw1 Computer
1 page
File Command
No ratings yet
File Command
5 pages
OS Lab Manual Lab5 &6
No ratings yet
OS Lab Manual Lab5 &6
16 pages
PrakW9S03 Multithreading Using Posix Thread
No ratings yet
PrakW9S03 Multithreading Using Posix Thread
4 pages
Os Model Exam Cse
No ratings yet
Os Model Exam Cse
2 pages
Add Yum Repos and Install Packages on CentOS 7.4
No ratings yet
Add Yum Repos and Install Packages on CentOS 7.4
7 pages
Log
No ratings yet
Log
2 pages
Advanced Unix Programming Guide
No ratings yet
Advanced Unix Programming Guide
20 pages
Devops Lab Manual
67% (3)
Devops Lab Manual
33 pages
Introduction to Purr Data Software
No ratings yet
Introduction to Purr Data Software
13 pages
ANSYS Hardware Guide for Engineers
No ratings yet
ANSYS Hardware Guide for Engineers
6 pages
8086 Interrupts: Lecturer Csed Tiet
No ratings yet
8086 Interrupts: Lecturer Csed Tiet
23 pages
Qemu Internals
No ratings yet
Qemu Internals
23 pages
Virtual Cpu Scheduling Techniques For Kernel Based Virtual Machine (KVM)
No ratings yet
Virtual Cpu Scheduling Techniques For Kernel Based Virtual Machine (KVM)
7 pages
Driver Installation For SP BSP Tools
No ratings yet
Driver Installation For SP BSP Tools
13 pages
Standby Android Log 2024 0729 015840
No ratings yet
Standby Android Log 2024 0729 015840
648 pages
SK62XX FATool UserGuideV1
No ratings yet
SK62XX FATool UserGuideV1
3 pages
Experiment-2 BDA Lab
No ratings yet
Experiment-2 BDA Lab
13 pages
Live Lock-Free or Deadlock - Fedor Pikus - CppCon 2015
No ratings yet
Live Lock-Free or Deadlock - Fedor Pikus - CppCon 2015
112 pages
Multipath v3
No ratings yet
Multipath v3
27 pages
Operating System PPT Shashank 1
No ratings yet
Operating System PPT Shashank 1
17 pages
Operaring System Short Questions of CIT 333
100% (1)
Operaring System Short Questions of CIT 333
24 pages
Operating System Command
No ratings yet
Operating System Command
8 pages
BP Linux Nutanix Ahv
No ratings yet
BP Linux Nutanix Ahv
36 pages
Linux Basics & Distributions Guide
No ratings yet
Linux Basics & Distributions Guide
46 pages
Dmde 3.2 Manual
No ratings yet
Dmde 3.2 Manual
60 pages
DYNAMIXEL Wizard 2.0 Installer Log
No ratings yet
DYNAMIXEL Wizard 2.0 Installer Log
2 pages
CCBoot Manual - Load Balance
No ratings yet
CCBoot Manual - Load Balance
8 pages
UsbFix Report
No ratings yet
UsbFix Report
116 pages
Compatibility Change Log Analysis
No ratings yet
Compatibility Change Log Analysis
11 pages
PROG-325 Checkpoint 1-4 Solutions
No ratings yet
PROG-325 Checkpoint 1-4 Solutions
3 pages

Summary Exam 2015

Uploaded by

Summary Exam 2015

Uploaded by

CIS 6930: Chip

CIS 6930: Chip Multiprocessor:

Acknowledgement: Slides borrowed from

Acquire technical knowledge required to achieve

Learn new many-core general-purpose and GPU

Parallel programming basics: Locking, synchronization,

CUDA GPU Proggming

Integrated host+device app C program

Serial or modestly parallel parts in host C code

Serial Code (host)

Serial Code (host)

CUDA Thread Blocks and Threads

Each thread uses IDs to

void MatrixMulOnHost(float* M, float* N, float* P, int Width)

G80 Example: Thread Scheduling (cont.)

Thread Scheduling (cont.)

G80 Implementation of CUDA

Read/write per-thread registers

Thread (0, 0) Thread (1, 0)

Thread (0, 0) Thread (1, 0)

How about performance on G80?

Two memory accesses (8

The actual code runs at about 15

Thread (0, 0) Thread (1, 0)

Thread (0, 0) Thread (1, 0)

Tiled Matrix Multiplication Kernel

int bx = blockIdx.x; int by = blockIdx.y;

// Identify the row and column of the Pd element to work on

for (int k = 0; k < TILE_WIDTH; ++k)

Todays Intel PC Architecture:

GeForce-8 Series HW Overview

Texture Processor Cluster

SM hardware implements zerooverhead Warp scheduling

Warps whose next instruction has its

4 clock cycles needed to dispatch

If one global memory access is needed

CUDA Device Memory Space: Review

Each thread can:

R/W per-thread registers

The host can R/W

Thread (0, 0) Thread (1, 0)

Thread (0, 0) Thread (1, 0)

Memory Layout of a Matrix in C

M0,1 M1,1 M2,1 M3,1

Bank Addressing Examples

8-way Bank Conflicts

Control Flow Instructions

Main performance concern with branching is

Threads within a single warp take different paths

A common case: avoid divergence when branch

Example with divergence:

Example without divergence:

Vector Reduction with Branch

No Divergence until < 16 sub-sums

The goals of parallel computing are

Dependences need to be identified and managed

Performance can be drastically reduced by many

Overhead of parallel processing

Fermi Implements CUDA

Definition of memory scope,

Grid: Array of thread blocks

Thread Block: up to 1536

GPU has an array of SMs,

Other resource constraints

Fermi GT300 Key Feature

Fermi GT300 Key Feature

Instruction Schedule Example

A total of 32 instructions from one or

The L2 cache of todays

#pragma omp critical

call OMP_INIT_LOCK (ilok)

Current spec is OpenMP

(combined C/C++ and Fortran)

#pragma omp parallel for private(A, B)

You might also like