0% found this document useful (0 votes)

101 views

CRGC Mcore PDF

This document discusses multicore programming with OpenMP. It begins by explaining the three walls that necessitated the introduction of multicores: the ILP wall, memory wall, and power wall. Multicores help overcome these issues by exploiting thread-level parallelism. The document then provides examples of multicore architectures like conventional multicores, TILE64, Intel Polaris, and the IBM Cell. It also briefly discusses GPU programming models before discussing techniques for single-core performance programming like vectorization.

Uploaded by

andres python

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views

CRGC Mcore PDF

Uploaded by

andres python

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 124

Multicore and Multicore programming with

OpenMP
(Calcul Réparti et Grid Computing)

[email protected]

for an up-to-date version of the slides:

http://buttari.perso.enseeiht.fr
Section 1

Introduction
Why multicores? the three walls

What is the reason for the introduction of multicores?

Uniprocessors performance is leveling off due to the “three walls”:
I ILP wall: Instruction Level Parallelism is near its limits
I Memory wall: caches show diminishing returns
I Power wall: power per chip is getting painfully high
The ILP wall

There are two common approaches to exploit ILP:

I Vector instructions (SSE, AltiVec etc.)
I Out-of-order issue with in-order retirement, speculation,
register renaming, branch prediction etc.
Neither of these can generate much concurrency because:
I irregular memory access patterns
I control dependent computations
I data dependent memory access
Multicore processors, on the other side, exploit Thread Level
Parallelism (TLP) which can virtually achieve any degree of
concurrency
The Memory wall
The gap between processors and memory speed has increased
dramatically. Caches are used to improve memory performance
provided that data locality can be exploited.
To deliver twice the performance with the same bandwidth, the
cache miss rate must be cut in half; this means:
I For dense matrix-matrix multiply or dense LU, 4x bigger cache
I For sorting or FFTs, the square of its former size
I For sparse or dense matrix-vector multiply, forget it
What is the cost of complicated memory hierarchies?

LATENCY

TLP (that is, multicores) can help overcome this inefficiency by

means of multiple streams of execution where memory access
latency can be hidden.
The Power wall
ILP techniques are based on the exploitation of higher clock
frequencies.
Processors performance can be improved by a factor k by
increasing frequency by the same factor.
Is this a problem? yes, it is.

P ' Pdynamic = CV 2 f
Pdynamic = dynamic power
C = capacitance
V = voltage
f = frequency
but

fmax ∼ V
Power consumption and heat dissipation grow as f 3 !
The Power wall
The Power wall

Is there any other way to increase performance without consuming

too much power?
Yes, with multicores:
a k-way multicore is k times faster than an unicore and consumes
only k times as much power.

Pdynamic ∝ C
Thus power consumption and heat dissipation grow linearly with
the number of cores (i.e., chip complexity or number of
transistors).
The Power wall
5
power consumption

x consumption
3

0
0 1 2 3 4 5
x speed

It is even possible to reduce power consumption while still

increasing performance.
Assume a single-core processor with frequency f and capacitance
C.
A quad-core with frequency 0.6 × f will consume 15% less power
while delivering 2.4 higher performance.
The Moore’s Law
The Moore’s law: the number of transistors in microprocessors
doubles every two years.
The Moore’s law, take 2: the performance of microprocessors
doubles every 18 months.
Examples of multicore architectures
Conventional Multicores
What are the problems with all these designs?
I Core-to-core communication. Although cores lie on the same
piece of silicon, there is no direct communication channel
between them. The only option is to communicate through
main memory.
I Shared memory bus. On modern systems, processors are
much faster than memory; example:
Intel Woodcrest:
I at 3.0 GHz each core can process
3 × 4(SSE ) × 2(dualissue) = 24 single-precision floating-point
values in a nanosecond.
I at 10.5 GB/s the memory can provide 10.5/4 ' 2.6
single-precision floating-point values in a nanosecond.
One core is 9 times as fast as the memory!
Attaching more cores to the same bus only makes the problem
worse unless heavy data reuse is possible.
The future of multicores
TILE64 is a microcontroller manufactured by Tilera. It consists of
a mesh network of 64 ”tiles”, where each tile houses a general
purpose processor, cache, and a non-blocking router, which the tile
uses to communicate with the other tiles on the processor.

I 4.5 TB/s on-chip mesh interconnect

I 25 GB/s towards main memory
I no floating-point
Intel Polaris
Intel Polaris 80 cores prototype:
I 80 tiles arranged in a 8 × 10 grid
I on-chip mesh interconnect with 1.62 Tb/s
bisection bandwidth
I 3-D stacked memory (future)
I consumes only 62 Watts and is 275 square
millimeters
I each tile has:
I a router
I 3 KB instruction memory
I 2 KB data memory
I 2 SP FMAC units
I 32 SP registers
That makes 4(FLOPS) × 80(tiles) × 3.16GHz ' 1TFlop/s. The
first TFlop machine was the ASCII Red made up of 10000 Pentium
Pro, taking 250 mq and 500 KW...
The IBM Cell
The Cell Broadband Engine was released in
2005 by the STI (Sony Toshiba IBM)
consortium. It is an 9-way multicore processor.

I 1 control core + 8 working cores

I computational power is achieved through
exploitation of two levels of parallelism:
I vector units
I multiple cores
I on-chip interconnect bus for core-to-core
communications
I caches are replaced by explicitly managed
local memories
I performance comes at a price: the Cell is
very hard to program
The Cell: architecture
I one POWER Processing Element (PPE):
this is almost like a PowerPC processor (it
does not have some ILP features) and it is
almost exclusively meant for control work.
I 8 Synergistic Processing Elements (SPEs)
(only 6 in the PS3)
I one Element Interconnect Bus (EIB):
on-chip ring bus connecting all the SPEs
and the PPE
I one Memory Interface Controller (MIC)
that connects the EIB to the main
memory
I PPE and SPEs have different ISAs and
thus we have to write different code and
use different compilers
Hello world! example

The PPU and the SPUs have different ISAs (Instruction Set
Architecture), therefore two different compilers must be used.
PPE code

#include <stdio.h>
#include <libspe.h>
#include <sys/wait.h>

extern spe_program_handle_t ; SPE code

extern hello_spu;
#include <stdio.h>
int main(void){
int main(unsigned long long speid,
speid_t speid[8];
unsigned long long argp,
int status[8];
unsigned long long envp){
int i;
for (i=0;i<8;i++)
printf("Hello world (0x%llx)\n", speid);
speid[i] = spe create thread(0, &hello_spu,
NULL, NULL, -1, 0);
return 0;
for (i=0;i<8;i++){
}
spe wait(speid[i], &status[i], 0);

printf("status = %d\n",
WEXITSTATUS(status[i]));
}
return 0;
}
Other computing devices: GPUs

NVIDIA GPUs vs Intel processors: performance

Other computing devices: GPUs
NVIDIA GeForce 8800 GTX:

16 streaming multiprocessors of 8 thread processors each.

Other computing devices: GPUs

How to program GPUs?

I SPMD programming model
I coherent branches (i.e. SIMD style) preferred
I penalty for non-coherent branches (i.e., when different
processes take different paths)
I directly with OpenGL/DirectX: not suited for general purpose
computing
I with higher level GPGPU APIs:
I AMD/ATI HAL-CAL (Hardware Abstraction Level - Compute
Abstraction Level)
I NVIDIA CUDA: C-like syntax with pointers etc.
I RapidMind
I PeakStream
Other computing devices: GPUs

LU on 8-cores Xeon + GeForce GTX 280:

Section 2

Single-core performance programming

Single-core programming
Simple matrix-vector multiply-add c=c+A*b in real, single
precision with A of size 32

void matvec(float A, float c, float *b)

{
int m, n;
for (n = 0; n < 32; n++)
for (m = 0; m < 32; m++)
c[m] += A[n*32+m] * b[n];
}

I good news: standard C code will compile and run correctly

on basically any computer
I bad news: no performance, i.e., 0.63 Gflop/s (∼ 3% peak on
a Core i7 @ 2.66 GHz)

Code available at:

http://buttari.perso.enseeiht.fr/stuff/matvec.c
Vectorization

Most modern processors are equipped with vector units. They

make it possible to perform the same operation on multiple data
with a single instruction. For this reason they are also called SIMD
(Single Instruction Multiple Data) units:

I x86 processors (Intel and AMD) have SSE (Streaming SIMD

Extensions) have 128-bit vectors and can do either 4 single or
2 double precision operations
I Power processors (IBM) have AltiVec have 128-bit vectors and
can do either 4 single or 2 double precision operations.
AltiVecs moreover can also do fused multiply-add
Vectorization

- *
- *
- *
- *
Vectorization
The Intel compiler includes a library of intrinsics, i.e., instructions
that are specific to the x86 architecture. They can be used to
vectorize a code. GNU compilers have similar features as well as
others (e.g., IBM XL)

#include "xmmintrin.h"

__m128 Av, cv, mul, bv; // declare vectors

for(j=0; j<32; j++){

bv = _mm_load1_ps(&b[j]); // splat one coefficient of b
for(i=0; i<32; i+=4){
Av = _mm_loadu_ps(&A[j*N+i]); // load 4 values in A(:,j)
cv = _mm_loadu_ps(&c[i]); // load 4 coefficients of c
mul = _mm_mul_ps(Av, bv); // multiply
cv = _mm_add_ps(mul, cv); // add
_mm_storeu_ps(&c[i], cv); // store the result in c
}
}

Performance now is 1.26 Gflop/s, i.e., ∼ 6% of the peak

Unrolling
__m128 b;
__m128 Av0, mul0, c0; ...
__m128 Av7, mul7, c7;

c0 = _mm_loadu_ps(&c[0 ]); ...

c7 = _mm_loadu_ps(&c[28]); With a complete unrolling
of the loop on rows, the c
for(j=0; j<B; j++){
b = _mm_load1_ps(&b[j]);
vector gets loaded only
once into registers and
Av0 = _mm_loadu_ps(&A[j*N + 0 ]); ... then the result stored only
Av7 = _mm_loadu_ps(&A[j*N + 28]);
once at the end.
mul0 = _mm_mul_ps(Av0, b); ...
mul7 = _mm_mul_ps(Av7, b); The performance is now
1.7 Gflop/s, i.e., ∼ 8% of
c0 = _mm_add_ps(mul0, c0); ...
c7 = _mm_add_ps(mul7, c7); the peak
}

_mm_storeu_ps(&c[0 ], c0); ...

_mm_storeu_ps(&c[28], c7);
Prefetching
__m128 b;
__m128 Av0, mul0, c0; ...
__m128 Av7, mul7, c7;
c0 = _mm_loadu_ps(&c[0 ]); ...
c7 = _mm_loadu_ps(&c[28]);

_mm_prefetch((void *)&A[0 ]); ... When column j of A is

_mm_prefetch((void *)&A[28]);
being multiplied,
for(j=0; j<B; j++){ column j+1 is
_mm_prefetch((void *)&A[(j+1)*N + 0 ]); ...
_mm_prefetch((void *)&A[(j+1)*N + 28]);
pre-loaded into cache
to reduce the latency
b = _mm_load1_ps(&b[j]); of access to data
Av0 = _mm_loadu_ps(&A[j*N + 0 ]); ...
Av7 = _mm_loadu_ps(&A[j*N + 28]);
mul0 = _mm_mul_ps(Av0, b); ...
mul7 = _mm_mul_ps(Av7, b);
c0 = _mm_add_ps(mul0, c0); ...
c7 = _mm_add_ps(mul7, c7);
}
The performance is
_mm_storeu_ps(&c[0 ], c0); ... now 1.9 Gflop/s, i.e.,
_mm_storeu_ps(&c[28], c7);
∼ 9% of the peak
Performance evaluation
An highly optimized implementation of the matrix-vector product
(e.g., the SGEMV routine in MKL BLAS) achieves 3.6 Gflop/s but
this is only 17% of the peak. Can we get any better performance?

The processor’s SP peak is 21.28

Gflop/s. The memory bandwidth
is 17 GB/s (i.e., 4.25 SP values
per second). Thus one core is
much faster than memory!

Prefetch can totally hide the memory transfers only if many more
computations are done on each data in cache/registers. In this case
it is possible to get very close to the processor’s peak performance.
This explains the difference between Level-1, Level-2 and Level-3
BLAS routines.
BLAS operations

I Level-1 BLAS: vector-vector operations like inner product or

vector sum. O(n) operations are performed on O(n) data.
Vectorizable but limited by bus speed

I Level-2 BLAS: matrix-vector operations like matrix-vector

product. O(n2 ) operations are performed on O(n2 ) data.
Vectorizable but limited by bus speed

I Level-3 BLAS: matrix-matrix operations like matrix-matrix

product or rank-k update. O(n3 ) operations are performed on
O(n2 ) data. Vectorizable and very efficient thanks to good
exploitation of memory hierarchy
Section 3

OpenMP
How to program multicores: OpenMP

OpenMP (Open specifications for MultiProcessing) is an

Application Program Interface (API) to explicitly direct
multi-threaded, shared memory parallelism.
I Comprised of three primary API components:
I Compiler directives (OpenMP is a compiler technology)
I Runtime library routines
I Environment variables
I Portable:
I Specifications for C/C++ and Fortran
I Already available on many systems (including Linux, Win,
IBM, SGI etc.)
I Full specs
http://openmp.org
I Tutorial
https://computing.llnl.gov/tutorials/openMP/
How to program multicores: OpenMP

OpenMP is based on a fork-join execution model:

I Execution is started by a single thread called master thread

I when a parallel region is encountered, the master thread
spawns a set of threads
I the set of instructions enclosed in a parallel region is executed
I at the end of the parallel region all the threads synchronize
and terminate leaving only the master
How to program multicores: OpenMP

Parallel regions and other OpenMP constructs are defined by

means of compiler directives:
C/C++
#include <omp.h>
Fortran
main () { program hello

int var1, var2, var3; integer :: var1, var2, var3

/* Serial code */ ! Serial code

#pragma omp parallel private(var1, var2) \ !$omp parallel private(var1, var2)

shared(var3) !$omp& shared(var3)
{
! Parallel section executed by all threads
/* Parallel section executed
by all threads */ !$omp end parallel

} ! Resume serial code

/* Resume serial code */ end program hello

}
OpenMP: the PARALLEL construct
The PARALLEL one is the main OpenMP construct and identifies a
block of code that will be executed by multiple threads:
!$OMP PARALLEL [clause ...]
IF (scalar_logical_expression)
PRIVATE (list)
SHARED (list)
DEFAULT (PRIVATE | SHARED | NONE)
FIRSTPRIVATE (list)
REDUCTION (operator: list)
COPYIN (list)
NUM_THREADS (scalar-integer-expression)

block

!$OMP END PARALLEL

I The master is a member of the team and has thread number 0

I Starting from the beginning of the region, the code is
duplicated and all threads will execute that code.
I There is an implied barrier at the end of a parallel section.
I If any thread terminates within a parallel region, all threads in
the team will terminate.
OpenMP: the PARALLEL construct

How many threads do we have? The number of threads depends

on:
I Evaluation of the IF clause
I Setting of the NUM THREADS clause
I Use of the omp set num threads() library function
I Setting of the OMP NUM THREADS environment variable
I Implementation default - usually the number of CPUs on a
node, though it could be dynamic
Hello world example:
program hello

integer :: nthreads, tid, &

& omp_get_num_threads, omp_get_thread_num

! Fork a team of threads giving them

! their own copies of variables
!$omp parallel private(tid)

! Obtain and print thread id

tid = omp_get_thread_num()
write(*,’("Hello from thread ",i2)’)tid

! Only master thread does this

if (tid .eq. 0) then
nthreads = omp_get_num_threads()
write(*,’("# threads: ",i2)’)nthreads
end if

! All threads join master thread and disband

!$omp end parallel

end program hello

I the PRIVATE clause says that each thread will have its own
copy of the tid variable (more later)
I the omp get num threads and omp get thread num are
runtime library routines
OpenMP: Data scoping
I Most variables are shared by default
I Global variables include:
I Fortran: COMMON blocks, SAVE and MODULE variables
I C: File scope variables, static
I Private variables include:
I Loop index variables
I Stack variables in subroutines called from parallel regions
I Fortran: Automatic variables within a statement block
I The OpenMP Data Scope Attribute Clauses are used to
explicitly define how variables should be scoped. They include:

I PRIVATE
I FIRSTPRIVATE
I LASTPRIVATE
I SHARED
I DEFAULT
I REDUCTION
I COPYIN
OpenMP: Data scoping

I PRIVATE(list): a new object of the same type is created for

each thread (uninitialized!)
I FIRSTPRIVATE(list): Listed variables are initialized
according to the value of their original objects prior to entry
into the parallel or work-sharing construct.
I LASTPRIVATE(list): The value copied back into the original
variable object is obtained from the last (sequentially)
iteration or section of the enclosing construct.
I SHARED(list): only one object exists in memory and all the
threads access it
I DEFAULT(SHARED|PRIVATE|NONE): sets the default scoping
I REDUCTION(operator:list): performs a reduction on the
variables that appear in its list.
OpenMP: worksharing constructs

I A work-sharing construct divides the execution of the enclosed

code region among the members of the team that encounter it
I Work-sharing constructs do not launch new threads
There are three main workshare constructs:
I DO/for construct: it is used to parallelize loops
I SECTIONS: used to identify portions of code that can be
executed in parallel
I SINGLE: specifies that the enclosed code is to be executed by
only one thread in the team.
OpenMP: worksharing constructs

The DO/for directive:

program do_example

integer :: i, chunk
integer, parameter :: n=1000, &
& chunksize=100
real(kind(1.d0)) :: a(n), b(n), c(n)

! Some sequential code...

chunk = chunksize

!$omp parallel shared(a,b,c) private(i)

do i = 1, n
c(i) = a(i) + b(i)
end do

!$omp end parallel

end program do_example

OpenMP: worksharing constructs

The DO/for directive:

program do_example

integer :: i, chunk
integer, parameter :: n=1000, &
& chunksize=100
real(kind(1.d0)) :: a(n), b(n), c(n)

! Some sequential code...

chunk = chunksize

!$omp parallel shared(a,b,c) private(i)

!$omp do
do i = 1, n
c(i) = a(i) + b(i)
end do
!$omp end do

!$omp end parallel

end program do_example

OpenMP: worksharing constructs

The DO/for directive:

!$OMP DO [clause ...]
SCHEDULE (type [,chunk])
ORDERED
PRIVATE (list)
FIRSTPRIVATE (list)
LASTPRIVATE (list)
SHARED (list)
REDUCTION (operator | intrinsic : list)

do_loop

!$OMP END DO [ NOWAIT ]

This directive specifies that the iterations of the loop immediately

following it must be executed in parallel by the team

There is an implied barrier at the end of the construct

OpenMP: worksharing constructs
The SCHEDULE clause in the DO/for construct specifies how the
cycles of the loop are assigned to threads:
I STATIC: loop iterations are divided into pieces of size chunk
and then statically assigned to threads in a round-robin
fashion
I DYNAMIC: loop iterations are divided into pieces of size chunk,
and dynamically scheduled among the threads; when a thread
finishes one chunk, it is dynamically assigned another
I GUIDED: for a chunk size of 1, the size of each chunk is
proportional to the number of unassigned iterations divided by
the number of threads, decreasing to 1. For a chunk size with
value k (greater than 1), the size of each chunk is determined
in the same way with the restriction that the chunks do not
contain fewer than k iterations
I RUNTIME: The scheduling decision is deferred until runtime by
the environment variable OMP SCHEDULE
OpenMP: worksharing constructs
Example showing scheduling policies for a loop of size 200

static

dynamic(7)

guided(7)

0 50 100 150 200

OpenMP: worksharing constructs

program do_example

integer :: i, chunk
integer, parameter :: n=1000, &
& chunksize=100
real(kind(1.d0)) :: a(n), b(n), c(n)

! Some sequential code...

chunk = chunksize

!$omp parallel shared(a,b,c,chunk) private(i)

!$omp do schedule(dynamic,chunk)
do i = 1, n
c(i) = a(i) + b(i)
end do
!$omp end do

!$omp end parallel

end program do_example

OpenMP: worksharing constructs

The SECTIONS directive is a non-iterative work-sharing construct.

It specifies that the enclosed section(s) of code are to be divided
among the threads in the team.
!$OMP SECTIONS [clause ...]
PRIVATE (list)
FIRSTPRIVATE (list)
LASTPRIVATE (list)
REDUCTION (operator | intrinsic : list)

!$OMP SECTION

block

!$OMP SECTION

block

!$OMP END SECTIONS [ NOWAIT ]

There is an implied barrier at the end of the construct

OpenMP: worksharing constructs

Example of the SECTIONS worksharing construct

program vec_add_sections

integer :: i
integer, parameter :: n=1000
real(kind(1.d0)) :: a(n), b(n), c(n), d(n)

! some sequential code

!$omp parallel shared(a,b,c,d), private(i)

!$omp sections

!$omp section
do i = 1, n
c(i) = a(i) + b(i)
end do

!$omp section
do i = 1, n
d(i) = a(i) * b(i)
end do

!$omp end sections

!$omp end parallel

end program vec_add_sections

OpenMP: worksharing constructs

The SINGLE directive specifies that the enclosed code is to be

executed by only one thread in the team.
!$OMP SINGLE [clause ...]
PRIVATE (list)
FIRSTPRIVATE (list)

block

!$OMP END SINGLE [ NOWAIT ]

There is an implied barrier at the end of the construct

OpenMP: synchronization constructs

The CRITICAL construct enforces exclusive access with respect to

all critical constructs with the same name in all threads
!$OMP CRITICAL [ name ]

block

!$OMP END CRITICAL

The MASTER directive specifies a region that is to be executed only

by the master thread of the team
!$OMP MASTER

block

!$OMP END MASTER

The BARRIER directive synchronizes all threads in the team

!$OMP BARRIER
OpenMP: synchronization all-in-one example
!$OMP PARALLEL
! all the threads do some stuff in parallel
...

!$OMP CRITICAL
! only one thread at a time will execute these instructions.
! Critical sections can be used to prevent simultaneous
! writes to some data
call one_thread_at_a_time()
!$OMP END CRITICAL

...

!$OMP MASTER
! only the master thread will execute these instructions.
! Some parts can be inherently sequential or need not be
! executed by all the threads
call only_master()
!$OMP END MASTER

! each thread waits for all the others to reach this point
!$OMP BARRIER
! After the barrier we are sure that every thread sees the
! results of the work done by other threads

...
! all the threads do more stuff in parallel

!$OMP END PARALLEL

OpenMP: synchronization constructs: ATOMIC
The ATOMIC directive specifies that a specific memory location
must be updated atomically, rather than letting multiple threads
attempt to write to it.
!$OMP ATOMIC

statement expression

[!$OMP END ATOMIC]

What is the difference with CRITICAL?

!$omp atomic
x = some_function()

With ATOMIC the function some function will be evaluated in

parallel since only the update is atomical.

Another advantage:
!$omp critical
!$omp atomic
x[i] = v
x[i] = v
!$omp end critical

With atomic different coefficients of x will be updated in parallel

OpenMP: synchronization constructs: ATOMIC

With ATOMIC it is possible to specify the access mode to the data:

Read a variable atomically Write a variable atomically

!$omp atomic read !$omp atomic write
v = x x = v

Capture a variable atomically

Update a variable atomically
!$omp atomic capture
!$omp atomic update x = x+1
x = x+1 v = x
!$omp end atomic

atomic regions enforce exclusive access with respect to other

atomic regions that access the same storage location x among all
the threads in the program without regard to the teams to which
the threads belong
OpenMP: reductions and conflicts

How to do reductions with OpenMP?

sum = 0
do i=1,n
sum = sum+a(i)
end do

Here is a wrong way of doing it:

sum = 0
!$omp parallel do shared(sum)
do i=1,n
sum = sum+a(i)
end do

What is wrong?

Concurrent access has to be synchronized otherwise we will end up

in a WAW conflict!
Conflicts
I Read-After-Write (RAW)
A data is read after an do i=2, n
a = b+c a(i) = a(i-1)*b(i)
instruction that modifies it. d = a+c end do

It is also called true

dependency
I Write-After-Read (WAR)
A data is written after an do i=1, n-1
a = b+c a(i) = a(i+1)*b(i)
instruction that reads it. It b = c*2 end do

is also called
anti-dependency
I Write-After-Write (WAW)
A data is written after an c = a(i)*b(i) do i=1, n
c = a(i)*b(i)
instruction that modifies it. c = 4
end do

It is also called output

dependency
OpenMP: reductions
We could use the CRITICAL construct:
sum = 0
!$omp parallel do shared(sum)
do i=1,n
!$omp critical
sum = sum+a(i)
!$omp end critical
end do

but there’s a more intelligent way

sum = 0
!$omp parallel do reduction(+:sum)
do i=1,n
sum = sum+a(i)
end do

The reduction clause specifies an operator and one or more list

items. For each list item, a private copy is created in each implicit
task, and is initialized appropriately for the operator. After the end
of the region, the original list item is updated with the values of
the private copies using the specified operator.
OpenMP: the task construct

The TASK construct defines an explicit task

!$OMP TASK [clause ...]
IF (scalar-logical-expression)
UNTIED
DEFAULT (PRIVATE | SHARED | NONE)
PRIVATE (list)
FIRSTPRIVATE (list)
SHARED (list)
block

!$OMP END TASK

When a thread encounters a TASK construct, a task is generated

(not executed!!!) from the code for the associated structured
block.
The encountering thread may immediately execute the task, or
defer its execution. In the latter case, any thread in the team may
be assigned the task.
OpenMP: the task construct

But, then, when are tasks executed? Execution of a task may be

assigned to a thread whenever it reaches a task scheduling point:
I the point immediately following the generation of an explicit
task
I after the last instruction of a task region
I in taskwait regions
I in implicit and explicit barrier regions
At a task scheduling point a thread can:
I begin execution of a tied or untied task
I resume a suspended task region that is tied to it
I resume execution of a suspended, untied task
OpenMP: the task construct

All the clauses in the TASK construct have the same meaning as for
the other constructs except for:
I IF: when the IF clause expression evaluates to false, the
encountering thread must suspend the current task region and
begin execution of the generated task immediately, and the
suspended task region may not be resumed until the
generated task is completed
I UNTIED: by default a task is tied. This means that, if the task
is suspended, then its execution may only be resumed by the
thread that started it. If, instead, the UNTIED clause is
present, any thread can resume its execution
OpenMP: the task construct
Example of the TASK construct:
program example_task

integer :: i, n
n = 10

!$omp parallel
!$omp master
do i=1, n
!$omp task
call tsub(i) result
!$omp end task iam: 3 nt: 4 i: 3
end do iam: 2 nt: 4 i: 2
!$omp end master iam: 0 nt: 4 i: 4
!$omp end parallel iam: 1 nt: 4 i: 1
iam: 3 nt: 4 i: 5
stop iam: 0 nt: 4 i: 7
end program example_task iam: 2 nt: 4 i: 6
iam: 1 nt: 4 i: 8
subroutine tsub(i) iam: 3 nt: 4 i: 9
integer :: i iam: 0 nt: 4 i: 10
integer :: iam, nt, omp_get_num_threads, &
&omp_get_thread_num

iam = omp_get_thread_num()
nt = omp_get_num_threads()

write(*,’("iam:",i2," nt:",i2," i:",i4)’)iam,nt,i

return
end subroutine tsub
OpenMP Locks
Lock can be used to prevent simultaneous access to shared
resources according to the schema
I acquire (or set or lock) the lock
I access data
I release (on unset or unlock) the lock
Acquisition of the lock is exclusive in the sense that only one
threads can hold the lock at a given time. A lock can be in one of
the following states:
I uninitialized: the lock is not active and cannot be
acquired/released by any thread;
I unlocked: the lock has been initialized and can be acquired
by any thread;
I locked: the lock has been acquired by one thread and cannot
be acquired by any other thread until the owner releases it.
OpenMP Locks

Locks are used through the following routines:

I omp init lock: initializes a lock

I omp destroy lock: uninitializes a lock
I omp set lock: waits until a lock is available, and then sets it
I omp unset lock: unsets a lock
I omp test lock: routine tests a lock, and sets it if it is
available
OpenMP Locks

Examples:
omp test lock
!$OMP MASTER
! initialize the lock
omp set lock call omp init lock(lock)
!$OMP MASTER !$OMP END MASTER
! initialize the lock ...
call omp init lock(lock) ! do work in parallel
!$OMP END MASTER ...
... if(omp set (lock)) then
! do work in parallel ! the lock is available: acquire it and
... ! have exclusive access to data
call omp set (lock) ...
! exclusive access to data call omp unset lock(lock)
... else
call omp unset lock(lock) ! do other stuff and check for availability
... ! later
! do more work in parallel ...
... end if
! destroy the lock ...
call omp destroy lock(lock) ! do more work in parallel
...
! destroy the lock
call omp destroy lock(lock)
Section 4

OpenMP examples
Loop parallelism vs parallel region

Note that these two codes are essentially equivalent:

Parallel region
!$OMP PARALLEL PRIVATE(iam, nth, b, nl, i)
iam = omp get thread num()
nth = omp get num threads()

! compute the number of loop iterations

Loop parallelism ! done by each thread
!$OMP PARALLEL DO nl = (n-1)/nth+1
do i=1, n
a(i) = b(i) + c(i) ! compute the first iteration number
end do ! for this thread
b = iam*nl+1

do i=b, min(b+nl-1,n)
a(i) = b(i) + c(i)
end do
!$OMP END PARALLEL

Loop parallelism is not always possible or may not be the best way
of parallelizing a code.
Loop parallelism vs parallel region
Another example: parallelize the maxval(x) routine which
computes the maximum value of an array x of length n
Parallel region
!$OMP PARALLEL PRIVATE(iam, nth, beg, loc_n, i) REDUCTION(max:max_value)
iam = omp get thread num()
nth = omp get num threads()

! each thread computes the length of its local part of the array
loc_n = (n-1)/nth+1

! each thread computes the beginning of its local part of the array
beg = iam*loc_n+1

! for the last thread the local part may be smaller

if(iam == nth-1)
loc_n = n-beg;

max_value = maxval(x(beg:beg+loc_n-1))
!$OMP END PARALLEL

thread0 thread1 thread2 thread3

max_val0 max_val1 max_val2 max_val3

max

max_val
OpenMP MM product

subroutine mmproduct(a, b, c)
...

do i=1, n
do j=1, n
do k=1, n
c(i,j) = c(i,j)+a(i,k)*b(k,j)
end do
end do
end do

end subroutine mmproduct

Sequential version
OpenMP MM product

subroutine mmproduct(a, b, c)
...
do i=1, n
do j=1, n
do k=1, n
!$omp task
c(i,j) = c(i,j)+a(i,k)*b(k,j)
!$omp task
end do
end do
end do
end subroutine mmproduct
OpenMP MM product

subroutine mmproduct(a, b, c)
...
do i=1, n
do j=1, n
do k=1, n
!$omp task
c(i,j) = c(i,j)+a(i,k)*b(k,j)
!$omp task
end do
end do
end do
end subroutine mmproduct

Incorrect parallel with WAW, WAR and RAW conflict on c(i,j)

OpenMP MM product

subroutine mmproduct(a, b, c)
!$omp parallel private(i,j)
do i=1, n
do j=1, n
!$omp do
do k=1, n
c(i,j) = c(i,j)+a(i,k)*b(k,j)
end do
!$omp end do
end do
end do
end subroutine mmproduct
OpenMP MM product

subroutine mmproduct(a, b, c)
!$omp parallel private(i,j)
do i=1, n
do j=1, n
!$omp do
do k=1, n
c(i,j) = c(i,j)+a(i,k)*b(k,j)
end do
!$omp end do
end do
end do
end subroutine mmproduct

Incorrect parallel with WAW, WAR and RAW conflict on c(i,j)

OpenMP MM product

subroutine mmproduct(a, b, c)
!$omp parallel reduction(+,c) private(i,j)
do i=1, n
do j=1, n
!$omp do
do k=1, n
c(i,j) = c(i,j)+a(i,k)*b(k,j)
end do
!$omp end do
end do
end do
end subroutine mmproduct
OpenMP MM product

Correct parallel but enormous waste of memory (c is replicated)

OpenMP MM product

subroutine mmproduct(a, b, c)

do i=1, n
do j=1, n
acc = 0
!$omp parallel do reduction(+:acc)
do k=1, n
acc = acc+a(i,k)*b(k,j)
end do
!$omp end do
c(i,j) = c(i,j)+acc
end do
end do
end subroutine mmproduct
OpenMP MM product

subroutine mmproduct(a, b, c)

do i=1, n
do j=1, n
acc = 0
!$omp parallel do reduction(+:acc)
do k=1, n
acc = acc+a(i,k)*b(k,j)
end do
!$omp end do
c(i,j) = c(i,j)+acc
end do
end do
end subroutine mmproduct

Correct parallel but low efficiency (many fork-join)

OpenMP MM product

subroutine mmproduct(a, b, c)
!$omp parallel private(i,j,acc)
do i=1, n
do j=1, n
acc = 0
!$omp do reduction(+:acc)
do k=1, n
acc = acc+a(i,k)*b(k,j)
end do
!$omp end do
!$omp single
c(i,j) = c(i,j)+acc
!$omp end single
end do
end do
end subroutine mmproduct
OpenMP MM product

Correct parallel but still low efficiency

OpenMP MM product

subroutine mmproduct(a, b, c)

!$omp parallel do private(j,k)

do i=1, n
do j=1, n
do k=1, n
c(i,j) = c(i,j)+a(i,k)*b(k,j)
end do
end do
end do
!$omp end parallel do
end subroutine mmproduct
OpenMP MM product

subroutine mmproduct(a, b, c)

!$omp parallel do private(j,k)

do i=1, n
do j=1, n
do k=1, n
c(i,j) = c(i,j)+a(i,k)*b(k,j)
end do
end do
end do
!$omp end parallel do
end subroutine mmproduct

Correct parallel and good performance

OpenMP MM product

subroutine mmproduct(a, b, c)
...

do i=1, n, nb
do j=1, n, nb
do k=1, n, nb
c(i:i+nb-1,j:j+nb-1) = c(i:i+nb-1,j:j+nb-1)+ &
& matmul(a(i:i+nb-1,k:k+nb-1), b(k:k+nb-1,j:j+nb-1))
end do
end do
end do

end subroutine mmproduct

OpenMP MM product

subroutine mmproduct(a, b, c)
...

do i=1, n, nb
do j=1, n, nb
do k=1, n, nb
c(i:i+nb-1,j:j+nb-1) = c(i:i+nb-1,j:j+nb-1)+ &
& matmul(a(i:i+nb-1,k:k+nb-1), b(k:k+nb-1,j:j+nb-1))
end do
end do
end do

end subroutine mmproduct

Optimized version by blocking

OpenMP MM product

subroutine mmproduct(a, b, c)
...
!$omp parallel do
do i=1, n, nb
do j=1, n, nb
do k=1, n, nb
c(i:i+nb-1,j:j+nb-1) = c(i:i+nb-1,j:j+nb-1)+ &
& matmul(a(i:i+nb-1,k:k+nb-1), b(k:k+nb-1,j:j+nb-1))
end do
end do
end do
!$omp parallel end do
end subroutine mmproduct
OpenMP MM product

Optimized parallel version

OpenMP MM product

subroutine mmproduct(a, b, c)
...
!$omp parallel do
do i=1, n, nb
do j=1, n, nb
do k=1, n, nb
c(i:i+nb-1,j:j+nb-1) = c(i:i+nb-1,j:j+nb-1)+a(i:i+nb-1,k:k+nb-1)*b(k:k+nb-1,j:j+nb-1)
end do
end do
end do
!$omp parallel end do
end subroutine mmproduct

1 Threads ---> 4.29 Gflop/s

2 Threads ---> 8.43 Gflop/s
4 Threads ---> 16.57 Gflop/s
8 Threads ---> 31.80 Gflop/s
16 Threads ---> 55.11 Gflop/s
The Cholesky factorization

  do k=1, n
l11
a(k,k) = sqrt(a(k,k))

 l21 l22 
 do i=k+1, n

 l31 l32 ã33 
 a(i,k) = a(i,k)/a(k,k)

 l41 l42 ã43 ã44 
 do j=k+1, n

 l51 l52 ã53 ã54 ã55 
 a(i,j) = a(i,j) - a(i,k)*a(j,k)
 l61 l62 ã63 ã64 ã65 ã66  end do
  end do
 l71 l72 ã73 ã74 ã75 ã76 ã77 
l81 l82 ã83 ã84 ã85 ã86 ã87 ã88 end do

The unblocked Cholesky factorization is extremely inefficient due

to a poor cache reuse. No level-3 BLAS operations possible.
The Cholesky factorization

l11 do k=1, n, nb
 
 l21 l22  call dpotf2( a(k:k+nb-1,k:k+nb-1) )
l31 l32 ã33
 
 
l41 l42 ã43 ã44 call dtrsm ( a(k+nb:n, k:k+nb-1), &
 
 
& a(k:k+nb-1,k:k+nb-1) )
l51 l52 ã53 ã54 ã55
 
 
l61 l62 ã63 ã64 ã65 ã66
 
  call dsyrk ( a(k+nb:n,k+nb:n), &
 l71 l72 ã73 ã74 ã75 ã76 ã77  & a(k+nb:n, k:k+nb-1) )
l81 l82 ã83 ã84 ã85 ã86 ã87 ã88 end do

The blocked Cholesky factorization is highly efficient thanks to the

usage of level-3 BLAS routines.

dpotf2 dtrsm dsyrk

No potential for parallelism?

The Cholesky factorization

The blocked Cholesky factorization is highly efficient thanks to the

usage of level-3 BLAS routines.

dpotf2 dtrsm dsyrk

No potential for parallelism? FALSE

The Cholesky factorization
 
l11

 l21 l22 


 l31 l32 ã33 


 l41 l42 ã43 ã44 


 l51 l52 ã53 ã54 ã55 


 l61 l62 ã63 ã64 ã65 ã66 

 l71 l72 ã73 ã74 ã75 ã76 ã77 
l81 l82 ã83 ã84 ã85 ã86 ã87 ã88

do k=1, n, nb

call dpotf2( a(k:k+nb-1,k:k+nb-1) ) The matrix can be logically split into

do i=k+nb, n, nb blocks of size nb × nb and the
call dtrsm ( a(i:i+nb-1, k:k+nb-1), &
& a(k:k+nb-1,k:k+nb-1) ) factorization written exactly as the
do j=k+nb, i, nb
call dpoup ( a(i:i+nb-1,j:j+nb-1), &
non blocked where operations on
& a(i:i+nb-1, k:k+nb-1) & single values are replaced by
& a(j:j+nb-1, k:k+nb-1) )
end do equivalent operations on blocks.
end do
end do
Blocked Cholesky: multithreading
First tentative:
!$omp parallel do
do k=1, n, nb
call dpotf2( a(k:k+nb-1,k:k+nb-1) )

do i=k+nb, n, nb
call dtrsm ( a(i:i+nb-1, k:k+nb-1), a(k:k+nb-1,k:k+nb-1) )

do j=k+nb, i, nb
call dpoup ( a(i:i+nb-1,j:j+nb-1), a(i:i+nb-1, k:k+nb-1), a(j:j+nb-1, k:k+nb-1) )
end do

end do

end do
!$omp end parallel do

WRONG!
This parallelization will lead to incorrect results. The steps of the
blocked factorization have to be performed in the right order.
Blocked Cholesky: multithreading
Second tentative:
do k=1, n, nb
call dpotf2( a(k:k+nb-1,k:k+nb-1) )
!$omp parallel do
do i=k+nb, n, nb
call dtrsm ( a(i:i+nb-1, k:k+nb-1), a(k:k+nb-1,k:k+nb-1) )

do j=k+nb, i, nb
call dpoup ( a(i:i+nb-1,j:j+nb-1), a(i:i+nb-1, k:k+nb-1), a(j:j+nb-1, k:k+nb-1) )
end do

end do
!$omp end parallel do
end do

WRONG!
This parallelization will lead to incorrect results. At step step, the
dpoup operation on block a(row,col) depends on the result of
the dtrsm operations on blocks a(row,step) and a(col,step).
This parallelization only respects the dependency on the first one.
Blocked Cholesky: multithreading
Third tentative:
do k=1, n, nb
call dpotf2( a(k:k+nb-1,k:k+nb-1) )

do i=k+nb, n, nb
call dtrsm ( a(i:i+nb-1, k:k+nb-1), a(k:k+nb-1,k:k+nb-1) )
!$omp parallel do
do j=k+nb, i, nb
call dpoup ( a(i:i+nb-1,j:j+nb-1), a(i:i+nb-1, k:k+nb-1), a(j:j+nb-1, k:k+nb-1) )
end do
!$omp end parallel do
end do

end do

CORRECT!
This parallelization will lead to correct results. Because, at each
step, the order of the dtrsm operations is respected, once the
dtrsm operation on block a(row,step) is done, all the updates
along row row can be done independently. Not really efficient.
Blocked Cholesky: multithreading
Fourth tentative:
do k=1, n, nb
call dpotf2( a(k:k+nb-1,k:k+nb-1) )

!$omp parallel do
do i=k+nb, n, nb
call dtrsm ( a(i:i+nb-1, k:k+nb-1), a(k:k+nb-1,k:k+nb-1) )
end do
!$omp end parallel do

!$omp parallel do
do i=k+nb, n, nb
do j=k+nb, i, nb
call dpoup ( a(i:i+nb-1,j:j+nb-1), a(i:i+nb-1, k:k+nb-1), a(j:j+nb-1, k:k+nb-1) )
end do
end do
!$omp end parallel do
end do

CORRECT and more EFFICIENT!

All the dtrsm operations at step step are independent and can be
done in parallel. Because all the dtrsm are done before the
updates, these can be done in parallel too. But not optimal.
Blocked Cholesky: multithreading
dpotf2 dtrsm dsyrk

Fork-join parallelism suffers from:

I poor parallelism: some operations are inherently sequential
and pose many constraints to the parallelization of the whole
code
I synchronizations: any fork or join point is a synchronization
point. This makes the parallel flow of execution extremely
constrained, increases the idle time, limits the scalability
Blocked Cholesky: better multithreading
All the previous parallelization approaches are based on the
assumption that step step+1 can be started only when all the
operations related to step step are completed. This constraint is
too strict and can be partially relaxed.
Which conditions have to be necessarily respected?
1. the dpotf2 operation on the diagonal block a(step,step)
can be done only if the block is up to date with respect to
step step-1
2. the dtrsm operation on block a(row,step) can be done only
if the block is up to date with respect to step step-1 and the
dpotf2 of block a(step,step) is completed
3. the dpoup of block a(row,col) at step step can be done
only if the block is up to date with respect to step step-1
and the dtrsm of blocks a(row,step) and a(col,step) at
step step are completed
Blocked Cholesky: better multithreading

How is it possible to handle all this 1,1

complexity? The order of the operations may
be captured in a Directed Acyclic Graph where 2,1 3,1
nodes define the computational tasks and
edges the dependencies among them. Tasks in 2,2 3,2 3,3
the DAG may be dynamically scheduled.
I fewer dependencies, i.e., fewer 2,2
synchronizations and high flexibility for
the scheduling of tasks 3,2
I no idle time
I adaptativity 3,3

I better scaling
3,3
Blocked Cholesky: better multithreading
Multithreaded blocked Cholesky
80
seq.
parallel do v1
parallel do v2
70
DAG based

50
GFlop/s

0
0 5 10 15 20 25
# of cores

download the code at:

http://buttari.perso.enseeiht.fr/stuff/ompchol.F90
OpenMP: the task construct
recursive subroutine traverse ( p )
type node
type(node), pointer :: left, right
end type node
type(node) :: p
if (associated(p%left)) then
!$omp task ! p is firstprivate by default
call traverse(p%left)
!$omp end task
end if
if (associated(p%right)) then
!$omp task ! p is firstprivate by default
call traverse(p%right)
!$omp end task
end if
call process ( p )
end subroutine traverse

Although the sequential code will traverse the tree in postorder,

this is not true for the parallel execution since no synchronizations
are performed.
OpenMP: the task construct
recursive subroutine traverse ( p )
type node
type(node), pointer :: left, right
end type node
type(node) :: p
if (associated(p%left)) then
!$omp task ! p is firstprivate by default
call traverse(p%left)
!$omp end task
end if
if (associated(p%right)) then
!$omp task ! p is firstprivate by default
call traverse(p%right)
!$omp end task
end if
!$omp taskwait
call process ( p )
end subroutine traverse

The TASKWAIT construct is a synchronization which forces a thread

to wait until the execution of the children tasks is terminated.
OMP tasks: example
Write a parallel version of the following subroutine using OpenMP
tasks:
function foo()
integer :: foo
integer :: a, b, c, x, y;

a = f_a()
b = f_b()
c = f_c()
x = f1(b, c)
y = f2(a, x)

return y;

end function foo

OMP tasks: example
Write a parallel version of the following subroutine using OpenMP
tasks:
!$omp parallel
!$omp single
!$omp task
a = f_a()
!$omp end task

!$omp task

!$omp task
b = f_b()
!$omp end task

!$omp task
c = f_c()
!$omp end task

!$omp taskwait

x = f1(b, c)
!$omp end task

!$omp taskwait

y = f2(a, x)

!$omp end single

!$omp end parallel
Section 5

OpenMP: odds & ends

NUMA: Memory locality

Even if every core can access any memory module, data will be
transferred at different speeds depending on the distance (number
of hops)
NUMA: Memory locality

Even if every core can access any memory module, data will be
transferred at different speeds depending on the distance (number
of hops)
NUMA: memory locality
If an OpenMP parallel DGEMV (matrix operation) operation is not
correctly coded on such an architecture, only a speedup of 1.5 can
be achieved using all the 24 cores. Why?
NUMA: memory locality
If an OpenMP parallel DGEMV (matrix operation) operation is not
correctly coded on such an architecture, only a speedup of 1.5 can
be achieved using all the 24 cores. Why?

If all the data is stored on only one memory module, the memory
bandwidth will be low and the conflicts/contentions will be high.
NUMA: memory locality
If an OpenMP parallel DGEMV (matrix operation) operation is not
correctly coded on such an architecture, only a speedup of 1.5 can
be achieved using all the 24 cores. Why?

If all the data is stored on only one memory module, the memory
bandwidth will be low and the conflicts/contentions will be high.
When possible, it is good to partition the data, store partitions on
different memory modules and force each core to access only local
data.
NUMA: memory locality

Implementing all this requires the ability to:

I control the placement of threads: we have to bind each thread
to a single core and prevent threads migrations. This can be
done in a number of ways, e.g. by means of tools such as
hwloc which allows thread pinning
I control the placement of data: we have to make sure that one
front physically resides on a specific NUMA module. This can
be done with:
I the first touch rule: the data is allocated close to the core that
makes the first reference
I hwloc or numalib which provide NUMA-aware allocators
I detect the architecture we have to figure out the
memory/cores layout in order to guide the work stealing. This
can be done with hwloc
NUMA: memory locality
When this optimization is applied much better performance and
scalability is achieved:
Hybrid parallelism

How to exploit parallelism in a cluster of SMPs/Multicores? There

are two options:
I Use MPI all over: MPI works on distributed memory systems
as well as on shared memory
I Use an MPI/OpenMP hybrid approach: define one MPI task
for each node and one OpenMP thread for each core in the
node.
Hybrid parallelism

program hybrid

use mpi

integer :: mpi_id, ierr, mpi_nt

integer :: omp_id, omp_nt, &
& omp_get_num_threads, &
& omp_get_thread_num

call mpi_init(ierr) result

Thread 0(2) within MPI task 0(2)
call mpi_comm_rank(mpi_comm_world, mpi_id, ierr)
Thread 0(2) within MPI task 1(2)
call mpi_comm_size(mpi_comm_world, mpi_nt, ierr)
Thread 1(2) within MPI task 1(2)
Thread 1(2) within MPI task 0(2)
!$omp parallel
omp_id = omp_get_thread_num()
omp_nt = omp_get_num_threads()

write(*,’("Thread ",i1,"(",i1,") &

& within MPI task ",i1,"(",i1,")")’) &
& omp_id,omp_nt,mpi_id,mpi_nt

!$omp end parallel

end program hybrid

Section 6

Mixed-precision Iterative refinement

Mixed-precision arithmetic

On modern systems, single-precision arithmetic has a clear

performance advantage over double-precision arithmetic for the
following reasons:
I Vector instructions: vector units can normally do twice as
many SP operations as DP ones every clock cycle. For
example SSE units can do either 4 SP or 2 DP.
I Bus bandwidth: because SP values are twice as small as DP
ones, the memory transfer rate for SP values is twice as big as
for DP
I Data locality: again, because SP data are twice as small DP,
you can put twice as many SP values in the cache as DP
Mixed-precision arithmetic
Performance comparison between single and double precision
arithmetic for matrix-matrix and matrix-vector product operations
on square matrices.

Size SGEMM/ Size SGEMV/

DGEMM DGEMV
AMD Opteron 246 3000 2.00 5000 1.70
Sun UltraSparc-IIe 3000 1.64 5000 1.66
Intel PIII Copp. 3000 2.03 5000 2.09
PowerPC 970 3000 2.04 5000 1.44
Intel Woodcrest 3000 1.81 5000 2.18
Intel XEON 3000 2.04 5000 1.82
Intel Centrino Duo 3000 2.71 5000 2.21
Mixed-precision arithmetic

Is there a way to do computations at the speed of single-precision

while achieving the accuracy of double-precision? Sort of.
For some operations it is possible to do the bulk of computations in
single-precision and then recover the accuracy to double-precision
by means of an iterative method like the Newton’s method.
Mixed-precision arithmetic
The Newton’s method says that given an approximate root of the
function f (x), we can refine it through iterations of the type:

f (xk )
xk+1 = xk −
f 0 (xk )
Mixed-precision arithmetic

The Newton’s method can be applied, for example, to refine the

root of the function f (x) = b − Ax (equivalent to solving the linear
system Ax = b).

xk+1 = xk − A−1 rk where rk = b − Axk

This leads to the well known iterative refinement method:

x0 ← A−1 b
repeat
rk ← b − Axk−1
zk ← A−1 rk
xk ← xx−1 − zk
until convergence
Mixed-precision arithmetic

The Newton’s method can be applied, for example, to refine the

root of the function f (x) = b − Ax (equivalent to solving the linear
system Ax = b).

xk+1 = xk − A−1 rk where rk = b − Axk

This leads to the well known iterative refinement method:

x0 ← A−1 b O(n3 )
repeat
rk ← b − Axk−1 O(n2 )
zk ← A−1 rk O(n2 )
xk ← xx−1 − zk O(n)
until convergence
Mixed-precision arithmetic

The Newton’s method can be applied, for example, to refine the

root of the function f (x) = b − Ax (equivalent to solving the linear
system Ax = b).

xk+1 = xk − A−1 rk where rk = b − Axk

This leads to the well known iterative refinement method:

x0 ← A−1 b εs We can perform the

repeat expensive factorization in
rk ← b − Axk−1 εd single precision and then
zk ← A−1 rk εs do the refinement in
xk ← xx−1 − zk εd double.
until convergence
Mixed Precision Iterative Refinement: results

LU Solve −− AMD Opteron246 2.0 GHz

4
Gflop/s

1 Full−single
Mixed−prec
Full−double
0
0 1000 2000 3000 4000 5000
problem size
Mixed Precision Iterative Refinement: results

Cholesky Solve −− Cell Broadband Engine 3.2 GHz

180

160

140

120

100
Gflop/s

Full−single

80 Mixed−precision
Peak−double
60

0
500 1000 1500 2000 2500 3000 3500 4000
problem size
Mixed Precision Iterative Refinement: results

Intel Woodcrest 3.0 GHz

Single/double
2.5 Mixed prec./double

2 2
3 2
3
3 2
speedup

1.5

0.5

0
1 2 3 4 5 6
matrix number
Appendix: routines for blocked Cholesky
I dpotf2: this LAPACK routine does the unblocked Cholesky
factorization of a symmetric positive definite matrix using only
the lower or upper triangular part of the matrix
I dtrsm: this BLAS routine does the solution of the problem
AX=B where A is a lower or upper triangular matrix and B is
a matrix containing multiple right-hand-sides
I dgemm: this BLAS routine performs a product of the type
C=alpha*A*B+beta*c where alpha and beta are scalars, A, B
and C are dense matrices
I dsyrk: this BLAS routine performs a symmetric rank-k
update of the type A=B*B’+alpha*A where alpha is a scalar,
A is a symmetric matrix and B a rank-k matrix updating only
the upper or lower triangular part of A
I dpoup: this routine (not in BLAS nor in LAPACK) calls the
dgemm or the dsyrk routine to perform an update on an
off-diagonal block or a diagonal block, respectively
Reference and examples
The OpenMP reference document can be found at this address:
http://www.openmp.org/mp-documents/OpenMP3.1.pdf
The file contains detailed documentation about all the OpenMP
directives (included those that were not discussed in the lectures)
and many examples.
It is warmly recommended to study and analyze (at least) the
following examples:
I A.1 parallel loop I A.19 critical
I A.5 parallel I A.21 barrier
I A.12 sections I A.22.1 atomic
I A.14 single I A.22.2 atomic
I A.15.5 task I A.32.1 private
I A.15.6 task I A.32.2 private
I A.18 master I A.36 reduction

Dotnet Core Tutorial
100% (1)
Dotnet Core Tutorial
108 pages
PMI-ACP Questions Domain 1-6
No ratings yet
PMI-ACP Questions Domain 1-6
22 pages
lecture1
No ratings yet
lecture1
37 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
20BCE2351 Micro Assignment-02
No ratings yet
20BCE2351 Micro Assignment-02
5 pages
Riding The Next Wave of Embedded Multicore Processors: - Maximizing CPU Performance in A Power-Constrained World
No ratings yet
Riding The Next Wave of Embedded Multicore Processors: - Maximizing CPU Performance in A Power-Constrained World
36 pages
20BCE2351 Micro Assignment-02
No ratings yet
20BCE2351 Micro Assignment-02
5 pages
multicore02-2
No ratings yet
multicore02-2
18 pages
Lecture1 Introduction to Parallel Computing_2025
No ratings yet
Lecture1 Introduction to Parallel Computing_2025
38 pages
Multi core system
No ratings yet
Multi core system
9 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
DigitalLogic ComputerOrganization L23 Multicore Handout
No ratings yet
DigitalLogic ComputerOrganization L23 Multicore Handout
32 pages
Intro To OpenMP Mattson Customized
No ratings yet
Intro To OpenMP Mattson Customized
94 pages
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
No ratings yet
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
56 pages
Ahmad Aljebaly Department of Computer Science Western Michigan University
No ratings yet
Ahmad Aljebaly Department of Computer Science Western Michigan University
42 pages
Seminar Report
50% (4)
Seminar Report
30 pages
Multi Core Processor
No ratings yet
Multi Core Processor
11 pages
Ppar2017 Gpu 1
No ratings yet
Ppar2017 Gpu 1
61 pages
Unit 7 - Parallel Processing Paradigm
No ratings yet
Unit 7 - Parallel Processing Paradigm
26 pages
Architecture
No ratings yet
Architecture
67 pages
Pipelining vs. Parallel Processing
No ratings yet
Pipelining vs. Parallel Processing
23 pages
Multi-Core Processor: From Wikipedia, The Free Encyclopedia
No ratings yet
Multi-Core Processor: From Wikipedia, The Free Encyclopedia
10 pages
Ece 10 - Microprocessor and Microcontroller System and Design (Module 1)
No ratings yet
Ece 10 - Microprocessor and Microcontroller System and Design (Module 1)
20 pages
Multi-Core Architectures
100% (1)
Multi-Core Architectures
43 pages
Comp422 534 2020 Lecture1 Introduction
No ratings yet
Comp422 534 2020 Lecture1 Introduction
49 pages
CA 4 notes
No ratings yet
CA 4 notes
34 pages
UNIT-5 (1)
No ratings yet
UNIT-5 (1)
86 pages
Prebook MCAP
No ratings yet
Prebook MCAP
11 pages
Lecture 2 Multi-core computing
No ratings yet
Lecture 2 Multi-core computing
42 pages
Motivation For Parallelism Motivation For Parallelism
No ratings yet
Motivation For Parallelism Motivation For Parallelism
6 pages
04 Hardware
No ratings yet
04 Hardware
109 pages
The Central Processing Unit:: What Goes On Inside The Computer
No ratings yet
The Central Processing Unit:: What Goes On Inside The Computer
42 pages
BCSE412L - Parallel Computing 03
No ratings yet
BCSE412L - Parallel Computing 03
11 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
FPGA Based
No ratings yet
FPGA Based
7 pages
Lecture 3 Multi-core computing
No ratings yet
Lecture 3 Multi-core computing
38 pages
Debugging Real-Time Multiprocessor Systems: Class #264, Embedded Systems Conference, Silicon Valley 2006
No ratings yet
Debugging Real-Time Multiprocessor Systems: Class #264, Embedded Systems Conference, Silicon Valley 2006
15 pages
Chapter 7
No ratings yet
Chapter 7
25 pages
Multicore Processor
100% (1)
Multicore Processor
23 pages
Lecture13 - Full IS1500
No ratings yet
Lecture13 - Full IS1500
34 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
CS516: Parallelization of Programs: Overview of Parallel Architectures
No ratings yet
CS516: Parallelization of Programs: Overview of Parallel Architectures
43 pages
CS0051 - Module 01 - Subtopic 1
No ratings yet
CS0051 - Module 01 - Subtopic 1
27 pages
Architecture1 1 (2012)
No ratings yet
Architecture1 1 (2012)
87 pages
Multi-Core Computing: Osama Awwad
No ratings yet
Multi-Core Computing: Osama Awwad
37 pages
Modle 01 - HPC Introduction To Pipeline
No ratings yet
Modle 01 - HPC Introduction To Pipeline
124 pages
MultiCore Architecture
100% (2)
MultiCore Architecture
44 pages
Multi-Core Processing: Advantages & Challenges
No ratings yet
Multi-Core Processing: Advantages & Challenges
35 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
ELECH473 Th06
No ratings yet
ELECH473 Th06
65 pages
Microprocessors, Advanced: Partitioning An Embedded System For Multicore Design
No ratings yet
Microprocessors, Advanced: Partitioning An Embedded System For Multicore Design
36 pages
Multicore Processor
No ratings yet
Multicore Processor
18 pages
2.ParallelArchExec
No ratings yet
2.ParallelArchExec
46 pages
Inside The Cpu
No ratings yet
Inside The Cpu
10 pages
2.1 Advanced Processor Technology
No ratings yet
2.1 Advanced Processor Technology
40 pages
23.L20 Multiprocessing Multithreading Vectorization
No ratings yet
23.L20 Multiprocessing Multithreading Vectorization
38 pages
onur-digitaldesign-2020-lecture20-gpu-beforelecture
No ratings yet
onur-digitaldesign-2020-lecture20-gpu-beforelecture
73 pages
CP4253 Map Unit I
No ratings yet
CP4253 Map Unit I
31 pages
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
From Everand
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
Rodrigo Copetti
No ratings yet
ColorImages PDF
No ratings yet
ColorImages PDF
59 pages
Semi Supervised
No ratings yet
Semi Supervised
13 pages
Seguridad ML
No ratings yet
Seguridad ML
7 pages
Seguridad
No ratings yet
Seguridad
29 pages
Sui2017 PDF
No ratings yet
Sui2017 PDF
8 pages
A Subset Feature Elimination Mechanism For Intrusion Detection System
No ratings yet
A Subset Feature Elimination Mechanism For Intrusion Detection System
10 pages
OpenMPSlides Tamu SC PDF
No ratings yet
OpenMPSlides Tamu SC PDF
74 pages
Seguridad
No ratings yet
Seguridad
6 pages
Omp Hands On SC08 PDF
No ratings yet
Omp Hands On SC08 PDF
153 pages
Computer Security: A Machine Learning Approach: Sandeep V. Sabnani
No ratings yet
Computer Security: A Machine Learning Approach: Sandeep V. Sabnani
62 pages
pp12 PDF
No ratings yet
pp12 PDF
70 pages
Toward Generating A New Intrusion Detection Dataset and Intrusion Traffic Characterization
No ratings yet
Toward Generating A New Intrusion Detection Dataset and Intrusion Traffic Characterization
9 pages
Implementation and Analysis of Combined Machine Learning Method For Intrusion Detection System
No ratings yet
Implementation and Analysis of Combined Machine Learning Method For Intrusion Detection System
10 pages
Ruud Van Der Pas - Eric Stotzer - Christian Terboven - Using Openmp - The Next Step - Affinity, Accelerators, Tasking, and Simd (2017, Mit Press) PDF
No ratings yet
Ruud Van Der Pas - Eric Stotzer - Christian Terboven - Using Openmp - The Next Step - Affinity, Accelerators, Tasking, and Simd (2017, Mit Press) PDF
381 pages
Num Tech
No ratings yet
Num Tech
39 pages
Tr17 11 Ritschel TKS
No ratings yet
Tr17 11 Ritschel TKS
53 pages
A Comparison of Co-Array Fortran and Openmp Fortran For SPMD Programming
No ratings yet
A Comparison of Co-Array Fortran and Openmp Fortran For SPMD Programming
20 pages
Numtech PDF
No ratings yet
Numtech PDF
167 pages
Numerical Recipes in F 90
No ratings yet
Numerical Recipes in F 90
20 pages
PFC SilviaPascual
No ratings yet
PFC SilviaPascual
77 pages
Composable Multi-Threading For Python Libraries: Hsutter Wtichy
No ratings yet
Composable Multi-Threading For Python Libraries: Hsutter Wtichy
5 pages
Chatterjee L26
No ratings yet
Chatterjee L26
42 pages
IMSL Fortran
No ratings yet
IMSL Fortran
151 pages
Fortran 95 Openmp Directives
No ratings yet
Fortran 95 Openmp Directives
12 pages
Azizul Azri Bin Mustaffa - PEC12-60
No ratings yet
Azizul Azri Bin Mustaffa - PEC12-60
36 pages
Interview - Oracle BI Publisher3
No ratings yet
Interview - Oracle BI Publisher3
4 pages
Log
No ratings yet
Log
11 pages
Pega .
No ratings yet
Pega .
4 pages
Second Week Quiz
No ratings yet
Second Week Quiz
18 pages
Configure JIRA For The Managing The Project To Solve The Identified Problem, UML Diagram For Given Use Case
No ratings yet
Configure JIRA For The Managing The Project To Solve The Identified Problem, UML Diagram For Given Use Case
12 pages
Software Engineering (Week-4)
No ratings yet
Software Engineering (Week-4)
50 pages
Goodrich 6e Ch02 ObjectOriented
No ratings yet
Goodrich 6e Ch02 ObjectOriented
28 pages
Computer Project
No ratings yet
Computer Project
21 pages
File Sharing and Printer Sharing
No ratings yet
File Sharing and Printer Sharing
12 pages
Advanced Web Programming Soln
No ratings yet
Advanced Web Programming Soln
38 pages
Laboratory 4
No ratings yet
Laboratory 4
13 pages
AS1 - HUỲNH NHẬT NAM - GCS190293 - GCS0805B
No ratings yet
AS1 - HUỲNH NHẬT NAM - GCS190293 - GCS0805B
29 pages
Choosing Right Automation Tool
No ratings yet
Choosing Right Automation Tool
8 pages
Software Engineering Notes
No ratings yet
Software Engineering Notes
131 pages
Programming in C (Theory) - Final PDF
No ratings yet
Programming in C (Theory) - Final PDF
242 pages
ZK 6.5.4 Developer's Reference
No ratings yet
ZK 6.5.4 Developer's Reference
637 pages
My Resume
No ratings yet
My Resume
1 page
CBSE Class 11 Computer Science Sample Paper Set 1
No ratings yet
CBSE Class 11 Computer Science Sample Paper Set 1
4 pages
26-5 Introduction To C
No ratings yet
26-5 Introduction To C
27 pages
Yearly Planner (Dated & Starting Sunday)
0% (1)
Yearly Planner (Dated & Starting Sunday)
4 pages
CS2205 - Web Programming 1 Assignment - Unit One
No ratings yet
CS2205 - Web Programming 1 Assignment - Unit One
7 pages
02 Sap Ha100
No ratings yet
02 Sap Ha100
41 pages
CBX-POS808 Win Driver Manual - Rev.1.0
No ratings yet
CBX-POS808 Win Driver Manual - Rev.1.0
43 pages
Executables With Python
No ratings yet
Executables With Python
18 pages
1 - Core C# Programming Constructs - Part I
No ratings yet
1 - Core C# Programming Constructs - Part I
66 pages
README GCC TDM
No ratings yet
README GCC TDM
6 pages
Gopi Java 3 Yrs Exp (1)
No ratings yet
Gopi Java 3 Yrs Exp (1)
2 pages
Mains Computer Syllabus
No ratings yet
Mains Computer Syllabus
2 pages

CRGC Mcore PDF

Uploaded by

CRGC Mcore PDF

Uploaded by

Multicore and Multicore programming with

for an up-to-date version of the slides:

What is the reason for the introduction of multicores?

There are two common approaches to exploit ILP:

TLP (that is, multicores) can help overcome this inefficiency by

Is there any other way to increase performance without consuming

It is even possible to reduce power consumption while still

I 4.5 TB/s on-chip mesh interconnect

I 1 control core + 8 working cores

extern spe_program_handle_t ; SPE code

NVIDIA GPUs vs Intel processors: performance

16 streaming multiprocessors of 8 thread processors each.

How to program GPUs?

LU on 8-cores Xeon + GeForce GTX 280:

Single-core performance programming

void matvec(float *A, float *c, float *b)

I good news: standard C code will compile and run correctly

Code available at:

Most modern processors are equipped with vector units. They

I x86 processors (Intel and AMD) have SSE (Streaming SIMD

__m128 Av, cv, mul, bv; // declare vectors

for(j=0; j<32; j++){

Performance now is 1.26 Gflop/s, i.e., ∼ 6% of the peak

c0 = _mm_loadu_ps(&c[0 ]); ...

_mm_storeu_ps(&c[0 ], c0); ...

_mm_prefetch((void *)&A[0 ]); ... When column j of A is

The processor’s SP peak is 21.28

I Level-1 BLAS: vector-vector operations like inner product or

I Level-2 BLAS: matrix-vector operations like matrix-vector

I Level-3 BLAS: matrix-matrix operations like matrix-matrix

OpenMP (Open specifications for MultiProcessing) is an

OpenMP is based on a fork-join execution model:

I Execution is started by a single thread called master thread

Parallel regions and other OpenMP constructs are defined by

int var1, var2, var3; integer :: var1, var2, var3

/* Serial code */ ! Serial code

#pragma omp parallel private(var1, var2) \ !$omp parallel private(var1, var2)

} ! Resume serial code

/* Resume serial code */ end program hello

!$OMP END PARALLEL

I The master is a member of the team and has thread number 0

How many threads do we have? The number of threads depends

integer :: nthreads, tid, &

! Fork a team of threads giving them

! Obtain and print thread id

! Only master thread does this

! All threads join master thread and disband

end program hello

I PRIVATE(list): a new object of the same type is created for

I A work-sharing construct divides the execution of the enclosed

The DO/for directive:

! Some sequential code...

!$omp parallel shared(a,b,c) private(i)

!$omp end parallel

end program do_example

The DO/for directive:

! Some sequential code...

!$omp parallel shared(a,b,c) private(i)

!$omp end parallel

end program do_example

The DO/for directive:

!$OMP END DO [ NOWAIT ]

This directive specifies that the iterations of the loop immediately

There is an implied barrier at the end of the construct

0 50 100 150 200

! Some sequential code...

!$omp parallel shared(a,b,c,chunk) private(i)

!$omp end parallel

end program do_example

The SECTIONS directive is a non-iterative work-sharing construct.

!$OMP END SECTIONS [ NOWAIT ]

There is an implied barrier at the end of the construct

Example of the SECTIONS worksharing construct

! some sequential code

!$omp parallel shared(a,b,c,d), private(i)

!$omp end sections

void matvec(float A, float c, float *b)