CRGC Mcore PDF
CRGC Mcore PDF
OpenMP
(Calcul Réparti et Grid Computing)
Introduction
Why multicores? the three walls
LATENCY
P ' Pdynamic = CV 2 f
Pdynamic = dynamic power
C = capacitance
V = voltage
f = frequency
but
fmax ∼ V
Power consumption and heat dissipation grow as f 3 !
The Power wall
The Power wall
Pdynamic ∝ C
Thus power consumption and heat dissipation grow linearly with
the number of cores (i.e., chip complexity or number of
transistors).
The Power wall
5
power consumption
x consumption
3
0
0 1 2 3 4 5
x speed
The PPU and the SPUs have different ISAs (Instruction Set
Architecture), therefore two different compilers must be used.
PPE code
#include <stdio.h>
#include <libspe.h>
#include <sys/wait.h>
printf("status = %d\n",
WEXITSTATUS(status[i]));
}
return 0;
}
Other computing devices: GPUs
- *
- *
- *
- *
Vectorization
The Intel compiler includes a library of intrinsics, i.e., instructions
that are specific to the x86 architecture. They can be used to
vectorize a code. GNU compilers have similar features as well as
others (e.g., IBM XL)
#include "xmmintrin.h"
Prefetch can totally hide the memory transfers only if many more
computations are done on each data in cache/registers. In this case
it is possible to get very close to the processor’s peak performance.
This explains the difference between Level-1, Level-2 and Level-3
BLAS routines.
BLAS operations
OpenMP
How to program multicores: OpenMP
}
OpenMP: the PARALLEL construct
The PARALLEL one is the main OpenMP construct and identifies a
block of code that will be executed by multiple threads:
!$OMP PARALLEL [clause ...]
IF (scalar_logical_expression)
PRIVATE (list)
SHARED (list)
DEFAULT (PRIVATE | SHARED | NONE)
FIRSTPRIVATE (list)
REDUCTION (operator: list)
COPYIN (list)
NUM_THREADS (scalar-integer-expression)
block
I the PRIVATE clause says that each thread will have its own
copy of the tid variable (more later)
I the omp get num threads and omp get thread num are
runtime library routines
OpenMP: Data scoping
I Most variables are shared by default
I Global variables include:
I Fortran: COMMON blocks, SAVE and MODULE variables
I C: File scope variables, static
I Private variables include:
I Loop index variables
I Stack variables in subroutines called from parallel regions
I Fortran: Automatic variables within a statement block
I The OpenMP Data Scope Attribute Clauses are used to
explicitly define how variables should be scoped. They include:
I PRIVATE
I FIRSTPRIVATE
I LASTPRIVATE
I SHARED
I DEFAULT
I REDUCTION
I COPYIN
OpenMP: Data scoping
integer :: i, chunk
integer, parameter :: n=1000, &
& chunksize=100
real(kind(1.d0)) :: a(n), b(n), c(n)
do i = 1, n
c(i) = a(i) + b(i)
end do
integer :: i, chunk
integer, parameter :: n=1000, &
& chunksize=100
real(kind(1.d0)) :: a(n), b(n), c(n)
!$omp do
do i = 1, n
c(i) = a(i) + b(i)
end do
!$omp end do
do_loop
static
dynamic(7)
guided(7)
program do_example
integer :: i, chunk
integer, parameter :: n=1000, &
& chunksize=100
real(kind(1.d0)) :: a(n), b(n), c(n)
!$omp do schedule(dynamic,chunk)
do i = 1, n
c(i) = a(i) + b(i)
end do
!$omp end do
!$OMP SECTION
block
!$OMP SECTION
block
integer :: i
integer, parameter :: n=1000
real(kind(1.d0)) :: a(n), b(n), c(n), d(n)
!$omp sections
!$omp section
do i = 1, n
c(i) = a(i) + b(i)
end do
!$omp section
do i = 1, n
d(i) = a(i) * b(i)
end do
block
block
block
!$OMP CRITICAL
! only one thread at a time will execute these instructions.
! Critical sections can be used to prevent simultaneous
! writes to some data
call one_thread_at_a_time()
!$OMP END CRITICAL
...
!$OMP MASTER
! only the master thread will execute these instructions.
! Some parts can be inherently sequential or need not be
! executed by all the threads
call only_master()
!$OMP END MASTER
! each thread waits for all the others to reach this point
!$OMP BARRIER
! After the barrier we are sure that every thread sees the
! results of the work done by other threads
...
! all the threads do more stuff in parallel
statement expression
Another advantage:
!$omp critical
!$omp atomic
x[i] = v
x[i] = v
!$omp end critical
What is wrong?
is also called
anti-dependency
I Write-After-Write (WAW)
A data is written after an c = a(i)*b(i) do i=1, n
c = a(i)*b(i)
instruction that modifies it. c = 4
end do
All the clauses in the TASK construct have the same meaning as for
the other constructs except for:
I IF: when the IF clause expression evaluates to false, the
encountering thread must suspend the current task region and
begin execution of the generated task immediately, and the
suspended task region may not be resumed until the
generated task is completed
I UNTIED: by default a task is tied. This means that, if the task
is suspended, then its execution may only be resumed by the
thread that started it. If, instead, the UNTIED clause is
present, any thread can resume its execution
OpenMP: the task construct
Example of the TASK construct:
program example_task
integer :: i, n
n = 10
!$omp parallel
!$omp master
do i=1, n
!$omp task
call tsub(i) result
!$omp end task iam: 3 nt: 4 i: 3
end do iam: 2 nt: 4 i: 2
!$omp end master iam: 0 nt: 4 i: 4
!$omp end parallel iam: 1 nt: 4 i: 1
iam: 3 nt: 4 i: 5
stop iam: 0 nt: 4 i: 7
end program example_task iam: 2 nt: 4 i: 6
iam: 1 nt: 4 i: 8
subroutine tsub(i) iam: 3 nt: 4 i: 9
integer :: i iam: 0 nt: 4 i: 10
integer :: iam, nt, omp_get_num_threads, &
&omp_get_thread_num
iam = omp_get_thread_num()
nt = omp_get_num_threads()
return
end subroutine tsub
OpenMP Locks
Lock can be used to prevent simultaneous access to shared
resources according to the schema
I acquire (or set or lock) the lock
I access data
I release (on unset or unlock) the lock
Acquisition of the lock is exclusive in the sense that only one
threads can hold the lock at a given time. A lock can be in one of
the following states:
I uninitialized: the lock is not active and cannot be
acquired/released by any thread;
I unlocked: the lock has been initialized and can be acquired
by any thread;
I locked: the lock has been acquired by one thread and cannot
be acquired by any other thread until the owner releases it.
OpenMP Locks
Examples:
omp test lock
!$OMP MASTER
! initialize the lock
omp set lock call omp init lock(lock)
!$OMP MASTER !$OMP END MASTER
! initialize the lock ...
call omp init lock(lock) ! do work in parallel
!$OMP END MASTER ...
... if(omp set (lock)) then
! do work in parallel ! the lock is available: acquire it and
... ! have exclusive access to data
call omp set (lock) ...
! exclusive access to data call omp unset lock(lock)
... else
call omp unset lock(lock) ! do other stuff and check for availability
... ! later
! do more work in parallel ...
... end if
! destroy the lock ...
call omp destroy lock(lock) ! do more work in parallel
...
! destroy the lock
call omp destroy lock(lock)
Section 4
OpenMP examples
Loop parallelism vs parallel region
do i=b, min(b+nl-1,n)
a(i) = b(i) + c(i)
end do
!$OMP END PARALLEL
Loop parallelism is not always possible or may not be the best way
of parallelizing a code.
Loop parallelism vs parallel region
Another example: parallelize the maxval(x) routine which
computes the maximum value of an array x of length n
Parallel region
!$OMP PARALLEL PRIVATE(iam, nth, beg, loc_n, i) REDUCTION(max:max_value)
iam = omp get thread num()
nth = omp get num threads()
! each thread computes the length of its local part of the array
loc_n = (n-1)/nth+1
! each thread computes the beginning of its local part of the array
beg = iam*loc_n+1
max_value = maxval(x(beg:beg+loc_n-1))
!$OMP END PARALLEL
max
max_val
OpenMP MM product
subroutine mmproduct(a, b, c)
...
do i=1, n
do j=1, n
do k=1, n
c(i,j) = c(i,j)+a(i,k)*b(k,j)
end do
end do
end do
Sequential version
OpenMP MM product
subroutine mmproduct(a, b, c)
...
do i=1, n
do j=1, n
do k=1, n
!$omp task
c(i,j) = c(i,j)+a(i,k)*b(k,j)
!$omp task
end do
end do
end do
end subroutine mmproduct
OpenMP MM product
subroutine mmproduct(a, b, c)
...
do i=1, n
do j=1, n
do k=1, n
!$omp task
c(i,j) = c(i,j)+a(i,k)*b(k,j)
!$omp task
end do
end do
end do
end subroutine mmproduct
subroutine mmproduct(a, b, c)
!$omp parallel private(i,j)
do i=1, n
do j=1, n
!$omp do
do k=1, n
c(i,j) = c(i,j)+a(i,k)*b(k,j)
end do
!$omp end do
end do
end do
end subroutine mmproduct
OpenMP MM product
subroutine mmproduct(a, b, c)
!$omp parallel private(i,j)
do i=1, n
do j=1, n
!$omp do
do k=1, n
c(i,j) = c(i,j)+a(i,k)*b(k,j)
end do
!$omp end do
end do
end do
end subroutine mmproduct
subroutine mmproduct(a, b, c)
!$omp parallel reduction(+,c) private(i,j)
do i=1, n
do j=1, n
!$omp do
do k=1, n
c(i,j) = c(i,j)+a(i,k)*b(k,j)
end do
!$omp end do
end do
end do
end subroutine mmproduct
OpenMP MM product
subroutine mmproduct(a, b, c)
!$omp parallel reduction(+,c) private(i,j)
do i=1, n
do j=1, n
!$omp do
do k=1, n
c(i,j) = c(i,j)+a(i,k)*b(k,j)
end do
!$omp end do
end do
end do
end subroutine mmproduct
subroutine mmproduct(a, b, c)
do i=1, n
do j=1, n
acc = 0
!$omp parallel do reduction(+:acc)
do k=1, n
acc = acc+a(i,k)*b(k,j)
end do
!$omp end do
c(i,j) = c(i,j)+acc
end do
end do
end subroutine mmproduct
OpenMP MM product
subroutine mmproduct(a, b, c)
do i=1, n
do j=1, n
acc = 0
!$omp parallel do reduction(+:acc)
do k=1, n
acc = acc+a(i,k)*b(k,j)
end do
!$omp end do
c(i,j) = c(i,j)+acc
end do
end do
end subroutine mmproduct
subroutine mmproduct(a, b, c)
!$omp parallel private(i,j,acc)
do i=1, n
do j=1, n
acc = 0
!$omp do reduction(+:acc)
do k=1, n
acc = acc+a(i,k)*b(k,j)
end do
!$omp end do
!$omp single
c(i,j) = c(i,j)+acc
!$omp end single
end do
end do
end subroutine mmproduct
OpenMP MM product
subroutine mmproduct(a, b, c)
!$omp parallel private(i,j,acc)
do i=1, n
do j=1, n
acc = 0
!$omp do reduction(+:acc)
do k=1, n
acc = acc+a(i,k)*b(k,j)
end do
!$omp end do
!$omp single
c(i,j) = c(i,j)+acc
!$omp end single
end do
end do
end subroutine mmproduct
subroutine mmproduct(a, b, c)
subroutine mmproduct(a, b, c)
subroutine mmproduct(a, b, c)
...
do i=1, n, nb
do j=1, n, nb
do k=1, n, nb
c(i:i+nb-1,j:j+nb-1) = c(i:i+nb-1,j:j+nb-1)+ &
& matmul(a(i:i+nb-1,k:k+nb-1), b(k:k+nb-1,j:j+nb-1))
end do
end do
end do
subroutine mmproduct(a, b, c)
...
do i=1, n, nb
do j=1, n, nb
do k=1, n, nb
c(i:i+nb-1,j:j+nb-1) = c(i:i+nb-1,j:j+nb-1)+ &
& matmul(a(i:i+nb-1,k:k+nb-1), b(k:k+nb-1,j:j+nb-1))
end do
end do
end do
subroutine mmproduct(a, b, c)
...
!$omp parallel do
do i=1, n, nb
do j=1, n, nb
do k=1, n, nb
c(i:i+nb-1,j:j+nb-1) = c(i:i+nb-1,j:j+nb-1)+ &
& matmul(a(i:i+nb-1,k:k+nb-1), b(k:k+nb-1,j:j+nb-1))
end do
end do
end do
!$omp parallel end do
end subroutine mmproduct
OpenMP MM product
subroutine mmproduct(a, b, c)
...
!$omp parallel do
do i=1, n, nb
do j=1, n, nb
do k=1, n, nb
c(i:i+nb-1,j:j+nb-1) = c(i:i+nb-1,j:j+nb-1)+ &
& matmul(a(i:i+nb-1,k:k+nb-1), b(k:k+nb-1,j:j+nb-1))
end do
end do
end do
!$omp parallel end do
end subroutine mmproduct
subroutine mmproduct(a, b, c)
...
!$omp parallel do
do i=1, n, nb
do j=1, n, nb
do k=1, n, nb
c(i:i+nb-1,j:j+nb-1) = c(i:i+nb-1,j:j+nb-1)+a(i:i+nb-1,k:k+nb-1)*b(k:k+nb-1,j:j+nb-1)
end do
end do
end do
!$omp parallel end do
end subroutine mmproduct
do k=1, n
l11
a(k,k) = sqrt(a(k,k))
l21 l22
do i=k+1, n
l31 l32 ã33
a(i,k) = a(i,k)/a(k,k)
l41 l42 ã43 ã44
do j=k+1, n
l51 l52 ã53 ã54 ã55
a(i,j) = a(i,j) - a(i,k)*a(j,k)
l61 l62 ã63 ã64 ã65 ã66 end do
end do
l71 l72 ã73 ã74 ã75 ã76 ã77
l81 l82 ã83 ã84 ã85 ã86 ã87 ã88 end do
l11 do k=1, n, nb
l21 l22 call dpotf2( a(k:k+nb-1,k:k+nb-1) )
l31 l32 ã33
l41 l42 ã43 ã44 call dtrsm ( a(k+nb:n, k:k+nb-1), &
& a(k:k+nb-1,k:k+nb-1) )
l51 l52 ã53 ã54 ã55
l61 l62 ã63 ã64 ã65 ã66
call dsyrk ( a(k+nb:n,k+nb:n), &
l71 l72 ã73 ã74 ã75 ã76 ã77 & a(k+nb:n, k:k+nb-1) )
l81 l82 ã83 ã84 ã85 ã86 ã87 ã88 end do
l11 do k=1, n, nb
l21 l22 call dpotf2( a(k:k+nb-1,k:k+nb-1) )
l31 l32 ã33
l41 l42 ã43 ã44 call dtrsm ( a(k+nb:n, k:k+nb-1), &
& a(k:k+nb-1,k:k+nb-1) )
l51 l52 ã53 ã54 ã55
l61 l62 ã63 ã64 ã65 ã66
call dsyrk ( a(k+nb:n,k+nb:n), &
l71 l72 ã73 ã74 ã75 ã76 ã77 & a(k+nb:n, k:k+nb-1) )
l81 l82 ã83 ã84 ã85 ã86 ã87 ã88 end do
do k=1, n, nb
do i=k+nb, n, nb
call dtrsm ( a(i:i+nb-1, k:k+nb-1), a(k:k+nb-1,k:k+nb-1) )
do j=k+nb, i, nb
call dpoup ( a(i:i+nb-1,j:j+nb-1), a(i:i+nb-1, k:k+nb-1), a(j:j+nb-1, k:k+nb-1) )
end do
end do
end do
!$omp end parallel do
WRONG!
This parallelization will lead to incorrect results. The steps of the
blocked factorization have to be performed in the right order.
Blocked Cholesky: multithreading
Second tentative:
do k=1, n, nb
call dpotf2( a(k:k+nb-1,k:k+nb-1) )
!$omp parallel do
do i=k+nb, n, nb
call dtrsm ( a(i:i+nb-1, k:k+nb-1), a(k:k+nb-1,k:k+nb-1) )
do j=k+nb, i, nb
call dpoup ( a(i:i+nb-1,j:j+nb-1), a(i:i+nb-1, k:k+nb-1), a(j:j+nb-1, k:k+nb-1) )
end do
end do
!$omp end parallel do
end do
WRONG!
This parallelization will lead to incorrect results. At step step, the
dpoup operation on block a(row,col) depends on the result of
the dtrsm operations on blocks a(row,step) and a(col,step).
This parallelization only respects the dependency on the first one.
Blocked Cholesky: multithreading
Third tentative:
do k=1, n, nb
call dpotf2( a(k:k+nb-1,k:k+nb-1) )
do i=k+nb, n, nb
call dtrsm ( a(i:i+nb-1, k:k+nb-1), a(k:k+nb-1,k:k+nb-1) )
!$omp parallel do
do j=k+nb, i, nb
call dpoup ( a(i:i+nb-1,j:j+nb-1), a(i:i+nb-1, k:k+nb-1), a(j:j+nb-1, k:k+nb-1) )
end do
!$omp end parallel do
end do
end do
CORRECT!
This parallelization will lead to correct results. Because, at each
step, the order of the dtrsm operations is respected, once the
dtrsm operation on block a(row,step) is done, all the updates
along row row can be done independently. Not really efficient.
Blocked Cholesky: multithreading
Fourth tentative:
do k=1, n, nb
call dpotf2( a(k:k+nb-1,k:k+nb-1) )
!$omp parallel do
do i=k+nb, n, nb
call dtrsm ( a(i:i+nb-1, k:k+nb-1), a(k:k+nb-1,k:k+nb-1) )
end do
!$omp end parallel do
!$omp parallel do
do i=k+nb, n, nb
do j=k+nb, i, nb
call dpoup ( a(i:i+nb-1,j:j+nb-1), a(i:i+nb-1, k:k+nb-1), a(j:j+nb-1, k:k+nb-1) )
end do
end do
!$omp end parallel do
end do
I better scaling
3,3
Blocked Cholesky: better multithreading
Multithreaded blocked Cholesky
80
seq.
parallel do v1
parallel do v2
70
DAG based
60
50
GFlop/s
40
30
20
10
0
0 5 10 15 20 25
# of cores
a = f_a()
b = f_b()
c = f_c()
x = f1(b, c)
y = f2(a, x)
return y;
!$omp task
!$omp task
b = f_b()
!$omp end task
!$omp task
c = f_c()
!$omp end task
!$omp taskwait
x = f1(b, c)
!$omp end task
!$omp taskwait
y = f2(a, x)
Even if every core can access any memory module, data will be
transferred at different speeds depending on the distance (number
of hops)
NUMA: Memory locality
Even if every core can access any memory module, data will be
transferred at different speeds depending on the distance (number
of hops)
NUMA: memory locality
If an OpenMP parallel DGEMV (matrix operation) operation is not
correctly coded on such an architecture, only a speedup of 1.5 can
be achieved using all the 24 cores. Why?
NUMA: memory locality
If an OpenMP parallel DGEMV (matrix operation) operation is not
correctly coded on such an architecture, only a speedup of 1.5 can
be achieved using all the 24 cores. Why?
If all the data is stored on only one memory module, the memory
bandwidth will be low and the conflicts/contentions will be high.
NUMA: memory locality
If an OpenMP parallel DGEMV (matrix operation) operation is not
correctly coded on such an architecture, only a speedup of 1.5 can
be achieved using all the 24 cores. Why?
If all the data is stored on only one memory module, the memory
bandwidth will be low and the conflicts/contentions will be high.
NUMA: memory locality
If an OpenMP parallel DGEMV (matrix operation) operation is not
correctly coded on such an architecture, only a speedup of 1.5 can
be achieved using all the 24 cores. Why?
If all the data is stored on only one memory module, the memory
bandwidth will be low and the conflicts/contentions will be high.
When possible, it is good to partition the data, store partitions on
different memory modules and force each core to access only local
data.
NUMA: memory locality
program hybrid
use mpi
f (xk )
xk+1 = xk −
f 0 (xk )
Mixed-precision arithmetic
x0 ← A−1 b
repeat
rk ← b − Axk−1
zk ← A−1 rk
xk ← xx−1 − zk
until convergence
Mixed-precision arithmetic
x0 ← A−1 b O(n3 )
repeat
rk ← b − Axk−1 O(n2 )
zk ← A−1 rk O(n2 )
xk ← xx−1 − zk O(n)
until convergence
Mixed-precision arithmetic
4
Gflop/s
1 Full−single
Mixed−prec
Full−double
0
0 1000 2000 3000 4000 5000
problem size
Mixed Precision Iterative Refinement: results
160
140
120
100
Gflop/s
Full−single
80 Mixed−precision
Peak−double
60
40
20
0
500 1000 1500 2000 2500 3000 3500 4000
problem size
Mixed Precision Iterative Refinement: results
Single/double
2.5 Mixed prec./double
2 2
3 2
3
3 2
speedup
1.5
0.5
0
1 2 3 4 5 6
matrix number
Appendix: routines for blocked Cholesky
I dpotf2: this LAPACK routine does the unblocked Cholesky
factorization of a symmetric positive definite matrix using only
the lower or upper triangular part of the matrix
I dtrsm: this BLAS routine does the solution of the problem
AX=B where A is a lower or upper triangular matrix and B is
a matrix containing multiple right-hand-sides
I dgemm: this BLAS routine performs a product of the type
C=alpha*A*B+beta*c where alpha and beta are scalars, A, B
and C are dense matrices
I dsyrk: this BLAS routine performs a symmetric rank-k
update of the type A=B*B’+alpha*A where alpha is a scalar,
A is a symmetric matrix and B a rank-k matrix updating only
the upper or lower triangular part of A
I dpoup: this routine (not in BLAS nor in LAPACK) calls the
dgemm or the dsyrk routine to perform an update on an
off-diagonal block or a diagonal block, respectively
Reference and examples
The OpenMP reference document can be found at this address:
http://www.openmp.org/mp-documents/OpenMP3.1.pdf
The file contains detailed documentation about all the OpenMP
directives (included those that were not discussed in the lectures)
and many examples.
It is warmly recommended to study and analyze (at least) the
following examples:
I A.1 parallel loop I A.19 critical
I A.5 parallel I A.21 barrier
I A.12 sections I A.22.1 atomic
I A.14 single I A.22.2 atomic
I A.15.5 task I A.32.1 private
I A.15.6 task I A.32.2 private
I A.18 master I A.36 reduction