HPC Applications and Parallel Programming
HPC Applications and Parallel Programming
with Applications in R
Florian Schwendinger, Gregor Kastner, Stefan Theußl
October 4, 2021
1 / 68
Outline
Four parts:
I Introduction
I Computer Architecture
I The parallel Package
I (Cloud Computing)
Outline 2 / 68
Part I
Introduction
About this Course
High performance computing (HPC) refers to the use of (parallel) supercomputers and
computer clusters. Furthermore, HPC is a branch of computer science that
concentrates on developing high performance computers and software to run on
these computers.
Parallel computing is an important area of this discipline. It is referred to as the development
of parallel processing algorithms or software.
Parallelism is physically simultaneous processing of multiple threads or processes with the
objective to increase performance (this implies multiple processing elements).
32−core SPARC M7 ● ●
22−core Xeon Broadwell−E5
● ● ●
18−core Xeon ● ●
61−core
Xbox Xeon
One PhiHaswell−E5
main ●
SoC ● ●
4e+09 ● ● ● ●
● ● ●
●
10−core Xeon Westmere−EX ● ● ●
8−core Xeon Nehalem−EX ● ●
2e+09 Six−core2 Xeon 7400 ● ● ● ● ● ●
Dual−core Itanium ● ●
● ●
● ●
1e+09 Six−core Opteron 2400 ●
●
●
● ●
●
POWER6 ● ●
Itanium 2 with 9 MB cache ●
5e+08 Itanium 2 Madison 6M ●
●
●
●
●
● D Smithfield ●
Pentium
Itanium 2 McKinley ● ●
●
2e+08 ● ●
● ●
● ●
PentiumPentium III Tualatin
4 Willamette ● ● ●
● ●
AMD K6 ● ●
●
Pentium II Deschutes ●
Pentium Pro ●
4e+06 AMD K5 ●
Pentium ●
68060 ● ●
R4000 ●
Intel 8048668040
● ●
Intel 80386 ● ●
2e+05 Intel ●
i960 ●
Motorola 68020 ● ●
Intel 80286 ●
●
Motorola 68000 ●
●
ARM 2 ●
●
Intel 8086 ● ●
WDC 65C816 ● ●
●
1e+04 ●
WDC 65C02 ●
Zilog Z80 ●
8e+03 TMS 1000 ●
●
●
●
●6502 ●
IntelMOS ●
8008Technology
Intel 4004 ●
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
Date of introduction
normal
60
● MPI
50
●
40
30
●
20
●
●
●
10
● ●
●
●
●
2 4 6 8 10
# of CPUs
The best and easiest way to analyze the performance of an application is to measure the
execution time. Consequently an application can be compared with an improved version
through the execution times.
ts
Speedup = (1)
te
where
ts denotes the execution time for a program without enhancements (serial version)
te denotes the execution time for a program using the enhancements (enhanced
version)
I A simple model says intuitively that a computation runs p times faster when split over p
processors.
I More realistically, a problem has a fraction f of its computation that can be parallelized
thus, the remaining fraction 1 − f is inherently sequential.
I Amdahl’s law:
1
Maximum Speedup =
f /p + (1 − f )
I Problems with f = 1 are called embarrassingly parallel.
I Some problems are (or seem to be) embarrassingly parallel: computing column means,
bootsrapping, etc.
10
f=1
f = 0.9
f = 0.75
8 f = 0.5
speedup
6
4
2
2 4 6 8 10
number of processors
0
z
−5
2
1
y 0−1 2
−1 0 1
−2 −2 x
Require: option characteristics (S, X, T), the risk free yield r , the number of simulations n
1: for i = 1 : n do
2: generate Z i √
1 2 i
3: STi = S(0)e (r − 2 σ )T +σ T Z
4: CTi = e −rT max(STi − X , 0)
5: end for
1 n
6: Cbn = CT +...+CT
T n
The number of trials carried out depends on the accuracy required. If n independent
simulations are run, the standard error of the estimate CbTn of the payoff is
s
√
n
where s is the (estimated) standard deviation of the discounted payoff given by the simulation.
According to the central limit theorem, a 95% confidence interval for the “true” price of the
derivative is given asymptotically by
1.96s
CbTn ± √ .
n
The accuracy of the simulation is therefore inversely proportional to the square root of the
number of trials n. This means that to double the accuracy the number of trials have to be
quadrupled.
I “Pseudo” parallelism by simply starting the same program with different parameters
several times.
I Implicit parallelism, e.g., via parallelizing compilers, or built-in support of packages.
I Explicit parallelism with implicit decomposition.
I Parallelism easy to achieve using compiler directives (e.g., OpenMP).
I Incrementally parallelizing sequential code possible.
I Explicit parallelism, e.g., with message passing libraries.
I Use R packages porting API of such libraries.
I Development of parallel programs is difficult.
I Deliver good performance.
Shared Memory Systems (SMS) host multiple processors which share one global main memory
(RAM), e.g., multi core systems.
Distributed Memory Systems (DMS) consist of several units connected via an interconnection
network. Each unit has its own processor with its own memory.
DMS include:
I Beowulf Clusters are scalable performance clusters based on commodity hardware, on a
private system network, with open source software (Linux) infrastructure (e.g.,
clusterwu@WU).
I The Grid connects participating computers via the Internet (or other wide area networks)
to reach a common goal. Grids are more loosely coupled, heterogeneous, and
geographically dispersed.
The Cloud or cloud computing is a model for enabling convenient, on-demand network access
to a shared pool of configurable computing resources.
Processes are the execution of lists of statements (a sequential program). Processes have
their own state information, use their own address space and interact with other
processes only via an interprocess communication mechanism generally managed
by the operating system. (A master process may spawn subprocesses which are
logically separated from the master process)
Threads are typically spawned from processes for a short time to achieve a certain task
and then terminate – fork/join principle. Within a process threads share the
same state and same memory space, and can communicate with each other
directly through shared variables.
Parallel computing involves splitting work among several processors. Shared memory parallel
computing typically has
I single process,
I single address space,
I multiple threads or light-weight processes,
I all data is shared,
I access to key data needs to be synchronized.
Login server
2 Quad Core Intel Xeon X5550 @ 2.67 GHz
24 GB RAM
File server
2 Quad Core Intel Xeon X5550 @ 2.67 GHz
24 GB RAM
node.q – 44 nodes
2 Six Core Intel Xeon X5670 @ 2.93 GHz
24 GB RAM
This is a total of 528 64-bit computation cores (544 including login- and file server) and more
than 1 terabyte of RAM.
I Sockets: everything is managed by R, thus “easy”. Socket connections run over TCP/IP,
thus usable almost on any given system. Advantage: no additional software required.
I Message Passing Interface (MPI): basically a definition of a networking protocol. Several
different implementations, but openMPI (see http://www.open-mpi.org/) is the most
common and widely used.
I Parallel Virtual Machine (PVM): nowadays obsolete.
I NetWorkSpaces (NWS): is a framework for coordinating programs written in scripting
languages.
I Debian GNU/Linux
I Compiler Collections
I GNU 4.4.7 (gcc, g++, gfortran, . . . ), [g]
I R, some packages from CRAN
I R-g latest R-patched compiled with [g]
I R-<g>-<date> R-devel compiled at <date>
I Linear algebra libraries (BLAS, LAPACK, INTEL MKL)
I OpenMPI, PVM and friends
I various editors (emacs, vi, nano, etc.)
Son of Grid Engine (SGE) is an open source cluster resource management and scheduling
software. It is used to run cluster jobs which are user requests for resources (i.e., actual
computing instances or nodes) available in a cluster/grid.
In general the SGE has to match the available resources to the requests of the grid users. SGE
is responsible for
I accepting jobs from the outside world
I delay a job until it can be run
I sending job from the holding area to an execution device (node)
I managing running jobs
I logging of jobs
Useful SGE commands:
I qsub submits a job.
I qstat shows statistics of jobs running on cluster@WU.
I qdel deletes a job.
I sns shows the status of all nodes in the cluster.
Part II: Computer Architecture Using Cluster@WU 38 / 68
Submitting SGE Jobs
1. Login,
2. create a plain text file (e.g., ’myJob.qsub’) with the job description, containing e.g.:
#!/bin/bash
## This is my first cluster job.
#$ -N MyJob
R-g --version
sleep 10
3. then type qsub myJob.qsub and hit enter.
4. Output files are provided as ’<jobname>.o<jobid>’ (standard output) and
’<jobname>.e<jobid>’ (error output), respectively.
An SGE job typically begins with commands to the grid engine. These commands are prefixed
with #$.
E.g., the following arguments can be passed to the grid engine:
-N specifying the actual jobname
-q selecting one of the available queues. Defaults to node.q.
-pe [type] [n] stets up a parallel environment of type [type] reserving [n] cores.
-t <first>-<last>:stepsize creates a job-array (e.g. -t 1-20:1)
-o [path] redirects stdout to path
-e [path] redirects stderr to path
-j y[es] merges stdout and stderr into one file
For an extensive listing of all avialable arguments type qsub -help into your terminal.
I We want our processes sent us the following message: "Hello World from processor
<ID>" where <ID> is the processor ID (or rank in MPI terminology).
I MPI uses the master-worker paradigm, thus a master process is responsible for starting
(spawning) worker processes.
I In R we can utilize the MPI library via package Rmpi.
mpi.is.master()
mpi.get.processor.name()
mpi.spawn.Rslaves(nslaves = slots)
mpi.remote.exec(mpi.comm.rank())
mpi.bcast.Robj2slave(hello)
mpi.remote.exec(hello())
I MPI is used via package Rmpi.
Shared Memory
Function Description Example
detectCores detect the number of CPU cores ncores <- detectCores()
mclapply parallelized version of lapply mclapply(1:5, runif, mc.cores = ncores)
Distributed memory
Function Description Example
makeCluster start the cluster 1 cl <- makeCluster(10, type="MPI")
clusterSetRNGStream set seed on cluster clusterSetRNGStream(cl, 321)
clusterExport exports variables to the workers clusterExport(cl, list(a=1:10, x=runif(10)))
clusterEvalQ evaluates expressions on workers clusterEvalQ(cl, {x <- 1:3
myFun <- function(x) runif(x)})
clusterCall calls a function on all workers clusterCall(cl, function(y) 3 + y, 2)
parLapply parallelized version of lapply parLapply(cl, 1:100, Sys.sleep)
parLapplyLB parLapply with load balancing parLapplyLB(cl, 1:100, Sys.sleep)
stopCluster stop the cluster stopCluster(cl)
1
(allowed types are PSOCK, FORK, SOCK, MPI and NWS)
Part III: The parallel Package 46 / 68
A Simple Multicore Example
Sequential:
fun <- function(x){
3*(1-x[1])^2*exp(-x[1]^2-(x[2]+1)^2) - 10 * (x[1]/5 - x[1]^3 - x[2]^5) * exp(-x[1]^2 - x[2]^2) - 1/3 * exp(-(x[1]+1)^2 - x[2]^2)
}
start <- list( c(0, 0), c(-1, -1), c(0, -1), c(0, 1) )
seqt <- system.time(sol <- lapply(start, function(par) optim(par, fun, method = "Nelder-Mead",
lower = -Inf, upper = Inf, control = list(maxit = 1000000, beta = 0.01, reltol = 1e-15)))
)["elapsed"]
seqt
Parallel:
require(parallel)
ncores <- detectCores()
part <- system.time(sol <- mclapply(start, function(par)
optim(par, fun, method = "Nelder-Mead", lower = -Inf, upper = Inf,
control = list(maxit = 1000000, beta = 0.01, reltol = 1e-15)), mc.cores = ncores)
)["elapsed"]
I You need to be careful when generating pseudo random numbers in parallel, especially if
you want the streams be independend and reproducible.
I Identical streams produced on each node are likely but not guaranteed.
I Parallel PRNGs usually have to be set up by the user. E.g., via clusterSetRNGStream()
in package parallel.
I The source file ’snow pprng.R’ shows how to use such parallel PRNGs.
#$ -N SNOW_MC
#$ -pe mpi 4
Note: to run this example package snow has to be installed since no functionality to start MPI
clusters is provided with package parallel.
source("HPC_course.R")
## start MPI cluster and retrieve the nodes we are working on.
cl <- snow::makeMPIcluster(slots)
clusterCall(cl, function() Sys.info()[c("nodename","machine")])
## setup PRNG
clusterSetRNGStream(cl, iseed = 123)
price <- MC_sim_par(cl, sigma = 0.2, S = 120, T = 1, X = 130, r = 0.05, n_per_node = sim_per_slot, nodes = slots)
price
I R-core (2013)
I Mahdi (2014)
I Jing (2010)
I McCallum and Weston (2011)
I Private cloud. The cloud infrastructure is operated solely for an organization (e.g.,
wu.cloud).
I Community cloud. The cloud infrastructure is shared by several organizations and
supports a specific community that has shared concerns.
I Public cloud. The cloud infrastructure is made available to the general public or a large
industry group and is owned by an organization selling cloud services (e.g., Amazon EC2).
I Hybrid cloud. The cloud infrastructure is a composition of two or more clouds (private,
community, or public) bound together by standardized or proprietary technology.
wu.cloud is a private cloud service and thus the following characteristics hold.
I Emulate public cloud on (existing) private resources,
I thus, provides benefits of clouds (elasticity, dynamic provisioning, multi-OS/arch
operation, etc.),
I while maintaining control of resources.
I Moreover, there is always the option to scale out to the public cloud (going hybrid).
wu.cloud is
I solely operated for WU members and projects,
I thus, network access only via Intranet/VPN (https://vpn.wu.ac.at),
I on-demand self-service,
I resource pooling via virtualization,
I extensible/elastic,
I Infrastructure as a Service (IaaS),
I Platform as a Service (PaaS).
I wu.cloud is a private cloud system based on the open source software package
Eucalyptus (see http://open.eucalyptus.com/).
I Accessible via http://cloud.wu.ac.at/.
I Consists of a frontend (website, management software) and a backend (providing
resources) system.
Backend system:
I 2x IBM X3850 X5
I 8x8 (64) core Intel Xeon CPUs 2.26 GHz
I 1 TB RAM
I EMC2 Storage Area Network: 7 TB fast + 4 TB slow disks
(c) 2010 IBM Corporation, from I Suse Linux Enterprise Server 11 SP1
Datasheet XSD03054-USEN-05 I Xen 4.0.1
I Eucalyptus backend components (cluster, storage, node
controller)
Frontend System:
I Virtual (Xen) instance
I Apache Webserver
I Eucalyptus frontend components (cloud controller, walrus)
customized system
R dev environment
15
Matlab/PASW/Stata
Windows base system
10
R/Mathematica/Matlab
R dev environment
GUI−based
0
1 2 4 8 16 32 64 128 256
Florian Schwendinger
Institute for Statistics and Mathematics
email: [email protected]
URL: http://www.wu.ac.at/statmath/faculty_staff/faculty/fschwendinger
WU Vienna
Welthandelsplatz 1/D4/level 4
1020 Wien
Austria
L. Jing. Parallel Computing with R and How to Use it on High Performance Computing Cluster, 2010. URL
http://datamining.dongguk.ac.kr/R/paraCompR.pdf.
E. Kontoghiorghes, editor. Handbook of Parallel Computing and Statistics. Chapman & Hall, 2006.
E. Mahdi. A survey of r software for parallel computing. American Journal of Applied Mathematics and Statistics, 2(4):
224–230, 2014. ISSN 2333-4576. doi: 10.12691/ajams-2-4-9. URL http://pubs.sciepub.com/ajams/2/4/9.
Q. E. McCallum and S. Weston. Parallel R. O’Reilly Media, Inc., 2011. ISBN 1449309925, 9781449309923.
R-core. Package ’parallel’, 2013. URL
https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf.
https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.R.
A. Rossini, L. Tierney, and N. Li. Simple Parallel Statistical Computing in R. UW Biostatistics Working Paper Series,
(Working Paper 193), 2003. URL http://www.bepress.com/uwbiostat/paper193.
M. Schmidberger, M. Morgan, D. Eddelbuettel, H. Yu, L. Tierney, and U. Mansmann. State of the art in parallel
computing with r. Journal of Statistical Software, 31(1):1–27, 8 2009. ISSN 1548-7660. URL
http://www.jstatsoft.org/v31/i01.
S. Theußl. Applied high performance computing using R. Master’s thesis, WU Wirtschaftsuniversität Wien, 2007. URL
http://statmath.wu-wien.ac.at/~theussl/publications/thesis/Applied_HPC_Using_R-Theussl_2007.pdf.