Tu#s
Cluster
Update:
Introduc4on
to
Slurm
David
Lapointe
October
23,2014
What’s
New!
• Increased
Hardware
– Cisco
• UCS
Compute
nodes
(
1000
cores
)
• Dual
10
core/socket
nodes
• Nodes:
alpha/omega
– IBM
Hardware
• Decommission
old
HS21
and
HS22
hw
(~500
cores)
• Migrate
M3
and
M4
compute
nodes
progressively
• Oct
15-‐
moved
1/3
of
M3/M4
compute
nodes.
• Mid
Nov
–
next
move
of
1/3
M3/M4
nodes
• Mid
Dec
–
last
1/3
moved
• Slurm
Scheduler
– Currently
on
UCS
cluster
– Replaces
LSF
as
nodes
migrate
• Final
(12/2014):
~2600
cores,
12
nVidia
GPU,
Slurm
More
Newness
• [Link]#[Link]
• [Link]#[Link]
– Transfer
node
s#p/scp/rsync
– No
ssh
login
to
xfer
• ssh
to
nodes
– Only
when
node
is
assigned
via
slurm
• Private
spaces
during
login
• Modules
can
be
invoked
within
scripts
Agenda
Slurm
Facts
• Began
at
Lawrence
Livermore
Na4onal
Laboratory
(
2003)
• TOP500:
– 6
of
the
top
10
sites
use
Slurm
– Half
of
the
Top500
use
Slurm
• Locally
Harvard
and
MIT
have
moved
to
Slurm
for
their
HPC
clusters
(MGHPCC)
• Developers
worldwide
contribute
to
Slurm
development
• Open
Source
/Flexible
/Extensible
hap://[Link]/
Slurm
• Simple
Linux
U4lity
for
Resource
Management
– Runs
on
other
OSes
also
• Three
basic
func4ons
– Allocate
access
to
resources
(nodes)
– Provide
a
framework
for
managing
work
• Start,
execute
,
monitor
jobs
– Arbitrate
conten4ons
for
resources
by
managing
pending
work
requests
Resource
Manager
• Allocate
Resources
– Nodes
• Sockets
– Cores
» Hyperthreads
• Memory
• GPU
• Launch
and
Manage
Jobs
Scheduler
• When
there
are
more
jobs
than
resources
a
scheduler
is
needed
to
manage
queues/par44ons
• Complex
algorithm
for
assigning
resources
– Fair-‐share,
reserva4ons,
preemp4on,
priori4za4on
– Resource
limits
(
queue,
user,
group)
– QOS
• Slurm
is
both
a
resource
manager
and
a
scheduler
“You
have
to
get
in
line
to
ride”
Slurm
<-‐>
LSF
Ac#on
Slurm
LSF
Show
running
jobs
squeue
bjobs
Submit
a
batch
job
sbatch
bsub
Submit
with
alloca4ons
salloc
bsub
+
op4ons
+
queue
Start
an
interac4ve
session
srun
bsub
–Ip
–q
int_public6
<app>
List
queues/par44ons
squeue
bqueues
List
nodes
sinfo
bhosts
Control
running
jobs
scontrol
bstop
Kill
a
job
scancel
bkill
User
accoun4ng
sreport
or
sacct
bacct
Other
func4ons
srun
lsrun,
lsplace
Slurm
En44es
• Nodes
– Compute
resources
• Par44ons
– Groups
nodes
into
defined
sets
• Jobs
– Alloca4ons
of
resources
with
dura4ons
• Job
steps
– Sets
of
tasks
within
a
job,
possibly
in
parallel
Par44ons
and
Jobs
Par44ons
• Batch
– Submit
serial
programs
for
execu4on
– ‘sbatch
{op4ons}
–o
ousile
[Link]’
• Interac4ve
– Run
an
interac4ve
program
– Higher
priority
than
batch,
same
priority
as
MPI
– ‘srun
–x11=first
–pty
–c4
matlab’
• MPI
(Message
Passing
Interface)
– Submit
parallel
programs
for
execu4on
– sbatch
{op4ons}
–p
mpi
mpi_script.sh
– Higher
priority
than
batch,
same
priority
as
Interac4ve
• Interac4ve
and
MPI
can
preempt
batch
jobs,
but
do
not
preempt
each
other.
Slurm
Par44ons
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpu up 1-[Link] 2 idle alpha025,omega025
interactive up [Link] 2 idle alpha001,omega001
batch* up 3-[Link] 46 idle alpha[003-024],omega[001-024]
mpi up 7-[Link] 46 idle alpha[003-024],omega[001-024]
The
default
par44on,
marked
with
an
asterisk,
is
batch.
Most
jobs
will
be
submiaed
to
the
batch
par44on,
in
the
form
of
a
script,
as
serially
executed
jobs.
(examples
follow)
For
jobs
which
use
parallel
processing,
the
mpi
par44on
should
be
used.
The
interac4ve
par44on
is
useful
for
running
programs
and
applica4ons
which
rely
on
user
interac4on,
or
are
graphical
based.
Development
work
can
be
done
using
the
interac4ve
par44on.
Applica4ons
which
use
gpus
can
be
run
on
the
gpu
par44on.
There
are
just
2
GPU
nodes
in
the
Cisco
system
currently,
each
one
with
2
nVidia
GPUs
Slurm
Op4ons
• Most
op4ons
have
two
formats:
short/long
• The
short
op4on
uses
a
single
–
with
a
leaer
• The
long
op4on
uses
double
-‐
-‐
with
a
string
• Eg
-‐p
test
-‐-‐par44on
test
• Time
format
day-‐hour:minute:seconds
• -‐t
2-‐[Link]
for
two
days
and
10
hours
• -‐t
[Link]
for
five
minutes
• -‐t
5
alternate
style
for
5
minutes
Slurm
Op4ons
• -‐N
request
nodes
e.g.
–N
5
• -‐n
request
tasks
e.g.
–n
4
• -‐c
request
cores
e.g.
–c
8
• -‐-‐pty
for
interac4ve
• -‐-‐x11=first
add
X11
to
an
interac4ve
session
– Matlab,
mathema4ca,maple,
Rstudio,
etc
• -‐-‐mem
allocates
total
memory
-‐-‐mem=4000
(
value
in
MB
)
• -‐-‐mem-‐per-‐cpu
allocates
memory
per
core
-‐-‐
mem-‐per-‐cpu=2000
Limits
• JobPerUser
100
max
• CoresPerUser
256
max
• RunTimes
– Batch
3
days
max
– MPI
7
days
max
– Interac4ve
4
hours
max
– GPU
1
day
max
SINFO
sinfo
shows
status
of
the
par44ons.
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpu up 1-[Link] 2 idle alpha025,omega025
interactive up [Link] 2 idle alpha001,omega001
batch* up 3-[Link] 46 idle alpha[003-024],omega[001-024]
mpi up 7-[Link] 46 idle alpha[003-024],omega[001-024]
Another
use
of
sinfo
gives
a
different
breakdown
$ sinfo -o "%P %C"
PARTITION CPUS(A/I/O/T) #allocated/idle/other/total
gpu 0/80/0/80
interactive 0/80/0/80
batch* 0/1840/0/1840
mpi 0/1840/0/1840
SQUEUE
squeue
show
the
status
of
jobs
on
the
cluster
$squeue
JOBID
PARTITION
NAME
USER
ST
TIME
NODES
NODELIST(REASON)
10615
batch
test
dlapoi01
R
[Link]
1
alpha002
$squeue
–help
Displays
op4ons
for
squeue
$squeue
–u
my_utln
Displays
job
status
for
user
my_utln
JOB
STATE
CODES
Jobs
typically
pass
through
several
states
in
the
course
of
their
execu4on.
The
typical
states
are
PEND-‐
ING,
RUNNING,
SUSPENDED,
COMPLETING,
and
COMPLETED.
An
explana4on
of
each
state
follows.
BF
BOOT_FAIL
Job
terminated
due
to
launch
failure,
typically
due
to
a
hardware
failure
(e.g.
unable
to
boot
the
node
or
block
and
the
job
can
not
be
requeued).
CA
CANCELLED
Job
was
explicitly
cancelled
by
the
user
or
system
administrator.
The
job
may
or
may
not
have
been
ini4ated.
CD
COMPLETED
Job
has
terminated
all
processes
on
all
nodes.
CF
CONFIGURING
Job
has
been
allocated
resources,
but
are
wai4ng
for
them
to
become
ready
for
use
(e.g.
boo4ng).
CG
COMPLETING
Job
is
in
the
process
of
comple4ng.
Some
processes
on
some
nodes
may
s4ll
be
ac4ve.
F
FAILED
Job
terminated
with
non-‐zero
exit
code
or
other
failure
condi4on.
NF
NODE_FAIL
Job
terminated
due
to
failure
of
one
or
more
allocated
nodes.
PD
PENDING
Job
is
awai4ng
resource
alloca4on.
PR
PREEMPTED
Job
terminated
due
to
preemp4on.
R
RUNNING
Job
currently
has
an
alloca4on.
S
SUSPENDED
Job
has
an
alloca4on,
but
execu4on
has
been
suspended.
TO
TIMEOUT
Job
terminated
upon
reaching
its
4me
limit.
SCANCEL
Example
$ cat test_job
#!/bin/bash
echo "I'm on the launching node!"
hostname
echo ”These tasks are on the allocated nodes"
srun -l hostname
srun -l sleep 10
echo "All done”
$ sbatch –n2 test_job
Submitted batch job 259
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
259 batch test_job dlapoi01 R 0:10 1 alpha003
$ scancel 259
Or
$scancel –signal=TERM 259
Shell
Direc4ves
Shell
direc4ves
are
Slurm
op4ons
that
can
be
specified
in
a
script,
similar
to
the
command
line
op4ons.
Shell
direc4ves
start
with
a
#SBATCH
followed
by
op4ons
and
values
Useful
op4ons
set
memory,
4me,
and
other
resource
alloca4ons
-‐-‐mem=500
MB
allocated
for
run
-‐-‐mem-‐per-‐cpu=250
MB
allocated
per
core
-‐t
5
Length
of
4me
(minutes)
allocated
to
the
job
-‐N
4
Number
of
Nodes
requested
for
job
-‐n
16
Number
of
tasks
requested
Example
if
–N
1
–n
2
are
specified
both
tasks
will
be
on
the
same
node.
-‐-‐mem=4000
4Gb
Or
-‐-‐mem-‐per-‐cpu=4000
-‐c2
each
core
will
get
4
Gb
for
a
total
of
8
Gb.
To
see
a
complete
list
of
direc4ves,
visit
the
sites
on
the
Resources
Slide,
or
use
the
man
pages
for
sbatch,
salloc,
srun.
SALLOC
allocates
resources
salloc
is
used
to
allocate
resources
which
can
be
used
repeatedly
un4l
exit
which
then
relinquishes
the
resources
(
Nodes,
cores,
memory,…).
Example
$ salloc -N5 sh # allocate 5 nodes and start a shell locally!
salloc: Granted job allocation 255!
sh-4.1$ srun hostname!
alpha003!
alpha004!
alpha005!
alpha006!
alpha007!
sh-4.1$ exit!
exit!
salloc: Relinquishing job allocation 255!
!
$ salloc –N 4 mpirun -nolocal mrbayes-3.2_p/mb [Link]!
!
can
be
used
to
run
mpi
jobs.
Be
sure
to
release
the
alloca4ons
using
exit
when
you
are
finished.
Using
SBATCH
to
run
scripts
Submi|ng
scripts
using
sbatch
is
the
main
way
to
run
jobs
on
the
cluster.
sbatch [Link]
All
op4ons
can
be
listed
in
the
script,
for
example
#!/bin/bash
#SBATCH –c 2 # request two cores
#SBATCH –p batch # partition to run job on
#SBATCH –-mem=1000 # total memory (MB) request for job, split over cores
#SBATCH –o [Link]
#SBATCH –e [Link]
#SBATCH –mail-type=END #notification email sent when job finishes (END)
#SBATCH –mail-user=my_utln@[Link]
module load ##### # modules needed for script
{script….}
SBATCH
Example
$ cat test_job
#!/bin/bash
echo "I'm on the launching node!"
hostname
echo ”These tasks are on the allocated nodes"
srun -l hostname
srun -l sleep 10
echo "All done”
$ sbatch –n2 test_job
Submitted batch job 259
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
259 batch test_job dlapoi01 R 0:10 1 alpha003
$ cat [Link]
I'm on the launching node!
alpha003
These tasks are on the allocated nodes
0: alpha003
1: alpha003
All done
SBATCH:
Wrap
Perhaps
you
have
something
like
this.
prog1
params|prog2
-‐
params2|prog3
>output
Could
write
a
batch
script
for
this
,
but
there
is
an
easier
op4on
-‐-‐wrap=<command
string>
sbatch
will
wrap
the
specified
command
string
in
a
simple
"sh"
shell
script,
and
submit
that
script
to
the
slurm
controller.
When
-‐-‐wrap
is
used,
a
script
name
and
arguments
may
not
be
specified
on
the
command
line;
instead
the
sbatch-‐generated
wrapper
script
is
used.
Might
want
to
salloc
resources
first.
sbatch
-‐-‐wrap=“prog1
params|prog2
-‐
params2|prog3
>output”
Don’t
do
this
with
srun:
The
pipes
won’t
work.
srun
prog1
params|prog2
-‐
params2|prog3
>output
#!/bin/bash
##
[Link]
script
#SBATCH
-‐c
10
#SBATCH
-‐-‐mem
16000
#SBATCH
-‐p
batch
#SBATCH
-‐o
[Link]
#SBATCH
-‐e
[Link]
#SBATCH
-‐-‐mail-‐type=END
#SBATCH
-‐-‐mail-‐user=[Link]@tu#[Link]
module
load
bwa/0.7.9a
module
load
samtools/0.1.19
fnames=(’DN3_ROI/DN3’
’HH14_ROI/HH14')
export
MM10=/cluster/tu#s/genomes/MusMusculus/mm10/Sequence/BWAIndex
for
i
in
"${fnames[@]}";
do
SORTED=$i"_sorted”
bwa
mem
-‐t
10
$MM10/[Link]
$i
.fasta
>
$[Link]
samtools
view
-‐b
-‐S
$[Link]
>$[Link]
samtools
sort
$[Link]
$SORTED
samtools
index
$[Link]
$[Link]
done
$sbatch
[Link]
#
will
run
one
set
then
the
other
SRUN
srun
allows
running
a
job
on
the
a
specified
par44on.
Unless
specified
the
default
par44on
is
used.
srun
can
be
used
within
salloc,
as
part
of
a
script
sent
to
sbatch,
or
by
itself
for
an
interac4ve
session.
$srun –l –N3 hostname!
1: alpha004!
0: alpha005!
2: alpha003!
srun
is
useful
for
running
interac4ve
jobs,
where
applica4ons
are
graphically
based,
or
console
tools,
like
R
$salloc –mem=8000 –c10!
salloc: Granting job allocation 258!
srun –x11=first –pty –p interactive matlab!
!
Or !
$srun –mem=8000 –c10 –x11=first –pty –p interactive matlab!
MPI
(Message
Passing
Interface)
MPI
runs
should
use
the
MPI
par44on,
with
salloc,
sbatch
,
or
both.
MPI
does
not
work
well
using
srun.
$salloc
–
N
4
–p
mpi
sh
#
allocates
four
nodes,
mpi
and
a
local
shell
salloc: Granted job allocation 562!
sh4.1$ module load openmpi/1.8.2!
sh4.1$ mpirun hello_world!
Hello world from processor alpha003, rank 1 out of 4 processors!
Hello world from processor alpha005, rank 3 out of 4 processors!
Hello world from processor alpha002, rank 0 out of 4 processors!
Hello world from processor alpha004, rank 2 out of 4 processors!
sh-4.1$ exit!
exit!
salloc: Relinquishing job allocation 562!
salloc: Job allocation 562 has been revoked.!
!
!
MPI
(Message
Passing
Interface)
MPI
runs
should
use
the
MPI
par44on,
with
salloc,
sbatch
,
or
both.
Load
the
openmpi/1.8.2
module
first!
$module
load
opnempi/1.8.2
$salloc
–
N
4
–p
mpi
mpirun
hello_world
salloc: Granted job allocation 560!
Hello world from processor alpha003, rank 1 out of 4 processors!
Hello world from processor alpha005, rank 3 out of 4 processors!
Hello world from processor alpha002, rank 0 out of 4 processors!
Hello world from processor alpha004, rank 2 out of 4 processors!
salloc: Relinquishing job allocation 560!
salloc: Job allocation 560 has been revoked.!
!
!
MPI
(Message
Passing
Interface)
MPI
runs
should
use
the
MPI
par44on,
with
salloc,
sbatch
,
or
both.
$
sbatch
-‐N
4
-‐p
mpi
hello_test.sh
SubmiXed
batch
job
556
[dlapoi01@login001
mpi_hello_world]$
more
slurm-‐[Link]
Hello
world
from
processor
alpha002,
rank
0
out
of
4
processors
Hello
world
from
processor
alpha004,
rank
2
out
of
4
processors
Hello
world
from
processor
alpha003,
rank
1
out
of
4
processors
Hello
world
from
processor
alpha005,
rank
3
out
of
4
processors
!
!
Job
Arrays
• Job
arrays
offer
a
mechanism
for
submi|ng
and
managing
collec4ons
of
similar
jobs
quickly
and
easily
• All
jobs
must
have
the
same
ini4al
op4ons
(e.g.
size,
4me
limit,
etc.)
• These
can
be
altered
a#er
the
job
has
begun
• Job
arrays
are
limited
to
batch
job
submissions
• This
groups
jobs
together
but
as
one
submission
(
job
id)
Job
Array
Example
$sbatch –array=1-31 arrtempl!
# submits an array of 31 jobs to one node!
!
#!/bin/bash!
#SBATCH –j arrtempl # job name !
#SBATCH –o arrtempl_%A_%[Link]!
#SBATCH –e arrtempl_%A_%[Link]!
#SBATCH –N 1 !
#SBATCH –p batch!
#SBATCH –t [Link] # four hours run time!
#SBATCH –mem 4000!
!
arrprog data$SLURM_ARRAY_TASK_ID.dat!
This
will
submit
31
jobs
that
use
[Link]
..
[Link]
as
input
data,
and
have
31
output
and
error
files
as
arrtempl_jobid_{1..31}.{out,err}
%A
is
the
job
id
%a
is
the
array
job
index
(1-‐31)
$SLURM_ARRAY_TASK_ID
is
an
environment
variable
which
contains
the
job
array
index
and
can
be
used
in
the
script.
Job
Dependencies
Under
some
circumstances
jobs
submiaed
serially
need
to
wait
for
a
previous
job
to
finish,
for
example
in
a
workflow
pipeline
The
–dependency
op4on
can
be
used
to
manage
these
cases.
$ sbatch [Link]!
Submitted batch job 978354!
$ sbatch –dependency=afterok:978354 annotate_hits.sh!
Submitted batch job 978357!
!
The
annotate_hits.sh
is
queued
but
won’t
run
unless
[Link]
is
successful
(a#erok:978354).
There
are
other
parameters
that
can
be
used
with
–dependency,
and
mul4ple
jobids
can
be
specified.
See
man
sbatch
then
look
for
-‐-‐dependency
Resources
• [Link]#[Link]/cluster
-‐
documenta4on
and
examples
for
Tu#s
HPC
– Look
under
slurm
for
• This
presenta4on
• Command
equivalents
(Slurm<-‐>
LSF
op4ons
)
• [Link]
-‐
Main
site
– Videos,
tutorials,
FAQ,
documenta4on
• Man
pages
on
cluster
– man
{slurm|srun|sbatch|salloc|..}
– man
–k
slurm
#list
of
all
man
pages
for
slurm