New HPC Introduction
New HPC Introduction
HTCondor
Kelley Ross
Covered In This Tutorial
• How to get access to the
ISYE Cluster
• How to login to the cluster • How HTCondor Matches
• How to get your files to the and Runs Jobs
ISYE Cluster • Testing and
• How to access the Cluster Troubleshooting
• What is HTCondor? • Use Cases and HTCondor
• Set up and Run a Job Features
with HTCondor • Automation
• Submit Multiple Jobs with
HTCondor
Introduction
Getting access to the cluster
• First, sign up for an ISyE account:
https://www.isye.gatech.edu/about/school/computing
• Get a terminal app:
• Windows:
• PuTTY: https://www.chiark.greenend.org.uk/~sgtatham/putty/ (careful if you Google!)
• SecureCRT: https://software.oit.gatech.edu
• Windows Subsystem for Linux: https://docs.microsoft.com/en-us/windows/wsl/install-win10
• Mac:
• Terminal.app (built-in)
• iTerm2: https://www.iterm2.com
• Linux: It’s built in!
• iOS: Prompt: https://panic.com/prompt
• Android: lots of options https://www.dunebook.com/10-best-android-terminal-emulator/
Getting your files on the cluster
File access via SCP/SFTP:
• Windows:
• WinSCP: https://winscp.net
• FileZilla: https://filezilla-project.org/
• Mac:
• Cyberduck: https://cyberduck.io
• FileZilla: https://filezilla-project.org
• Transmit: https://panic.com/transmit (not
free)
• Linux:
• scp at the command line, GUI varies by
distro
• GitHub
How to login to the ISYE Cluster
• Cplex
• Python3
• Gams
• RH
• Gurobi
• xpressmp
• Julia
HTCondor Capabilities
• Even if your system has default CPU, memory and disk requests, these may be
too small!
• Important to run test jobs and use the log file to request the right amount
of resources
• Requesting too little: causes problems for your and other jobs; jobs might by held
by HTCondor and requesting too much: jobs will match to fewer “slots”
What is HTCondor?
• Software that
schedules and
runs computing
tasks on
computers
1
How It Works
execute
queue
execute
tasks
HTCondor on Many Computers
Access Point
execute
queue
execute
tasks
execute
Why HTCondor?
getenv = true
executable = lp.py
arguments = test0.lp
Log = test.log
output = test.out
error = test.error
notification = error
notification = complete
request_memory = 1024
request_cpus = 1
queue
Submitting and Monitoring Jobs
• To submit a job/jobs:
condor_submit submit_file_name HTCondor Manual: condor_submit
HTCondor Manual: condor_q
• To monitor submitted jobs, use:
condor_q
$ condor_submit job.submit
Submitting job(s).
1 job(s) submitted to cluster 128.
$ condor_q
-- Schedd: submit-1.chtc.wisc.edu : <128.104.101.92:9618?... @ 05/01/19 10:35:54
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
alice ID: 128 5/9 _ _ 1 1 128.0
11:09
JobId =
ClusterId .ProcId
More about condor_q
Access Point
(submit_dir)/
job.submit
compare_states
wi.dat
us.dat
job.log
Job Starts
$ condor_q -nobatch
-- Schedd: submit-1.chtc.wisc.edu : <128.104.101.92:9618?...
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
128.0 alice 5/9 11:09 0+00:00:00 < 0 0.0 compare_states wi.dat us.dat
w
1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
job.submit compare_states
compare_states wi.dat
wi.dat us.dat
us.dat stderr
job.log stdout
wi.dat.out
Job Completes
$ condor_q -nobatch
-- Schedd: submit-1.chtc.wisc.edu : <128.104.101.92:9618?...
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
128 alice 5/9 11:09 0+00:02:02 > 0 0.0 compare_states wi.dat
us.dat
1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
job.submit compare_states
compare_states wi.dat
wi.dat us.dat
us.dat stderr stderr
stdout stdout
job.log
wi.dat.out wi.dat.out
Job Completes (cont.)
$ condor_q -nobatch
-- Schedd: submit-1.chtc.wisc.edu : <128.104.101.92:9618?...
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
Access Point
(submit_dir)/
job.submit
compare_states
wi.dat
us.dat
job.log
job.out
job.err
wi.dat.
out
Log File
000 (7195807.000.000) 05/19 14:30:18 Job submitted from host:
<128.105.244.191:9618 ...>
......
040 (7195807.000.000) 05/19 14:31:55 Started transferring input files
Transferring to host: <128.105.245.85:9618 ...>
...
040 (7195807.000.000) 05/19 14:31:55 Finished transferring input files
...
001 (7195807.000.000) 05/19 14:31:56 Job executing on host:
<128.105.245.85:9618? ...>
...
005 (7195807.000.000) 05/19 14:35:56 Job terminated.
(1) Normal termination (return value 0)
...
whole computer
your
request
31
Why do we care?
32
Why do we care?
34
Numbered Input Files
job.submit
(submit_dir)/
Universe = vanilla
getenv = true
executable = lp.py lp.py
arguments = file$(ProcID).lp file0.lp
file1.lp
log = job.log file2.lp
output = job.out
error = job.err job.submit
Queue 3
35
Automatic Variables
• Each job’s
ClusterId ProcId
ClusterId and
ProcId can be
128 0
accessed inside the
submit file using: queue N 128 1
128 2
$(ClusterId)
... ...
36
Job Variation
queue 3
37
Submitting Mutilple Using $(ProcId)
universe = vanilla
getenv = true
executable = $ENV(HOME)/gurobi/lp.py
output = $ENV(HOME)/gurobi/$(Cluster).$(ProcId).out
error = $ENV(HOME)/gurobi/$(cluster).$(ProcId).error
notification = error
notification = complete
request_memory = 1048
request_cpus = 1
queue 3
38
Submitting Multiple jobs Using $(ProcId)
universe = vanilla
getenv = true
executable = $ENV(HOME)/gurobi/lp.py
output = $ENV(HOME)/gurobi/$(Cluster).$(ProcId).out
error = $ENV(HOME)/gurobi/$(cluster).$(ProcId).error
notification = error
notification = complete
request_memory = 1048
request_cpus = 1
queue 3
Submit and Monitor (review)
$ condor_q
-- Schedd: submit-1.chtc.wisc.edu : <128.104.101.92:9618?... @ 05/09/19 10:35:54
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
alice ID: 128 5/9 11:03 _ _ 3 3 128.0-2
40
Submitting Multiple jobs Using Queue Statements
universe = vanilla
getenv = true
executable = lp.py
arguments = $(file)
Log =
$ENV(HOME)/guro
bi/$(Cluster).log
output = $ENV(HOME)/gurobi/$(Cluster).$(process).out
error = $ENV(HOME)/gurobi/$(cluster).$(Process).error
notification = error
notification = complete
request_memory = 1048
request_cpus = 1
$ condor_q
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
alice CoolJobs 5/9 _ _ 3 3 128.0-2
11:03
42
Organizing Jobs
43
Use Sub-Directories for File Type
44
One Job per Directory
45
Separate Jobs with InitialDir
(submit_dir)/
job.submit job0/ job1/ job2/
analyze.exe file.in file.in file.in
job.log job.log job.log
job.err job.err job.err
file.out file.out file.out
job.submit
executable = analyze.exe
initialdir = job$(ProcId)
arguments = file.in file.out Executable should be
transfer_input_files = in the directory with
file.in the submit file, *not*
log = job.log in the individual job
error = job.err directories
queue 3
46
Other Submission Methods
47
Submitting Multiple Jobs
transfer_input_files = us.dat,
wi.dat
queue 1
executable = compare_states
…with a variable of arguments = $(infile) us.dat $(infile).out
queue ... 48
Possible Queue Statements
matching ...
queue infile matching *.dat
pattern
in ... list queue infile in (wi.dat ca.dat ia.dat)
49
Queue Statement Comparison
50
Using Multiple Variables
job.submit job_list.txt
executable = compare_states wi.dat, 2010
arguments = -y $(option) -i $ wi.dat, 2015
(file) ca.dat, 2010
ca.dat, 2015
should_transfer_files = YES ia.dat, 2010
when_to_transfer_output = ia.dat, 2015
ON_EXIT transfer_input_files = $
queue
(file)file,option from job_list.txt
51
Other Features
52
(60 second)
Pause
Questions so far? 53
Job Matching and
Class Ad Attributes
54
The Central Manager
execute
central manager
execute
55
Class Ads
can be a boolean,
number, or string
56
Job Class Ad
RequestCpus = 1
Err = "job.err"
executable = compare_states
arguments = wi.dat us.dat wi.dat.out WhenToTransferOutput = "ON_EXIT"
should_transfer_files = YES TargetType = "Machine"
transfer_input_files = us.dat, wi.dat
when_to_transfer_output = ON_EXIT Cmd = "/home/alice/compare_states"
log = job.log JobUniverse = 5
output = job.out
=>
error = job.err Iwd = "/home/alice/tests/htcondor_week"
request_cpus = 1 RequestDisk = 20480
request_disk = 20MB
request_memory = 20MB NumJobStarts = 0
queue 1 TransferInput = "us.dat,wi.dat"
Out = "job.out"
+ UserLog = "/home/alice/job.log"
HTCondor configuration* RequestMemory = 20
...
57
Computer “Machine” Class Ad
HasFileTransfer = true
DynamicSlot = true
TotalSlotDisk = 4300218.0
TargetType = "Job"
TotalSlotMemory = 2048
Mips = 17902
58
Job Matching
access point
execute
execute
central manager
execute
59
Job Execution
access point
execute
execute
execute
60
Class Ads for People
See later talk: What Are My Jobs Doing? Photo by Roman Kraft on Unsplash
61
Finding Job Attributes
63
Selecting Job Attributes
64
Other Displays
$
--condor_q
Schedd: -all
submit-1.chtc.wisc.edu : <128.104.101.92:9618?...
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
alice DAG: 128 5/9 02:52 982 2 _ _ 1000 18888976.0 ...
bob DAG: 139 5/9 09:21 _ 1 89 _ 180 18910071.0 ...
alice DAG: 219 5/9 10:31 1 997 2 _ 1000 18911030.0 ...
bob DAG: 226 5/9 10:51 10 _ 1 _ 44 18913051.0
bob CMD: ce.sh 5/9 10:55 _ _ _ 2 _ 18913029.0 ...
alice CMD: sb 5/9 10:57 _ 2 998 _ _ 18913030.0-999
65
Class Ads for Computers
66
Find Machine Attributes
67
Useful Machine Attributes
68
Display Machine Attributes
69
Machine Attributes
70
Testing and Troubleshooting
71
What Can Go Wrong?
72
“Live” Troubleshooting
$ condor_ssh_to_job 128.0
Welcome to [email protected]!
Your condor job is running with pid(s) 3954839.
73
Reviewing Failed Jobs
74
Reviewing Recent Jobs
81
Held Jobs
76
Diagnosing Holds
77
Fixing Holds
• Job attributes can be edited while jobs are in the queue using:
condor_qedit [U/C/J] Attribute Value
$ condor_qedit 128.0 RequestMemory
3072 Set attribute ”RequestMemory".
• If a job has been fixed and can run again, release it with:
condor_release [U/C/J]
$ condor_release 128.0
Job 18933774.0 released
84
Holding or Removing Jobs
$ condor_hold bob
All jobs of user ”bob" have been held
$ condor_hold 128
All jobs in cluster 128 have been held
$ condor_hold 128.0
Job 128.0 held
79
Job States, Revisited
80
Job States, Revisited
condor_hold,
or HTCondor
puts
condor_release a job on hold
Held
(H)
81
Job States, Revisited* *not comprehensive
condor_history
condor_hold,
or HTCondor
puts
condor_release a job on hold condor_rm Removed
(X)
Held
(H)
82
Use Cases and
HTCondor Features
83
Interactive Jobs
84
Self-Checkpointing
85
Self-Checkpointing
86
Self-Checkpointing How-To
• Edit executable:
• Regularly exit with a non-zero exit code, after saving intermediate
states to a checkpoint file
• Always check for a checkpoint file when starting
• Add HTCondor options that transfer checkpoint files back to
the Access Point and then restarts the executable:
checkpoint_exit_code = 85
transfer_checkpoint_files = check.point
See Todd Miller’s afternoon talk: Self-Checkpointing Jobs
87
Job Universes
88
Multi-CPU and GPU Computing
89
Automation
90
Automation
91
Retries
max_retries = 5
92
Limiting Jobs
max_idle = 1000
93
Useful Job Attributes for Automation
94
Automatically Hold Jobs
• Problem: Your job should run in 2 hours or less, but a few jobs
“hang” randomly and run for days
• Solution: Put jobs on hold if they run for over 2 hours, using a
periodic_hold statement
job is running
95
Automatically Release Jobs
97
Useful commands
- condor_status -master – View all compute nodes and there resources
- condor_q –all – See the whole queue (All users, All Jobs)
- condor_status -compact -constraint 'TotalGpus > 0' - View available GPU systems
- condor_status -compact -constraint 'TotalGpus > 0' -af Machine TotalGpus CUDADeviceName CUDACapability – Display GPU hardware information
GETTING HELP
• In Person: Groseclose 302
• Knowledgebase: https://it.isye.gatech.edu/knowledgebase/computational-
resources
• Email: [email protected]
(please don’t email a specific person directly, use the helpdeskemail tokeep from getting
lost!)