0% found this document useful (0 votes)
22 views

New HPC Introduction

Uploaded by

Krishna Acharya
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

New HPC Introduction

Uploaded by

Krishna Acharya
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 100

An Introduction to Using

HTCondor
Kelley Ross
Covered In This Tutorial
• How to get access to the
ISYE Cluster
• How to login to the cluster • How HTCondor Matches
• How to get your files to the and Runs Jobs
ISYE Cluster • Testing and
• How to access the Cluster Troubleshooting
• What is HTCondor? • Use Cases and HTCondor
• Set up and Run a Job Features
with HTCondor • Automation
• Submit Multiple Jobs with
HTCondor
Introduction
Getting access to the cluster
• First, sign up for an ISyE account:
https://www.isye.gatech.edu/about/school/computing
• Get a terminal app:
• Windows:
• PuTTY: https://www.chiark.greenend.org.uk/~sgtatham/putty/ (careful if you Google!)
• SecureCRT: https://software.oit.gatech.edu
• Windows Subsystem for Linux: https://docs.microsoft.com/en-us/windows/wsl/install-win10
• Mac:
• Terminal.app (built-in)
• iTerm2: https://www.iterm2.com
• Linux: It’s built in!
• iOS: Prompt: https://panic.com/prompt
• Android: lots of options https://www.dunebook.com/10-best-android-terminal-emulator/
Getting your files on the cluster
File access via SCP/SFTP:
• Windows:
• WinSCP: https://winscp.net
• FileZilla: https://filezilla-project.org/
• Mac:
• Cyberduck: https://cyberduck.io
• FileZilla: https://filezilla-project.org
• Transmit: https://panic.com/transmit (not
free)

• Linux:
• scp at the command line, GUI varies by
distro
• GitHub
How to login to the ISYE Cluster

Off campus, or on GTwireless:


•Connect to VPN(https://faq.oit.gatech.edu/content/how-do-i-get-started-campus-vpn)
OR
• SSHfirst to castle.isye.gatech.edu or keep.isye.gatech.edu
Once connected to VPN or castle, or if you’re on the wired ISyE network:
• SSH to compute01
• SCP/SFTP via castle.isye.gatech.edu (does not require VPN or an SSHconnection to castlefirst)
HTCondor Documentation
• Documentation:
• ISyE website
• https://www.isye.gatech.edu/about/school/computing
• https://www2.isye.gatech.edu/apps/helpdesk/kb/browse/866-Computational-Resources
• HTCondor website (make sure you select the ‘stable’ release)
• https://research.cs.wisc.edu/htcondor/manual/
• https://htcondor.readthedocs.io/en/latest/users-manual/quick-start-guide.html
What is HTCondor

Open source software for distributed High-Throughput


computing. Developed at the University of Wisconsin in
the 1980’s. HTCondor has three components
- A job scheduler
- A resource manager
- A workflow management system

High-Throughput Computing - Allows for many


computational tasks to be completed over a long period of
time. Useful for
researchers and other users who are more concerned with the
number of computations they can do over long spans of time
than they are with short-burst computations
APPLICATIONS AVAILABE ON THE CLUSTER
• Ampl
• Mathematica
• Matlab • Users are able to install additional
• Openmpi software if needed.

• Cplex
• Python3
• Gams
• RH
• Gurobi
• xpressmp
• Julia
HTCondor Capabilities

• HTCondor has built-in ways to submit multiple


independent jobs with one submit file.

- Analyze multiple data files

- Test parameter or various input combination

• ...without having to:

- Start each job individually

- create separate submit files for each job


HTCondor Jobs

• A single computing task is called a “job”

• Three main pieces of a job are the input,


executable (program) and output

• Executable must be runnable from the command


line
without any interactive input
Resource request
• Jobs are nearly always using a part of a computer, not the whole thing

• Very important to request appropriate resources


- memory,
- cpus
- gpus
- disk

• Even if your system has default CPU, memory and disk requests, these may be
too small!

• Important to run test jobs and use the log file to request the right amount
of resources

• Requesting too little: causes problems for your and other jobs; jobs might by held
by HTCondor and requesting too much: jobs will match to fewer “slots”
What is HTCondor?

• Software that
schedules and
runs computing
tasks on
computers

1
How It Works

• Submit tasks to a queue (on an Access Point –


compute01)
• HTCondor schedules them to run on computers (Execute
Points)
access point

execute

queue
execute
tasks
HTCondor on Many Computers

Access Point

execute

queue
execute
tasks

execute
Why HTCondor?

• HTCondor manages and runs work on your behalf.


• Manage shared resources among users:
• Schedule tasks on a group* of computers (which
may/may not be directly accessible to the user).

• Schedule tasks submitted by multiple users on one or


more computers.

*in HTCondor-speak, a “pool”


Set Up and Run a Job with
HTCondor
Jobs

• A single computing task is called a “job”


• Three main pieces of a job are the input, executable (program)
and output

• Executable must be runnable from the command line without


any interactive input
Simple submit file
universe = vanilla

getenv = true

executable = lp.py

arguments = test0.lp

Log = test.log

output = test.out

error = test.error

notification = error

notification = complete

request_memory = 1024

request_cpus = 1

queue
Submitting and Monitoring Jobs

• To submit a job/jobs:
condor_submit submit_file_name HTCondor Manual: condor_submit
HTCondor Manual: condor_q
• To monitor submitted jobs, use:
condor_q
$ condor_submit job.submit
Submitting job(s).
1 job(s) submitted to cluster 128.

$ condor_q
-- Schedd: submit-1.chtc.wisc.edu : <128.104.101.92:9618?... @ 05/01/19 10:35:54
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
alice ID: 128 5/9 _ _ 1 1 128.0
11:09

1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended


More about condor_q

• By default condor_q shows:


• user’s job(s) only, summarized in “batches”
• Constrain with username, ClusterId or full JobId,
which will be denoted [U/C/J] in the following slides.
$ condor_q
-- Schedd: submit-1.chtc.wisc.edu : <128.104.101.92:9618?... @ 05/09/19 11:35:54
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
alice ID: 128 5/9 _ _ 1 1 128.0
11:09
1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended

JobId =
ClusterId .ProcId
More about condor_q

• To see individual job information, use:


condor_q -nobatch
$ condor_q -nobatch
-- Schedd: submit-1.chtc.wisc.edu : <128.104.101.92:9618?...
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
128.0 alice 5/9 11:09 0+00:00:00 I 0 0.0 compare_states wi.dat
us.dat
1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended

• We will use the -nobatch option in the following slides to


see extra detail about what is happening with a job
Job Idle
$ condor_q -nobatch
-- Schedd: submit-1.chtc.wisc.edu : <128.104.101.92:9618?...
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
128.0 alice 5/9 11:09 0+00:00:00 I 0 0.0 compare_states wi.dat
us.dat
1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended

Access Point
(submit_dir)/

job.submit
compare_states
wi.dat
us.dat
job.log
Job Starts
$ condor_q -nobatch
-- Schedd: submit-1.chtc.wisc.edu : <128.104.101.92:9618?...
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
128.0 alice 5/9 11:09 0+00:00:00 < 0 0.0 compare_states wi.dat us.dat
w
1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

Access Point Execute Point


(submit_dir)/ (execute_dir)/

job.submit compare_states compare_states


compare_states wi.dat wi.dat
wi.dat us.dat us.dat
us.dat
job.log
Job Running
$ condor_q -nobatch

-- Schedd: submit-1.chtc.wisc.edu : <128.104.101.92:9618?...


ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
128.0 alice 5/9 11:09 0+00:01:08 R 0 0.0 compare_states wi.dat us.dat

1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

Access Point Execute Point


(submit_dir)/ (execute_dir)/

job.submit compare_states
compare_states wi.dat
wi.dat us.dat
us.dat stderr
job.log stdout
wi.dat.out
Job Completes
$ condor_q -nobatch
-- Schedd: submit-1.chtc.wisc.edu : <128.104.101.92:9618?...
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
128 alice 5/9 11:09 0+00:02:02 > 0 0.0 compare_states wi.dat
us.dat
1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

Access Point Execute Point


(submit_dir)/ (execute_dir)/

job.submit compare_states
compare_states wi.dat
wi.dat us.dat
us.dat stderr stderr
stdout stdout
job.log
wi.dat.out wi.dat.out
Job Completes (cont.)
$ condor_q -nobatch
-- Schedd: submit-1.chtc.wisc.edu : <128.104.101.92:9618?...
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

Access Point
(submit_dir)/

job.submit
compare_states
wi.dat
us.dat
job.log
job.out
job.err
wi.dat.
out
Log File
000 (7195807.000.000) 05/19 14:30:18 Job submitted from host:
<128.105.244.191:9618 ...>
......
040 (7195807.000.000) 05/19 14:31:55 Started transferring input files
Transferring to host: <128.105.245.85:9618 ...>
...
040 (7195807.000.000) 05/19 14:31:55 Finished transferring input files
...
001 (7195807.000.000) 05/19 14:31:56 Job executing on host:
<128.105.245.85:9618? ...>
...
005 (7195807.000.000) 05/19 14:35:56 Job terminated.
(1) Normal termination (return value 0)

...

Partitionable Resources : Usage Request Allocated


Cpus : 0 1 1
Disk (KB) : 26 1024 995252
Memory (MB) : 1 1024 1024
Job States
transfer
transfer
executable
output
and input to
back to
execute
access
point
point
condor_ Idle Running Completed
submit (I) (R) (C)

in the queue leaving the queue


Resource Request

• Jobs are nearly always using a part of a computer (a slot), not


the whole thing.
• Very important to request appropriate resources (memory,
cpus, disk) for a job.

whole computer
your
request

Photo by Evan-Amos on WikiMedia, CC-BY-SA


Submit Multiple Jobs
with HTCondor

31
Why do we care?

• Run many independent


jobs...
• analyze multiple data
files
• test parameter or input
combinations
• and more!

32
Why do we care?

• Run many independent


jobs...
• analyze multiple data
files
• test parameter or input
combinations
• and more!
• ...without having to:
• start each job
individually
• create separate submit files
for each job 33
Many Jobs, One Submit File

• HTCondor has built-


in ways to submit
multiple independent
jobs with one submit
file.

See Rachel Lombardi’s talk next: Organizing and


Submitting HTC Workloads Photo by Joanna Kosinska on Unsplash

34
Numbered Input Files

• Goal: create 3 jobs that each analyze a different input file.

job.submit
(submit_dir)/
Universe = vanilla
getenv = true
executable = lp.py lp.py
arguments = file$(ProcID).lp file0.lp
file1.lp
log = job.log file2.lp
output = job.out
error = job.err job.submit

Queue 3

35
Automatic Variables

• Each job’s
ClusterId ProcId
ClusterId and
ProcId can be
128 0
accessed inside the
submit file using: queue N 128 1

128 2
$(ClusterId)
... ...

$(ProcId) 128 N-1

36
Job Variation

• How to uniquely identify each job (filenames, log/out/err


names)?
job.submit
(submit_dir)/
executable = analyze.exe
arguments = file0.in file0.out
transfer_input_files = analyze.exe
file0.in file0.in
file1.in
log = job.log file2.in
output = job.out
error = job.err job.submit

queue 3

37
Submitting Mutilple Using $(ProcId)
universe = vanilla

getenv = true

executable = $ENV(HOME)/gurobi/lp.py

arguments = test$(ProcID).lp Use the $(ClusterId), $(ProcId) variables to


Log = $ENV(HOME)/gurobi/$ provide unique values to jobs.*
(Cluster).log

output = $ENV(HOME)/gurobi/$(Cluster).$(ProcId).out

error = $ENV(HOME)/gurobi/$(cluster).$(ProcId).error

notification = error

notification = complete

request_memory = 1048

request_cpus = 1

queue 3
38
Submitting Multiple jobs Using $(ProcId)

universe = vanilla

getenv = true

executable = $ENV(HOME)/gurobi/lp.py

arguments = test$(ProcID).lp Use the $(ClusterId), $(ProcId) variables to


Log = $ENV(HOME)/gurobi/$ provide unique values to jobs.*
(Cluster).log

output = $ENV(HOME)/gurobi/$(Cluster).$(ProcId).out

error = $ENV(HOME)/gurobi/$(cluster).$(ProcId).error

notification = error

notification = complete

request_memory = 1048

request_cpus = 1

queue 3
Submit and Monitor (review)

condor_submit submit_file_name HTCondor Manual: condor_submit


HTCondor Manual: condor_q
condor_q
• Jobs in the queue will be grouped in batches (in this case by
cluster number)
$ condor_submit job.submit
Submitting job(s).
3 job(s) submitted to cluster 128.

$ condor_q
-- Schedd: submit-1.chtc.wisc.edu : <128.104.101.92:9618?... @ 05/09/19 10:35:54
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
alice ID: 128 5/9 11:03 _ _ 3 3 128.0-2

3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended

40
Submitting Multiple jobs Using Queue Statements
universe = vanilla

getenv = true

executable = lp.py

arguments = $(file)

Log =
$ENV(HOME)/guro
bi/$(Cluster).log

output = $ENV(HOME)/gurobi/$(Cluster).$(process).out

error = $ENV(HOME)/gurobi/$(cluster).$(Process).error

notification = error

notification = complete

request_memory = 1048

request_cpus = 1

queue file from testlp.txt


Using Batches

• Alternatively, batches can be grouped manually using the


JobBatchName attribute in a submit file:
JobBatchName = "CoolJobs"

$ condor_q
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
alice CoolJobs 5/9 _ _ 3 3 128.0-2
11:03

• To see individual jobs, use:


condor_q -nobatch

42
Organizing Jobs

43
Use Sub-Directories for File Type

• Create sub-directories* and use paths in the (submit_dir)/

submit file to separate input, error, log, and job.submit


output files. analyze.exe
file0.out
job.submit
file1.out
executable = analyze.exe file2.out
arguments = file$(Process).in file$(ProcId).out input/
transfer_input_files = input/file$(ProcId).in file0.in
file1.in
log = log/job$(ProcId).log file2.in
log/
queue 3 job0.log
job1.log
job2.log
* must be created before the job is submitted

44
One Job per Directory

• Change the submission directory for each job using


initialdir
• Allows the user to organize job files into separate directories.
• Use the same name for all input/output files

Box by Alice Design from the Noun Project


• Useful for jobs with lots of output files

job0 job1 job2 job3 job4

45
Separate Jobs with InitialDir
(submit_dir)/
job.submit job0/ job1/ job2/
analyze.exe file.in file.in file.in
job.log job.log job.log
job.err job.err job.err
file.out file.out file.out
job.submit
executable = analyze.exe
initialdir = job$(ProcId)
arguments = file.in file.out Executable should be
transfer_input_files = in the directory with
file.in the submit file, *not*
log = job.log in the individual job
error = job.err directories
queue 3

46
Other Submission Methods

• What if your input files/directories aren’t numbered from 0


to(N- 1)?
• There are other ways to submit many jobs!

Photo by Andrew Toskin on Flickr, CC-BY-SA

47
Submitting Multiple Jobs

Replacing single job executable = compare_states


arguments = wi.dat us.dat
inputs… wi.dat.out

transfer_input_files = us.dat,
wi.dat

queue 1
executable = compare_states
…with a variable of arguments = $(infile) us.dat $(infile).out

choice. transfer_input_files = us.dat, $(infile)

queue ... 48
Possible Queue Statements

matching ...
queue infile matching *.dat
pattern
in ... list queue infile in (wi.dat ca.dat ia.dat)

from ... file queue infile from state_list.txt wi.dat


ca.dat
ia.dat
state_list.txt

49
Queue Statement Comparison

matching .. Natural nested looping, minimal programming, use


pattern optional “files” and “dirs” keywords to only match
files or directories
Requires good naming conventions,
in .. list Supports multiple variables, all information
contained in a single file, reproducible
Harder to automate submit file creation
from .. file Supports multiple variables, highly modular (easy to
use one submit file for many job batches),
reproducible
Additional file needed

50
Using Multiple Variables

• The "from" syntax supports using multiple variables from a list.

job.submit job_list.txt
executable = compare_states wi.dat, 2010
arguments = -y $(option) -i $ wi.dat, 2015
(file) ca.dat, 2010
ca.dat, 2015
should_transfer_files = YES ia.dat, 2010
when_to_transfer_output = ia.dat, 2015
ON_EXIT transfer_input_files = $
queue
(file)file,option from job_list.txt

51
Other Features

• Match existing files or directories:


queue input matching files *.dat
queue directory matching dirs job*

• Submit multiple jobs with same input data


queue 10 input matching files *.dat
• Use other automatic variables: $(Step)
arguments = -i $(input) -rep $
(Step) queue 10 input matching
files *.dat

52
(60 second)
Pause
Questions so far? 53
Job Matching and
Class Ad Attributes

54
The Central Manager

• HTCondor matches jobs with computers via a “central


manager”.

manager by Gan Khoon Lay from the Noun Project


access point
execute

execute

central manager
execute

55
Class Ads

• HTCondor stores a list of information


about each job and each computer.
• This information is stored as a “Class

Photo by Wherda Arsianto on Unsplash


Ad”
• Class Ads have the format:
• AttributeName = value

can be a boolean,
number, or string

HTCondor Manual: Appendix A: Class Ad Attributes

56
Job Class Ad

RequestCpus = 1
Err = "job.err"
executable = compare_states
arguments = wi.dat us.dat wi.dat.out WhenToTransferOutput = "ON_EXIT"
should_transfer_files = YES TargetType = "Machine"
transfer_input_files = us.dat, wi.dat
when_to_transfer_output = ON_EXIT Cmd = "/home/alice/compare_states"
log = job.log JobUniverse = 5
output = job.out

=>
error = job.err Iwd = "/home/alice/tests/htcondor_week"
request_cpus = 1 RequestDisk = 20480
request_disk = 20MB
request_memory = 20MB NumJobStarts = 0
queue 1 TransferInput = "us.dat,wi.dat"
Out = "job.out"
+ UserLog = "/home/alice/job.log"
HTCondor configuration* RequestMemory = 20
...

57
Computer “Machine” Class Ad

HasFileTransfer = true
DynamicSlot = true
TotalSlotDisk = 4300218.0
TargetType = "Job"
TotalSlotMemory = 2048
Mips = 17902

=> Memory = 2048


UtsnameSysname = "Linux"
MAX_PREEMPT = ( 3600 * 72 )
OpSysMajorVer = 6
+ TotalMemory = 9889
HTCondor configuration OpSysName = "SL"
HasDocker = true
...

58
Job Matching

• On a regular basis, the central manager reviews Job


and Machine Class Ads and matches jobs to computers.

access point
execute

execute

central manager
execute

59
Job Execution

• (Then the access and execute points communicate directly.)

access point
execute

execute

execute

60
Class Ads for People

• Class Ads also provide lots of useful information about jobs


and computers to HTCondor users and administrators

See later talk: What Are My Jobs Doing? Photo by Roman Kraft on Unsplash

61
Finding Job Attributes

• Use the “long” option for condor_q


condor_q -l JobId
$ condor_q -l 128.0
WhenToTransferOutput = "ON_EXIT"
TargetType = "Machine"
Cmd = "/home/alice/tests/htcondor_week/compare_states"
JobUniverse = 5
Iwd = "/home/alice/tests/htcondor_week"
RequestDisk = 20480
NumJobStarts = 0
OnExitRemove = true
TransferInput =
"us.dat,wi.dat"
UserLog =
"/home/alice/tests/
htcondor_week/job.l
og"
RequestMemory = 20 62
...
Displaying Job Attributes

• Use the “auto-format” option:


condor_q [U/C/J] -af Attribute1 Attribute2 ...

$ condor_q -af ClusterId ProcId RemoteHost MemoryUsage

1725 116 [email protected] 1709


1725 118 [email protected] 1709
1725 137 [email protected] 1709
1725 139 [email protected] 1709
1861 0 [email protected] 196
1863 0 [email protected] 269
1864 0 [email protected] 245
1865 0 [email protected] 196

63
Selecting Job Attributes

• Use the "constraint" option, along with an expression for what


jobs you want to look at:
condor_q [U/C/J] -constraint 'Attribute >/</== value'

$ condor_q -constraint 'JobBatchName == "CoolJobs"'

OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS


alice CoolJobs 5/9 _ _ 3 3 128.0-2
11:03

64
Other Displays

• See the whole queue (all users, all


jobs)
condor_q -all

$
--condor_q
Schedd: -all
submit-1.chtc.wisc.edu : <128.104.101.92:9618?...
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
alice DAG: 128 5/9 02:52 982 2 _ _ 1000 18888976.0 ...
bob DAG: 139 5/9 09:21 _ 1 89 _ 180 18910071.0 ...
alice DAG: 219 5/9 10:31 1 997 2 _ 1000 18911030.0 ...
bob DAG: 226 5/9 10:51 10 _ 1 _ 44 18913051.0
bob CMD: ce.sh 5/9 10:55 _ _ _ 2 _ 18913029.0 ...
alice CMD: sb 5/9 10:57 _ 2 998 _ _ 18913030.0-999

65
Class Ads for Computers

• as condor_q is to jobs, condor_status is to computers (or


“machines”)
$ condor_status
Name OpSys Arch State
Activ LoadAv Mem Actvty
[email protected]
ity LINUX X86_64 Unclaimed Idle 0.000 673 25+01

HTCondor Manual: condor_status


[email protected] LINUX X86_64 Claimed Busy 1.000 2048 0+01
[email protected] LINUX X86_64 Claimed Busy 1.000 2048 0+01
[email protected] LINUX X86_64 Claimed Busy 1.000 2048 0+00
[email protected] LINUX X86_64 Claimed Busy 1.000 2048 0+14
[email protected] LINUX X86_64 Unclaimed Idle 1.000 2693 19+19
[email protected] LINUX X86_64 Claimed Busy 1.000 2048 0+04
[email protected] LINUX X86_64 Claimed Busy 1.000 2048 0+01
[email protected] LINUX X86_64 Unclaimed Idle 0.010 645 25+05
[email protected] LINUX X86_64 Claimed Busy 1.000 2048 0+01

Total Owner Claimed Unclaimed Matched Preempting Backfill Drain

X86_64/LINUX 10962 0 10340 613 0 0 0 9


X86_64/WINDOWS 2 2 0 0 0 0 0 0

Total 10964 2 10340 613 0 0 0 9

66
Find Machine Attributes

• Use same options as condor_q: to get attributes for


a specific machine, use:
condor_status -l Slot/Machine
$ condor_status -l [email protected]
HasFileTransfer = true
COLLECTOR_HOST_STRING = "cm.chtc.wisc.edu”
TotalTimeClaimedBusy = 43334c001.chtc.wisc.edu
Mips = 17902
MAX_PREEMPT = ( 3600 * ( 72 - 68 * ( WantGlidein =?= true ) ) )
Requirements = ( START ) && ( IsValidCheckpointPlatform ) && ( WithinResourceLimits )
State = "Claimed"
OpSysMajorVer = 6
OpSysName = "SL”
...

67
Useful Machine Attributes

• Machine, Name: name of the server, or slot


• Cpus, Memory, Disk: resources on that server
• GPUs, GPUs_DeviceName: number and type of GPUs
• RemoteOwner: Who is running
• CPUModel: type of CPU
• ...and more (see the manual)

68
Display Machine Attributes

• Use same options as condor_q, part 2, to display attributes,


use
condor_status [Machine] -af Attribute1 Attribute2 ...
$ condor_status e000.chtc.wisc.edu –af Name CPUs Memory Disk
HasCHTCStaging
[email protected]
[email protected] 1 1 768 12992383
80116 82285091 false
false
[email protected] 2 1536 1332553 false
[email protected] 1 768 12992383 false
[email protected] 2 1536 1332553 false
[email protected] 2 1536 1332553 false
[email protected] 2 1536 1332553 false
[email protected] 1 2048 2331967 false
[email protected] 2 1536 1332553 false

69
Machine Attributes

• To summarize, use the "-compact" option


condor_status -compact
$ condor_status -compact
Machine Platform Slots Cpus Gpus TotalGb FreCpu FreeGb CpuLoad ST
e007.chtc.wisc.edu x64/SL6 8 8 23.46 0 0.00 1.24 Cb
e008.chtc.wisc.edu x64/SL6 8 8 23.46 0 0.46 0.97 Cb
e009.chtc.wisc.edu x64/SL6 11 16 23.46 5 0.00 0.81 **
e010.chtc.wisc.edu x64/SL6 8 8 23.46 0 4.46 0.76 Cb
matlab-build-1.chtc.wisc.edu x64/SL6 1 12 23.45 11 13.45 0.00 **
matlab-build-5.chtc.wisc.edu x64/SL6 0 24 23.45 24 23.45 0.04 Ui
mem1.chtc.wisc.edu x64/SL6 24 80 1009.67 8 0.17 0.60 **

Total Owner Claimed Unclaimed Matched Preempting Backfill Drain

x64/SL6 10416 0 9984 427 0 0 0 5


x64/WinVista 2 2 0 0 0 0 0 0

Total 10418 2 9984 427 0 0 0 5

70
Testing and Troubleshooting

71
What Can Go Wrong?

• Jobs can go wrong “internally”:


• something happens after the executable begins to run
• Jobs can go wrong from HTCondor’s perspective:
• A job can’t be started at all,
• Uses too much memory,
• Has a badly formatted executable,
• And more...

72
“Live” Troubleshooting

• To log in to a job where it is running, use:


condor_ssh_to_job JobId

$ condor_ssh_to_job 128.0
Welcome to [email protected]!
Your condor job is running with pid(s) 3954839.

73
Reviewing Failed Jobs

• A job’s log, output and error files can provide valuable


information for troubleshooting
Log Output Error

• When jobs were Any “print” or “display” Captured by the operating


submitted, started, and information from your system
stopped program
• Resources used
• Exit status
• Where job ran
• Interruption reasons

74
Reviewing Recent Jobs

• To review a large group of jobs at once, use condor_history


[U/C/J]
• As condor_q is to the present, condor_history is to
the past
$ condor_history alice
ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD
189.1012 alice 5/11 09:52 0+00:07:37 C 5/11 16:00 /home/alice
189.1002 alice 5/11 09:52 0+00:08:03 C 5/11 16:00 /home/alice
189.1081 alice 5/11 09:52 0+00:03:16 C 5/11 16:00 /home/alice
189.944 alice 5/11 09:52 0+00:11:15 C 5/11 16:00 /home/alice
189.659 alice 5/11 09:52 0+00:26:56 C 5/11 16:00 /home/alice
189.1003 alice 5/11 09:52 0+00:07:38 C 5/11 15:59 /home/alice
189.962 alice 5/11 09:52 0+00:09:36 C 5/11 15:59 /home/alice
189.898 alice 5/11 09:52 0+00:13:47 C 5/11 15:59 /home/alice

81
Held Jobs

• HTCondor will put your job on


hold if there’s something YOU

Photo by Tim Gouw on Unsplash


need to fix.
• A job that goes on hold is
interrupted (all progress is lost)
and kept from running again, but
remains in the queue in the “H”
state.
$ condor_q -nobatch
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
128.0 alice 5/9 11:09 0+00:00:00 H 0 0.0 analyze.exe

1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended

76
Diagnosing Holds

• If HTCondor puts jobs on hold, it provides a hold reason, which


can be viewed with:
condor_q -hold
$ condor_q -hold
ID OWNER HELD_SINCE HOLD_REASON
125.0 bob 5/09 17:12 Error from [email protected]: Job has
gone over memory limit of 2048 megabytes.
128.0 alice 5/11 12:06 from [email protected]: STARTER
at 128.104.101.138 failed
Errorto send file(s) to <128.104.101.92:9618>; SHADOW
at 128.104.101.92 failed to write to file /home/alice/Test_18925319_16.err:
(errno 122) Disk quota exceeded
131.0 bob 5/12 09:02 Error from [email protected]: Failed
to execute '/var/lib/condor/execute/slot1/dir_2471876/condor_exec.exe' with
arguments 2: (errno=2: 'No such file or directory')

77
Fixing Holds

• Job attributes can be edited while jobs are in the queue using:
condor_qedit [U/C/J] Attribute Value
$ condor_qedit 128.0 RequestMemory
3072 Set attribute ”RequestMemory".

• If a job has been fixed and can run again, release it with:
condor_release [U/C/J]
$ condor_release 128.0
Job 18933774.0 released

84
Holding or Removing Jobs

• If you know your job has a problem and it hasn’t yet


completed, you can:
• Place it on hold yourself, with condor_hold [U/C/J]

$ condor_hold bob
All jobs of user ”bob" have been held

$ condor_hold 128
All jobs in cluster 128 have been held

$ condor_hold 128.0
Job 128.0 held

HTCondor Manual: condor_hold


• Remove it from the queue, using condor_rm [U/C/J] HTCondor Manual: condor_rm

79
Job States, Revisited

condor_ Idle Running Completed


submit (I) (R) (C)

in the queue leaving the queue

80
Job States, Revisited

condor_ Idle Running Completed


submit (I) (R) (C)

condor_hold,
or HTCondor
puts
condor_release a job on hold

Held
(H)

in the queue leaving the queue

81
Job States, Revisited* *not comprehensive

condor_ Idle Running Completed


submit (I) (R) (C)

condor_history
condor_hold,
or HTCondor
puts
condor_release a job on hold condor_rm Removed
(X)

Held
(H)

in the queue condor_q leaving the queue

82
Use Cases and
HTCondor Features

83
Interactive Jobs

• An interactive job proceeds like a normal batch job, but opens


a bash session into the job’s execution directory instead of
running an executable.
condor_submit -i submit_file
$ condor_submit -i interactive.submit
Submitting job(s).
1 job(s) submitted to cluster 18980881.
Waiting for job to start...
Welcome to [email protected]!

• Useful for testing and troubleshooting

84
Self-Checkpointing

• By default, a job that is interrupted will start from the


beginning if it is restarted.
• It is possible to implement self-checkpointing, which
will allow a job to restart from a saved state if
interrupted.
• Self-checkpointing is useful for:
• very long jobs
• running on opportunistic resources.

85
Self-Checkpointing

• By default, a job that is interrupted will start from the


beginning if it is restarted.
• It is possible to implement self-checkpointing, which
will allow a job to restart from a saved state if
interrupted.
• Self-checkpointing is useful for:
• very long jobs
• running on opportunistic resources.

86
Self-Checkpointing How-To

• Edit executable:
• Regularly exit with a non-zero exit code, after saving intermediate
states to a checkpoint file
• Always check for a checkpoint file when starting
• Add HTCondor options that transfer checkpoint files back to
the Access Point and then restarts the executable:

checkpoint_exit_code = 85
transfer_checkpoint_files = check.point
See Todd Miller’s afternoon talk: Self-Checkpointing Jobs

87
Job Universes

• HTCondor has different “universes” for


running specialized job types
• HTCondor Manual: Choosing an HTCondor
Universe
• Vanilla (default)
• good for most software
• HTCondor Manual: Vanilla Universe
• Set in the submit file using:
Image credit: EHT collaboration
universe = vanilla

88
Multi-CPU and GPU Computing

• Jobs that use multiple cores on a single computer can


be run in the vanilla universe (parallel universe not
needed):
request_cpus = 16

• If there are computers with GPUs, request them with:


request_gpus = 1

89
Automation

90
Automation

• After job submission,


HTCondor manages jobs
based on its configuration

Photo by Mixabest on WikiMedia, CC-BY-SA


• You can use options that
will customize job
management even further
• These options can
automate when jobs are
started, stopped, and
removed.

91
Retries

• Problem: a small number of jobs fail; if they run again, they


complete successfully.
• Solution: If the job exits with an error, leave it in the queue to
run again. This is done via the automatic option
max_retries.

max_retries = 5

92
Limiting Jobs

• Problem: Submitting more than a few thousand jobs to the


queue at once
• Solution: Use the max_idle option. This limits the number
of jobs submitted at one time, but allows there to always be
idle jobs ready to run.

max_idle = 1000

93
Useful Job Attributes for Automation

• CurrentTime: current time


• EnteredCurrentStatus: time of last status change
• ExitCode: the exit code from the job
• HoldReasonCode: number corresponding to a hold reason
• NumJobStarts: how many times the job has gone from idle to
running
• JobStatus: number indicating idle, running, held, etc.

HTCondor Manual: Appendix A: JobStatus and HoldReason Codes

94
Automatically Hold Jobs

• Problem: Your job should run in 2 hours or less, but a few jobs
“hang” randomly and run for days
• Solution: Put jobs on hold if they run for over 2 hours, using a
periodic_hold statement
job is running

periodic_hold = (JobStatus == 2) &&


((CurrentTime - EnteredCurrentStatus) > (60 * 60 * 2))

How long the job has been


2 hours
running, in seconds

95
Automatically Release Jobs

• Problem (related to previous): A few jobs are being held for


running long; they will complete if they run again.
• Solution: automatically release those held jobs with a
periodic_release option, up to 3 times
job is held

periodic_release = (JobStatus == 5) &&


(HoldReasonCode == 3) && (NumJobStarts < 3)

job was put on hold job has started running


by periodic_hold less than 3 times
96
Automatically Remove Jobs

• Problem: Jobs are repetitively failing


• Solution: Remove jobs from the queue using a
periodic_remove statement

periodic_remove = (NumJobStarts > 5)

job has started running


more than 5 times

97
Useful commands
- condor_status -master – View all compute nodes and there resources

- condor_q -ana:sum (user) –


- condor_rm <jobid> - Remove a job
- condor_qedit <jobid> RequestMemory 3078 – Edit an existing job

- condor_q –all – See the whole queue (All users, All Jobs)

- condor_rm -all – Removes all of your jobs.

- condor_q –hold Hold Reason

- condor_q -analyze 107.5

- condor_status -compact -constraint 'TotalGpus > 0' - View available GPU systems

- condor_status -compact -constraint 'TotalGpus > 0' -af Machine TotalGpus CUDADeviceName CUDACapability – Display GPU hardware information
GETTING HELP
• In Person: Groseclose 302
• Knowledgebase: https://it.isye.gatech.edu/knowledgebase/computational-
resources
• Email: [email protected]
(please don’t email a specific person directly, use the helpdeskemail tokeep from getting
lost!)

ISyE HPCAnnouncements EmailList


https://lists.isye.gatech.edu/mailman/listinfo/isye-hpc-announce
Final Questions ?

You might also like