BME Operating Systems 2015.
UNIX process scheduling
Tamás Mészáros
http://www.mit.bme.hu/~meszaros/
Department of Measurement and Information Systems
Budapest University of Technology and Economics
UNIX scheduling 1 / 23
BME Operating Systems 2015.
Previously....
• Definition of a process (task)
– Running program (solving a particular problem)
• Relation between the process and the kernel
– execution mode (user, kernel) and context (process, kernel)
– syscall interface
– process states and transitions
two running states, zombie, suspended (stopped) states, etc.
• Administrative data of a process (u-area and proc structure)
– PID (process identifier) and PPID (parent PID)
– credentials (UID, GID)
– actual process state
– scheduling information
– etc.
UNIX scheduling 2 / 23
BME Operating Systems 2015.
Scheduling in general
• Main task of scheduling
selecting the next task to be run from the set of runnable processes
• Basic notation
– preemptive and cooperative scheduling
– priority
– static and dynamic scheduling
– measuring quality (CPU utilization, throughput, avg. wait time, etc.)
• Basic operation
– maintain a set of runnable tasks (FIFO, red-black binary tree, ...)
– choose the next task to be run
• Requirements
– small overhead, small complexity (O(1), O(N), O(log N))
– optimality (according to the selected measures)
– deterministic, fair, avoids starvation and system breakdown
– there might be special application needs (real-time, batch, multimedia, ...)
UNIX scheduling 3 / 23
BME Operating Systems 2015.
Overview of this lecture
• Traditional UNIX scheduling
– characteristics
– operation in user and kernel mode
– calculating priorities
– detailed algorithm (user mode)
• Practice
• Modern UNIX schedulers
– requirements and characteristics
– short summary of the Solaris and Linux schedulers
UNIX scheduling 4 / 23
BME Operating Systems 2015.
Overview of the classical UNIX scheduling
• UNIX scheduling is preemptive, time-sharing, and priority-based
• Priority-based
– every process has a dynamically calculated priority
– the process with the highest priority runs
• Time-sharing
– multiple processes at the same priority level can run in parallel
– each of them is given a time slice (e.g. 10 msec)
– after the time is expired the next process with the same prio will run
• Note: scheduling is different in user and kernel mode
– user mode: preemptive, time-sharing, dynamic priority
– kernel mode: non-preemptive, no time-sharing, fixed priority
(Modern UNIX schedulers are quite different.)
UNIX scheduling 5 / 23
BME Operating Systems 2015.
Scheduling (priority) levels
• The priority is between 0 and 127 (classical UNIX)
– 0 is the highest, 127 is the lowest priority level
– 0-49: kernel mode 50-127: user mode
• Processes are organized into 32 priority levels:
UNIX scheduling 6 / 23
BME Operating Systems 2015.
Scheduling in kernel mode
• Characteristics (very simple and almost zero overhead)
– fixed priority
– non preemtive
– no time-sharing (since there is no preemtion)
• The priority does not depend on
– the priority in user model
– processor usage in the past
• Calculating the kernel mode priority
– It is based on the reason why the process went to sleep.
– sleep priority: attached to reasons to be in sleep state
– After waking up a process the kernel sets its priority to the sleep priority.
– Examples for sleep priority
20 disk I/O
28 user input from the character terminal
UNIX scheduling 7 / 23
BME Operating Systems 2015.
Scheduling in user mode
• The dynamically calculated priority is the basis of scheduling
– the process with the highest priority runs
– the scheduler checks the priority levels and chooses the first process from
the highest non-empty queue (see the figure before)
• Task of the scheduler
– in every cycle (typically 100 times a second)
Is there any process at higher levels than the currently running process?
If so then switch to the highest priority process.
– after each time slice (typically 10 cycles, i.e. 10 times a second)
Is there another process at the same priority level?
If so the switch to the next process in the queue of that priority level.
Round-Robin scheduling algorithm
UNIX scheduling 8 / 23
BME Operating Systems 2015.
Scheduling data for processes
• For each process:
p_pri the actual priority of the process (kernel or user mode)
p_usrpri the user mode priority of the process
p_cpu CPU usage in the past
p_nice priority modifier given by the user
• Scheduling decisions are based on p_pri
• Entering kernel mode saves p_pri into p_usrpri
• Returning from kernel mode recalls p_pri from p_usrpri
• p_cpu shows how much CPU was given to the process.
The scheduler will „forget” past CPU usage slowly to avoid starvation.
UNIX scheduling 9 / 23
BME Operating Systems 2015.
Calculating the priority in user mode
• In every cycle p_cpu is increased for the running process
p_pcu++
• After 100 cycles p_pri is calculated in the following way
– p_cpu is „aged” by a correction factor (CF)
p_cpu = p_cpu * CF
– then the new priority is calculated according to this equation:
p_pri = P_USER + p_cpu / 4 + 2 * p_nice P_USER = 50 (constant)
• Calculating the correction factor:
– SVR3: CF = 1/2 (What is the problem with this?)
– 4.3 BSD: CF depends on the average number of processes (load_avg)
in runnable (ready to run) state:
CF = 2 * load_avg / (2 * load_avg + 1)
See the following commands: w, top and the graphical system monitor
UNIX scheduling 10 / 23
BME Operating Systems 2015.
Summary of the user mode scheduling
• In every cycle
– If there is a process at a higher priority level then switch to that process
– Increase p_cpu for the running process.
• Every 10 cycles
– Round-Robin scheduling among the processes at the same priority level
• Every 100 cycles
– calculating the correction factor based on the last 100 cycles
– „aging” p_cpu
– calculating priorities for all processes
please note that the priority does not depend on the past priority
– switching to the highest priority process
UNIX scheduling 11 / 23
BME Operating Systems 2015.
Practice: scheduling 3 processes
(see scheduling_examples.xls)
UNIX scheduling 12 / 23
BME Operating Systems 2015.
Evaluating the classical UNIX scheduling
• Pros
+ simple and efficient
+ suitable for general-purpose time-sharing systems
+ avoids starvation
+ supports processes with I/O operations
• Cons
– does not scale well
– no guarantee for processes
– users can not configure scheduling (except for nice)
– no support for multi-processor, multi-core systems
– kernel mode is non-preemptive:
A process running in kernel mode for a long time can hold up the entire
system (priority inversion)
UNIX scheduling 13 / 23
BME Operating Systems 2015.
Modern UNIX schedulers (requirements)
• New scheduling classes
– special application needs (multimedia, real-time, etc.)
– „fair share”: it is possible to plan the resource allocation
– multitasking at kernel level
– modular scheduling with and extendable framework
• Kernel preemption
– it is necessary for multiprocessor scheduling
• Performance, overhead
– scheduling became more and more complex (requirements, hardware)
– scheduling algorithms should scale well
• Threads or processes?
– modern applications use threads a lot (e.g. Java)
– schedulers should focus on threads not on processes
UNIX scheduling 14 / 23
BME Operating Systems 2015.
Scheduling in the Solaris operating system
• Characteristics
– scheduling is thread-based
– the kernel is fully preemptible
– supports multi-processor systems and virtualization
• New scheduling classes
– Time Sharing (TS): similar to the classical scheduling
– Interactive (IA): same as above but puts more emphasis on the active
window on the graphical user interface
– Fixed priority (FX)
– Fair share (FSS): allocating CPU resources to process groups
– Real-time (RT): provides the shortest resonse time
– Kernel threads (SYS)
UNIX scheduling 15 / 23
BME Operating Systems 2015.
Solaris scheduling levels
UNIX scheduling 16 / 23
BME Operating Systems 2015.
Inherited priorities (Solaris)
• The problem of priority inversion (blue: waiting, red: holding)
• Solution: increasing the priorities according to the waiting scheme
P3
pri: 60 R2
P2
pri: 80
P1 R1
pri: 100
UNIX scheduling 17 / 23
BME Operating Systems 2015.
Linux schedulers
• Before kernel V2: based on the classical UNIX scheduler
• Before V 2.4
– scheduling classes: real-time, non-preemptive, normal
– scheduling algorithm with O(N) complexity
– single runnable queue (no SMP support)
– non-preemptive kernel
• Kernel v2.6 (Ingo Molnár)
– O(1) scheduler (scales very well)
– multiple runnable queues (better SMP support)
– a heuristic algorithm to differentiate between I/O and CPU-bound tasks
• comparing running and waiting (sleeping) times (takes considerable time)
• prefers I/O-bound processes
• 2.6.23 kernel: CFS (Completely Fair Scheduler)
– designed and implemented by Ingo Molnár, some ideas from Con Kolivas
– a new data structure for runnable processes: self-balancing red-black tree
– tries to be fair by calculating a „virtual” runtime for all processes
UNIX scheduling 18 / 23
BME Operating Systems 2015.
Linux scheduling information (practice)
• Acquiring information using the /proc filesystem
/proc/cpuinfo – available CPUs
/proc/stat – CPU and scheduler properties
/proc/loadavg – average system load (past 1, 5, 15 minutes)
/proc/sys/kernel/sched* – scheduler information
/proc/<PID>/status – process state, Cpus_allowed, ...
/proc/<PID>/sched – process scheduling data
• What is happening on my computer?
– Interesting story: Peeking into Linux kernel-land using /proc filesystem
Uses ps, strace, /proc/PID/... to debug a database problem
– Other interesting things to know:
What Your Computer Does While You Wait
UNIX scheduling 19 / 23
BME Operating Systems 2015.
Linux CFS
• It replaces the previous O(1) scheduler with an O(log n) algorithm
• It uses a self-balancing red-black tree instead of simple linked lists
– this is a binary tree with O(log n) complexity search
– lower values to the left, higher to the right
– insert and delete is more simple
• Calculating the priority is based on
– number of virtually running processes (nr_running)
– virtual run time (vruntime) in the rbtree index
• Basic operations
– enqueue_task: New task arrived (nr_running++)
– dequeue_task: Task no longer ready to run (nr_running--)
– pick_next_task: who is the next to run
UNIX scheduling 20 / 23
BME Operating Systems 2015.
UNIX CRON and AT: long term scheduling
• Executing tasks at given time(s)
– e.g. simple backup, maintenance tasks, etc.
• Usage
– AT: execute a task at a given time (at now + 1 day)
– CRON: periodically execute a task (see man crontab)
minute, hour, day of month, month, day of week
0 6 * 1-6,9-12 2 /local/bin/lets_play_soccer
Send an invitation every Tuesday morning at 6am (except during summer)
*/20 * * * * /local/bin/clear_old_temp_cache
Clear temporary and cache files in every 20 minutes
• This scheduling is not performed by the kernel
– It is part of the userspace program set.
– It starts certain tasks but does not govern them while they are running.
– After started these tasks belong to short term scheduling.
UNIX scheduling 21 / 23
BME Operating Systems 2015.
Summary
• Classical UNIX scheduling
– user mode: priority-based, time-sharing, preemptive
• the process with the highest priority runs first
• round-robin time-sharing scheduling between processes at the same prio.
level
• priority is calculated based on previous CPU usage and the nice value
– kernel mode: fixed priority, non-preemptive
• sleep priority assigned to resources will be given to awaking processes
– simple, avoids starvation, handles I/O jobs very well
– no SMP support, does not scale well, no support for spec. app. needs
• Modern UNIX schedulers
– modular
– several scheduling classes according to applications' needs
– supports multi-cpu, multi-core systems (including CPU affinity)
– better resource allocation (guaranteed CPU resources)
– schedule threads
UNIX scheduling 22 / 23
BME Operating Systems 2015.
Try this at home: Linux scheduler simulator
• Install and get familiar with LinSched!
http://www.cs.unc.edu/~jmc/linsched/
• Experiment with several task setups like
– a mixture of I/O and CPU bound processes
– processes with different nice values
– a typical web server scenario (web + db + programs)
• This guide will help you on the way
http://www.ibm.com/developerworks/library/l-linux-scheduler-simulator/
UNIX scheduling 23 / 23