The Hitchhiker’s Guide to Operating Systems
Yanyan Jiang, Nanjing University
https://www.usenix.org/conference/atc23/presentation/jiang-yanyan
This paper is included in the Proceedings of the
2023 USENIX Annual Technical Conference.
July 10–12, 2023 • Boston, MA, USA
978-1-939133-35-9
Open access to the Proceedings of the
2023 USENIX Annual Technical Conference
is sponsored by
The Hitchhiker’s Guide to Operating Systems
Yanyan Jiang
Nanjing University
Abstract between theoretical concepts and practical system implemen-
tations.
This paper presents a principled approach to operating sys-
This model-driven approach is grounded in several innova-
tem teaching that complements the existing practices. Our
tive philosophies on operating systems education, which are
methodology takes state transition systems as first-class cit-
outlined below:
izens in operating systems teaching and demonstrates how
to effectively convey non-trivial research systems to junior Everything is a state machine (Section 2). The key idea
OS learners within this framework. This paper also presents of this paper is to consider state transition systems as the
the design and implementation of a minimal operating sys- foremost concept in teaching operating systems. The state-
tem model with nine system calls covering process-based machine abstraction is fundamental: the state of a modern
isolation, thread-based concurrency, and crash consistency, multi-processor system is essentially determined by regis-
with a model checker and interactive state space explorer for ter/memory bit values, driven by the non-deterministic se-
exhaustively examining all possible system behaviors. lection of a single CPU executing a single-step instruction1 .
The same abstraction is also applicable to any multi-threaded
program.
1 Introduction
Consequently, we argue that it is beneficial to view the
“Everything should be made as simple as possible, operating system as both a state machine and a manager of
but no simpler.” —Albert Einstein state machines. An operating system essentially leverages
application-invisible data structures (e.g., a page table) to
The teaching foundation of operating system design and multiplex CPUs across processes and threads. This approach
implementation has been well-established for decades. From provides a rigorous explanation of process management APIs:
fork/execve/exit functions simply clone, reset, and destroy live
Tanenbaum’s “Operating Systems: Design and Implementa-
tion (1987)” [45] to Arpaci-Dusseau’s “Operating Systems: state machines. This abstraction also encourages in-depth
Three Easy Pieces (2018)” [3], students approach operating discussions about fork [4] and the initial process state after
execve.
systems by studying the layered design of abstractions over
processors, memory, and storage systems. By adopting this state-machine-centric perspective, we can
In parallel, researchers have observed the emergence of explain research systems with a clear and rigorous foundation.
fast, scalable, reliable, and secure systems over the past few For instance, every debugger [12], trace [8], and profiler [9]
decades. This progress has been driven by the development essentially “observes” runtime state snapshots, facilitating
of innovative system technologies, such as hardware/software discussions on interactive query debuggers [39], deterministic
co-design [24, 43], cross-stack integration [20, 23], program full-system replay [16], time-travel failure reproduction [11],
analysis [11, 48], and formal methods [30, 31], among others. snapshot-based fault tolerance [38], and state space explo-
ration [7] in an introductory-level operating system course.
This paper attempts to share these exciting ideas with junior
operating system learners under a unified theme by “adding Emulate state machines with executable models (Section 3).
a layer of indirection.” Our key insight is to view all compo- Since state machine is a mathematically rigorous concept, we
nents of a computer system–including hardware, applications, could always emulate the execution of real state machines.
and the operating systems that connect them–as state transi- Although emulation has been widely adopted in operating sys-
tion systems. By analyzing these components as informal yet
mathematically rigorous objects, we aim to bridge the gap 1 Under the assumption of race-freedom that instructions are serializable.
USENIX Association 2023 USENIX Annual Technical Conference 929
System Call Description 1. We propose a new “state-machine first” approach in the
breakdown of operating system teaching: (1) model sys-
fork() Create current thread’s heap and context clone
spawn( f , xs) Spawn a heap-sharing thread executing f (xs)
tems as state machines, (2) realize models by emulation,
sched() Switch to a non-deterministic thread and (3) explore models by enumeration. This approach
choose(xs) Return a non-deterministic choice among xs enabled us to introduce non-trivial research systems to
write(xs) Write strings xs to standard output junior operating system learners.
bread(k) Return the value of block k
bwrite(k, v) Write block k with value v to a buffer 2. We design and implement M OSAIC, a minimal (500 lines
sync() Persist all outstanding block writes to storage of code, including comments) executable operating sys-
crash() Simulate a non-deterministic system crash tem model and checker, which strikes a balance between
understandability and functionality. M OSAIC can rig-
Table 1: System calls in the operating system model. orously explain non-trivial textbook cases concerning
concurrency, virtualization, and persistence. M OSAIC is
available via
tem teaching2 , this paper takes one step further by emulating
a “fully functional” operating system model with processes, https://github.com/jiangyy/mosaic.
threads, a debug console, and block storage. The system calls
are listed in Table 1. The executable model approach has the 3. We incorporated these ideas in a first undergraduate oper-
following advantages: ating system course (Section 5). This course became one
First, executable model is a foundation for exploring op- of the most popular operating system courses in China
erating system concepts. Synchronization primitives like Pe- and has attracted over 2,000,000 video views since its
terson’s algorithm [34], condition variable, and semaphore initial release in 2020.
can be implemented over shared memory. The non-trivial
state copy behavior of fork() [4] can be reproduced under
this model. A file system checker can be carried out upon a 2 State Machines: First-class Citizens of Oper-
simulated crash(). ating Systems
Second, executable model is a behavioral specification
of real operating systems; it is the golden standard on the Philosophy 1: Everything is a state machine.
application-observable behaviors. A model facilitates discus-
sions on the abstractions–the concrete implementation of the This paper’s key contribution is the “state-machine first” ap-
fork() function may employ copy-on-write, but this should proach to operating systems. By regarding both user-level
remain transparent to a process. Such a model also motivates applications and kernels as state machines (Section 2.1), it
the key idea behind formally verified systems like seL4 [25] became obvious that operating systems are state machine
and Hyperkernel [31]. managers (Section 2.2). This section also discusses modern
computer systems and tools under the state machine perspec-
Enumeration demystifies operating systems (Section 4). tive (Section 2.3).
We design our emulator to handle all sources of non-
determinism in a coherent way: every system call (not merely
choose) returns a set of possible choices as callbacks. Conse- 2.1 Introducing State Machines in the Operat-
quently, we can exhaustively enumerate all possible system ing System Class
behaviors with little implementation effort.
Such a design finally leads to our M OSAIC (Modeled Program as a state machine. Every program run essentially
Operating System And Interactive Checker) operating system boils down to the execution of binary instructions, whose be-
model and checker. M OSAIC adds lightweight formal meth- havior is rigorously defined by a state machine in which states
ods [21, 47] to operating systems teaching. M OSAIC is capa- are register/memory values and transitions are the execution
ble of checking fork-based process parallelism, thread-based of one instruction at the program counter. We implement this
shared memory concurrency, and crash consistency [36]. The idea on Linux (Figure 1) to provide a definition of system
model checker’s output can be piped to an interactive state calls: system call is a state transition (e.g., via a trap instruc-
space explorer that can be embedded in a Jupyter notebook tion or any process-kernel communication mechanism [44])
(Figure 3); thus, all non-trivial corner cases of the operating for accessing the “exterior” of the state machine, e.g., writing
system model can be rigorously explained. data to the operating system or changing the state machine’s
memory address space (via mmap or mprotect) and existence
In summary, this paper makes the following contributions:
(via exit). Without system calls, the program (state machine)
2We loved the emulated process scheduler, virtual memory, and file sys- is a “closed world” that can only perform arithmetic and logi-
tems in the “Three Easy Pieces” [3]. cal operations over memory and register values.
930 2023 USENIX Annual Technical Conference USENIX Association
1 #include "sys/syscall.h" Application A 0 1 2 3
2 mov $SYS_write, %rax // write(
3 mov $1, %rdi // fd=1, 1
4 mov $hello, %rsi // buf=hello, 4 6
5 mov $16, %rdx // count=16 Application B 0 2
6 syscall // ); 5 7
syscall
7 // "ret" here yields SIGSEGV (nondeterministic) 3
8 mov $SYS_exit, %rax // exit(
9 mov $1, %rdi // status=1
Operating
10 syscall // ); B
0 3
A B
1 5
System
0 0 1 3
11 hello: ; .ascii "Hello, OS World\n" (Model)
mov $hello, %rsi
Initial state s0 “re nement mapping”
rax rsi … rax rsi …
Operating
Mem
0 0 … Reg 1 0 … Reg 1 …
… …
… … System
… Hello… (Implementation) Reg
Mem … Hello… Mem … Hello…
Figure 2: Operating system as a state machine manager. In
Figure 1: A minimal “Hello World” program and its corre- this example, the operating system “executes” state transitions
sponding state machine. 0 → 1 and 0 → 3 → 5 for applications A and B, respectively.
Bare-metal as a state machine. The bare-metal hardware The operating system should give the application the illusion
fi
shares a similar model with binaries: a CPU essentially op- that the state transition system runs continuously following
erates as an infinite loop of instruction execution, which is its specification, even though instruction execution could be
also the case for a full-system emulator [5]. In contrast to non-deterministically interrupted at any time.
user-level programs that can perform system calls, bare-metal The state-machine approach provides a natural “implemen-
kernels (including operating systems) access the “external tation” of virtualization: by making state snapshots of all
world” via port or memory-mapped I/O and can be interrupted processes available and scheduling a process through “mov-
as if a trap instruction is non-deterministically injected. ing” a state machine to the CPU. The trap/interrupt handler
plays such a role: it stores the state machine’s registers in
Discussions. The advantage of introducing the state machine
the operating system’s private memory space, ensuring the
model early in an operating system course is that it fosters
system-wide invariant that all application states can be re-
a tendency of rigorous thinking–state transition systems are
constructed. Subsequently, the operating system can continue
well-defined mathematical objects. Specifically, we motivate
processing interrupts, executing system calls, and resuming
the students to think of what is the mathematically precise
any process based on a predefined scheduling policy.
definition of the process initial state. We explain that any pro-
These arguments conclude our claim that “everything is a
cess’s initial state is well-defined by its binary executable and
state machine” and gives us a new picture of understanding
the Application Binary Interface. We also demonstrate how
operating systems, as shown in Figure 2:
to inspect the initial state of the code in Figure 1 using stepi
in GDB and memory mapping files in procfs. We further en- 1. Application code is the developer’s specification of a
courage students to consider more involved details of process state machine.
states, such as the reasons behind the inability to perform a
function return (using a ret instruction) and the necessity of 2. Operating system code is the designer’s specification
wrapping C main functions with a __libc_start_main. of a state machine manager, a “superset” state machine
container of all application state machines.
2.2 Operating System as a State Machine Man- 3. The operating system provides system calls as services
ager and leverages application-invisible states (e.g., page ta-
ble) to give processes the illusion of continuous state
Computer system stack on state machines. Virtualization machine execution.
is the most fundamental mechanism of modern operating
systems. Each application in an operating system can be Process APIs on state machines. Following the idea that
regarded as a state machine whose initial memory layout running applications are state machines, the need for process
and state transitions are specified by its binary executable. APIs became obvious: an operating system must provide
USENIX Association 2023 USENIX Annual Technical Conference 931
mechanisms for manipulating the set of live state machines. Playing with snapshots. fork provides a verbatim copy of
We found that the state machine language3 precisely and a program’s state si with reasonably low cost. Holding such
concisely explains UNIX process APIs: program state snapshots yields interesting applications. One
is the Zygote process of Android [14], which copies initial-
1. fork() makes a “full copy” of the currently running state ized Java virtual machine state to avoid repetitive and time-
machine. Registers and the address space should appear consuming bootstrap-time class loading. Another example is
to be deeply copied. References to operating system that one can take periodical clean-state snapshots (e.g., in the
objects (e.g., file descriptors and signal handlers) should idle state of an event loop) and fall back to a snapshot when
also be copied, but with caution [4]. an unexpected error occurs [38].
2. posix_spawn(...) creates a new state machine (always re- Time-travel debugging. Developers use a debugger to in-
sets to the initial state of an application) with controllable teractively examine tr, which can be enhanced by a query
state sharing with the parent. language [39]. Debuggers can also enable time-travel de-
bugging by recording the differences between consecutive
3. execve(path, argv, envp) resets a running state machine states, essentially creating an undo log. Time-travel debug-
to the initial state specified by the binary file path, with ging is already implemented in GDB [12]. Observing that
arguments argv and environment list envp placed in mem- non-deterministic transitions are only a tiny fraction of tr, one
ory following the Application Binary Interface. can also keep track of their locations and choices to enable a
4. exit(status) removes the currently running state ma- deterministic replay [16, 33].
chine from the operating system, reclaims used re- Trace and profiler. One can insert probes exclusively at state
sources, and notifies any waiting process with the exit transitions relevant to the application logic (e.g., function
status. calls and returns) and gather diagnostic data (e.g., call stack
traces). Trace utilities such as ftrace and Kprobe in Linux [8]
2.3 State Machines Meet Operating Systems are widely used for debugging production failures.
One can place probes only at application logic relevant state
We discovered that the state-machine approach is not only transitions (e.g., function calls and returns) to collect diagnos-
beneficial for clarifying operating system concepts, but it can tic information (e.g., call stack trace). Such trace tools like
also serve as a fundamental basis for explaining non-trivial ftrace and Kprobe in Linux [8] are widely used in debugging
research systems to students: production failures.
Understanding system execution. Theoretically, executing The overhead associated with tracing can be further re-
a state transition system (be it an application or an operating duced through sampling, which involves periodically activat-
system kernel) results in an execution trace composed of state ing probes within a specified time interval. Such profilers
snapshots connected by state transitions generate summaries of the sampled program states and are
extremely useful in diagnosing performance issues.
tr = s0 → si → . . . → si+1 → . . . , Runtime checkers. Runtime checkers can also be considered
as if we single-instruction debug the program and save a core as functions that accept tr as input and check it against spe-
dump after each instruction execution. Such a trace contains cific bug patterns. A broad spectrum of checkers operate in
all information needed for understanding this specific pro- this manner: AddressSanitizer [40] asserts the absence of out-
gram execution. of-bounds and use-after-free memory accesses. ThreadSani-
However, such a massive trace (billions of instructions exe- tizer [41] confirms that there are no conflicting shared mem-
cuted per second and megabytes of snapshots) is impractical ory accesses unordered by happens-before relations. Lock-
and unnecessary to keep for any engineering practice. Debug- dep [29] checks whether all observed lock acquisition order-
gers provide the break/watchpoint mechanism to efficiently ings do not form a cycle.
stop at interested program points (sometimes with hardware Symbolic execution and program verification. It is obvi-
assistance like debug registers) and let the developer examine ous that system calls can exhibit non-deterministic behavior.
the program states interactively. However, it is less emphasized that such non-determinism
Understanding a program’s execution usually only requires can be rigorously quantified; for instance, a read system call
a tiny fraction of information in the full trace tr. The trade-off returns only a finite number of possibilities. Thus, we can
space of “what parts of tr to observe” leads to many impor- enumerate all possible state transitions to capture all potential
tant mechanisms incorporated in the engineering of modern program behaviors; however, this approach is only feasible in
operating systems, which are explained below. a theoretical context. Even reading a 32-bit integer results in
3 For brevity, we removed less critical mechanisms including signals, 232 distinct states.
process groups, and access control in this discussion. However, all of them Using a compact representation of a vast number of states
can be explained under the state machine perspective whenever needed. (e.g., using a symbolic value x to represent an “arbitrary”
932 2023 USENIX Annual Technical Conference USENIX Association
value of variable x) and imposing constraints on symbolic 9 match syscall:
values across branches results in a symbolic program veri- 10 case 'SYS_write': # write to debug console
11 print(*args)
fier [7].
12 case 'SYS_sched': # switch to a random process
13 self._current = random.choice(self._procs)
14
3 An Executable Operating System Model
15 OperatingSystem([main('ping'), main('pong')]).run()
Philosophy 2: Emulate state machines with exe-
cutable models. Process APIs. Because deep-copying a generator object is
not allowed in Python, we implement fork() by creating a
As state machines are mathematically rigorous constructs, new OperatingSystem object and replaying all executed sys-
their usefulness is not limited to merely clarifying operating tem calls to obtain a deep copy of the process. This requires
system concepts. It is also feasible to develop executable state OperatingSystem to keep track of the non-deterministic choices
machines that accurately emulate the behavior of processes of all previously executed system calls. Processes have in-
and operating systems. creasing IDs starting from 1,000, and the child process ID is
Specifically, we leverage modern programming language returned on fork(). There is no exit() because returned gen-
mechanisms like coroutines for lightweight in-process con- erators are never scheduled and are considered exited. There
text switches to implement a lightweight executable operating is also no execve() because its functionality largely overlaps
system model with emulated threads, processes, and devices with spawn() and fork().
(Section 3.1). This section also discusses how instructors
Threads and shared memory. The shared memory among
could use a model to simplify non-trivial textbook cases (Sec-
threads is emulated by the global heap variable, whose value
tion 3.2) and use models as behavioral specifications of real
is updated before switching to a process/thread by
systems (Section 3.3).
globals()['heap'] = self._current.heap,
3.1 Emulating an Operating System
and readers may notice that this heap models a “page table base
register” which is changed on context switches. spawn(f, *xs)
State machines (processes) and system calls. We implement
creates a new generator calling f with arguments xs and a
our operating system model in Python, a popular program-
shared heap. The replay-based fork() obtains a deep copy of
ming language among students. A process is emulated by a
the heap in the freshly allocated OperatingSystem object.
generator (stackless coroutine) object where process memory
is its local variables. System calls (Table 1) are emulated by Devices. Writing to the debug console appends the message
yield in which the generator saves its local state (local vari- to a buffer. Reading from the debug console can be imple-
ables and program counter) in a closure and transfers control mented by choose() from possible inputs. The emulated block
to its caller4 : device is a key-value mapping, which maps each block’s ID
(any string like inode or even emojis) to its contents (any seri-
1 def main(msg): # an emulated application process
2 i = 0 alizable data structure including strings and lists). All block
3 while (i := i + 1): device writes are first appended to a queue to simulate real
4 yield 'SYS_write', msg, i # write(msg, i) disks with a volatile buffer [36]. Write-back happens only
5 yield 'SYS_sched', # sched() when sync() is called.
Our operating system model, as a state machine manager,
maintains a set of processes (continuable generators) and is an 3.2 Modeling Operating System Concepts
infinite loop of yield trap handler, just like any real operating
Such a surprisingly simple model can simplify textbook cases
system:
that require non-trivial interactions across system layers and
1 class OperatingSystem: are thus challenging to debug or even reproduce–we can selec-
2 def __init__(self, procs): # OS initialization tively model the essential elements of the system to minimize
3 self._procs = procs the complexity:
4 self._current = procs[0]
5 A fork() in the road [4]. Fork is no longer simple, consid-
6 def run(self): # the OS main loop
ering it conducts a full state copy of libraries and references
7 while True:
8 syscall, *args = self._current.__next__() (handles) to operating system objects. Below is such a non-
trivial case related to the buffer mechanism in the standard C
4 In the M OSAIC implementation, the process code is stored in a stan-
libraries:
dalone Python file. Applications invoke system calls in Table 1 as ordinary
function calls like x = sys_choose(['Head', 'Tail']), and M OSAIC rewrites 1 for (int i = 0; i < 2; i++) {
the AST by replacing all system call nodes to yield. 2 int pid = fork();
USENIX Association 2023 USENIX Annual Technical Conference 933
3 printf("%d\n", pid); 16
4 } 17 if heap.cond: # cond_signal()
18 t = sys_choose(heap.cond) # |
19 heap.cond.remove(t) # |- wake up anyone
(unix) $ ./a.out (unix) $ ./a.out | wc -l 20 sys_sched()
1000 8 # ??? 21
1001 22 heap.count += delta # produce or consume
0 23
0 24 heap.mutex = ' ' # mutex_unlock()
1002 25 sys_sched()
0 26
27 def main():
28 heap.mutex = ' ' # or
Debugging the internal implementation of libc (even with a
29 heap.count = 0 # filled buffer
much simpler implementation like musl [1]) to understand this 30 heap.cond = [] # condition variable's wait list
case requires substantial engineering efforts. Alternatively, 31 sys_spawn(Tworker, 'Tp', 1) # delta=1, producer
we first model this case by removing all low-level details of 32 sys_spawn(Tworker, 'Tc1', -1) # delta=-1, consumer
process creation and focusing on the behavior of a fork-cloned 33 sys_spawn(Tworker, 'Tc2', -1) # delta=-1, consumer
buffer: At first glance, this model seems to diverge from the text-
1 def main(): book example, as all synchronization primitives are denoted
2 heap.buf = '' by spin-wait constructs (Lines 3-4, 11-12, and 13-14). How-
3 for _ in range(2): ever, this is intentional: spin wait reflects the specification that
4 pid = sys_fork() # heap.buf is deeply copied
5 sys_sched() # non-deterministic context switch
the thread could not make any progress unless the synchro-
6 heap.buf += f'{pid}\n' # or sys_write() nization condition is satisfied (e.g., a mutex is in the unlocked
7 sys_write(heap.buf) # flush buffer at exit state or a condition variable has been signaled). Blocking wait
is merely one possible implementation. Such a model also
The executable model always gives a process schedule captures a detail often overlooked by students: a condition
to explain its outputs5 . After fully understanding the model, variable contains an implicit re-acquisition of its associated
students can examine the system call traces and debug the mutex (Lines 13–15) after being signaled. An executable
libc source code with less pain. model facilitates the development of rigorous concepts in
Understanding synchronization. Synchronization primi- operating systems.
tives (mutexes, condition variables, semaphores, etc.) are usu- The incorrect use of condition variable is also non-trivial:
ally informally introduced in a textbook or an operating sys- manifesting the bug requires at least three threads (a producer
tem course. Implementing them upon our operating system and two consumers) and N ≥ 2. Such a fact can be easily
model gives them a rigorous semantics specification6 . Below verified by the model checker (Section 4.1). Running this
displays a model of the buggy producer-consumer implemen- model under a uniform-random scheduler, there is only ap-
tation from Chapter 30 of “The Three Easy Pieces” [3], in proximately an 8% chance of triggering the deadlock in which
which a consumer may erroneously wake up another con- all three threads Tp , Tc1 , and Tc2 are spinning on Line 11.
sumer (instead of a producer), resulting in a deadlock: Finally, we found that emojis in the code can improve the
readability of program states: “ ” intuitively indicates that
1 def Tworker(name, delta): a thread holds this mutex. Other cases include using
2 for _ in range(N): in Peterson’s algorithm [34] (instead of flag[2] and integer
3 while heap.mutex == ' ': # mutex_lock()
4 sys_sched() # |- spin wait
values 0 or 1) to indicate a thread “raising hand” to enter the
5 heap.mutex = ' ' # | critical section and to denote success or failure.
6
File system consistency and journaling. The emulated block
7 while not (0 <= heap.count + delta <= BUFSIZE):
8 sys_sched() device enabled us to implement ideas in file systems without
9 heap.mutex = ' ' # cond_wait() tedious low-level device details. Recall that the block device
10 heap.cond.append(name) # | is conceptually a dict. Thus, we can assign blocks with intu-
11 while name in heap.cond: # |- spin wait itive names like 'bitmap1' to indicate a bitmap block in the
12 sys_sched() # |
persistent storage. We can also use this dict as a file system
13 while heap.mutex == ' ': # |- reacquire lock
14 sys_sched() # | by mapping file names (e.g., '/tmp/a.txt') to their metadata
15 heap.mutex = ' ' # | and contents (e.g., ('symlink', '/etc/passwd')) when the ac-
tual storage layout is not relevant. Below is a simplified model
5 The model checker (Section 4.1) can be used to exhaustively examine
of xv6 [10] log commit:
all process schedules and understand the possible outputs.
6 Our model assumes that the execution of statements between consecutive 1 def main():
sched() appears to be atomic and uninterruptible. 2 # 1. log the write to block #B
934 2023 USENIX Annual Technical Conference USENIX Association
3 head = sys_bread(0) # blocks #1, #2, ... are the log 1 Q ← {[]}; // the queue of traces pending checking
4 free = max(log.values(), default=0) + 1 # allocate log 2 S ← ∅; // the set of checked states
5 sys_bwrite(free, f'contents for #{B}') 3 while ¬Q.empty() do
6 sys_sync() 4 tr ← Q.pop() ;
7
5 ⟨s, choices⟩ ← replay(tr);
8 # 2. write updated log head
9 head = head | {B: free}
6 if s ∈
/ S then
10 sys_bwrite(0, head) 7 S ← S ∪ {s}; // add the unexplored state to S
11 sys_sync() 8 for c ∈ choices do
12 9 Q.push(tr :: c); // extend tr with c and append to Q
13 # 3. install transactions
14 for k, v in head.items(): Algorithm 1: The M OSAIC model checker
15 content = sys_bread(v)
16 sys_bwrite(k, content)
17 sys_sync()
18 fork() sequence (assuming that all forks succeed) should yield
19 # 4. clear log head identical process trees for both the model and a student’s op-
20 sys_bwrite(0, {}) erating system kernel. Such an approach is also known as the
21 sys_sync()
lightweight formal method [22] and has been widely adopted
With the model checking feature (Section 4.1), all possible in validating practical systems [6].
crash behaviors and potential file system inconsistencies can
be exhaustively explored.
4 One Model Checker to Rule Them All
3.3 Application: Specification of Systems Philosophy 3: Enumeration demystifies operating
systems.
An operating system model can be useful beyond explaining
textbook cases. A model also provides a behavioral specifi-
The executable model’s behavior can be exhaustively explored
cation for real operating systems, like a high-level reference
by enumerating all possible non-deterministic choices. This
implementation. For example, it could be proved that the mu-
section presents such a model checker (Section 4.1) and its ap-
tex model in Section 3.2 has the following two properties:
plication to operating system teaching (Section 4.2), followed
1. Safety: as long as a thread holds a mutex, any other by short quantitative experiments in Section 4.3.
thread’s lock acquisition never returns.
2. Liveness: a thread eventually acquires a mutex if threads 4.1 M OSAIC Model Checker Design and Im-
with acquired locks eventually release them under a fair plementation
(random) scheduler.
Instead of executing a system call immediately, all M OSAIC
Because everything is a state machine (and thus a well-defined
systems calls return a dict mapping possible choices (which
mathematical object), it could be theoretically possible to
can be regarded as labeled transitions in the state machine)
prove that a real system’s implementation is consistent with a
to lambda callbacks for actually performing the system call,
model by constructing a refinement mapping7 . This is exactly
even if there is only one unique choice:
the idea behind formally verified systems like seL4 [25] (with
a Haskell executable model) and Hyperkernel [31] (with a 1 def sys_sched(self):
Python executable model), which all modeled operating sys- 2 return { # all possible choices
3 f't{i+1}': (lambda i=i: self._switch_to(i)) # callback
tems as a state machine. Even though the technical details
4 for i, th in enumerate(self._threads)
of the research work may be too involved for first-time oper- 5 if is_runnable(th.context)
ating system learners, state machines still facilitate grasping 6 }
the fundamental concepts underlying them–one could always 7
perform a “brute-force prove” by enumerating all reachable 8 def sys_fork(self, *args):
9 return { # only one choice
vertices on the state transition graph for finite systems.
10 'fork': (lambda: self._do_fork())
Models are also useful as a behavioral reference for real sys- 11 }
tem implementations. A more practical “refinement mapping”
is to feed the same workload to both a model and a real system. Such a design yields a simple replay-based state space ex-
Cross-checking the model and system traces validates the im- plorer as shown in Algorithm 1. The algorithm is a straightfor-
plementation’s correctness. For example, executing the same ward breadth-first search that memorizes traversed states in S.
7 Onefundamental result of program verification is that refinement map- A trace is a chronological list of each system call’s selected
pings between high-level and low-level specifications always exist [2]. choice. Replaying a trace will always reach the next system
USENIX Association 2023 USENIX Annual Technical Conference 935
5 sys_sched()
6 if pid == 0: # attacker: symlink file -> /etc/passwd
7 sys_bwrite('file', ('symlink', '/etc/passwd'))
8 else: # sendmail (root): write to plain file
9 filetype, contents = sys_bread('file') # for check
10 if filetype == 'plain':
11 sys_sched() # TOCTTOU interval
12 filetype, contents = sys_bread('file') # for use
13 match filetype:
14 case 'symlink': filename = contents
15 case 'plain': filename = 'file'
16 sys_bwrite(filename, 'mail')
17 sys_write(f'{filename} written')
Figure 3: The interactive thread interleaving space explorer 18 else:
19 sys_write('rejected')
on M OSAIC’s results of checking a spin lock implementa-
tion. Process and thread states are plotted as vertices. Thread M OSAIC reveals that “/etc/passwd written” is possible and
program counters are highlighted on the source code like a gives such a process schedule. The exhaustive search can also
debugger. Clicking a vertex expands its children. reveal that the two sys_sched in Lines 6 and 10 are essential
to produce such a result.
call’s non-deterministic choices (Line 5), or there is no choice Hardness of shared-memory concurrency. Understanding
(choices = ∅) when all processes and threads are terminated. thread interleaving can be difficult. Restoring the global or-
For finite-state models, the algorithm always terminates and dering of shared memory accesses on thread-local read/write
produces a state transition graph whose vertices are traces in S sequences is NP-Complete [17]. One interesting case is the
and edges are labeled with c in Line 9. M OSAIC serializes the possible outcomes of concurrent tot++, assuming that loads
state transition graph as a JSON file. Both states (generator and stores are atomic (i.e., a sequentially consistent memory
states, heaps, debug console output, and storage state) and model) and the compiler does not merge multiple tot++:
transitions (labeled edges) are serialized. We encourage the 1 def Tsum():
students to follow the UNIX philosophy and pipe the text 2 for _ in range(N):
output to different backends: 3 tmp = heap.tot # load(tot)
4 sys_sched()
1. Simply grep stdout | sort | uniq -c for a quick (and 5 heap.tot = tmp + 1 # store(tot)
dirty, perhaps unsound) check for all possible debug 6 sys_sched()
7
console outputs.
8 def main():
9 heap.tot = 0
2. Any JSON query or viewer like jq [15] to extract fields of
10 for _ in range(T):
interest (e.g., variable values or block device contents). 11 sys_spawn(Tsum)
3. Our interactive state explorer (Figure 3) in which one M OSAIC reveals that tot can be 2 regardless of N and T
can selectively expand nodes in the state transition graph. (for N, T ≥ 2) and gives such a thread schedule in which one
This interactive explorer is particularly handy for class thread “holds” a value of 2 in the last iteration of the loop and
demonstration. does not write it back until all other threads are terminated.
We used the N = 3, T = 2 case as an exam problem, and
4.2 Model Checking for Fun and Profits approximately half of the students got wrong.
The ability to exhaustively explore the state space makes a Persistence and crash consistency. Upon crash(), M OSAIC
model checker suitable for rigorously explaining non-trivial automatically explores all 2n possible crash disks, assuming
cases in operating systems. A few such cases are shown below. that any of the n buffered block I/O requests could be lost [36].
By modeling a file operation that involves multiple block
Processes and TOCTTOU attack. Both UNIX and our op- updates (inode, bitmap, and data), an instructor can clearly
erating system model lack a mechanism (e.g., transactions and rigorously illustrate potential inconsistencies in a file
[13, 37]) to enforce the atomicity across system calls and system upon a system crash. Below is a textbook case in
may be subject to time-of-check to time-of-use attacks. We Chapter 42 of “The Three Easy Pieces” [3]:
demonstrate such a case of process-level race from [46]:
1 def main():
1 def main(): 2 # intially, file has a single block #1
2 sys_bwrite('/etc/passwd', ('plain', 'secret...')) 3 sys_bwrite('file.inode', 'i [#1]')
3 sys_bwrite('file', ('plain', 'data...')) 4 sys_bwrite('used', '#1')
4 pid = sys_fork() 5 sys_bwrite('#1', '#1 (old)')
936 2023 USENIX Annual Technical Conference USENIX Association
Introduction
Subject Parameters # State Memory Time
State machine model of programs
n = 1 (p = 2) 15 17.0 MB < 0.1s
fork-buf
n = 2 (p = 4) 557 19.8 MB 3.3s (171 st/s) 1 - Concurrency 2 - Virtualization 3 - Persistence
(7 LOC)
n = 3 (p = 8) Timeout (> 60s)
State machine model of OS as a state machine Physical persistence of
n = 1;t p = 1;tc = 1 33 17.3 MB < 0.1s multi-threaded programs manager 1-bit information
cond-var n = 1;t p = 1;tc = 2 306 19.7 MB 0.1s (2 912 st/s)
(34 LOC) n = 2;t p = 1;tc = 2 2 799 26.0 MB 0.8s (3 343 st/s) Thread APIs System calls and zero- Devices and drivers
n = 2;t p = 2;tc = 1 4 666 30.5 MB 1.4s (3 247 st/s) library shell
n=2 55 17.3 MB < 0.1s Mutual exclusion and File system interface and
synchronization implementation
xv6-log n=4 265 19.2 MB < 0.1s Standard C libraries
(27 LOC) n=8 6 157 40.2 MB 1.3s (4 810 st/s)
n = 10 28 687 93.9 MB 20.7s (1 385 st/s) Surviving system crashes
Weak memory Concurrency Executable les and
model bugs dynamic loading
p=2 33 17.4 MB < 0.1s
Real le system
tocttou p=3 97 17.8 MB 0.2s (413 st/s) implementations
(24 LOC) p=4 367 19.4 MB 2.7s (135 st/s) Validation and Trace and System call
p=5 1 402 23.5 MB 30.2s (46 st/s) veri cation pro ling implementation
Context switch
and scheduling
n = 1;ts = 2 40 17.2 MB < 0.1s
parallel-inc n = 2;ts = 2 164 18.0 MB < 0.1s fork() and
demand paging
(11 LOC) n = 2;ts = 3 6 635 37.4 MB 1.4s (4 580 st/s)
n = 3;ts = 3 52 685 139.5 MB 14.1s (3 725 st/s)
n=2 90 17.5 MB < 0.1s Figure 4: Major modules and their dependencies in our oper-
fs-crash n=4 332 19.4 MB < 0.1s ating system course. The concept of state-machine is a good
(25 LOC) n=8 5 136 36.2 MB 2.6s (1 944 st/s) foundation for thread-based concurrency, and thus we intro-
n = 10 Timeout (> 60s) duce concurrency first in the course.
Table 2: Evaluation subjects and results. p, t, n denote the
number of processes, threads, and loop iterations, respectively. 4.3 Experiments
fi
fi fi fi
All experiments were performed on an i7-6700 Linux PC
with 4 GB RAM running Python 3.11. Each configuration is We evaluate the performance of M OSAIC by checking the six
repeated for 10 times, and the average number is reported. representative models in Sections 3.2 and 4.2. Both experi-
mental subjects and results are listed in Table 2. As expected,
M OSAIC cannot address the state space explosion problem
and has no comparable performance with a state-of-the-art
software model checker with dedicated optimizations. Further-
6 sys_sync()
7
more, programs that extensively fork is significantly slower
8 # append a block #2 to the file (benchmarks fork-buf and tocttou) because our fork() is imple-
9 sys_bwrite('file.inode', 'i [#1 #2]') # inode mented by a full-system replay. Nevertheless, checking thou-
10 sys_bwrite('used', '#1 #2') # bitmap sands of nodes per minute could be considered sufficiently
11 sys_bwrite('#1', '#1 (new)') # data block 1
useful for instructional purposes, and our design choice is to
12 sys_bwrite('#2', '#2 (new)') # data block 2
13 sys_crash() # system crash make a functional model checker minimal and elegant.
14
15 # display file system state at crash recovery
16 inode = sys_bread('file.inode') 5 A New Operating System Course
17 used = sys_bread('used')
18 sys_write(f'{inode:10}; used: {used:5} | ') We design a new operating system course from scratch based
19 for i in [1, 2]: on “The Three Easy Pieces” [3] and our teaching philosophies:
20 if f'#{i}' in inode:
21 b = sys_bread(f'#{i}')
everything is a state machine, emulate state machines with
22 sys_write(f'{b} ') executable models, and enumeration demystifies operating
systems. The course syllabus is shown in Figure 4. This
section presents the impacts of the state machine perspective
(Section 5.1) and model checker (Section 5.2) in the course
M OSAIC’s self-explanatory outputs verified that the one- design, followed by discussions in Section 5.3.
page informal arguments in the textbook are indeed exhaustive
and correctly covered all possible cases. M OSAIC can also 5.1 State Machines and Operating Systems
check the journal implementation in Section 3.2 by adding
crash() to the code and reveal that removing the sync() in In addition to introducing the key concepts in operating sys-
Line 6 may result in file system inconsistency. tems using state machines, the state machine perspective also
USENIX Association 2023 USENIX Annual Technical Conference 937
brings the following advantages in establishing a high-level 1 void hanoi(int n, int from, int to, int via) {
understanding of important concepts regarding computer sys- 2 if (n == 1) {
tems in a natural and coherent way. 3 printf("%d -> %d\n", from, to);
4 } else {
Don’t panic in hacking real systems! All students had a hard 5 hanoi(n - 1, from, via, to);
time in debugging real (even minimal) systems, including but 6 hanoi(1, from, to, via);
7 hanoi(n - 1, via, to, from);
not limited to operating system kernel, even if we provided
8 }
skeletal code, tool chain, and state visualization scripts. 9 }
The state-machine perspective provides a natural reflex on
how to deal with bugs or unexpected behavior in real systems:
1 typedef struct { int pc, n, from, to, via; } Frame;
All bugs in computer systems are essentially some anomaly 2 #define call(...) ({*(++top) = (Frame) {0, __VA_ARGS__};})
in the state-machine’s execution trace. Given an unlimited 3 #define ret() ({top--;})
amount of time, one just seeks the first abnormal state, and the 4 #define jmp(loc) ({f->pc = (loc) - 1;})
root cause is right there. We teach students this (impractical) 5
6 void hanoi_nr(int n, int from, int to, int via) {
debugging principle and motivate students to consider clever
7 Frame stk[64], *top = stk - 1, *f;
tricks to make this procedure fast, robust, and easy. 8 call(n, from, to, via);
For example, the essence of printf-debugging is to provide 9 while ((f = top) >= stk) {
a high-level digest of the state-machine trace, which helps 10 switch (f->pc) {
in narrowing down the scope of the initial anomalous state. 11 case 0: if (f->n == 1) {
12 printf("%d -> %d\n", f->from, f->to); jmp(4);
One can also employ defensive programming by inserting
13 } break;
assertions to the validity of states. These lessons are usually 14 case 1: call(f->n - 1, f->from, f->via, f->to); break;
less taught in an operating system class but are essential for 15 case 2: call(1, f->from, f->to, f->via); break;
surviving hacking or implementing a large-scale system. 16 case 3: call(f->n - 1, f->via, f->to, f->from); break;
One classroom story is using a profiler (i.e., “frequent” state 17 case 4: ret(); break;
18 default: assert(0);
sampler) in diagnosing an unexpected 100% CPU usage on an
19 }
idle workload in a production system in on a specific machine. 20 f->pc++;
The perf tool [9] attributes the hot spot to an xhci-related 21 }
function, which leads us to a short-circuited USB port. 22 }
Concurrency meets state machines. The model checking hanoi(2, 1, 2, 3) (Line 5)
community has long represented concurrent programs as state
stack frames stack frames
transition systems, and model checking is widely recognized
hanoi(PC=1) n=2, f=1, t=2, v=3
as a computationally intensive technique that frequently en-
counters state explosion issues. Nevertheless, employing ex- hanoi(PC=5) n=3, f=1, t=3, v=2 hanoi(PC=6) n=3, f=1, t=3, v=2
haustive enumeration is not the sole efficient approach to hanoi(PC=6) n=3, f=1, t=3, v=2 hanoi(PC=6) n=3, f=1, t=3, v=2
harness the capabilities of state machines.
The concept of data race, an important topic in operating global/allocated memory global/allocated memory
system courses, refers to the simultaneous access of a shared … …
memory location by two threads or processors (with at least
one performing a write). Data races are considered harmful Figure 5: State machine perspective of C programs. hanoi_nr
in systems programming. is also an “executable model” emulating recursions for rigor-
When one checks a state machine trace against data races, it ously understanding the semantics of C programs.
is essential to examine all types of state transitions that could
lead to memory access [41]. However, two sources of mem-
ory access may be overlooked by students: (1) fetching an Hanoi” implementation) to illustrate that the “runtime state”
instruction from the program counter and (2) stack operations, of C programs consists of static variables, heap memory, and a
including function and interrupt returns. list of stack frames. State transitions are small-step expression
We let the students experience a subtle data race in an op- evaluations at the top-most stack frame’s program counter.
erating system kernel lab that requires students to migrate a
Compilers should always generate equivalent assembly
process from one processor to another. The destination proces-
(low-level state machine specification) from source code
sor could not immediately schedule the process. Otherwise,
(high-level state machine specification). Therefore, a funda-
there will be a data race on the kernel’s interrupt stack.
mental question is what kinds of translation are allowed for
Demystifying compilers. It is not obvious to students that C an optimized compiler. Notably, such deliberations are fre-
programs can also be represented by state transition systems. quently neglected throughout the undergraduate curriculum,
We use the example in Figure 5 (a non-recursive “Tower of including in courses specifically addressing compilers.
938 2023 USENIX Annual Technical Conference USENIX Association
With the conceptual model of state machines, the correct- assumes the atomicity of statements between consecutive
ness of translation is essentially the equivalence between two sched() calls and a sequentially consistent memory model.
state machines. This naturally leads to the definition of ex- The discrepancies between a model and an actual system
ternal observable equivalence between systems: given that are explained by careful examination of these assumptions.
system calls are the only way of influencing the remaining Peterson’s algorithm is correct only under proper assumptions–
parts of the system, two programs are considered equivalent if specifically, a sequentially consistent memory model as if
they generate identical system call traces for the same inputs, context switches only happen on instruction boundaries. For
and one program terminates if and only if the other program Peterson’s algorithm, we provide an equivalent C implemen-
terminates. This principle serves as the core concept behind a tation to illustrate how compiler and memory barriers may
verified compiler such as CompCert [27]. impact the program’s behavior.
5.2 Modeling and Model Checking in Action 5.3 Student Acceptance and Discussions
Models and emulation are everywhere. Models in an oper- Student Feedback. After publicizing the course lecture notes,
ating system course may not be limited to Python-implement demonstrations, and videos on the Internet, we received an
system calls. We advocate using minimal but functionally excessive amount of positive feedback. Comments included
“working” models, even they are implemented using a lower- statements like, “It is remarkable that such a comprehensive
level programming language. explanation of operating system principles can be provided in
One particular example is that we long had difficulties in an undergraduate-level course.” Students conveyed that they
explaining the ELF dynamic linker and loader to the students “gained valuable insights on overcoming the panic in hacking
due to the unnecessarily excessive complexity of the ELF large-scale systems in this course.”
format. We identified that the problem stems from the fact There are also controversial arguments on the appropri-
that the ELF design is intended to be read exclusively by ateness of incorporating state machines as a key concept in
machines, rather than humans. operating system courses. However, we have also received
Therefore, we design a simplified binary format imple- feedback from the industry professionals supporting our ap-
mented using GNU C preprocessor and assembly. The bi- proach by indicating that state machines are one of the most
nary file contains merely a magic number, a NULL-terminating fundamental abstractions for controlling complexity in build-
symbol table whose entries are macros like IMPORT(printf) or ing production systems.
EXPORT(main), followed by assembly instructions. By reusing Since the first public release of the course in 2020, the
GCC and binary utilities, we implement the full toolchain of video has received more than 2,000,000 plays on the Internet.
linker, loader, and an objdump equivalent in 200 lines of C code. Moreover, this course has been conferred the "Test-of-Time
Student and social media feedback indicate that such a model Teaching Award of the Department," as chosen by alumni
significantly flattens the learning curve of dynamic loading. who evaluated all courses in their curriculum.
Formal method meets operating systems. We motivate the Usefulness of models. Modeling is a versatile technique for
need for a model checker by making substantial (boring) ef- establishing concepts and understanding. Modeling can also
forts to draw a state transition graph to prove the safety and control the complexity by selectively hiding low-level im-
liveness of Peterson’s mutex algorithm [34]. It is then obvious plementation details. Another major advantage of executable
that a program like M OSAIC can replace human labor by em- models is making operating system concepts rigorous. Con-
ulation. We received positive feedback from students on their cepts (e.g., mutex, condition variable, and crash consistency)
first contact with the model-checking approach, particularly can be defined by “all possible behaviors on a model.”
the interactive visualizer (Figure 3), which is embedded in a One may argue that any model behavior can be manifested
Jupyter notebook for in-class demonstrations. The machine- by real workloads, and thus students should have first-hand
generated state transition graph is also generally more reliable experiences on real systems. We consider understanding the
than the informal arguments in popular textbooks [42]. model (and thus the concepts) a critical step before students
Another advantage of a model checker over existing teach- can hack real systems. Otherwise, the excessive and irrele-
ing methods is the immediate feedback when answering “what vant implementation details can be a significant source of
if” questions related to changes in assumptions, implementa- distraction.
tions, and other factors. We encourage students to extensively
Limitations. The “state-machine perspective” motivates the
experiment with the model, e.g., to see if the system breaks
key insights and high-level designs of operating systems well.
with added sched() or removed sync().
However, such over-simplification may yield students over-
The gap between models and real systems. We also teach looking the challenges of implementing real systems. There-
students that models do not fully reflect the real world. Models fore, we still consider the “hands-on approach,” [26] in which
are good at making all assumptions explicit, e.g., M OSAIC students implement their own operating system kernel on
USENIX Association 2023 USENIX Annual Technical Conference 939
emulated bare metal, an indispensable part of an operating E X PLODE system [47] for model checking real storage sys-
system course8 . tems, in which all non-determinism and fault injection are
M OSAIC only models a small fraction of an operating sys- implemented upon choose().
tem. More are missing, and one has to model them explicitly: As a pedagogical model checker, M OSAIC’s primary use
file descriptors, signals, futexes, RAID, network stack, etc. is to explain real operating system behaviors by mapping
Theoretically, it could be possible to model them in M OSAIC; the model’s execution traces (e.g., examples in Sections 3.2
however, we preferred simplicity in our model design and and 4.2) to real systems. Such an approach belongs to the
leaving these mechanisms to user-level applications like we paradigm of lightweight formal methods [21, 22], which
did in Sections 3.2 and 4.2. strongly emphasizes practicability rather than the full sound-
The implementation of M OSAIC also has limitations: main ness of a proof. Lightweight formal methods have been proven
must be a Python generator (rather than a stackful coroutine). effective against validating excessively complex real sys-
Thus, system calls are not allowed in functions being called by tems [6]. Like other pedagogical model checkers [28], we
main. M OSAIC also assumes that the program being checked intentionally trade off the performance with understandabil-
is deterministic. Non-determinism beyond system calls (e.g., ity. Compared with fully verified systems [31], M OSAIC is
random numbers) results in unsound model-checking results. functional but with magnitudes less code.
Considering that M OSAIC is a pedagogical model checker Emulation is also a widely-adopted approach in operating
and an instructor can easily bypass these limitations; thus, system teaching, which facilitates students establishing a cor-
they are not a significant obstacle to adopting M OSAIC in rect and rigorous understanding of concepts. The exercises
practice. of “Three Easy Pieces” [3] are based on a substantial amount
of independent emulators. M OSAIC as a unified model, on
6 Related Work the other hand, can model (and check) the interplay between
different levels of system mechanisms, e.g., how file system
Emerging from the logic and programming language com- operations and process-level race result in a TOCTTOU attack
munity, formal methods (mainly model checking and formal in Section 4.2.
verification) has been widely adopted in the validation and Finally, (replicated) state machines also play a fundamental
verification of computer systems [6, 25, 27, 31, 47]. The key role in distributed systems [32]. Formal methods became
idea of formal methods is to treat specifications, models, and increasingly necessary in handling the counter-intuitive corner
implementations as unambiguously-defined mathematical ob- cases often overlooked by informal arguments. We believe
jects and prove properties by exhaustive search or axiomatic that getting familiar early with such a paradigm on rigorous
reasoning. modeling and reasoning in a first operating system course can
Despite a growing trend of formal method applications inspire the future generation of system researchers.
for computer systems, the teaching practice of “classical”
operating systems remains classical on the layered abstrac-
7 Conclusion
tions of computer systems [3, 42] and the “hands-on” ap-
proach [26] in which students hack teaching operating system
This paper presents a state-machine-first and model-based
kernels [10, 19, 35] over emulators like QEMU [5] to fully
approach to teaching operating systems. By leveraging mod-
understand all low-level implementation techniques.
eling and model checking, we can define operating system
There are attempts to incorporate model checking in teach-
concepts rigorously, explore system behaviors exhaustively,
ing computer systems. Hamberg and Vaandrager [18] mod-
and motivate non-trivial research systems intuitively under a
eled textbook concurrency control algorithms using the Up-
unified framework in a first operating system course. We be-
paal modeling language and checker. Michael et al. [28] target
lieve that this paper’s teaching philosophies have the potential
real Java programs on a message-passing model and check
to lead a paradigm shift in the teaching of operating systems.
against all possibilities of message reorderings, drops, and
duplications. Both concurrent programs and distributed sys-
tems are classical application scenarios of a model checker. Acknowledgments
To the best of our knowledge, we are the first to apply formal
methods throughout an entire operating system course. We would like to thank Haibo Chen, Yubin Xia, the anony-
M OSAIC models a fully functional operating system by the mous reviewers, and our shepherd, David Cock, for their
unified treatment of non-determinism in system calls (Sec- valuable and constructive feedback on this work. This work
tion 4.1). M OSAIC can check the interactions between pro- is supported in part by National Natural Science Founda-
cesses, threads, and devices. Such a design resembles the tion of China (Grant #61932021, #62025202, #62272218),
8 Students all had a hard time debugging a bare-metal kernel. Such experi- Fundamental Research Funds for the Central Universities
ences further motivate the need for debugging aids and dynamic analysis in (#2022300287, #020214380102), State Key Laboratory for
Section 2.3. Novel Software Technology, and the Xiaomi Foundation.
940 2023 USENIX Annual Technical Conference USENIX Association
References [12] The GDB Developers. GDB: The gnu project debugger.
https://sourceware.org/gdb/.
[1] the musl libc. https://musl.libc.org/.
[13] The Microsoft Windows Developers. The
[2] Martín Abadi and Leslie Lamport. The existence of Windows Kernel Transaction Manager (KTM).
refinement mappings. Theoretical Computer Science, https://learn.microsoft.com/en-us/windows/
82(2):253–284, 1991. win32/ktm/kernel-transaction-manager-portal.
[3] Remzi H. Arpaci-Dusseau and Andrea C. Arpaci- [14] Android Developer Documentation. Overview of mem-
Dusseau. Operating Systems: Three Easy Pieces. ory management. https://developer.android.com/
Arpaci-Dusseau Books, 1.00 edition, August 2018. topic/performance/memory-overview.
[4] Andrew Baumann, Jonathan Appavoo, Orran Krieger,
[15] Stephen Dolan. jq: sed for JSON data. https://
and Timothy Roscoe. A fork() in the road. In Pro-
stedolan.github.io/jq/.
ceedings of the Workshop on Hot Topics in Operating
Systems, HotOS 19, pages 14–22, New York, NY, USA, [16] George W. Dunlap, Samuel T. King, Sukru Cinar, Mur-
2019. Association for Computing Machinery. taza A. Basrai, and Peter M. Chen. ReVirt: Enabling
[5] Fabrice Bellard. QEMU, a fast and portable dynamic intrusion analysis through virtual-machine logging and
translator. In 2005 USENIX Annual Technical Con- replay. In Proceedings of the 5th Symposium on Op-
ference, USENIX ATC 05, Anaheim, CA, April 2005. erating Systems Design and Implementation, OSDI 02,
USENIX Association. pages 211–224, USA, 2002. USENIX Association.
[6] James Bornholt, Rajeev Joshi, Vytautas Astrauskas, [17] Phillip B. Gibbons and Ephraim Korach. Testing shared
Brendan Cully, Bernhard Kragl, Seth Markle, Kyle memories. SIAM Journal on Computing, 26(4):1208–
Sauri, Drew Schleit, Grant Slatton, Serdar Tasiran, Jacob 1244, 1997.
Van Geffen, and Andrew Warfield. Using lightweight [18] Roelof Hamberg and Frits Vaandrager. Using model
formal methods to validate a key-value storage node in checkers in an introductory course on operating systems.
Amazon S3. In Proceedings of the ACM SIGOPS 28th SIGOPS Oper. Syst. Rev., 42(6):101–111, October 2008.
Symposium on Operating Systems Principles, SOSP 21,
pages 836–850, New York, NY, USA, 2021. Association [19] David A. Holland, Ada T. Lim, and Margo I. Seltzer. A
for Computing Machinery. new instructional operating system. In Proceedings of
the 33rd SIGCSE Technical Symposium on Computer
[7] Cristian Cadar, Daniel Dunbar, and Dawson Engler.
Science Education, SIGCSE 02, pages 111–115, New
KLEE: Unassisted and automatic generation of high-
York, NY, USA, 2002. Association for Computing Ma-
coverage tests for complex systems programs. In Pro-
chinery.
ceedings of the 8th USENIX Conference on Operating
Systems Design and Implementation, OSDI 08, pages [20] Jack Tigar Humphries, Neel Natu, Ashwin Chaugule,
209–224, USA, 2008. USENIX Association. Ofir Weisse, Barret Rhoden, Josh Don, Luigi Rizzo, Oleg
[8] The Kernel Development Community. Linux tracing Rombakh, Paul Turner, and Christos Kozyrakis. GhOSt:
technologies. https://www.kernel.org/doc/html/ Fast & flexible user-space delegation of Linux schedul-
latest/trace/index.html. ing. In Proceedings of the ACM SIGOPS 28th Sympo-
sium on Operating Systems Principles, SOSP 21, pages
[9] The Kernel Development Community. perf: Linux pro- 588–604, New York, NY, USA, 2021. Association for
filing with performance counters. https://perf.wiki. Computing Machinery.
kernel.org.
[21] Daniel Jackson. Software Abstractions: Logic, Lan-
[10] Russ Cox, Frans Kaashoek, and Robert Morris. guage, and Analysis. MIT Press, revised edition, 2016.
xv6: a simple, Unix-like teaching operating sys-
tem. https://pdos.csail.mit.edu/6.1810/2022/ [22] Daniel Jackson and Jeannette Wing. Lightweight formal
xv6/book-riscv-rev3.pdf. methods. Computer, 29(4):20–22, April 1996.
[11] Weidong Cui, Xinyang Ge, Baris Kasikci, Ben Niu, Upa- [23] Kostis Kaffes, Jack Tigar Humphries, David Mazières,
manyu Sharma, Ruoyu Wang, and Insu Yun. REPT: and Christos Kozyrakis. Syrup: User-defined schedul-
Reverse debugging of failures in deployed software. In ing across the stack. In Proceedings of the ACM
Proceedings of the 13th USENIX Symposium on Oper- SIGOPS 28th Symposium on Operating Systems Prin-
ating Systems Design and Implementation, OSDI 2018, ciples, SOSP 21, pages 605–620, New York, NY, USA,
Carlsbad, CA, October 2018. 2021. Association for Computing Machinery.
USENIX Association 2023 USENIX Annual Technical Conference 941
[24] Jongyul Kim, Insu Jang, Waleed Reda, Jaeseong Im, of the 2014 USENIX Conference on USENIX Annual
Marco Canini, Dejan Kostić, Youngjin Kwon, Simon Technical Conference, USENIX ATC 14, pages 305–
Peter, and Emmett Witchel. LineFS: Efficient Smart- 320, USA, 2014. USENIX Association.
NIC offload of a distributed file system with pipeline
parallelism. In Proceedings of the ACM SIGOPS 28th [33] Robert O’Callahan and Kyle Huey. rr: Lightweight
Symposium on Operating Systems Principles, SOSP 21, recording and deterministic debugging. https://
pages 756–771, New York, NY, USA, 2021. Association rr-project.org/.
for Computing Machinery. [34] Gary L. Peterson. Myths about the mutual exclusion
[25] Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June problem. Inf. Process. Lett., 12(3):115–116, 1981.
Andronick, David Cock, Philip Derrin, Dhammika Elka-
[35] Ben Pfaff, Anthony Romano, and Godmar Back. The
duwe, Kai Engelhardt, Rafal Kolanski, Michael Norrish,
Pintos instructional operating system kernel. In Proceed-
Thomas Sewell, Harvey Tuch, and Simon Winwood.
ings of the 40th ACM Technical Symposium on Com-
SeL4: Formal verification of an OS kernel. In Proceed-
puter Science Education, SIGCSE 09, pages 453–457,
ings of the ACM SIGOPS 22nd Symposium on Operating
New York, NY, USA, 2009. Association for Computing
Systems Principles, SOSP 09, pages 207–220, New York,
Machinery.
NY, USA, 2009. Association for Computing Machinery.
[36] Thanumalayan Sankaranarayana Pillai, Vijay Chi-
[26] Malcolm G. Lane. Teaching operating systems and
dambaram, Ramnatthan Alagappan, Samer Al-Kiswany,
machine architecture—more on the hands-on labora-
Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-
tory approach. In Proceedings of the Twelfth SIGCSE
Dusseau. All file systems are not created equal: On
Technical Symposium on Computer Science Education,
the complexity of crafting crash-consistent applications.
SIGCSE 81, pages 28–36, New York, NY, USA, 1981.
In Proceedings of the 11th USENIX Conference on Op-
Association for Computing Machinery.
erating Systems Design and Implementation, OSDI 14,
[27] Xavier Leroy. A formally verified compiler back-end. pages 433–448, USA, 2014. USENIX Association.
Journal of Automated Reasoning, 43(4):363–446, 2009.
[37] Donald E. Porter, Owen S. Hofmann, Christopher J.
[28] Ellis Michael, Doug Woos, Thomas Anderson, Rossbach, Alexander Benn, and Emmett Witchel. Oper-
Michael D. Ernst, and Zachary Tatlock. Teaching ating system transactions. In Proceedings of the ACM
rigorous distributed systems with efficient model SIGOPS 22nd Symposium on Operating Systems Prin-
checking. In Proceedings of the Fourteenth EuroSys ciples, SOSP 09, pages 161–176, New York, NY, USA,
Conference 2019, EuroSys 19, New York, NY, USA, 2009. Association for Computing Machinery.
2019. Association for Computing Machinery.
[38] Feng Qin, Joseph Tucek, Jagadeesan Sundaresan, and
[29] Ingo Molnar and Arjan van de Ven. Runtime locking Yuanyuan Zhou. Rx: Treating bugs as allergies–a safe
correctness validator. https://www.kernel.org/doc/ method to survive software failures. In Proceedings of
html/latest/locking/lockdep-design.html. the Twentieth ACM Symposium on Operating Systems
Principles, SOSP 05, pages 235–248, New York, NY,
[30] Luke Nelson, James Bornholt, Ronghui Gu, Andrew USA, 2005. Association for Computing Machinery.
Baumann, Emina Torlak, and Xi Wang. Scaling sym-
bolic evaluation for automated verification of systems [39] Andrew Quinn, Jason Flinn, Michael Cafarella, and
code with Serval. In Proceedings of the 27th ACM Baris Kasikci. Debugging the OmniTable way. In 16th
Symposium on Operating Systems Principles, SOSP 19, USENIX Symposium on Operating Systems Design and
pages 225–242, New York, NY, USA, 2019. Association Implementation, OSDI 22, pages 357–373, Carlsbad,
for Computing Machinery. CA, July 2022. USENIX Association.
[31] Luke Nelson, Helgi Sigurbjarnarson, Kaiyuan Zhang, [40] Konstantin Serebryany, Derek Bruening, Alexander
Dylan Johnson, James Bornholt, Emina Torlak, and Potapenko, and Dmitry Vyukov. AddressSanitizer: A
Xi Wang. Hyperkernel: Push-button verification of an fast address sanity checker. In Proceedings of the 2012
OS kernel. In Proceedings of the 26th Symposium on USENIX Conference on Annual Technical Conference,
Operating Systems Principles, SOSP 17, pages 252–269, USENIX ATC 12, pages 28–37, USA, 2012. USENIX
New York, NY, USA, 2017. Association for Computing Association.
Machinery.
[41] Konstantin Serebryany and Timur Iskhodzhanov.
[32] Diego Ongaro and John Ousterhout. In search of an Threadsanitizer: Data race detection in practice. In
understandable consensus algorithm. In Proceedings Proceedings of the Workshop on Binary Instrumentation
942 2023 USENIX Annual Technical Conference USENIX Association
and Applications, WBIA 09, pages 62–71, New York,
NY, USA, 2009. Association for Computing Machinery.
[42] Abraham Silberschatz, Greg Gagne, and Peter B. Galvin.
Operating System Concepts. Wiley, 10th edition, 2018.
[43] Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett
Witchel. GPUfs: Integrating a file system with GPUs.
ACM Transcation on Computer Systems, 32(1), February
2014.
[44] Livio Soares and Michael Stumm. Flexsc: Flexible sys-
tem call scheduling with exception-less system calls. In
Proceedings of the 9th USENIX Conference on Oper-
ating Systems Design and Implementation, OSDI 10,
pages 33–46, USA, 2010. USENIX Association.
[45] Andrew S. Tanenbaum. Operating Systems: Design and
Implementation. Prentice-Hall, first edition, 1987.
[46] Jinpeng Wei and Calton Pu. TOCTTOU vulnerabilities
in UNIX-style file systems: An anatomical study. In Pro-
ceedings of the 4th Conference on USENIX Conference
on File and Storage Technologies - Volume 4, FAST 05,
page 12, USA, 2005. USENIX Association.
[47] Junfeng Yang, Can Sar, and Dawson Engler. EXPLODE:
A lightweight, general system for finding serious storage
system errors. In 7th USENIX Symposium on Operating
Systems Design and Implementation, OSDI 06, Seattle,
WA, November 2006. USENIX Association.
[48] Cristian Zamfir and George Candea. Execution synthe-
sis: A technique for automated software debugging. In
Proceedings of the 5th European Conference on Com-
puter Systems, EuroSys 10, pages 321–334, New York,
NY, USA, 2010. Association for Computing Machinery.
USENIX Association 2023 USENIX Annual Technical Conference 943