Shedding Light On Static Partitioning Hypervisors PDF
Shedding Light On Static Partitioning Hypervisors PDF
net/publication/369380018
CITATIONS READS
0 54
2 authors:
All content following this page was uploaded by José Martins on 20 April 2023.
Hyp Jailhouse Hypervisor Xen (Dom0-less) Hypervisor Bao Hypervisor seL4 microkernel
Fig. 1: Architectural overview of the assessed hypervisors: Jailhouse, Xen (Dom0-less), Bao and seL4 CAmkES VMM
will also provide support for per-VM user-mode VMMs1 while a quad-core Cortex-A53 running at 1.2 GHz, a GIC-400
promising to alleviate the performance overhead of CAmkES. (GICv2) featuring 4 list registers, and an MMU-500 (SM-
MUv2). Cores have private 32KiB separate L1 instruction
III. M ETHODOLOGY AND E XPERIMENTAL S ETUP
and data caches, and share a L2 1MiB unified cache. It also
A. Methodology includes a programmable logic (PL) component (i.e., FPGA).
Selected Hypervisors. We have selected four open-source
SPH (Fig.1). Jailhouse and Bao were designed for the static Hypervisors configuration. We made an effort to use the latest
partitioning use case; both are open-source and target Arm versions of each SPH. Still, we applied a few patches to
platforms. Xen Dom0-less is a novel deployment that al- Jailhouse, Xen, and Bao to include features such as coloring
lows directly booting multiple VMs (bypassing Dom0) and or direct injection, which are not yet fully merged. Further,
passthrough of peripherals to VMs. Finally, seL4 is a well- we had to make small adjustments to all SPH to enable
established open-source microkernel, which can be used as a homogeneous configurations (e.g., uniforming VM memory
hypervisor in combination with a user-level VMM. The seL4 map), allow direct guest access to PMUs, or instrumenting
CAmkES VMM is an open-source reference VMM implemen- hypervisors for specific experiments. For each SPH, we lever-
tation with static allocation of resources. These systems are ac- aged the default configuration for the target SoC, with some
tively maintained, adopted for commercial purposes, and there tweaked options such as disabling debug and logging features.
is a fair amount of information about them. We have excluded There were, however, specific adjustments that were made on a
other open-source SPH that do not support Armv8-A such as per-hypervisor basis. For example, to remove or minimize the
the SPH architecture pioneer Quest-V [36]–[38], and ACRN invocation of a scheduler in Xen, we used the null scheduler
[39], as well as open-source hypervisors that don’t explicitly and disabled trapping of wait-for-interrupt (WFI) instructions;
target static partitioning (e.g., KVM [6], Xvisor [40]). We have in seL4, since it was not possible to disable the timer tick,
excluded microkernels such as NOVA [41] due to the lack of we configured the tick with a period of about 5 seconds.
availability of an open-source reference user-space VMM, and We compiled all hypervisors with GCC 11.2, with the default
because we believe seL4 serves as a faithful representative of optimization level defined by each hypervisor’s build system.
the microkernel architecture. TrustZone-assisted hypervisors All these SPH configurations and modifications are available
[42]–[44] were left out due to multicore scalability issues and clearly discernible in the provided artifact [12].
and lack of active maintenance. Finally, we have excluded
commercial products (e.g., PikeOS, LynxSecure) as these often VM configuration. VM configurations are as similar as
require licenses the authors did not have access to, and that possible, mainly w.r.t. number of vCPUs and memory. For
would limit wide access to the study artifacts. Jailhouse and seL4-VMM, where memory must be manually
allocated, we set memory regions aligned to 2 MiB. The
Empirical Evaluation. The evaluation focuses on perfor-
only device assigned to each VM is a UART. We evaluated
mance, interrupt latency, inter-VM communication latency and
two different classes of VMs: (i) large VMs running Linux
bandwidth, boot time, and code size. We also assess the effect
(v5.14), as representative of rich, Unix-like OSs; and (ii) small
of interference and of the available mitigation mechanism
VMs running baremetal applications or FreeRTOS (v10.4), as
(i.e., cache coloring). Although we consider virtual device
representative of critical workloads. When cache coloring is
performance, IO interference, and applied security techniques
enabled, we assign half of the colors (four out of eight2 ) to the
such as stack canaries or guards, data execution prevention or
VM executing the benchmark, three colors to the interference
control-flow integrity very relevant, these are out of scope of
application, and one color to the hypervisor (just supported in
this work. We advocate for a follow up study as future work.
Bao and Xen). Note that color assignment configuration can
B. Experimental setup significantly impact the final measurements for all metrics. In
Hardware Platform. Experiments were carried out on a Xil- real deployments, the color assignment should be carefully
inx ZCU104, featuring a Zynq Ultrascale+ SoC. It includes defined based on the profile of the final system.
1 Only after the bulk of this work was carried out, virtualization support in 2 We consider only eight cache colors while, in truth, the target platform
seL4CP was made openly available. At the time of writing, it still appears to allows for 16. We do this to avoid color assignment configurations that would
be in a beta stage and not as mature as CAmkES. partition the L1 cache.
Interference Workload. When evaluating memory hierarchy benchmark, while large operates over a considerable input data
interference, we use a custom baremetal guest which continu- set, emulating a real-world application scenario.
ously writes a buffer with the size of the LLC (1MiB). Unless
Base Performance Overhead. Fig. 2 presents the relative per-
noted otherwise, this interference guest runs on a VM with
formance degradation for the MiBench AICS. For each bench-
two vCPUs. We stress that although parameterized to cause a
mark, below the plotted bars, we present the average absolute
significant level of interference, the observed effects caused by
execution time for the native execution. The first observation
the interference workload do not necessarily reflect the worst
is that, independently of the hypervisor, different benchmarks
case that could be achieved if further fine-tuned.
are affected to different degrees. Secondly, Jailhouse, Xen, and
Measurement tools. We use the Arm PMU to collect microar- Bao incur a negligible performance penalty, i.e., less than 1%
chitectural events on benchmark execution. The selected events across all benchmarks. Although seL4 CAmkES-VMM also
include instruction count, TLB accesses and refills, cache presents a small overhead for most benchmarks, the overhead
access and refills, number of exceptions taken, and number of can reach up to 7%.
interrupts triggered; we register the exception level on which
For a virtualized system configured with a single guest
these events occur. For the Linux VMs, we use the perf tool
VM, there are two main possible sources of overhead. The
[45] to measure the time and to collect microarchitectural
first source is the increase in TLB miss penalty due to
events. For baremetal or RTOS VMs, we use the Arm Generic
the second stage of translation, since it can, in the worst
Timer, with a resolution of 10 ns, and a custom PMU driver.
case, increase the number of memory accesses in a page-
C. Threats to validity walk by a factor of four. Second, the overhead of trapping
Experiments were independently conducted by two re- to the hypervisor and performing interrupt injection, e.g.,
searchers. Each used a different ZCU104 platform and pre- timer tick interrupt. Additionally, the pollution of caches and
agreed VM configurations (cross-checked). We have contacted TLBs by the hypervisor might also affect guest performance.
key individuals and/or maintainers as representatives of each To further understand the behavior of the benchmarks, in
SPH community. We have received replies from all of them, particular the larger overhead of the CAmkES-VMM, we have
which led to a few iterations and repetition of some exper- collected a number of microarchitectural events. Fig. 3 shows
iments. Overall, the comments and issues raised by these them normalized to the number of executed instructions. We
individuals are reflected in the presented ideas and results. highlight two events whose increase is highly correlated with
Despite all efforts, these experiments may still be subject to the degradation observed: hypervisor L2 cache refills (Fig.
latent inaccuracies. We will open source all artifacts to enable 3a) and guest TLB misses (Fig. 3b), with Pearson correlation
independent validation of the results. This study may also coefficients of up to 0.94 and 0.96, respectively.
include limitations on the generalization to other platforms. An important hypervisor feature to minimize the impact of
For the hardware platform, we argue both the SoC (Zynq two-stage translation is to leverage superpages. By inspecting
Ultrascale+) and the Cortex-A53 are representative of others hypervisor code, we concluded that only CAmkES-VMM does
used in automotive and industrial settings (e.g., NXP i.MX8 or not have support for 2MiB superpages. This justifies the higher
Renesas R-Car M3). To corroborate this, we have also carried number of TLB misses. Notwithstanding, to corroborate this
out the performance and interrupt latency experiments for the argument, we have configured the other SPH to preclude the
Bao hypervisor in an NXP i.MX8QM, which features the GIC- use of superpages. As expected, we observed an increase
500 (GICv3). The obtained results are fully consistent with in the performance degradation (and TLB misses) similar to
those presented in Sections IV and V. Furthermore, we argue CAmkES-VMM (Fig. 4). We still observed a gap of up to 2%
next generation platforms, such as i.MX9 featuring Cortex- between CAmkES-VMM and the other SPH; this is related to
A55 CPUs, implement very similar microarchitectures. the aforementioned interrupt handling and injection overheads,
i.e., a consequence of the microkernel design: more costly
IV. SPH: P ERFORMANCE switches between VM and VMM and a high number of VMM
We start by assessing the performance degradation3 of a to microkernel calls for managing and inject the interrupts.
single-core Linux VM atop each SPH. The main results are This is confirmed by Figures 3c and 3d, which show the
depicted in Figures 2, 3, and 4. We then evaluate the system hypervisor-to-guest executed instruction ratio and the number
under interference to understand the effectiveness of microar- of exceptions taken by the hypervisor, respectively. For these
chitectural isolation mechanisms available in each SPH. events, seL4 has a higher ratio when compared to the other
Selected Benchmark. We use the MiBench Embedded Bench- SPH. We further investigate interrupt injection in Section V.
marks’ Automotive and Industrial Control Suite (AICS) [46].
These benchmarks are intended to emulate the environment of Takeaway 1. SPH do not incur in meaningful performance
embedded applications such as airbag controllers and sensor impacts due to: (i) modern hardware virtualization support;
systems. Each test has two variants: small operates in a (ii) 1-to-1 mapping between virtual and physical CPUs; and
reduced input data set representing a lightweight use of the (iii) minimal traps. However, one key aspect is that SPH must
3 Performance degradation is the ratio between the total execution time of
have support for / make use of superpages to minimize TLB
the benchmark running atop the hypervisors and native execution.
misses and page-table walk overheads.
jailhouse
% Performance Degradation
7
6 xen
5 bao
sel4/camkes-vmm
4
3
2
1
0
22.24 ms 219.46 ms 4.74 ms 18.40 ms 5.14 ms 33.46 ms 23.45 ms 297.08 ms 20.72 ms 252.75 ms 100.47 ms 1496.70 ms
qsort qsort susanc susanc susane susane susans susans bitcount bitcount basicmath basicmath
small large small large small large small large small large small large
Fig. 2: Relative performance degradation for the MiBench Automotive and Industrial Control Suite.
0.0006 coloring can only reduce interference but not completely
0.00015 mitigate it. In these experiments, the interference workload
0.0004
0.00010 runs continuously. However, in a more realistic scenario,
0.0002 it might be intermittent. The improvement in predictability
0.00005
achieved by coloring is reflected in the difference between
0.00000 0.0000
(a) Hyp. L2 cache miss per instr. (b) Guest iTLB miss per instr. the base experiment results (bars in Fig. 2 and +interf in
1e 6 Fig. 5) and respective variants with coloring enabled (+col in
0.005 4
0.004 Fig. 5). The lower the difference, the higher the predictability.
3
0.003 For example, in the case of susanc-small, we observed that
2
0.002 without coloring, the variation can go up to 105 percentage
0.001 1
points (pp), while when coloring is enabled, the observed
0.000 0
(c) Hyp./Guest instr. ratio (d) Hyp. exceptions per instr.
overhead is around 58%, which corresponds to a variation
of 38 pp compared to the configuration with coloring enabled
Fig. 3: MiBench AICS microarchitectural events. but without interference. Nevertheless, we observed that cache
misses are essentially reduced to the same level as when
7 0.0006
6
coloring is enabled but without interference. Clearly, the
5 0.0004 observed interference is not only due to cache-line contention.
4
3 There are points of contention at deeper levels of the memory
2 0.0002
hierarchy, e.g., buses and memory controller [47] or even in
1
0 0.0000 internal LLC structures [48]. Finally, results on Xen and Bao
(a) % Performance Degradation (b) Guest iTLB miss per instr. demonstrate that hypervisor coloring has no substantial benefit
Fig. 4: MiBench AICS without the use of superpages on as it only reduces performance degradation due to interference
second-stage translation. by at most 1% (omitted due to lack of space).
Performance under interference. We also evaluate inter- Takeaway 2. Multicore memory hierarchy interference sig-
VM interference and the effectiveness of cache coloring at nificantly affects guests’ performance. Cache partitioning via
both guest and hypervisor levels. Fig. 5 plots the results page coloring is not a silver bullet as despite fully elimi-
under interference (+interf ), with coloring enabled (+col), nating inter-core conflict misses, it does not fully mitigate
and with interference and coloring enabled (+interf+col). seL4 interference (up to 38 pp increase in relative overhead).
CAmkES VMM shows no results for coloring enabled as this
feature is not openly available yet. V. SPH: I NTERRUPT L ATENCY
There are four conclusions to be drawn. Firstly, interference As discussed in Section II-A, the existing GIC virtualization
significantly affects the benchmark execution over all hyper- support is not ideal for MCS: hypervisors have to handle and
visors. As expected, this is explained by a significant increase inject all interrupts and must actively manage list registers
in L2 cache misses. On Jailhouse, Xen, and Bao performance when the number of pending interrupts is larger than the phys-
is degraded by a similar factor, i.e., to a maximum of about ical list registers. This is of particular importance to guarantee
105%; seL4-VMM is more susceptible to interference, reach- the correct interrupt priority order which might be critical for
ing up to 125% in the worst case. This pertains to the fact an RTOS [49]. In this section, we investigate the overhead
that, given that seL4-VMM executes a much higher number of each SPH in the interrupt latency, their susceptibility to
of instructions, the interference also impacts the execution interference, and the effectiveness of cache coloring. Then, we
of the hypervisor. Secondly, coloring, per se, significantly evaluate the direct injection technique and analyze interrupt
impacts performance (up to about 20%). This seems logical priority support as well as virtual IPI latencies.
given that coloring (i) forces the hypervisor to use 4KiB Methodology. To measure interrupt latency, we used a custom
pages, reducing TLB reach, and (ii) reduces the available lightweight baremetal benchmark, which measures the latency
cache space, which for working sets larger than LLC increases of a periodic interrupt triggered by the Arm Generic Timer.
memory system pressure (i.e., L2 cache misses). Thirdly, The timer is programmed in auto-reload mode, to continuously
% Performance Degradation
120 jailhouse+interf xen+interf+col
jailhouse+col bao+interf
100 jailhouse+interf+col bao+col
80 xen+interf bao+interf+col
60 xen+col sel4/camkes-vmm+interf
40
20
0
0.0125
Guest L2 Cache Misses
0.0100
per Instruction
0.0075
0.0050
0.0025
0.0000
qsort qsort susanc susanc susane susane susans susans bitcount bitcount basicmath basicmath
small large small large small large small large small large small large
Fig. 5: Performance degradation and L2 cache misses per instruction for the Mibench AICS under interference and coloring.
9000 baremetal VMM; (ii) a system call from the VMM to inject the interrupt
jailhouse
8000 xen in the VM (i.e., write the list register); (iii) another to “reply”
bao
Interrupt Latency (ns)
7000 to the exception, resuming the VM; and (iv) a final one where
sel4/camkes-vmm
6000
5000
the VMM waits for a message signaling a new VM event or
4000 interrupt, resulting in a final context-switch back to the VM.
3000 We have also concluded that seL4 does not use a GIC feature
2000
that would allow guests to directly deactivate4 the physical
1000
0 interrupt, resulting in an extra trap.
Fig. 6: Base interrupt latency. Takeaway 3. Due to the lack of efficient hardware support
for directly delivering interrupts to guests in Arm platforms,
trigger an interrupt at each 10 ms. The interrupt handler reads
all SPH increase the interrupt latency by at least one order of
the value of the timer, i.e., it measures the time elapsed since
magnitude. However, by-design, SPH such as Jailhouse and
the interrupt was triggered. Each measurement is carried out
Bao are able to achieve the lowest latencies as they provide
with cold L1 caches. To achieve this, after each measurement,
an optimized path for hardware interrupt injection.
we flush the instruction cache. During the 10 ms, we also
prime the L1 data cache with useless data. Latency Under Interference. Fig. 7 shows the results for
Base Latency. Fig. 6 depicts the violin plots for the custom interrupt latency under interference, including the baseline
benchmark running atop each SPH. From the baseline of about results of Fig. 6 for relative comparison as solo. Analyzing
200 ns, Bao and Jailhouse incur the smallest increase, albeit the effects of VM interference on interrupt latency (interf ),
significant, to an interrupt latency of about 4x (840 ns) and 5x we observed that Bao latency increases to an average of
(1090ns), respectively. Xen shows an increase of about 14x 7260 ns, Jailhouse to 7730 ns, Xen to 23000 ns, and seL4-
(2800 ns). The variance observed in these three systems is VMM to 85940 ns. It corresponds to an increase of 36x, 38x,
negligible. The difference observed between Jailhouse/Bao and 115x, and 430x, respectively, compared to the base latency.
Xen is justified by the interrupt injection path being highly It is also worth noting that the variance also increases. When
optimized in the former, while more generic in Xen. We enabling coloring (col), we measured no significant difference
confirmed this by studying the source code and assessing the in interrupt latency compared to the base case. However,
number of instructions executed by each hypervisor on the when enabling cache coloring in the presence of inter-VM
interrupt handling and injection path: while Jailhouse and Bao interference (interf+col), there is a visible improvement in
execute around 200 instructions, Xen executes about 1050. average latency and variance. However, note that the observed
seL4-VMM presents the largest interrupt latency (47x, 9400 variance does not constitute a measure of predictability. As
ns), an order of magnitude higher than Jailhouse and Bao. explained in Section IV, predictability is reflected in the differ-
The variance of the latency is also affected. This can be ence between the interf and interf+col results and respective
explained by the interrupt handling and injection mechanism baselines, i.e., solo and col. Finally, by applying coloring also
of a microkernel architecture. In the other SPH, each interrupt to the hypervisor (interf+col+hypcol), Bao latency is reduced
results in a single exception taken at EL2, where the interrupt to almost no-interference levels with negligible variance. Xen
is handled and injected in the VM; virtualization support is latency also drops considerably to an average of 6300 ns.
leveraged such that no further traps occur. In CAmkES VMM The observed interrupt latency under interference can be
it results in four traps to the microkernel: (i) the first due to 4 Deactivating an interrupt in the GIC means marking it as handled, enabling
the interrupt that results in forwarding it as a message to the the distributor to forward it to the CPU when it occurs again.
jailhouse 3000 jailhouse
100000 bao
90000 xen
bao 2500
80000 sel4/camkes-vmm
70000 2000
60000 1500
50000
40000 1000
30000 500
20000
10000 0
solo interf interf+col
0
solo col
interf+col interf+colinterf Fig. 9: Interrupt latency with direct injection enabled.
+hypcol
test
Fig. 7: Interrupt latency under interference and cache coloring. 35000 baremetal
30000 jailhouse
Takeaway 6. Only Xen and Bao respect interrupt priority Fig. 12: Inter-VM notification latencies.
order. Additionally, we observe that for all SPH, if multiple
interrupts are triggered simultaneously, there is a partial VI. SPH: I NTER -VM COMMUNICATION
priority inversion as lower priority interrupts take precedence For inter-VM communication, SPH typically only provide
due to the need for the hypervisor to handle and inject them. statically allocated shared memory. This is usually coupled
Inter-Processor Interrupts. IPIs (SGIs) are critical for multi- with an asynchronous notification mechanism signaled as an
core VM performance. For a vCPU to send an SGI, the guest interrupt. All four SPH provide such mechanisms. Next, we
must write a virtual GIC distributor register. This will trap analyze inter-VM notification latency and transfer throughput.
to the hypervisor that must emulate the access and forward Inter-VM latency. Fig. 12 shows the inter-VM notification
the event to the target core, where the SGI is injected via list latency, reflecting the time since the notification is issued until
registers. We use a custom baremetal benchmark to measure the execution of the handler in the destination VM. The relative
IPI latency. It works by measuring the time between when the differences between the latencies for each SPH are similar to
source vCPU writes the distributor register and when the final those observed for passthrough interrupts and IPIs. Jailhouse
IPI handler starts executing. It also measures the overhead of achieves the lower latency (1500 ns), followed by Bao (1900
the trap. We instrument the SPH to sample the time the IPI is ns). Xen shows an intermediate value of 4600 ns, while seL4
forwarded internally; this signals the end of the emulation and CAmkES VMM is significantly larger than others (average
translates the overhead of injecting the interrupt in the target. 18000 ns). Studying the internals of the implementations,
Figure 11 shows that IPI latency increases significantly for we note that while most hypervisors synthesize and inject
all SPH. While the baremetal IPI latency is around 260 ns, the virtual interrupts, Jailhouse uses non-allocated physical
it reaches 2258 ns for Jailhouse, 4157 ns for Xen, 2711 ns interrupts for these notifications. Thus, to send one, Jailhouse
for Bao, and 10868 ns for the CAmkES VMM. However, the only sets the interrupt pending in the GIC distributor. This is
costs of the register access emulation and interrupt injection significantly advantageous when combined with direct injec-
are not proportional across all SPH. For example, Bao has the tion. Note that enabling direct injection in Bao would preclude
lowest emulation and event forwarding times, but the overall the use of this mechanism. For seL4, we highlight the impact
IPI latency is higher than Jailhouse’s. This means that the of the microkernel architecture since atop VM/VMM context
interrupt injection path on Bao is slower than on Jailhouse. By switches, we observe additional overheads due to inter-VMM
inspecting the source of both hypervisors, we have observed communication. Lastly, we see that interference increases all
that Bao immediately forwards the SGI event to the target core, latencies accordingly and that coloring can mitigate it.
performing all interrupt injection operations in the target core. Inter-VM throughput. In Fig. 13, we evaluate the throughput
Jailhouse, in turn, manages the interrupt injection structures of bulk data transfers via a shared memory buffer. The bench-
at the source core and only then signals the target vCPU by mark transmits 16 MiB of random data through a shared buffer
writing the list register. Xen follows the same approach as with varying sizes. When the source VM finishes writing
Jailhouse, but presents higher overhead. The CAmkES VMM the buffer, it either signals the destination VM via a shared
has the highest overhead due to the large number of system memory flag or via an asynchronous notification, and waits for
calls the VMM issues to the microkernel (in total, 7). Four are a signal back to start writing the next chunk. For the polling
issued before the event forwarding, and the rest only after the scenario, the obtained throughput is very similar across all
SGI is forward to the target core. All in all, the access to the hypervisors; this confirms that are no significant differences
virtual distributor is more expensive than the IPI itself. in how they allocate and map memory or configure memory
attributes. Throughput is stable (1500 MiB/s) until the buffer
Takeaway 7. IPI latency reflects the same overheads of size surpasses the LLC size (1 MiB), dropping to about
external interrupts. Future Arm platforms might reduce them 1300 MiB/s. For the asynchronous scenario, throughput is
with GICv4.1 [50]. In the short term, direct injection might significantly impacted when using smaller buffer sizes, given
alleviate this issue. However, both approaches fall short the high number of synchronization points that reflects the
of achieving native latency as they still pay the price of observed interrupt overheads. Finally, we note that interference
emulating the write to the “IPI send” register. has no significant effect as long as the buffer size is kept
No Interference Interference 12000 fsbl root-cell fsbl uboot
1500 atf jailhouse atf xen
10000 uboot
1250
Throughput (MiB/s)
8000
Time (ms)
1000
6000
Polling
750
jailhouse 4000
500 xen
bao 2000
250
sel4/camkes-vmm 0
0
1500 12000 fsbl uboot fsbl elf-loader
atf bao atf sel4
1250 10000 uboot camkes-vmm
Throughput (MiB/s)
1000 8000
Time (ms)
Interrupt
750 6000
500 4000
250 2000
0 0
104 105 106 107 104 105 106 107 0 10 20 30 40 50 60 0 10 20 30 40 50 60
Buffer Size (KiB) Buffer Size (KiB) VM Image Size (MiB) VM Image Size (MiB)
Fig. 13: Inter-VM communication throughput. Fig. 14: Boot time for each stage by VM image region size.
this macro perspective, the other hypervisors add an almost
below about half the size of LLC. Beyond that, throughput is
constant offset to U-boot’s boot time, the largest being seL4-
reduced from 1300 to 850 MiB/s. Although not shown due to
VMM’s. We observe this overhead is not on the microkernel,
lack of space, using coloring does not prove beneficial, as the
but at user level, which nevertheless heavily interacts with the
throughput illustrated in Fig. 13 remains virtually unchanged.
microkernel to setup capabilities and kernel objects. We can
Takeaway 8. Inter-VM notification latencies are significant conclude that VM boot time has its bottleneck by the loading
and, as is the case for hardware interrupts, very susceptible to of guest images to memory, not the hypervisor logic.
the effects of interference. However, for bulk data transfers it FreeRTOS and Linux Boot Times. We also measure the boot
does not seem to significantly affect throughput if the shared time of (i) a small VM running FreeRTOS with a 90 KiB
buffer size is chosen on a range of about one-fourth to half image and (ii) a large VM with a Linux guest (built-in ramfs)
the LLC size (i.e., 256 KiB to 512 KiB). totaling 59 MiB of image size. For Jailhouse, the Linux VM
is a non-root cell. In Table I, we present results for a single-
VII. SPH: B OOT TIME guest and a dual-guest system. For the latter, both VMs boot
System boot time is a crucial metric in industries such as simultaneously; thus, we did not run experiments for dual-
automotive [51], [52] as critical components have strict timing guest with Jailhouse, because it launches VMs sequentially.
requirements for becoming fully operational. Table I presents the absolute boot time for the guest’s native
Platform’s Boot Flow. The platform’s boot flow [53] starts by and virtualized execution, highlighting the relative percentage
executing ROM code which loads the first-stage bootloader increase compared to native execution. For the single-guest
(FSBL) and enables the main cores. These initial boot stages FreeRTOS VM, all hypervisors but Bao cause a non-negligible
setup the platform basic infrastructure (e.g., clocks, DRAM) increase in boot time. The same happens with the single-guest
and load the TF-A and U-boot. U-boot will load the hypervisor Linux VM. For the dual-guest configuration, we concluded
and, except for Jailhouse, the guest images. Bao and Xen that the small VM is heavily affected for all hypervisors.
directly boot guests after initialization. Jailhouse starts with the Surprisingly, we observe that although the cost of booting
boot of the Linux root cell, that installs the hypervisor which a single FreeRTOS in Bao is negligible, this is not true
then loads the guests. seL4’s execution starts with an ELF for a dual-guest configuration. Booting it alongside a Linux
loader which loads the all images, initializes secondary cores, VM significantly increases its boot time, reaching similar
and sets up an initial set of page tables for the microkernel. overheads to those observed in Jailhouse’s sequential boot.
The microkernel initializes and hands control to user space.
Takeaway 9. The major bottleneck for the VM boot time is
Total VM Boot Time. The hypervisor boot time is heavily caused by the bootloader, not the hypervisors. Notwithstand-
dependent on the VM and how it is configured. We observed ing, the hypervisor can significantly increase the boot time
that the VM image size is one of the parameters that has the of a critical VM (small RTOS) when booting it alongside a
higher impact in the hypervisor boot time. We measure boot larger VM (e.g., in dual-OS Linux+RTOS configuration).
time as a function of VM image size. Thus, to understand
the overhead of the hypervisor in the context of the complete VIII. SPH: C ODE S IZE AND TCB
boot flow, in Fig. 14, we plot the cumulative time for each In MCS, the size of the hypervisor code, measured in source
boot stage. Here, we can confirm that in all hypervisors but lines of code (SLoC), is critical. It should be minimal as
Jailhouse, the bulk of boot time is spent by U-boot. For it is part of the trusted computing base (TCB) of all VMs.
Jailhouse, U-boot run time is constant, albeit large, as it In this paper, we consider that a VM TCB encompasses any
always only loads the root cell’s image. Jailhouse execution component with sufficient privileges that if it is compromised
time increases steeply while loading the VM image. From or malfunctions, might be able to affect the safety and/or
Baremetal Jailhouse Xen Bao seL4-VMM
FreeRTOS
Single 1670.89 6242.18 / 173.58% 2338.24 / 39.94% 1716.23 / 2.71 % 3496.19 / 109.24% differences are reflected in the binary size of each hypervisor.
Dual
Single 7665.14
N/A
12284.92 / 60.27%
6887.88 / 312.23%
8533.88 / 11.33%
5734.04 / 143.17%
7805.54 / 1.83 %
9291.02 / 456.05%
12629.79 / 64.77%
TCB. The hypervisor SLoC does not directly reflect the VM
Linux
Dual N/A 8707.15 / 13.59% 7895.95 / 3.01 % 13086.86 / 70.73% TCB. Although by design SPH such as Bao has a smaller
SLoC count, the seL4-VMM is vastly superior from a security
TABLE I: Total boot time (ms) and relative increase compared
perspective: shared TCB is limited only to the formally verified
to the baremetal case, for FreeRTOS and Linux VMs.
microkernel, because each VM is managed by a fully isolated
C
Asm
Total .text VMM. From a FuSa certification standpoint, however, the
(SLoC) (KiB)
.c .h VMM would still need to be considered. Moreover, seL4
jailhouse
hypervisor 7308 2279 342 9929 79.3 formal proofs are limited to a set of kernel configurations,
driver 2041 139 N/A 2180 20.1
currently not including multicore. Regarding Jailhouse, despite
xen 57360 8127 1765 67342 451.5
bao 5046 2840 537 8423 57.9
its small size, the root cell is a privileged component of the
seL4 microkernel 14569 N/A 189 14758 224.7 system. It executes part of all VM management logic, being in
CAmkES VMM
VMM 20932 19291 N/A 40223 724.3 the critical path for booting all other VMs. It is arguably part
of all VM’s TCB, increasing it significantly [54]. Analogously,
TABLE II: Hypervisor SLoC count and binary code size.
Xen must depart from true Dom0-less to leverage richer
security properties of the VM. As well understood in the features (e.g., PV drivers, dynamic VM creation). Recently,
literature, a larger TCB typically has a higher number of bugs the Xen community has ignited efforts to use a smaller OS,
and wider attack surface [54], resulting in a higher probability such as Zephyr [24], as Dom0, refactor Xen to MISRA C, and
of vulnerabilities. It is important to understand that each VM provide extensive requirements and test documentation [56].
has its own TCB. Thus, CAmkES VMM is only considered Takeaway 10. Hypervisors specifically targeting static par-
for the managed VM’s TCB, not the others. Further, large code titioning have the smallest code bases. Despite facilitating
bases are impractical for certification, both from a technical certification, none of the evaluated SPH provide other arti-
and economic perspective. To qualify a component assigned facts (e.g., requirements specification, coding standards). Xen
a safety integrity level (SIL), all components on which it is the first to take steps in this direction; nevertheless, seL4’s
depends must also be qualified to the same or higher SIL [4]. formal proofs provide the most comprehensive guarantees.
Methodology. We measured SLoC for the target configurations
using cloc [55]. Xen build system offers a make target to assess IX. D ISCUSSION AND F UTURE D IRECTIONS
the SLoC for a specific configuration. However, it does not In this section, we discuss some of the open issues and
count header files, which we believe must be accounted for potential research directions to improve the guarantees of SPH.
since they provide function-like macros and inline functions. Interference Mitigation Techniques. Cache coloring does not
We have modified the Xen makefile to measure headers. We fully mitigate the effects of inter-core interference. Further-
have also extended Jailhouse and Bao build systems with the more, coloring has inherent inefficiencies such as (i) pre-
same functionality. For seL4, we used the fully unified and cluding the use of superpages and (ii) increasing memory
pre-processed kernel source file to assess the microkernel code pressure which affects performance and predictability, as well
base. For the CAmkES VMM, given that its source code is as (iii) internal fragmentation (exclusively assigning 1 out of
scattered throughout multiple seL4 project libraries, we were N colors, implicitly allocates 1/Nth of physical memory, a
not able to list its source code files from the build system. portion of which may remain unused for small RTOSs or
Instead, we used debug information from the final executable the SPH). While the latter could be solved by employing
and inspected each source to assess the included header files. cache bleaching [57] in heterogeneous platforms, to further
Code Size. Looking at Table II we see Bao and Jailhouse minimize coloring bottlenecks, we advocate for SPH to adopt
have the smallest code base of about 8400 and 9900 SLoC, other proven, widely applicable contention mitigation mech-
respectively. Bao is implemented as a standalone component anisms, e.g., bandwidth regulation mechanisms implemented
with no external dependencies. However, since part of Jail- via PMU-based CPU throttling [58], [59]. We also stress the
house functionality is implemented as a Linux kernel module, importance of including support for hardware extensions such
we also account that for the code base. It adds about 2180 as Arm’s Memory Partitioning and Monitoring (MPAM) [11],
SLoC, bringing Jailhouse total code base to 12 KSLoC. For [60], which provide flexible hardware means for partitioning
Xen we use a custom config with almost all features disabled, cache space and memory bandwidth and call for platform
except a few ones such as coloring and static shared memory. It designers to include such facilities in their upcoming designs
features the largest code base with around 67 KSLoC. Finally, targeting MCS. Finally, we stress the need for instrumentation,
seL4 microkernel has 14.5 KSLoC, while the CAmkES VMM analysis, and profiling tools [20], [61] that integrate with these
can go up to 40K, i.e., almost 55 KSLoC in total. The visible hypervisors to help system designers understand the trade-offs
difference between Bao and Jailhouse, and seL4 microkernel and fine-tune these mechanisms (e.g., through automation).
and, especially, Xen, lies in the fact that the former were Platform-Level Contention and Mitigation. None of the stud-
designed specifically for the static partitioning use case, while ied SPH manages traffic from peripheral DMAs. We advocate
the latter aim at being more generic and adaptable. These that SPH must provide contention mitigation mechanisms at
the platform level, e.g., (i) leveraging QoS hardware [20], SPH (to minimize code size). On the other hand, the seL4
[62] available on the bus and (ii) controlling interference microkernel architecture is much more flexible as it allows
from DMA-capable devices or accelerators. Furthermore, since for an isolated user space VMM per guest, providing more
DMA masters still share SMMU structures (e.g., TLBs [63]), robust isolation and customization; however, it comes at the
we hypothesize that bandwidth regulation techniques may fall cost of non-negligible latencies. We advocate for novel archi-
short of efficiently mitigating interference at this level. tectures that combine microkernels’ flexibility and strong fault
Interrupt Injection Optimization. Arm-based SPH’s interrupt encapsulation with SPH’s simplicity and minimalist latencies
latency is mainly due to inadequate support in GICv2/3. GICv4 by hosting per-partition VMMs directly at the hypervisor
will provide direct interrupt injection support, but only for IPIs privilege level. Such a design could arguably be achieved
and MSIs. We want to raise awareness of Arm silicon makers by combining multikernel-like architectures [64] and per-core
and designers of the need for additional hardware support at memory protection mechanisms (e.g., Armv9 RME’s GPT
the GIC level for direct injection of wired interrupts. The [65], or RISC-V PMP [66]) statically configured by firmware.
same holds for RISC-V [25]. Besides hardware support, we Full IO Passthrough. Pure static partitioning supports only
observed that simple SPH provide optimized interrupt injection passthrough IO. However, as highlighted by [7], there is a crit-
paths. It is also possible to optimize this path in larger SPH ical problem in providing full IO passthrough when controls
(e.g., Xen) and in microkernels (e.g., moving injection logic over IO resources such as clock, reset, power, or pin-muxes
to the microkernel). Finally, Bao and Jailhouse implement cannot be securely partitioned or shared, e.g., if their MMIO
direct interrupt injection; however, we must stress that using registers reside on the same frame or they are configured via
this technique severely hinders the ability of the SPH to platform management co-processors oblivious of SPH’s VMs.
manage devices or implement any functionality dependent on Thus, SPH should provide controlled guest access to these
interrupts. A plausible research direction would be a hybrid resources by emulation or through standard interfaces such as
approach, i.e., selectively enabling direct injection only in SCMI [67]. Nevertheless, this would require including drivers
specific cores for critical guests while providing the more in the hypervisor, increasing its code base. Again, we urge
complex functionality in cores running non-critical guests. hardware designers to provide hardware primitives that enable
Interrupt Priority Inversion Fix. As discussed in Section V, SPH to pass through IO resource controls.
the studied SPH suffer from partial interrupt priority inversion X. R ELATED W ORK
because all currently pending interrupts are handled by the There are several hypervisor analyses in the context of
hypervisor and injected in the guest before it can service embedded and MCSs, but none provide a cross-section anal-
the highest-priority one. We advocate for implementing a ysis and comparison on SPH. Some works focus on a single
lightweight solution by dynamically setting the interrupt pri- hypervisor while others evaluate a single metric or feature.
ority mask based on the priority of the last injected interrupt. In [40], authors compare the performance of Xvisor with
This approach ensures the hypervisor only receives the next Xen and KVM. Others have evaluated the effectiveness of
interrupt once the guest has handled the highest priority one. cache coloring and bandwidth reservations in Xvisor [59].
Critical VM Boot Priority. Section VII highlights the issue Similarly, in [19], authors evaluate cache and DRAM bank
of critical VM boot time overhead when booted under a dual- coloring in Jailhouse. Other works have evaluated Jailhouse
OS configuration. We advocate for the development of boot interrupt latency [68] or VM interference [69]. There are
mechanisms that prioritize the boot of small critical VMs. also studies about the feasibility of using Xen and KVM
However, as noted in Jumpstart [52], it must encompass the as real-time hypervisors [70], but mainly for x86. Little
full boot flow and be optimized across stages and components has been published regarding the new Xen Dom0-less and
since the bottleneck of the boot time is in the image loading cache coloring features, but results can be found in [71].
process performed by the bootloader, not the hypervisor. Evaluation of the seL4 CAmkES VMM has also been done
Per-Partition Hypervisor Replica. Memory contention highly for performance and interrupt latency [29]. There have been
affects interrupt latency but can be minimized by assigning works providing a qualitative analysis for MCS hypervisors,
different colors for VMs and the hypervisor. Notwithstanding, contrasting architectural approaches and highlighting future
coloring the hypervisor may prove wasteful and insufficient to trends [72] while others layout guidelines on how to choose
address other interference channels internal to the hypervisor. such a hypervisor in industrial settings [51].
We advocate for à la multikernel [64] implementations such XI. C ONCLUSION
as the one implemented in seL4, where the hypervisor image We have conducted the most comprehensive empirical eval-
is replicated per cache partition [31], fully closing internal uation of open-source SPH to date, focusing on key metrics
channels. For SPH with a small enough footprint, memory for MCS. With that, we drew a set of observations that
consumption or boot time costs should not be prohibitive. (i) will help industrial practitioners understand the trade-offs
Architecture Flexibility. Purely monolithic SPH (e.g., Jail- of SPH and (ii) raise awareness of the research and open-
house or Bao) have smaller code bases at the cost of feature source communities to the still open problems in SPH. We
richness and flexibility. The same holds for Xen, i.e., many are opening all artifacts to enable independent validation of
widely-used rich features are absent when configured as an results and encourage further exploration on SPH.
XII. ACKNOWLEDGMENTS [21] “jailhouse-rt: Bu-maintained version of the jailhouse partition-
ing hypervisor with real-time features.” [Online]. Available:
We would like to express our gratitude to the reviewers https://github.com/rntmancuso/jailhouse-rt
for their valuable feedback and suggestions, as well as to [22] A. Biondi et al., “SPHERE: A Multi-SoC Architecture for Next-
Generation Cyber-Physical Systems Based on Heterogeneous Plat-
our friendly shepherd for guiding us in making final im- forms,” IEEE Access, 2021.
provements. Additionally, we appreciate the time and thought- [23] G. Corradi, “Xen on Arm: Real-Time Virtualization with Cache Color-
ful input from all the representatives of SPH, namely Ralf ing,” in Proc. of Embedded World Conference, 2020.
[24] “Zephyr project,” Feb 2023. [Online]. Available:
Ramsauer (Jailhouse), Stefano Stabellini (Xen), and Gernot https://www.zephyrproject.org/
Heiser (seL4/CAmkES-VMM). José Martins was supported [25] B. Sa et al., “A First Look at RISC-V Virtualization from an Embedded
by FCT grant SFRH/BD/138660/2018. This work is supported Systems Perspective,” IEEE Transactions on Computers, 2021.
[26] G. Klein et al., “SeL4: Formal Verification of an OS Kernel,” in Proc.
by FCT – Fundação para a Ciência e Tecnologia within the of ACM Symposium on Operating Systems Principles (SOSP), 2009.
RD Units Project Scope UIDB/00319/2020, and European [27] G. Heiser, “The seL4 Microkernel: An Introduction,” The seL4 Foun-
Union’s Horizon Europe research and innovation program dation, 2020.
[28] G. Klein et al., “Formally Verified Software in the Real World,”
under grant agreement No 101070537, project CROSSCON Communications of the ACM, 2018.
(Cross-platform Open Security Stack for Connected Devices). [29] J. Millwood et al., “Performance Impacts from the seL4 Hypervisor,”
in Proc. of the Ground Vehicle Systems Engineering and Technology
R EFERENCES Symposium, 2020.
[30] A. Lyons et al., “Scheduling-Context Capabilities: A Principled, Light-
[1] J. Cerrolaza et al., “Multi-Core Devices for Safety-Critical Systems: A Weight Operating-System Mechanism for Managing Time,” in Proc. of
Survey,” ACM Computing Surveys, 2020. European Conference on Computer Systems (EuroSys), 2018.
[2] M. Staron, Contemporary Software Architectures: Federated and Cen- [31] Q. Ge et al., “Time Protection: The Missing OS Abstraction,” in Proc.
tralized. Springer International Publishing, 2021. of European Conference on Computer Systems (EuroSys), 2019.
[3] A. Burns and R. Davis, “A Survey of Research into Mixed Criticality [32] T. Murray et al., “seL4: From General Purpose to a Proof of Information
Systems,” ACM Computing Surveys, 2017. Flow Enforcement,” in Proc. of IEEE Symposium on Security and
[4] A. Esper et al., “An industrial view on the common academic under- Privacy (S&P), 2013.
standing of mixed-criticality systems,” Real-Time Systems, 2018. [33] G. Klein et al., “Comprehensive Formal Verification of an OS Micro-
[5] J. Hwang et al., “Xen on ARM: System Virtualization Using Xen Hy- kernel,” ACM Transactions on Computer Systems, 2014.
pervisor for ARM-Based Secure Mobile Phones,” in Proc. of Consumer [34] G. Heiser et al., “Towards Provable Timing-Channel Prevention,” ACM
Communications and Networking Conference, 2008. SIGOPS Operating Systems Review, 2020.
[6] C. Dall and J. Nieh, “KVM/ARM: The Design and Implementation of [35] ——, “Can We Put the ”S” Into IoT?” in Proc. of IEEE World Forum
the Linux ARM Hypervisor,” ACM SIGARCH Computer Architecture on Internet of Things, 2022.
News, 2014. [36] Y. Li et al, “A Virtualized Separation Kernel for Mixed Criticality
[7] R. Ramsauer et al., “A Novel Software Architecture for Mixed Criticality Systems,” SIGPLAN Notices, 2014.
Systems,” in Digital Transformation in Semiconductor Manufacturing, [37] R. West et al., “A Virtualized Separation Kernel for Mixed-Criticality
2020. Systems,” ACM Transactions on Computer Systems, 2016.
[8] J. Martins et al., “Bao: A Lightweight Static Partitioning Hypervisor for [38] S. Sinha and R. West, “Towards an integrated vehicle management
Modern Multi-Core Embedded Systems,” in Proc. of Workshop on Next system in driveos,” ACM Transactions on Computer Systems, 2021.
Generation Real-Time Embedded Systems (NG-RES), 2020. [39] H. Li et al., “ACRN: A Big Little Hypervisor for IoT Development,”
[9] S. VanderLeest and D. White, “MPSoC hypervisor: The safe & secure in Proc. of International Conference on Virtual Execution Environments
future of avionics,” in Proc. of Digital Avionics Systems Conference (VEE), 2019.
(DASC), 2015. [40] A. Patel et al., “Embedded Hypervisor Xvisor: A Comparative Analysis,”
[10] P. Burgio et al., “A software stack for next-generation automotive in Proc. of Euromicro International Conference on Parallel, Distributed,
systems on many-core heterogeneous platforms,” Microprocessors and and Network-Based Processing, 2015.
Microsystems, 2017. [41] U. Steinberg and B. Kauer, “NOVA: A Microhypervisor-Based Secure
[11] F. Rehm et al, “The Road towards Predictable Automotive High - Virtualization Architecture,” in Proc. of European Conference on Com-
Performance Platforms,” in Proc. of Design, Automation and Test in puter Systems, 2010.
Europe Conference (DATE), 2021. [42] S. Pinto et al., “LTZVisor: TrustZone is the Key,” in Proc. of Euromicro
[12] J. Martins, “ESRGv3/shedding-light-static-partitioning- Conference on Real-Time Systems (ECRTS), 2017.
hypervisors: v0.1.0,” 2023. [Online]. Available: [43] J. Martins et al., “µRTZVisor: A Secure and Safe Real-Time Hypervi-
https://doi.org/10.5281/zenodo.7696937 sor,” Electronics, 2017.
[13] Arm, “Learn the architecture: AArch64 Virtualization,” [44] S. Pinto and N. Santos, “Demystifying Arm TrustZone: A Comprehen-
https://developer.arm.com/documentation/den0125/latest, 2022. sive Survey,” ACM Computing Surveys, 2019.
[14] A. Gordon et al., “ELI: Bare-Metal Performance for I/O Virtualization,” [45] “perf: Linux profiling with performance counters.” [Online]. Available:
SIGPLAN Notices, 2012. https://perf.wiki.kernel.org/index.php/Main Page
[15] G. Gracioli et al., “A Survey on Cache Management Mechanisms for [46] M. Guthaus et al., “MiBench: A free, commercially representative
Real-Time Embedded Systems,” ACM Computing Surveys, 2015. embedded benchmark suite,” in Proc. of International Workshop on
[16] Arm, “Software Delegated Exception Interface (SDEI),” Workload Characterization (WWC), 2001.
https://developer.arm.com/documentation/den0054/latest, 2021. [47] T. Moscibroda and O. Mutlu, “Memory Performance Attacks: Denial of
[17] R. Ramsauer et al., “Look Mum, no VM Exits!(Almost),” in Proc. of Memory Service in Multi-Core Systems,” in Proc. of USENIX Security
Workshop on Operating Systems Platforms for Embedded Real-Time Symposium, 2007.
Applications (OSPERT), 2017. [48] P. Valsan et al., “Taming Non-Blocking Caches to Improve Isolation in
[18] ——, “Static Hardware Partitioning on RISC-V – Shortcomings, Lim- Multicore Real-Time Systems,” in Proc. of Real-Time and Embedded
itations, and Prospects,” in Proc. of IEEE World Forum on Internet of Technology and Applications Symposium (RTAS), 2016.
Things, 2022. [49] W. Hofer et al., “Sloth: Threads as Interrupts,” in Proc. of Real-Time
[19] T. Kloda et al., “Deterministic Memory Hierarchy and Virtualization Systems Symposium (RTSS), 2009.
for Modern Multi-Core Embedded Systems,” in Proc. of Real-Time and [50] Arm Ltd., “Arm Generic Interrupt Controller v3 and v4 - Virtualization,”
Embedded Technology and Applications Symposium (RTAS), 2019. 2022.
[20] P. Sohal et al., “E-WarP: A System-wide Framework for Memory [51] E. Hamelin et al., “Selection and evaluation of an embedded hypervisor:
Bandwidth Profiling and Management,” in Proc. of Real-Time Systems Application to an automotive platform,” in Proc. of European Congress
Symposium (RTSS), 2020. of Embedded Real Time Software and Systems, 2020.
[52] A. Golchin and R. West, “Jumpstart: Fast Critical Service Resumption
for a Partitioning Hypervisor in Embedded Systems,” in Proc. of Real-
Time and Embedded Technology and Applications Symposium (RTAS),
2022.
[53] Xilinx, “Zynq UltraScale+ Device: Technical Reference Manual,”
https://docs.xilinx.com/v/u/en-US/ug1085-zynq-ultrascale-trm, 2020.
[54] S. Biggs et al., “The Jury Is In: Monolithic OS Design Is Flawed:
Microkernel-Based Designs Improve Security,” in Proc. of Asia-Pacific
Workshop on Systems, 2018.
[55] Al Danial, “cloc - count lines of code,” https://github.com/AlDanial/cloc.
[56] A. Mygaiev and S. Stabellini, “Xen FuSa SIG update,” in Xen
Project Developer and Design Summit, 2021. [Online]. Available:
https://www.youtube.com/watch?v=XMNaIWZ-2sU
[57] S. Roozkhosh and R. Mancuso, “The potential of programmable logic in
the middle: Cache bleaching,” in Real-Time and Embedded Technology
and Applications Symposium (RTAS), 2020.
[58] H. Yun et al., “MemGuard: Memory bandwidth reservation system for
efficient performance isolation in multi-core platforms,” in Proc. of Real-
Time and Embedded Technology and Applications Symposium (RTAS),
2013.
[59] P. Modica at al., “Supporting temporal and spatial isolation in a
hypervisor for ARM multicore platforms,” in Proc. of International
Conference on Industrial Technology (ICIT), 2018.
[60] Arm Ltd., “Arm Architecture Reference Manual Supplement - Memory
System Resource Partitioning and Monitoring (MPAM), for A-profile
architecture,” 2022.
[61] G. Ghaemi et al., “Governing with Insights: Towards Profile-Driven
Cache Management of Black-Box Applications,” in Proc. of Euromicro
Conference on Real-Time Systems (ECRTS), 2021.
[62] M. Zini et al., “Profiling and controlling I/O-related memory contention
in COTS heterogeneous platforms,” Software: Practice and Experience,
2022.
[63] A. Panchamukhi and F. Mueller, “Providing task isolation via tlb
coloring,” in Real-Time and Embedded Technology and Applications
Symposium (RTAS), 2015.
[64] A. Baumann et al., “The multikernel: A new os architecture for scalable
multicore systems,” in Proc. of ACM Symposium on Operating Systems
Principles (SOSP), 2009.
[65] X. Li et al., “Design and verification of the arm confidential compute
architecture,” in USENIX Symposium on Operating Systems Design and
Implementation (OSDI), 2022.
[66] D. Lee et al., “Keystone: An open framework for architecting trusted
execution environments,” in Proc. of European Conference on Computer
Systems (EuroSys), 2020.
[67] Arm Ltd., “Arm System Control and Management Interface - Platform
Design Document, Version 3.1,” 2022.
[68] I. Pavic and H. Dzapo, “Virtualization in multicore real-time embedded
systems for improvement of interrupt latency,” in Proc. of International
Convention on Information and Communication Technology, Electronics
and Microelectronics (MIPRO), 2018.
[69] J. Danielsson et al., “Testing Performance-Isolation in Multi-core Sys-
tems,” in Proc. of Annual Computer Software and Applications Confer-
ence (COMPSAC), 2019.
[70] L. Abeni and D. Faggioli, “Using Xen and KVM as real-time hypervi-
sors,” Journal of Systems Architecture, 2020.
[71] S. Stabellini, “Xen Cache-Coloring: Interference Free Real-Time
Systems,” in Open Source Summit (Noth America), 2020. [Online].
Available: https://www.youtube.com/watch?v=9cA0QK2CdwQ
[72] M. Cinque et al., “Virtualizing mixed-criticality systems: A survey on
industrial trends and issues,” Future Generation Computer Systems,
2022.
In SPH implementations like Bao and Jailhouse, where the direct injection technique is used, the impact on interrupt priority management and latency is minimized, providing near-native latencies and correct priority handling even under interference . However, latency in other SPHs like Xen is higher due to a less optimized path. Priority management is critical, as ensuring correct priority order is essential for real-time operating systems . Limitations arise when interference increases, despite direct injection use, indicating inherent architectural challenges in maintaining consistent latency across varied workloads .
SPH implementation affects interrupt handling predictability through the efficiency of interrupt paths and handling mechanisms. Bao and Jailhouse have optimized paths leading to predictable interrupt latencies despite some variance under interference. In contrast, Xen experiences more substantial latency and variance due to a less optimized path . Interrupt latency predictability is further influenced by reductions in L2 cache misses achieved through VM and hypervisor-level coloring, although complete mitigation is challenging .
Different SPHs experience varying levels of L2 cache interference impacting interrupt latency. Bao eliminates L2 cache misses with full coloring, enhancing performance predictability, whereas Xen, lacking complete coloring, suffers higher latencies . This discrepancy indicates that while VM-level coloring lowers cache interference, hypervisors require their own coloring strategies to ensure low latencies, highlighting the relationship between cache management strategies and system performance . The inability to mitigate hypervisor L2 cache misses fully in Xen suggests inherent challenges in optimizing SPHs' performance and latency.
Hypervisor coloring has limited substantial benefit for Xen and Bao, as it only reduces performance degradation from interference by a marginal amount, less than 1% additional improvement . While it helps reduce L2 cache misses at the hypervisor level in Bao, the benefits in Xen are minimal due to its partially implemented coloring patch . This suggests that while beneficial for certain aspects, coloring alone may not be sufficient to address all interference issues effectively.
Multicore memory hierarchy interference significantly degrades guest performance primarily due to increased L2 cache misses, affecting execution flow and leading to reduced overall system efficiency . While page coloring can eliminate inter-core conflict misses, it does not fully resolve interference, leading to increased performance overhead in hypervisors, particularly when working within busy multi-threaded environments .
The direct injection technique is effective in reducing interrupt latency to near-native levels under no interference conditions, as it avoids additional traps to hypervisors . For example, with direct injection, average latency drops to near-native levels (243 ns for Bao and 232 ns for Jailhouse). However, latency and variance still increase under interference, albeit less than in systems without direct injection . This technique requires careful system design to maintain its effectiveness across different workloads and interference situations.
Cache coloring is effective in reducing guest L2 cache misses back to base case values but is less effective for hypervisor L2 cache misses under interference. Hypervisor cache misses increase substantially, and only by coloring the hypervisor level can these be minimized . On Bao, L2 cache misses are fully eliminated, unlike Xen where latency does not fully reduce to non-interference levels due to incomplete cache coloring implementation .
Interrupt latency correlates with the architecture and interrupt handling path of hypervisors. Bao and Jailhouse have optimized interrupt paths, leading to lower latency increases (4x and 5x) compared to Xen (14x) and seL4-VMM (47x). This difference arises from the number of executed instructions upon interrupt: Bao and Jailhouse handle around 200 instructions, while Xen executes about 1050 . The variance in latency is minimal in optimized paths but significant in microkernel-based architectures like seL4-VMM due to multiple traps to the microkernel .
Critical factors affecting interrupt latency variance in Bao and Jailhouse during interference include the efficacy of direct injection technique and level of cache coloring applied. Direct injection reduces traps to the hypervisor, decreasing latency variance, yet interference still affects performance variance (e.g. from 243 ns in Bao and 232 ns in Jailhouse). Cache coloring mitigates guest L2 cache misses to base levels, yet hypervisor-level misses require additional coloring to further reduce variance . These factors indicate the necessity of optimizing both software techniques and hardware resource management to minimize latency variance effectively.
Interference significantly impacts benchmark execution across all hypervisors due to increased L2 cache misses. This interference leads to performance degradation, especially in the seL4-VMM, where the execution of a higher number of instructions makes it more susceptible, reaching up to 125% in the worst case . Coloring, although helpful in reducing some interference, results in about a 20% performance impact because it uses smaller pages, reducing TLB reach and available cache space, leading to increased memory pressure . Cache partitioning via page coloring eliminates inter-core conflict misses but does not fully mitigate total interference, with up to a 38 pp increase in overhead .