0% found this document useful (0 votes)
79 views6 pages

Integrating NIC with OS for Performance

The paper argues for integrating the network interface adapter (NIC) directly into the operating system (OS) to enhance performance for remote procedure calls (RPCs) without sacrificing flexibility. Current approaches to NIC design create a trade-off between performance and adaptability, often resulting in increased software overhead. By leveraging new cache-coherent interconnects and treating the NIC as a trusted OS component, the authors propose a more efficient architecture that can handle dynamic workloads effectively.

Uploaded by

xiangpingzhangup
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views6 pages

Integrating NIC with OS for Performance

The paper argues for integrating the network interface adapter (NIC) directly into the operating system (OS) to enhance performance for remote procedure calls (RPCs) without sacrificing flexibility. Current approaches to NIC design create a trade-off between performance and adaptability, often resulting in increased software overhead. By leveraging new cache-coherent interconnects and treating the NIC as a trusted OS component, the authors propose a more efficient architecture that can handle dynamic workloads effectively.

Uploaded by

xiangpingzhangup
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

The NIC should be part of the OS.

Pengcheng Xu Timothy Roscoe


ETH Zurich ETH Zurich
Switzerland Switzerland
[email protected] [email protected]
ABSTRACT deliver gains in performance, but at the cost of flexibility:
The network interface adapter (NIC) is a critical component bypass generally relies on a relatively fixed assignment of
of a modern cloud server which occupies a unique position. processes to cores and queues together with busy-waiting
to achieve higher speeds. This works well for fairly static
arXiv:2501.10138v1 [cs.OS] 17 Jan 2025

Not only is network performance vital to the efficient opera-


tion of the machine, but unlike application-oriented compute workloads, but has limited applicability for more dynamic
accelerators like GPUs, the network subsystem must react application mixes.
to unpredictable events like the arrival of a network packet Our focus in this paper is on the network receive path,
and communicate with the appropriate application end point although it is also closely connected to the transmit path. We
with minimal latency. also focus on Remote Procedure Calls (RPCs), whether data
Current approaches to server stacks navigate a trade-off center application calls or serverless function invocations.
between flexibility, efficiency, and performance: the fastest While some are large, the great majority of network requests
kernel-bypass approaches dedicate cores to applications, and responses are small [23].
busy-wait on receive queues, etc. while more flexible ap- Our goal is to exploit this insight to reduce the CPU cycle
proaches appropriate to more dynamic workload mixes incur overhead of a small RPC call to essentially zero for many
much greater software overhead on the data path. workloads, within an architecture that nevertheless supports
However, we reject this trade-off, which we ascribe to an highly dynamic workloads with better performance than
arbitrary (and sub-optimal) split in system state between modern kernel-based stacks.
the OS and the NIC. Instead, by exploiting the properties of While the end-to-end latency of RPCs is dominated by
cache-coherent interconnects and integrating the NIC closely propagation time, end-system latency reflects CPU cycles con-
with the OS kernel, we can achieve something surprising: sumed by the invocation and is therefore a good measurable
performance for RPC workloads better than the fastest kernel- proxy for the efficiency of the software stack (unmarshaling,
bypass approaches without sacrificing the robustness and demultiplexing, function dispatch, etc.).
dynamic adaptation of kernel-based network subsystems. We suggest that kernel-bypass optimization is at its limit,
and argue for a radically different, more OS-centric approach
based on new interconnects and rich sharing of state
1 INTRODUCTION between the NIC and the OS: the NIC should be a full, trusted
The NIC is central to the operation of a data center server, component of the OS itself. We explore what this means for
responsible for all performance-critical communication with hardware and software designs, and describe our efforts to
the machine. It differs from other devices in a critical aspect: build a prototype to demonstrate the idea.
unpredictable events (packets arriving) happen to it during
normal operation. This is in contrast to components like 2 THE TRADITIONAL NIC PARADIGM
GPUs (to which the OS submits application tasks with rela- Most server network stacks, and also most NIC hardware de-
tively predictable behavior) or local storage devices (where signs, are based around the model in Figure 1: incoming pack-
the OS issues a request, assuming a response will arrive). ets are demultiplexed and transferred using Direct Memory
Networking in modern servers is notable for a clear parti- Access (DMA) into one of a set of descriptor-based queues,
tion of networking state between the OS, application, and with interrupts used for synchronization when the OS has
NIC. This has evolved over many years and modern NICs stopped polling the queue. DMA occurs using addresses that
(including new architectures proposed in research) encode a are translated (and protected) via an I/O Memory Manage-
number of implicit and explicit assumptions about trust, OS ment Unit (IOMMU) or System Memory Management Unit
design, and applications which we survey in section 2. (SMMU), and interrupts can be steered to cores using the
Surprisingly, these assumptions are preserved in high- demultiplexing information, or some other heuristic.
performance kernel-bypass architectures which attempt to This model has evolved over decades from the very sim-
bring the NIC closer to the application, or CPU designs which plistic model of Ethernet interface used in the Xerox Alto,
integrate the NIC with the processor cores. These techniques to handle 400Gb/s network links connected to machines
1
PCIe NIC Server CPU & DRAM
t Pac
cke ket
Pa
Rx Buffer Queue RPC Stub Handler

SCHED.
DISP.
RPC Stub Handler
Packet IRQ … …
Des
crip tors
crip RPC Stub Handler
tors Des

Ethernet MAC DMA Engine Rx Descriptor Ring CPU Core Socket Queues User Space

DMA overhead Schedule overhead Protocol overhead

Figure 1: Architecture of a traditional PCIe DMA NIC’s receive path.

with 100s of cores. We discuss recent alternatives below, but resemble Figure 1. The principal differences concern the ker-
observe that most kernel bypass approaches still look like nel/user space boundary (and where the different receive
Figure 1, but move some parts from the OS kernel to appli- path stages execute) and the design of the control plane
cation user space. In more detail, a minimal set of things (which is implemented in software on the CPU as a separate
have to happen to turn a network packet into a function component). To a large extent, kernel bypass turns what was
invocation on the host is: OS functionality into application-level functionality, inte-
grating the NIC more with the user application.
(1) Read the packet contents. Other work from architecture has explored closely in-
(2) Perform protocol processing (checksums, etc.). tegrating the NIC with the CPU. nanoPU [10] delivers
(3) Demultiplex the packet to an in-memory queue. packets processed by P4 directly into the register file of a
(4) Interrupt some CPU core to notify the OS, hypervisor, RISC-V core, while CC-NIC [22] uses a NUMA server to
or guest OS. explore by emulation the implications of cache-coherent pe-
(5) Perform some general protocol processing. ripheral interconnects for NICs. This, again, preserves the
(6) Identify the OS process (or thread, or task) that should same hardware/software boundary, while heavily optimizing
handle the message. hardware steps 3 and 4.
(7) Find a (perhaps different) core to execute this process. As with bypass, this works well when the workload is rel-
(8) Schedule the process on the core. atively static, can be bound to dedicated cores, and is rarely
(9) Context switch to the process if needed. idle. However, when the workload is dynamic with many
(10) Unmarshal/deserialize arguments and function name. more end-points than spare cores, the up-front cost of map-
(11) Find the address of the start of the function. ping the NIC’s demultiplexing to queues onto the scheduling
(12) Jump to this instruction. of applications on cores quickly becomes cumbersome. Even
newer Data Processing Unit (DPU) [1] and Infrastructure
A typical NIC performs steps 1 to 4, and then hands things Processing Unit (IPU) [9] systems share these characteristics.
over to the OS or application. Moreover, tightly coupling the NIC and CPU may not be
Kernel-bypass approaches like Arrakis [18], IX [3], and desirable: Networking parts do not develop in lock-step with
Demikernel [24] variously trade off latency and throughput CPUs and different workloads have very different compute-
against flexibility and energy efficiency by replacing step 4 to-network I/O ratios, so there is valuable flexibility gained
with spinning or polling, and simplifying steps 5 through 9 by keeping the NIC as a separate component.
by binding application processes to in-memory queues in
advance. The data plane is moved to application space, while 3 WHY THIS SPLIT?
the control plane can be left in the OS kernel or moved to The historical stability of the hardware/software boundary
dedicated cores (Shinjuku [12], Shenango [17], Caladan [7]) in NICs is arguably due to the state required to perform each
or userspace processes (ghOSt [8]). Snap [14], meanwhile, step. For example, steps 10-12 require application-specific
dedicates a subset of the CPU cores to provide applications state: argument formats, interface signatures, and code lay-
a uniform, yet highly configurable, abstraction of a NIC that out. In contrast, steps 5-9 cannot be performed without ref-
allows rapid deployment of new network stack features. erence to central OS state.
All these approaches, however, retain the same division A key factor is that the OS doesn’t trust the NIC. Kernel
of labor between software and the NIC; indeed, all of them developers have been keen to limit the coupling between OS
2
and NIC [16] due to the perception that the NIC never does
Enzian DMA
quite what the OS designer wants. Ironically, the result con-
tinues to be an increase of the complexity of device drivers as x86 DMA
hardware vendors adopt ad-hoc solutions to exposing func-
tionality to users. This in turn means that the functionality
ECI
that the vendors add to a NIC is limited to that which can 0 10 20 30 40 50 60
easily be exposed to users. Latency (us)
Moreover the introduction of IOMMUs and SMMUs has
led to a philosophy that, as far as possible the NIC should not Figure 2: 64-byte message round-trip latencies.
be trusted as a device. This is an anomaly, given that devices
like disks, CPU cores, GPUs, and DRAM are, for the most
part, trusted by at least part of the OS. One reason for this (PCIe), and even less the case when accessing the device over
is confusion about different roles of the SMMU: on the one new, cache-coherent interconnects like CXL.mem 3.0.
hand, providing a convenient memory translation function
on the data path to facilitate device pass-through to virtual 4 BREAKING THE IMPASSE
machines, and on the other, to firewall off a kernel running We are pursuing a different approach in order to deliver
on a set of application cores from the rest of the machine. performance better than current kernel bypass for relatively
The philosophy is compounded by protocols like RDMA stable RPC and serverless workloads, while providing all the
which regard the NIC as a relatively dumb device with little flexibility of the traditional approach with better efficiency
connection to the host OS that can nevertheless perform for highly dynamic workloads. We exploit three new insights.
memory accesses on behalf of a remote peer, using an au- Firstly, new cache-coherent peripheral interconnects
thorization framework that is naive at best in multi-tenant between devices and cores can radically change communica-
scenarios. A more OS-centric perspective on RDMA-like tion between CPU and NIC. Examples are CXL.mem 3.0 [6],
functionality views the NIC as providing to the OS addi- CCIX [4], and the Enzian Coherence Interface (ECI) [5]. Cru-
tional, specialized cores close to the network interface which cially, this allows lightweight signaling to the device: a NIC
can execute a limited number of RPC operations. can interpret cache operations on certain addresses as spe-
A related factor is that architecture researchers like to cific signals or requests, and return information back to the
ignore the OS [15]. User applications are a different matter, CPU in response or trigger other actions such as interrupts.
and so there are many proposals for accelerating subsets of Figure 2 shows the dramatically better interaction latency
steps 10-12 for memory latency benefits. Cereal [11] pro- possible using even the (comparatively slow) ECI vs. DMA
posed an accelerator targeting a custom message format; the over PCIe on the same machine, and on a modern PC server;
accelerator sits directly on the system interconnect inside we anticipate comparable gains with CXL 3.0.
the CPU package. Optimus Prime [19] proposed a format- For the data plane, such protocols allow packets to be
agnostic transformation architecture and focused on imple- transferred directly as cache lines to the destination core’s
menting an accelerator sitting on the system interconnect L1 cache and registers [21], providing dramatically lower
inside the CPU package as well. Cerebros [20] builds upon latency than can be achieved using DMA with descriptors.
Optimus Prime towards a fully-offloading RPC framework. For the control plane, communication with between CPU
ProtoAcc [13] targets Protocol Buffers instead with an accel- cores (running either application code or the OS kernel) is
erator attached to the custom RoCC [2] interface directly on lightweight and efficient, and easily protected using conven-
the RISC-V core pipeline. Like kernel bypass, these primarily tional MMU mechanisms. It also conveys rich information,
target static assignments of applications to cores and accel- for example, a NIC can infer whether a core is polling in user
erate single application performance in part by imposing mode or in kernel mode based on which address is requested
strong assumptions to minimize steps 5-9. from its home address space.
Underlying this apparent trade-off between performance Secondly, it’s time to trust the NIC. The NIC is a critical
(static assignment of cores) and flexibility (more OS involve- part of the OS function of the machine. Unlike, e.g., the GPU,
ment) is the misconception that fine-grained interaction it is enabling infrastructure for the whole system. Viewing it
between OS and NIC is slow: the NIC is not just untrust- as a potential part of the OS rather than an untrusted periph-
worthy but also hard to talk to. In cases such as Receive-Side eral is the only way to fully exploit its hardware resources
Scaling (RSS) the goal is to provide offload (e.g. load bal- and unique position in the data path.
ancing) without involving the OS at all. This assumption In particular, since the NIC is responsible for demultiplex-
may hold for DMA descriptor rings, but much less so for ing an incoming packet to an application end-point, it should
loads/stores to device registers over modern PCI Express have access to all the relevant OS state: which processes are
3
LAUBERHORN Smart NIC Server CPU
Dispatch
Poll #

DECOMPRESS
Dispatch

RPC DECODE
#

SCHED.
DECRYPT
Control Poll
Packet
Info
… …
Dispatch
Poll #

Ethernet MAC Decoder Pipeline Multi-level Scheduler 2F2F Protocol CPU Cores RPC Handlers

Figure 3: Overview of the Lauberhorn receive path.

Respond

CACHE LINE
currently in the run queues on which cores, which are cur- Control Rea
d
Info RX
rently executing, and which are waiting. Some of this can be Writ Jump to
e #
Return

CACHE LINE
inferred from the cache traffic the NIC observes as in the ex- x
E d
Control
ad Rea
ample above, while any other state can be explicitly pushed Info TX Re
to the NIC via the interconnect with negligible overhead.
Scheduler Queues FPGA-homed CLs CPU Core RPC Handler
Finally, adopting the previous positions allows us to fully
implement RPC dispatch on the NIC. Integrating existing
Figure 4: The Lauberhorn protocol between NIC and
techniques for accelerating deserialization with rich knowl-
CPU
edge of the OS state enables RPC dispatch with essentially
zero software overhead: in the common case, it is possible to
execute every step in Section 2 on the NIC, and have a stalled 1-3, 5-6, 10, and 11 in section 2. It can do this based on in-
load on a processor core return a carefully prepared cache formation provided by the application (type signatures, etc.)
line with only the information needed to dispatch an RPC: and OS state provided by the kernel (scheduling information,
just the arguments and virtual address of the first instruction described in detail below).
of the target function to jump to. In the FPGA implementation, an Ethernet frame streams in
In addition, sharing the OS state means that this efficiency from the Ethernet MAC IP block and passes through various
is preserved when executing highly dynamic workloads streaming-mode header decoders to demultiplex the packet
where statically associating DMA queues, cores, threads, and remove the Ethernet, IP, and UDP headers, possibly being
and sockets is not practical: the NIC already has informa- stored in SRAM for complex protocol layers that require
tion about whether, and where, a target process is running buffering. At the RPC stub level, arguments are unmarshalled
and can notify either it or the OS accordingly. Moreover, as with Optimus Prime [19].
the OS has up-to-date information from Lauberhorn about For each invocation, the protocol decoders on Lauber-
which core are polling and where, so as to guide scheduling horn create a control metadata structure containing argu-
decisions. ments, etc. for the CPU to process. At a low level, Lauber-
horn delivers this structure to the CPU using an extension
5 IMPLEMENTATION SKETCH of the protocol described by Ruzhanskaia et al. [21] (Fig-
We are building a prototype, Lauberhorn, to demonstrate ure 4). For each end-point, our Lauberhorn protocol uses
the feasibility of our position. Lauberhorn exploits the large two control cache lines plus multiple overflow cache lines
FPGA, 100Gb/s interfaces, and cache-coherent interconnect to handle payloads which do not fit into a single cache line
on the Enzian research computer [5]. Figure 3 shows an (128 B on Enzian). The transmit path uses a similar, disjoint
overview of a minimal receive path; we return to additional, set of cache lines.
non-functional issues to be addressed in this design in sec- The CPU loads from one cache line to receive a request.
tion 7. The receive path of the Lauberhorn hardware can When the NIC has a packet received and parsed, it responds
be divided into RPC decoding and transferring the request to to this load with the decoded control data structure for that
the CPU. packet. The CPU executes the handler and writes the RPC
Lauberhorn fully decodes an incoming RPC request packet result into the same control cache line, and loads the second
into a (1) a process and communication end-point, (2) a code cache line for the next packet. Lauberhorn sees the load
pointer and data pointer inside that process corresponding to for the second cache line and knows that the CPU has fin-
the request, and the call arguments, corresponding to steps ished serving the first request. Before responding to the read
4
Linux Task Scheduling RPC Process RPC Process LAUBERHORN

# # # # # #
User Process

User Mode # Poll Poll

PTBR Kernel Task


PTBR PTBR Critical Kernel Task
Kernel Mode

# #

Yield
IRQ (Syscall) Dispatch Dispatch
Yield Dispatch Poll
Syscall
Yield
(schedule())

Run Queue Runnable Task Smart NIC Run Queue Critical Task

Figure 5: Comparison between normal task scheduling and NIC-driven scheduling of RPC isolation domains.

on the second cache line, the NIC issues a clean invalidate Thereafter 1 the core remains in the same process and
command to the coherence protocol for the first cache line runs a user-mode loop on a different pair of control cache
to fetch the RPC response from the CPU and send it out in lines which Lauberhorn has dedicated to that process. At
the network. Finally, when the next packet arrives and is this point, dispatching requests to this service involves al-
decoded, Lauberhorn responds to the CPU’s read on the most no software overhead: the load executed by the core
second control cache line. immediately returns the address to jump to.
Timeouts in the CPU’s memory subsystem mean that The user-mode loop can give up the CPU in a variety of
Lauberhorn cannot block a cache line load indefinitely with- ways. Lauberhorn can request it yield the core by returning
out causing an irrecoverable “bus error” exception and leav- a YieldNow message instead of an RPC request or TryAgain
ing the system in an inconsistent state. Therefore, we avoid message, asking the userspace process to execute a system
this by returning TryAgain dummy messages after 15ms, call to return to the kernel-mode thread 2 . Alternatively,
reducing the polling overhead (both bus traffic and CPU the kernel and Lauberhorn can cooperate to fully preempt
spinning) to almost zero and improving energy efficiency. the user process by sending an interrupt to the core, and
then resuming it (allowing it to receive the interrupt) with
6 OS INTEGRATION a subsequent YieldNow. Note that this can be initiated by
the kernel scheduler, or by Lauberhorn based on what it
A key novelty of Lauberhorn is how it uses precise ker-
knows about the instantaneous load on each server process,
nel scheduling state to dispatch requests. In the fast case, a
or both.
request arrives directly at the correct process without ker-
This approach also handles dynamic scaling of the cores
nel intervention, since Lauberhorn is aware which core is
used for RPC based on load. Many data center deployments
running the process and waiting for a cache line holding
use non-preemptive kernels for throughput. Lauberhorn
the request. Otherwise, Lauberhorn quickly delivers the
provides dynamic load information to the kernel (using,
request to the kernel, allowing it schedule the target process
again, the kernel-mode control channels) to reallocate cores
and deliver the unmarshalled request in software. Figure 5
between RPC services and non-RPC processes. A non-preemptive
compares this approach to the traditional Linux dispatch
kernel thread waiting on Lauberhorn can be reallocated by
loop.
sending it a Retire message from the NIC.
A CPU core starts running a regular kernel thread which
uses the protocol in Section 5 to monitor a pair of control
cache lines for incoming requests; Lauberhorn can dispatch
7 OTHER CONCERNS AND OPEN
a request for any process to this CPU core, whereupon the QUESTIONS
CPU switches to the corresponding process to handle the Lauberhorn as described so far will support full-featured
request. As it is a conventional kernel thread, it periodically RPC interaction with high efficiency, but is lacking some
calls schedule() 3 and can handle regular critical kernel non-functional features that become important in real data
operations like Read-Copy-Update. center settings. While encryption can be handled with fairly
5
standard techniques, support for tracing, debugging, and [12] Kaffes, K., Chong, T., Humphries, J. T., Belay, A., Mazières, D., and
statistics presents interesting properties for further close Kozyrakis, C. Shinjuku: Preemptive Scheduling for {usecond-scale}
integration with the OS. Tail Latency. pp. 345–360.
[13] Karandikar, S., Leary, C., Kennelly, C., Zhao, J., Parimi, D., Nikolic,
For large messages, the direct, low-latency approach be- B., Asanovic, K., and Ranganathan, P. A Hardware Accelerator for
comes less efficient and it is best to revert back to DMA-based Protocol Buffers. In MICRO-54: 54th Annual IEEE/ACM International
transfers since throughput comes to dominate over latency. Symposium on Microarchitecture (Virtual Event Greece, Oct. 2021),
The trade-off will depend on the platform, empirically for ACM, pp. 462–478.
Enzian this happens at about 4KiB. [14] Marty, M., de Kruijf, M., Adriaens, J., Alfeld, C., Bauer, S., Con-
tavalli, C., Dalton, M., Dukkipati, N., Evans, W. C., Gribble, S.,
Nested RPCs will benefit from the ability to rapidly create Kidd, N., Kononov, R., Kumar, G., Mauer, C., Musick, E., Olson,
a dedicated end-point for an RPC reply. Fine-grained inter- L., Rubow, E., Ryan, M., Springborn, K., Turner, P., Valancius, V.,
action with the NIC should make creating this continuation Wang, X., and Vahdat, A. Snap: a microkernel approach to host
a cheap operation with significant performance benefits. networking. In Proceedings of the 27th ACM Symposium on Operating
The design of standard OS-NIC and application-NIC in- Systems Principles (New York, NY, USA, 2019), SOSP ’19, Association
for Computing Machinery, p. 399–413.
terface is an open question, one which we hope to answer [15] Mogul, J., Baumann, A., Roscoe, T., and Soares, L. Mind the Gap:
through building Lauberhorn as a prototype thus evolving Reconnecting Architecture and OS Research. In Proceedings of the 13th
the interfaces we provide. Workshop on Hot Topics in Operating Systems (HotOS-XIII) (Napa, CA,
USA, May 2011).
[16] Mogul, J. C. TCP offload is a dumb idea whose time has come. In 9th
REFERENCES Workshop on Hot Topics in Operating Systems (HotOS IX) (Lihue, HI,
[1] Amazon Web Services. The Security Design of the AWS Nitro System, May 2003), USENIX Association.
Nov. 2022. https://docs.aws.amazon.com/whitepapers/latest/security- [17] Ousterhout, A., Fried, J., Behrens, J., Belay, A., and Balakrishnan,
design-of-aws-nitro-system/security-design-of-aws-nitro- H. Shenango: Achieving High {CPU} Efficiency for Latency-sensitive
system.html. Datacenter Workloads. pp. 361–378.
[2] Asanović, K., Avizienis, R., Bachrach, J., Beamer, S., Biancolin, [18] Peter, S., Li, J., Zhang, I., Ports, D. R. K., Woos, D., Krishnamurthy,
D., Celio, C., Cook, H., Dabbelt, D., Hauser, J., Izraelevitz, A., A., Anderson, T., and Roscoe, T. Arrakis: The Operating System is
Karandikar, S., Keller, B., Kim, D., and Koenig, J. The Rocket Chip the Control Plane. In 11th Symposium on Operating Systems Design
Generator. and Implementation (OSDI’14) (Broomfield, Colorado, USA, October
[3] Belay, A., Prekas, G., Klimovic, A., Grossman, S., Kozyrakis, C., 2014).
and Bugnion, E. Ix: a protected dataplane operating system for [19] Pourhabibi, A., Gupta, S., Kassir, H., Sutherland, M., Tian, Z., Dru-
high throughput and low latency. In Proceedings of the 11th USENIX mond, M. P., Falsafi, B., and Koch, C. Optimus Prime: Accelerating
Conference on Operating Systems Design and Implementation (USA, Data Transformation in Servers. In Proceedings of the Twenty-Fifth
2014), OSDI’14, USENIX Association, p. 49–65. International Conference on Architectural Support for Programming Lan-
[4] CCIX Consortium and others. Cache Coherent Interconnect for guages and Operating Systems (Lausanne Switzerland, Mar. 2020), ACM,
Accelerators (CCIX), May 2024. pp. 1203–1216.
[5] Cock, D., Ramdas, A., Schwyn, D., Giardino, M., Turowski, A., [20] Pourhabibi, A., Sutherland, M., Daglis, A., and Falsafi, B. Cere-
He, Z., Hossle, N., Korolija, D., Licciardello, M., Martsenko, K., bros: Evading the RPC Tax in Datacenters. In MICRO-54: 54th Annual
Achermann, R., Alonso, G., and Roscoe, T. Enzian: an open, general, IEEE/ACM International Symposium on Microarchitecture (Virtual Event
CPU/FPGA platform for systems software. In ASPLOS ’22: Proceedings Greece, Oct. 2021), ACM, pp. 407–420.
of the Twenty-Seventh International Conference on Architectural Support [21] Ruzhanskaia, A., Xu, P., Cock, D., and Roscoe, T. Rethinking Pro-
for Programming Languages and Operating Systems (February 2022). grammed I/O for Fast Devices, Cheap Cores, and Coherent Intercon-
[6] Consortium, C. Compute Express Link (CXL) version 3.0, Aug. 2022. nects, Sept. 2024. arXiv:2409.08141 [cs].
[7] Fried, J., Ruan, Z., Ousterhout, A., and Belay, A. Caladan: Mitigat- [22] Schuh, H. N., Krishnamurthy, A., Culler, D., Levy, H. M., Rizzo, L.,
ing Interference at Microsecond Timescales. pp. 281–297. Khan, S., and Stephens, B. E. CC-NIC: a Cache-Coherent Interface
[8] Humphries, J. T., Natu, N., Chaugule, A., Weisse, O., Rhoden, B., to the NIC. In Proceedings of the 29th ACM International Conference
Don, J., Rizzo, L., Rombakh, O., Turner, P., and Kozyrakis, C. ghOSt: on Architectural Support for Programming Languages and Operating
Fast & Flexible User-Space Delegation of Linux Scheduling. In Pro- Systems, Volume 1 (La Jolla CA USA, Apr. 2024), ACM, pp. 52–68.
ceedings of the ACM SIGOPS 28th Symposium on Operating Systems [23] Seemakhupt, K., Stephens, B. E., Khan, S., Liu, S., Wassel, H.,
Principles (Virtual Event Germany, Oct. 2021), ACM, pp. 588–604. Yeganeh, S. H., Snoeren, A. C., Krishnamurthy, A., Culler, D. E.,
[9] Humphries, J. T., Natu, N., Kaffes, K., Novaković, S., Turner, P., and Levy, H. M. A Cloud-Scale Characterization of Remote Procedure
Levy, H., Culler, D., and Kozyrakis, C. Tide: A Split OS Architecture Calls. In Proceedings of the 29th Symposium on Operating Systems
for Control Plane Offloading, Oct. 2024. arXiv:2408.17351. Principles (Koblenz Germany, Oct. 2023), ACM, pp. 498–514.
[10] Ibanez, S., Mallery, A., Arslan, S., Jepsen, T., Shahbaz, M., Kim, C., [24] Zhang, I., Raybuck, A., Patel, P., Olynyk, K., Nelson, J., Leija, O.
and McKeown, N. The nanoPU: A Nanosecond Network Stack for S. N., Martinez, A., Liu, J., Simpson, A. K., Jayakar, S., Penna, P. H.,
Datacenters. pp. 239–256. Demoulin, M., Choudhury, P., and Badam, A. The Demikernel Dat-
[11] Jang, J., Jung, S. J., Jeong, S., Heo, J., Shin, H., Ham, T. J., and Lee, J. W. apath OS Architecture for Microsecond-scale Datacenter Systems. In
A Specialized Architecture for Object Serialization with Applications Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems
to Big Data Analytics. In 2020 ACM/IEEE 47th Annual International Principles (Virtual Event Germany, Oct. 2021), ACM, pp. 195–211.
Symposium on Computer Architecture (ISCA) (Valencia, Spain, May
2020), IEEE, pp. 322–334.
6

You might also like