Splitwise Efficient Generative LLM Inference Using Phase Splitting
Splitwise Efficient Generative LLM Inference Using Phase Splitting
119
Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
1 Queuing Model #Layers Hidden size #Heads
Prompt
Request id
CDF
CDF
0.25 0.25
for TBT, and therefore the E2E latency. Finally, there is mixed 0.00 0.00
batching (Figure 2(c)) [23]. With this batching, the scheduling 0 2000 4000 6000 0 500 1000 1500
decisions are made at each forward pass, and the prompt and (a) Prompt input tokens. (b) Generated output tokens.
token phases can run together. This reduces the impact on TBT,
Fig. 3: Distribution for prompt and generated tokens.
but does not eliminate it, since token phases scheduled with
prompt phases will experience a longer runtime. In the rest of
the paper, we use mixed batching unless stated otherwise. and output size (number of output tokens). Due to customer
E. Model parallelism privacy requirements (e.g., GDPR), we do not have visibility
into the content of the prompts. We instead use the production
Model parallelism can be used to divide a model onto multi-
traces to guide the input and output sizes, where we send the
ple GPUs, and even multiple machines, for higher efficiency and
input prompt with the required number of tokens, and force the
memory capacity. LLM inference typically uses pipeline and
model to generate the corresponding number of output tokens
tensor parallelism. Pipeline parallelism (PP) divides the layers
for each request. Note that the text of the inputs prompts
of the model among the GPUs, while keeping all the operators
does not impact the performance metrics that we benchmark,
and tensors within a layer on the same GPU. Tensor parallelism
since they depend only on the input and output sizes. For
(TP) divides the tensor across the GPUs, while replicating all
this characterization, we do not reuse the KV-cache between
the layers on each GPU. Pipeline parallelism requires lower
requests to emulate a cloud service with security guarantees.
communication across the participating GPUs, while tensor
parallelism requires high bandwidth communication for each Models. Table III shows the models that we evaluate. Both
layer. In general, tensor parallelism performs better for GPUs BLOOM [69] and Llama2 [71] are state-of-the-art open
within the same machine, connected with high bandwidth source LLMs. Both models are decoder-only, transformer-based
interconnects like e.g. NVLink [15]. In the rest of the paper, models. We use the version of each model with the most
we use tensor parallelism across 8 GPUs for the best latency. parameters, since these versions are the most representative
for production-class accuracy. Unless stated otherwise, we run
F. GPU clusters and interconnects BLOOM-176B and Llama-70B on vLLM [51] on a machine
With the recent rise of LLM use cases, several cloud service with 8 H100 [16] GPUs.
providers have expanded the GPU-based offerings, leading to
A. Number of prompt and generated tokens
large GPU cluster deployments [5], [56], [57]. Each machine in
these AI clusters is generally comprised of 8 flagship NVIDIA To better understand our traces, we examine the distribution
GPUs (A100 or H100). Each GPU is connected to all the of the number of prompt input and generated output tokens.
other GPUs in the cluster with a high bandwidth Mellanox Figure 3a shows the distribution of number of prompt tokens.
InfiniBand interconnect [10], [13], forming a high bandwidth Since the coding LLM inference service is generally used to
data plane network. The InfiniBand bandwidth offered in the generate completions as the user is writing code, its input
cloud today ranges from 25 to 50GBps per GPU pair [7], [10]. prompt can include large chunks of the code written so far.
Thus, it has a large median prompt size of 1500 tokens. On
III. C HARACTERIZATION the other hand, the conversation service has a wider range of
In this section, we explore the performance and utilization input prompt tokens since it depends on the user. The median
characteristics of LLM inference and draw key insights to number of prompt tokens for this trace is 1020 tokens.
guide the design of Splitwise. Figure 3b shows the distribution of the number of generated
Production traces. We use production traces taken from two tokens. Since the coding service typically only generates the
th
Azure LLM inference services on November 11 2023. Our next few words in the program as the user types, the median
traces represent the most common scenarios in LLM inference number of output token is 13 tokens. On the other hand, the
today: coding and conversation. We have released a subset of conversation service has an almost bimodal distribution, with
our traces at https://github.com/Azure/AzurePublicDataset [4]. a median of 129 tokens generated.
The traces we use for characterization are 20 minutes long and Insight I: Different inference services may have widely
include the arrival time, input size (number of prompt tokens), different prompt and token distributions.
120
Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
BLOOM Llama BLOOM Llama
1.0
Throughput (tokens/s)
Throughput (tokens/s)
0.8 10000 1250
7500 1000
0.6
CDF 0.4 5000
750
500
0.2 Coding-Bloom Conversation-Bloom 2500 250
0.0 Coding-Llama Conversation-Llama 0
0 10000 20000 30000
0
0 20 40 60
0 1000 2000 3000 4000 5000 6000 7000 Batched Token Size Batched Token Size
# Active Tokens
(a) Prompt phase. (b) Token generation phase.
Fig. 4: Cumulative distribution of time spent with various active
batched tokens. Fig. 6: Impact of batching on the throughput for the 2 LLMs.
104
500
TTFT Latency (ms)
60
Memory (GB)
Latency (ms)
40 103
20 400
102
0 Coding Coding Conv. Conv.
0
B. Batch utilization E2E. Figure 5c shows various percentiles of E2E latency for
both models, with no batching. The variability between the
To understand how much can these requests be batched, we request input and output sizes is apparent. Furthermore, we
measure how often machines run at a given batch size. We use see that most of the E2E time is spent running the token phase.
mixed continuous batching as shown in Figure 2. To fit into a This holds true even for the coding trace, where prompt sizes
single machine, we run a scaled-down version of the coding are large and generated tokens few. In fact, we find that for
and conversation traces with 2 requests per second. BLOOM-176B, a prompt phase with 1500 input tokens takes
Figure 4 shows the distribution of the time spent by the the same time as token phase with only 6 output tokens.
machine running various number of active tokens in a batch.
Note that if a prompt of 100 tokens is running in its prompt Insight III: For most requests, the majority of the E2E time
phase, we count the active tokens as 100. However, once the is spent in the token generation phase.
request is in the token phase, we count it as one active token, D. Throughput
since the tokens are generated one at a time (assuming a
Figure 6 shows the impact of batching on the throughput
beam search size of one [51]). We find that most of the time
(measured as tokens per second). For the prompt phase, we
(60–70%) for conversation is spent running only 20 tokens or
define the throughput as the number of prompt input tokens
fewer. Since the coding service has very few output tokens, it
that are processed per second. We see that the throughput
experiences even worse batching in the token phase and runs
decreases after 2048 prompt tokens, which corresponds to a
with a single token for more than 20% of the time. Both the
batch size of less than 2 for the median prompt sizes from the
LLMs show very similar trends.
traces. On the other hand, Figure 6b shows that the throughput
Insight II: Mixed continuous batching spends most of the time in the token phase keeps increasing with batching until 64
with very few active tokens batched. batch-size, at which point, the machine runs out of memory.
C. Latency Insight IV: The prompt phase batch size should be limited
to ensure good performance. In contrast, batching the token
TTFT. Figure 5a shows the impact of the number of prompt generation phase yields high throughput without any downside.
tokens on TTFT. The range of sizes was chosen based on the
coding and conversation traces. We find that TTFT for both E. Memory utilization
models grows almost linearly as the prompt size increases. During an LLM inference, the GPU memory is used to host
This behavior is due to the prompt phase having high GPU the model weights and activations, as well as the KV caches
utilization and being computationally bound. (Section II-B). As the number of tokens in a batch increase,
TBT. Figure 5b shows the impact of forcefully batching the the memory capacity required for the KV cache also increases.
output tokens of different requests together on the TBT. We Figure 7 shows the memory capacity utilization during each
observe very little impact on TBT as the batch size grows. phase as the number of tokens in the batch increases. During
With a batch size of 64, there is only 2× impact on TBT. the prompt phase, the input prompt tokens generate the KV
121
Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
BLOOM Llama BLOOM Llama Coding Conversation
TDP TDP
A100 H100 Ratio A100 H100 Ratio
Normalized Power
Normalized Power
1.00 1.00 TTFT 185 ms 95 ms 0.51× 155 ms 84 ms 0.54×
0.75 0.75 TBT 52 ms 31 ms 0.70× 40 ms 28 ms 0.70×
0.50 0.50 E2E 856 ms 493 ms 0.58× 4957 ms 3387 ms 0.68×
0.25 0.25 Cost [5] $0.42 $0.52 1.24× $2.4 $3.6 1.5×
Energy 1.37 Whr 1.37 Whr 1× 7.9 Whr 9.4 Whr 1.2×
0.00 0.00
512 1024 2048 4096 8192 1 2 4 8 16
Batched Token Size Batched Token Size
TABLE IV: P50 request metrics on A100 vs. H100 without
(a) Prompt phase. (b) Token generation phase. batching on Llama-70B.
Fig. 8: Maximum and mean power utilization varying the
batching size.
the two from running on different hardware. Table I shows the
specifications for DGX-A100 [15] and DGX-H100 [16]. The
80
memory-to-compute ratio favors A100 over H100. Table IV
12000
TTFT Latency (ms)
122
Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
Cluster-Level Scheduler
Prompt Prompt
(CLS) Itera�on 1 Itera�on 1
Prompt Prompt
Machine Machine KV-cache
KV-cache
transfer per transfer per
iteration layer
123
Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
can be generated in the token generation phase. This directly
impacts the maximum TBT and end-to-end latency of inference.
The time required for the transfer depends on the size of
the KV cache (which is directly proportional to the number
of prompt tokens) and on the bandwidth of the interconnect
between the prompt and the token machines. Even when using
fast InfiniBand links, the transfer overhead for large prompt
sizes could become a significant fraction of the TBT.
In Splitwise, we optimize the KV-cache transfer by overlap-
ping it with the computation in the prompt phase. As each layer
in the LLM gets calculated in the prompt machine, the KV
cache corresponding to that layer is also generated. At the end
of each layer, we trigger an asynchronous transfer of the KV-
cache for that layer while the prompt computation continues
to the next layer. Figure 11b shows this asynchronous transfer
which reduces the transfer overheads. Layer-wise transfer also
enables other optimizations, such as earlier start of the token
phase in the token machines, as well as earlier release of Fig. 12: Design space for provisioning a Splitwise-HH cluster.
KV-cache memory on the prompt machines. Cluster configurations targets a peak throughput of 70 RPS.
Layer-wise KV-cache transfer happens in parallel with the The cost-optimal Splitwise-HH configuration is marked with ⋆
prompt computation for the next layer. This requires fine- (27 prompt and 3 token machines).
grained synchronization per layer for correctness. Thus, it is
possible to incur performance interference and increase the
TTFT, especially for smaller prompts. However, for small 50% of the power. We propose this design based on Figure 9
prompts the total KV-cache size is small and does not need and Insight VII (i.e., the prompts phase is impacted by power
the layer-wise transfer to hide the latency. Since the number of caps while token has no performance impact with 50% lower
tokens in a batch is already known at the start of computation, power cap per GPU).
Splitwise picks the best technique for KV-cache transfer. It Number of machines. The LLM inference cluster deployment
uses serialized KV-cache transfer for smaller prompts and must be sized with the appropriate number of prompt and token
layer-wise transfer and for larger prompts. We show that the machines. Our methodology involves searching the design space
overall transfer and interference overheads are relatively small using our event-driven cluster simulator, which is described in
in Section VI-A. detail in Section V. We need to provide as input: (1) the target
D. Provisioning with Splitwise cluster design (e.g., Splitwise-HA or Splitwise-HHcap), (2) an
LLM-specific performance model that can estimate the TTFT
We leverage Splitwise to optimize LLM inference cluster
and TBT at various input, output, and batch sizes, (3) a short
deployments for power, cost, and throughput.
trace derived from the target prompt and token size distributions
Type of machines. We propose four main variants of Splitwise- for the service (e.g., Figure 3), (4) the SLOs (e.g., Table VI), (5)
based systems: Splitwise-AA, Splitwise-HH, Splitwise-HA, and the constraints (e.g., throughput), and (6) the optimization goal
Splitwise-HHcap. The nomenclature is simply drawn from the (e.g., minimize cost). Using this information, our provisioning
first letter representing the Prompt machine type, and the second framework searches the space for the desired optimal point.
letter representing the Token machine type. “A” represents a For example, searching with a throughput constraint and a
DGX-A100 machine, “H” represents a DGX-H100 machine, cost minimization goal gives us iso-throughput cost-optimized
and “Hcap” represents a power-capped DGX-H100 machine. clusters across different designs.
Table V shows a summary of the cost, power, and hardware
in each of our evaluated systems. Search space. Figure 12 shows an example of the two-
Splitwise-AA uses DGX-A100 for both prompt and token dimensional search space for the number of prompt and token
pools, while Splitwise-HH uses DGX-H100 for both. These two machines under Splitwise-HH for the coding workload (using
variants represent the commonly available setups in providers a 2-minute trace). The simulator outputs the various percentiles
where machines are homogeneous and interchangeable. for TTFT, TBT, and E2E latencies. Then, we select the clusters
Splitwise-HA uses DGX-H100 for the prompt pool and that meet the SLOs for each of these metrics and optimize
DGX-A100 for the token pool. We choose this configuration our target function. For example, Figure 12 shows a ⋆ for the
based on Table IV, and the Insight VII (i.e., A100s can be setup with 27 prompt and 3 token machines with the lowest
more cost- and power-efficient for the token phase). cost that achieves 70 RPS. We call this setup iso-throughput
Splitwise-HHcap uses DGX-H100 machines for both prompt cost-optimized.
and token pools. However, we power cap the token machines Optimization. We can use three optimization goals: throughput,
down to 70% of their rated power, with each GPU capped by cost, and power. Throughput optimization is important for both,
124
Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
Prompt Machine Token Machine Prompt-Token
Type Cost Power Type Cost Power Interconnect Bandwidth
Splitwise-AA DGX-A100 1× 1× DGX-A100 1× 1× 1×
Splitwise-HH DGX-H100 2.35× 1.75× DGX-H100 2.5× 1.75× 2×
Splitwise-HHcap DGX-H100 2.35× 1.75× DGX-H100 2.5× 1.23× 2×
Splitwise-HA DGX-H100 2.35× 1.75× DGX-A100 1× 1× 1×
the cloud service provider (CSP) and the user. Cost optimization Orchestration Cluster Scheduler Model
has different importance levels to the CSP and the user. For inputs spec configs zoo
the CSP, a higher cost for the same throughput might be Application Simulator 2
Simulator
acceptable if there are gains in power and space requirements inputs outputs
Cluster
for the cluster. However, for the end-user, a higher cost at Request TTFT, TBT,
traces E2E
the same throughput is generally unacceptable. Finally, power Perf. 3
Schedulers Util.
optimization is attractive for a CSP, since it enables more GPUs SLOs 2 models
levels
4 2
to be deployed in the same datacenter [62], [63], but it may not
LLM KV-cache
be as important to the user. We only consider the provisioned 4 Validation performance transfer
power, and not the dynamic power utilization, in our study. 4 Validation
Hardware
1
profiling Implementation
E. Practical Considerations
Fig. 13: Overview of the design of the Splitwise simulator.
Accuracy impact. Splitwise does not impact accuracy since
it uses lossless KV-cache transfer and does not add any
randomization. It executes inference with the same parameters Our implementation of the Splitwise technique assigns
and state as on a single machine. machines either a prompt role, or a token role. As the
Scalability. Since LLM requests are much longer than typical prompt machine generates the first token, it transfers the KV-
ML requests [37], [38], they incur lower scheduling overhead cache to the token machine using the technique described in
for similar cluster sizes. However, the CLS may become a Section IV-C. We use MSCCL++ [11], an optimized GPU-
scalability bottleneck for large clusters. Insights from prior driven communication library, to implement the naive and
work on partitioned or replicated scheduling could help improve layer-wise KV cache transfers.
scalability [27], [61], [72] and are orthogonal to Splitwise. In our implementation, the prompt machine uses the zero-
Reliability and fault tolerance. If the prompt or the token copy one-sided put primitive of MSCCL++ to send KV-cache
machine fail, Splitwise simply restarts requests from scratch, data over InfiniBand as soon as it is ready, without requiring
similar to today’s LLM serving systems [44], [51]. Alternatively, the token machine to issue any receive instructions. Once
Splitwise could checkpoint the KV-cache generated after we have issued a put for all layers, the prompt machine
prompt computation into an in-memory database. To recover, signals a semaphore that the token machine waits on. The
Splitwise can use this cache to skip prompt recomputation, and synchronization done with the help of semaphores uses the
start right away with the token phase. The KV-cache could also same InfiniBand connection used to send KV-cache data.
be checkpointed periodically during the token phase. Designing When processing a batch of prompts, each request is assigned
safe and efficient failure recovery is out of scope for our paper. a different semaphore since it may be routed to different
token machines. We ship the KV-caches block-by-block in
V. M ETHODOLOGY vLLM. To minimize the number of transfers, we also consider
the contiguity of KV blocks as long as they use the same
A. Experimental setup
semaphore.
To evaluate our proposal on real hardware, we implement
Splitwise’s KV-cache transfer mechanism on top of vLLM [51]. B. Simulator setup
Our implementation is open source [1]. We run this modified We build a simulator to explore cluster designs and evaluate
vLLM on two DGX-A100 and two DGX-H10 virtual machines Splitwise at scale. The simulator code is open source [20].
(VMs) on Microsoft Azure with specifications from Table I. Figure 13 shows the design of our simulator. The simulator is
These are the VMs used to collect the characterization data in event-driven and faithfully models the Splitwise machine pools,
Section III. These machines are connected with InfiniBand and schedulers, machine-level memory and queues, and KV-cache
the DGX-H100s have double the bandwidth (i.e., 400 Gbps). transfer. We first profile the LLM on the target hardware with
Since vanilla vLLM only supports continuous batching with various input/output sizes 1 . Based on the characterization
token preemption which can lead to much higher TBT, we profiles, we build a performance model. The simulator takes
implement state-of-the-art mixed continuous batching [81] as as input the request traces, SLOs, the performance model,
discussed earlier in Figure 2(c). and the configurations for cluster and scheduler 2 . For our
125
Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
P50 P90 P99 H100-Per-Layer A100-Per-Layer
TTFT 2× 3× 6×
H100-Serialized A100-Serialized
TBT 1.25× 1.5× 5× 40
Latency (ms)
E2E 1.25× 1.5× 5×
30
TABLE VI: SLO expressed as slowdown compared to a request 20
running on DGX-A100 under no contention. 10
0
0 512 1024 1536 2048
Batched Token Size
evaluation, we use the prompt and token size distributions from
the production traces in Section III. We tune the Poisson arrival Fig. 14: Overhead of the KV-cache transfer as the prompt size
rate to increase and decrease the load (requests per second) increases on A100s and H100s.
for cluster sizing. The simulator provides the achieved metrics
per request (TTFT, TBT, E2E), and the machine utilization E2E Per-layer TTFT Per-layer
E2E Serialized TTFT Serialized
levels 3 . We cross-validated the performance model with E2E 1-machine Baseline TTFT 1-machine Baseline
hardware experiments to ensure accuracy; we also validated
600
the simulator end-to-end using production load with over 50K
Latency (ms)
iterations to ensure fidelity 4 . 400
Performance model. We build a piece-wise linear performance 200
model using performance profiles at various batch sizes, input
0 128 256 384 512 640 768 896 1024 1536 2048
sizes, output sizes, in the required parallelism configuration Batched Token Size
on A100 and H100 machines from Section III. We validate
that our performance model has high accuracy; it incurs a Fig. 15: Overhead of KV cache transfer on TTFT, E2E latency
mean absolute percentage error (MAPE) of less than 3% when for coding trace for A100 and H100.
evaluated with a 80:20 train:test dataset split.
Communication model. In our evaluation, KV-cache transfers
setup has double the bandwidth of the A100 setup (i.e., 200
cause inter-machine communication, whereas tensor parallelism
vs 400 Gbps), and the impact of this can be clearly seen with
only causes intra-machine communication. We model inter-
transfers in the H100 setup happening about twice as fast as
machine communication overheads by benchmarking our KV-
those in the A100 setup.
cache transfer implementation over Infiniband in Section VI-A.
As discussed in Section IV-C, for small prompt sizes (< 512
SLOs. To determine the maximum throughput that can be in H100), Splitwise uses the serialized KV-cache transfer and
supported by a given cluster design, we use P50, P90, and for larger prompts, it uses per-layer transfers.
P99 SLOs for TTFT, TBT, and E2E latency metrics. Table VI
End-to-end impact. Next, we run the coding trace on the
shows our SLO definition using DGX-A100 as a reference. We
2-machine Splitwise setups without batching, and compare the
require all nine SLOs to be met. SLOs on TTFT are slightly
observed latency metrics to a 1-machine baseline setup with
looser, since it has a much smaller impact on the E2E latency.
no batching. Figure 15 shows our results. The latency impact
Baselines. We compare our Splitwise designs against Baseline- of serially transferring the KV-cache grows up to 3% of the
A100 and Baseline-H100. The clusters in these baselines E2E with large prompts. However, Splitwise only incurs 0.8%
consist of just DGX-A100s and DGX-H100s, respectively. Both of E2E. In a user-facing inference, the only visible impact
baselines use the same mixed continuous batching that Splitwise of KV-cache transfer overhead is the latency for the second
uses for mixed pool machines (described in Section IV-A). token. Splitwise adds a 16.5% latency to the second token,
as compared to the 64% overhead from a serialized transfer.
VI. E VALUATION
Overall, the transfer impact in Splitwise is hardly perceivable
A. Experimental results even in a user-facing inference.
KV-cache transfer latency. We first measure the latency to B. Iso-power throughput-optimized clusters
transfer the KV-cache as the prompt size grows. Figure 14
shows the visible transfer latency on both A100 and H100 Cluster provisioning. We provision clusters using the method-
setups with the naive and optimized transfer design as discussed ology described in Section IV-D. We target a specific workload
in Figure 11. Compared to the prompt computation time, the (e.g., conversation) at a peak load with the same power (i.e.,
overhead is minimal (< 7%). The time for serialized transfers iso-power) for each cluster design. For the baseline, we use the
linearly increases with the prompt size since the size of the power for 40 DGX-H100 machines as our target peak power.
KV-cache also increases. The optimized per-layer transfer, on For the A100 baseline, we can fit 70 DGX-A100 machines
the other hand, hides much of the latency. For these transfers, under the same power budget. We denote these two designs
we see a constant non-overlapped transfer time of around 8ms as 40P/T and 70P/T respectively, since they both use mixed
for the A100 and around 5ms for the H100 setup. The H100 batching in all machines.
126
Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
For Splitwise cluster designs under the coding trace,
Splitwise-AA provisions 55 prompt machines and 15 for the
token pool, denoted as (55P, 15P). Note that like Baseline-
SLO
A100, Splitwise-AA also provisions 75% more machines than
Baseline-H100. The legends in Figure 16 show the different
provisioning choices under coding and conversation workloads.
Request size distributions reflect in the machine pool sizing. For
example, we provision more prompt machines under Splitwise-
HH (35P, 5T) for the coding trace, while we provision more
token machines (25P, 15T) for the conversation trace.
Latency and throughput. Figure 16 shows a deep dive
into all the latency metrics at different input load for each
cluster design with the same power (i.e., iso-power). For the
coding trace (Figure 16a), Splitwise-HH, Splitwise-HHcap, and
Splitwise-AA all perform better than Baseline-H100. As the (a) Coding trace.
load increases, Baseline-H100 suffers from high TBT due to
mixed batching with large prompt sizes. Although Splitwise-
AA can support higher throughput, its TTFT is consistently
higher than most designs. Splitwise-HA clearly bridges the SLO
127
Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
Baseline-H100 Baseline-H100 #Servers Cost Power #Servers Cost Power
Splitwise-HH (Prompt) Splitwise-HH (Prompt) 1.0 1.0
Normalized Value
Normalized Value
Splitwise-HH (Token) Splitwise-HH (Token)
1.0 1.0 0.5 0.5
Cumulative Duration
Cumulative Duration
0.8 0.8 0.0 0.0
0.6 0.6
Baseline-A100
Splitwise-AA
Splitwise-AA
Baseline-H100
Splitwise-HH
Splitwise-HA
Splitwise-HHcap
Baseline-A100
Baseline-H100
Splitwise-HH
Splitwise-HA
Splitwise-HHcap
(25P, 16T)
(25P, 16T)
(11P, 19T)
(21P, 1T)
(5P, 17T)
(5P, 17T)
(88P/T)
(88P/T)
(8P, 16T)
(19P, 3T)
(24P/T)
(24P/T)
0.4 0.4
0.2 0.2
0.0 0.0 (a) Power-optimized. (b) Cost-optimized.
101 103 101 103
Number of Batch Tokens Number of Batch Tokens Fig. 19: Summary of iso-throughput cluster designs.
(a) Low load (70 RPS). (b) High load (130 RPS).
Fig. 17: Cumulative distribution of time spent at various batched Baseline-A100 (70P/T) Splitwise-AA (55P, 15T) Splitwise-HA (35P, 8T)
Baseline-H100 (40P/T) Splitwise-HH (35P, 5T) Splitwise-HHcap (35P, 7T)
token sizes for iso-power throughput-optimized design. 4 4 4
Normalized
Normalized
Normalized
p90 TTFT
p90 TBT
p90 E2E
2 2 2
#Servers Throughput Cost #Servers Throughput Power
0 0 0
30 50 70 90 110130150 30 50 70 90 110130150 30 50 70 90 110130150
Normalized Value
Normalized Value
Splitwise-AA
Baseline-H100
Splitwise-HH
Splitwise-HA
Splitwise-HHcap
Baseline-A100
Splitwise-AA
Baseline-H100
Splitwise-HH
Splitwise-HA
Splitwise-HHcap
(45P, 25T)
(51P, 35T)
(25P, 26T)
(30P, 21T)
(25P, 15T)
(25P, 15T)
(25P, 21T)
(30P, 10T)
(70P/T)
(86P/T)
(40P/T)
(40P/T)
4 4 4
Normalized
Normalized
Normalized
p90 TTFT
p90 TBT
p90 E2E
2 2 2
128
Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
current and upcoming LLMs, as long as the auto-regressive online monitoring with metrics like request length or hardware
nature of the workload requires these two phases. Note that performance counters to identify workload phases and allocate
as shown in Section VI-D, clusters provisioned with Splitwise them appropriately on heterogeneous processors. However, they
for one model can also efficiently serve other models. do not consider the complexities with batching. Distributed
Alternative compute hardware. In this work, we use NVIDIA dataflow systems orchestrate large-scale computational graphs
H100 and A100 GPUs since they are commonly used for and aim to provide general-purpose programmability [34], [46],
LLM inference in datacenters today [17]. Smaller datacenter [75], [82]. LLM inference under Splitwise can be viewed as
GPUs like NVIDIA T4 lack enough memory to run modern a static computational graph with two stages, so it could be
LLMs efficiently. In general, our methodology is applicable implemented using distributed frameworks that provide efficient
to any hardware (including CPUs, FPGAs, ASICs [33]) that GPU abstractions [59]. Splitwise differs from these works since
aligns with the computational requirements of prompt and it uses a specialized two-phase design for generative LLM
token phases. Our characterization suggests that prompt phases inference and leverages phase-aware resource management
need high compute capability and memory bandwidth with low with efficient batching.
memory capacity, whereas token phases need moderate compute Model serving systems. LLM inference serving is a rapidly
capability with high memory capacity and bandwidth. Thus, developing field, with several recent works optimizing batch-
GPUs like AMD MI-250 [2] and CPUs like Intel Sapphire- ing [23], [25], [51], [53], [81], scheduling [22], [42], [51], [66],
Rapids (with HBM) [9] could be effective token machines. [73], [79], and memory usage [32], [35], [51], [74]. Prior work
Since we do not have access to such hardware and/or optimized has also proposed using CPUs and lower compute capability
LLM implementations, we leave this to future work. devices for LLM serving [8], [12]. These approaches use the
Interconnect between prompt and token machines. In this same machine for both prompt and token phase. With Splitwise,
work, we assume Infiniband connection between the prompt they could improve throughput and latency by splitting phases.
and token machines in all the designs (albeit, lower bandwidth Prior work on video and ML serving focuses on scheduling
when A100s were involved). Although this is common for all model chains with data dependencies under latency con-
homogenous machines, Splitwise-HA is not be readily available straints [24], [31], [43], [49], [68]. Such schedulers rely on
with an Infiniband connection between H100s and A100s, even model profiling to make efficient allocation decisions and
though technically feasible. The alternative could be HPC manage requests across machines. Recommendation system
clouds, with Infiniband connections through the CPU [3], or inference exhibits compute/memory heterogeneity both within
Ethernet, using RoCE [58]. Given our optimized KV-cache and across models. Prior work exploits such heterogeneity
transfer that helps reduce critical latency, an interconnect with to selectively schedule requests between CPUs and accelera-
10× lower bandwidth would likely still be beneficial. To further tors [38], [52], colocate models with complementary memory
reduce our bandwidth utilization, we could also compress the usage [30], and partition compute/memory on heterogeneous
KV-cache before transferring it across the network [55]. hardware resources [45], [48]. Similarly, Splitwise exploits the
Heterogeneous prompt/token machines. Although Splitwise heterogeneity within LLM inference requests. However, it uses
is robust to varied models and input traces, we recognize that different optimizations due to the differences in LLM workload
fragmenting a data center with different types of GPUs (e.g., characteristics and requirements.
Splitwise-HA) may bring its own challenges for the CSP.
IX. C ONCLUSION
Conversation back and forth. Chat APIs for LLMs today
require the user to send the complete context of the conversation We extensively characterized the prompt computation and
so far [18]. However, in the future, services may have enough token generation phases of LLM inference to draw out
GPU capacity to cache the context and avoid recomputation. differences in their system utilization patterns. Based on
This could sway the memory utilization pattern of the prompt our insights, we designed Splitwise to separate these phases
phase from our characterization. Furthermore, it may require onto different machines and enable phase-specific resource
transferring the KV-cache back to a prompt machine to be management. Using Splitwise, we explored cluster designs
ready for the next conversation request. optimized for throughput, cost, and power, and showed that
they perform well even as workloads change. Splitwise clusters
VIII. R ELATED W ORK under performance SLOs achieve 1.76× better throughput with
Heterogeneous scheduling and dataflow systems. Prior 15% lower power at the same cost, or 2.35× better throughput
work has studied heterogeneous scheduling for a variety of with same the cost and power than existing designs.
interactive services [65], [68], [83]. These works exploit
ACKNOWLEDGEMENTS
hardware heterogeneity to strike a balance between different
objectives such as cost, energy, and performance. However, We thank the reviewers for their helpful feedback. We
they run the entire workload on the same machine. Research thank Chetan Bansal, Srikant Bhardwaj, Suriya Kalivardhan,
on heterogeneous multiprocessor CPU scheduling attempts to Ankur Mallick, Deepak Narayanan, and Amar Phanishayee for
match workload heterogeneity to hardware heterogeneity [29], insightful discussions. Pratyush Patel was partially supported
[40], [41], [50], [76], [80]. These works use profiling or by NSF CNS-2104548 and a research grant from VMware.
129
Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES and D. Amodei, “Language Models are Few-Shot Learners,” arXiv
preprint arXiv:2005.14165, 2020.
[1] Add Splitwise Implementation to vLLM. GitHub. [Online]. Available: [29] J. Chen and L. K. John, “Efficient Program Scheduling for Heterogeneous
https://github.com/vllm-project/vllm/pull/2809 Multi-core Processors,” in DAC, 2009.
[2] AMD Instinct™ MI250 Accelerator. [Online]. Available: https: [30] Y. Choi, J. Kim, and M. Rhu, “Hera: A Heterogeneity-Aware Multi-Tenant
//www.amd.com/en/products/server-accelerators/instinct-mi250 Inference Server for Personalized Recommendations,” arXiv preprint
[3] Azure InfiniBand HPC VMs. [Online]. Available: https://learn.microsoft. arXiv:2302.11750, 2023.
com/en-us/azure/virtual-machines/overview-hb-hc [31] D. Crankshaw, G.-E. Sela, X. Mo, C. Zumar, I. Stoica, J. Gonzalez, and
[4] Azure Public Dataset: Azure LLM Inference Trace 2023. GitHub. A. Tumanov, “InferLine: Latency-aware Provisioning and Scaling for
[Online]. Available: https://github.com/Azure/AzurePublicDataset/blob/ Prediction Serving Pipelines,” in SoCC, 2020.
master/AzureLLMInferenceDataset2023.md [32] T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “FlashAttention: Fast
[5] CoreWeave - Specialized Cloud Provider. [Online]. Available: and Memory-efficient Exact Attention with IO-Awareness,” in NeurIPS,
https://www.coreweave.com 2022.
[6] Google Assistant with Bard. [Online]. Available: https://blog.google/ [33] David Patterson. Domain Specific Architectures for Deep Neural
products/assistant/google-assistant-bard-generative-ai/ Networks: Three Generations of Tensor Processing Units (TPUs).
[7] HPC Interconnect on CoreWeave Cloud. [Online]. Available: https: Allen School Distinguished Lecture. [Online]. Available: https:
//docs.coreweave.com/networking/hpc-interconnect //www.youtube.com/watch?v=VCScWh966u4
[8] Intel BigDL-LLM. [Online]. Available: https://github.com/intel-analytics/ [34] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on
BigDL Large Clusters,” Communications of the ACM, 2008.
[9] Intel Sapphire Rapids with HBM. [Online]. [35] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “LLM.int8():
Available: https://www.anandtech.com/show/17422/ 8-bit Matrix Multiplication for Transformers at Scale,” arXiv preprint
intel-showcases-sapphire-rapids-plus-hbm-xeon-performance-isc-2022 arXiv:2208.07339, 2022.
[10] Microsoft Azure ND A100 v4-series . [Online]. Available: https: [36] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
//learn.microsoft.com/en-us/azure/virtual-machines/nda100-v4-series of Deep Bidirectional Transformers for Language Understanding,” in
NAACL, 2019.
[11] MSCCL++: A GPU-driven communication stack for scalable AI
[37] A. Gujarati, R. Karimi, S. Alzayat, W. Hao, A. Kaufmann, Y. Vigfusson,
applications. [Online]. Available: https://github.com/microsoft/mscclpp
and J. Mace, “Serving DNNs like Clockwork: Performance Predictability
[12] Numenta Inference on CPUs. [On-
from the Bottom Up,” in OSDI, 2020.
line]. Available: https://www.servethehome.com/
[38] U. Gupta, S. Hsia, V. Saraph, X. Wang, B. Reagen, G.-Y. Wei, H.-H. S.
numenta-has-the-secret-to-ai-inference-on-cpus-like-the-intel-xeon-max/
Lee, D. Brooks, and C.-J. Wu, “DeepRecSys: A System for Optimizing
[13] NVIDIA Accelerated InfiniBand Solutions. [Online]. Available:
End-to-end At-scale Neural Recommendation Inference,” in ISCA, 2020.
https://www.nvidia.com/en-us/networking/products/infiniband/
[39] V. Gupta, M. Harchol Balter, K. Sigman, and W. Whitt, “Analysis of
[14] NVIDIA Chip Shortage. [On- Join-the-Shortest-Queue Routing for Web Server Farms,” Performance
line]. Available: https://www.wired.com/story/ Evaluation, 2007.
nvidia-chip-shortages-leave-ai-startups-scrambling-for-computing-power/
[40] M. E. Haque, Y. H. Eom, Y. He, S. Elnikety, R. Bianchini, and K. S.
[15] NVIDIA DGX A100: Universal System for AI Infrastructure. [Online]. McKinley, “Few-to-Many: Incremental Parallelism for Reducing Tail
Available: https://resources.nvidia.com/en-us-dgx-systems/dgx-ai Latency in Interactive Services,” ACM SIGPLAN Notices, 2015.
[16] NVIDIA DGX H100. [Online]. Available: https://www.nvidia.com/en-us/ [41] M. E. Haque, Y. He, S. Elnikety, T. D. Nguyen, R. Bianchini, and
data-center/dgx-h100/ K. S. McKinley, “Exploiting Heterogeneity for Tail Latency and Energy
[17] NVIDIA Hopper GPUs Expand Reach as Demand for AI Efficiency,” in MICRO, 2017.
Grows. [Online]. Available: https://nvidianews.nvidia.com/news/ [42] K. Hong, G. Dai, J. Xu, Q. Mao, X. Li, J. Liu, K. Chen, H. Dong, and
nvidia-hopper-gpus-expand-reach-as-demand-for-ai-grows Y. Wang, “FlashDecoding++: Faster Large Language Model Inference
[18] OpenAI ChatGPT APIs. [Online]. Available: https://openai.com/blog/ on GPUs,” arXiv preprint arXiv:2311.01282, 2023.
introducing-chatgpt-and-whisper-apis [43] Y. Hu, R. Ghosh, and R. Govindan, “Scrooge: A Cost-effective Deep
[19] Power Availability Stymies Datacenter Growth. [On- Learning Inference System,” in SoCC, 2021.
line]. Available: https://www.networkworld.com/article/972483/ [44] Huggingface. Text Generation Inference (TGI). [Online]. Available:
power-availability-stymies-data-center-growth. https://github.com/huggingface/text-generation-inference
[20] SplitwiseSim: LLM Serving Cluster Simulator. GitHub. [Online]. [45] R. Hwang, T. Kim, Y. Kwon, and M. Rhu, “Centaur: A Chiplet-based,
Available: https://github.com/Mutinifni/splitwise-sim Hybrid Sparse-Dense Accelerator for Personalized Recommendations,”
[21] The New Bing. [Online]. Available: https://www.microsoft.com/en-us/ in ISCA, 2020.
edge/features/the-new-bing?form=MT00D8 [46] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: Distributed
[22] TurboMind Inference Server. [Online]. Available: https://github.com/ Data-Parallel Programs from Sequential Building Blocks,” in EuroSys,
InternLM/lmdeploy 2007.
[23] A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and [47] M. Javaheripi and S. Bubeck, “Phi-2: The Surprising Power of Small
R. Ramjee, “SARATHI: Efficient LLM Inference by Piggybacking Language Models,” Microsoft Research Blog, 2023.
Decodes with Chunked Prefills,” arXiv preprint arXiv:2308.16369, 2023. [48] W. Jiang, Z. He, S. Zhang, K. Zeng, L. Feng, J. Zhang, T. Liu, Y. Li,
[24] H. Albahar, S. Dongare, Y. Du, N. Zhao, A. K. Paul, and A. R. Butt, J. Zhou, C. Zhang et al., “FleetRec: Large-scale Recommendation
“SchedTune: A heterogeneity-aware GPU Scheduler for Deep Learning,” Inference on Hybrid GPU-FPGA Clusters,” in KDD, 2021.
in CCGrid, 2022. [49] R. S. Kannan, L. Subramanian, A. Raju, J. Ahn, J. Mars, and L. Tang,
[25] R. Y. Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, “GrandSLAm: Guaranteeing SLAs for Jobs in Microservices Execution
O. Ruwase, S. Smith, M. Zhang, J. Rasley, and Y. He, “DeepSpeed- Frameworks,” in EuroSys, 2019.
Inference: Enabling Efficient Inference of Transformer Models at [50] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen,
Unprecedented Scale,” in SC, 2022. “Single-ISA Heterogeneous Multi-core Architectures: The Potential for
[26] L. A. Barroso, U. Hölzle, and P. Ranganathan, “The Datacenter as a Processor Power Reduction,” in MICRO, 2003.
Computer: Designing Warehouse-Scale Machines,” Synthesis Lectures [51] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez,
on Computer Architecture, 2018. H. Zhang, and I. Stoica, “Efficient Memory Management for Large
[27] E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou, Z. Qian, M. Wu, and Language Model Serving with PagedAttention,” in SOSP, 2023.
L. Zhou, “Apollo: Scalable and Coordinated Scheduling for Cloud-scale [52] Y. Kwon, Y. Lee, and M. Rhu, “TensorDIMM: A Practical Near-Memory
Computing,” in OSDI, 2014. Processing Architecture for Embeddings and Tensor Operations in Deep
[28] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, Learning,” in MICRO, 2019.
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- [53] Z. Li, L. Zheng, Y. Zhong, V. Liu, Y. Sheng, X. Jin, Y. Huang,
Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, Z. Chen, H. Zhang, J. E. Gonzalez, and I. Stoica, “AlpaServe: Statistical
J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, Multiplexing with Model Parallelism for Deep Learning Serving,” in
B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, OSDI, 2023.
130
Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
[54] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, [78] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac,
L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A Robustly Optimized T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. v. Platen,
BERT Pretraining Approach,” arXiv preprint arXiv:1907.11692, 2019. C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame,
[55] Z. Liu, J. Wang, T. Dao, T. Zhou, B. Yuan, Z. Song, A. Shrivastava, Q. Lhoest, and A. M. Rush, “Transformers: State-of-the-art Natural
C. Zhang, Y. Tian, C. Re, and B. Chen, “Deja Vu: Contextual Sparsity Language Processing,” in EMNLP, 2020.
for Efficient LLMs at Inference Time,” in ICML, 2023. [79] B. Wu, Y. Zhong, Z. Zhang, G. Huang, X. Liu, and X. Jin, “Fast
[56] Meta. Introducing the AI Research SuperCluster — Meta’s Cutting- Distributed Inference Serving for Large Language Models,” arXiv preprint
Edge AI Supercomputer for AI Research. [Online]. Available: arXiv:2305.05920, 2023.
https://ai.facebook.com/blog/ai-rsc/ [80] H. Yang, Q. Chen, M. Riaz, Z. Luan, L. Tang, and J. Mars, “PowerChief:
[57] “Azure OpenAI Service,” Microsoft Azure, 2022. [Online]. Available: Intelligent Power Allocation for Multi-stage Applications to Improve
https://azure.microsoft.com/en-us/products/ai-services/openai-service Responsiveness on Power Constrained CMP,” in ISCA, 2017.
[58] R. Mittal, A. Shpiner, A. Panda, E. Zahavi, A. Krishnamurthy, S. Rat- [81] G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A
nasamy, and S. Shenker, “Revisiting Network Support for RDMA,” arXiv Distributed Serving System for Transformer-Based Generative Models,”
preprint arXiv:1806.08159, 2018. in OSDI, 2022.
[59] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, [82] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J.
M. Elibol, Z. Yang, W. Paul, M. I. Jordan et al., “Ray: A Distributed Franklin, S. Shenker, and I. Stoica, “Resilient Distributed Datasets: A
Framework for Emerging AI Applications,” in OSDI, 2018. Fault-Tolerant Abstraction for In-Memory Cluster Computing,” in NSDI,
[60] OpenAI. Scaling Kubernetes to 7,500 Nodes. [Online]. Available: 2012.
https://openai.com/research/scaling-kubernetes-to-7500-nodes [83] C. Zhang, M. Yu, W. Wang, and F. Yan, “MArk: Exploiting Cloud
[61] K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica, “Sparrow: Services for Cost-Effective, SLO-Aware Machine Learning Inference
Distributed, Low Latency Scheduling,” in SOSP, 2013. Serving,” in USENIX ATC, 2019.
[62] P. Patel, E. Choukse, C. Zhang, Í. Goiri, B. Warrier, N. Mahalingam, [84] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan,
and R. Bianchini, “POLCA: Power Oversubscription in LLM Cloud M. Diab, X. Li, X. V. Lin et al., “OPT: Open Pre-trained Transformer
Providers,” arXiv preprint arXiv:2308.12908, 2023. Language Models,” arXiv preprint arXiv:2205.01068, 2022.
[63] P. Patel, E. Choukse, C. Zhang, Í. Goiri, B. Warrier, N. Mahalingam, [85] W. Zhu, “Analysis of JSQ Policy on Soft Real-time Scheduling in Cluster,”
and R. Bianchini, “Characterizing Power Management Opportunities for in HPCAsia, 2000.
LLMs in the Cloud,” in ASPLOS, 2024.
[64] P. Patel, Z. Gong, S. Rizvi, E. Choukse, P. Misra, T. Anderson, and A PPENDIX
A. Sriraman, “Towards Improved Power Management in Cloud GPUs,”
in IEEE CAL, 2023. A. Abstract
[65] P. Patel, K. Lim, K. Jhunjhunwalla, A. Martinez, M. Demoulin, J. Nelson,
I. Zhang, and T. Anderson, “Hybrid Computing for Interactive Datacenter We open source critical components needed to evaluate
Applications,” arXiv preprint arXiv:2304.04488, 2023. Splitwise; these could be repurposed to also evaluate future
[66] R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek,
K. Xiao, S. Agrawal, and J. Dean, “Efficiently Scaling Transformer LLM inference serving systems. Our artifact includes:
Inference,” in MLSys, 2023. ● Production traces from two LLM inference services at
[67] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
“Language Models are Unsupervised Multitask Learners,” OpenAI blog, Microsoft Azure.
2019. ● A prototype implementation of Splitwise’s KV-cache
[68] F. Romero, Q. Li, N. J. Yadwadkar, and C. Kozyrakis, “INFaaS: transfer mechanism in vLLM [51].
Automated Model-less Inference Serving,” in USENIX ATC, 2021.
[69] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné,
● SplitwiseSim, a discrete event simulator to evaluate model
A. S. Luccioni, F. Yvon, M. Gallé, J. Tow, A. M. Rush, S. Biderman, serving in LLM inference clusters.
A. Webson, P. S. Ammanamanchi, T. Wang, B. Sagot, N. Muennighoff,
A. V. del Moral, O. Ruwase, R. Bawden, S. Bekman, A. McMillan-
Artifact functionality was only tested for the traces and
Major, I. Beltagy, H. Nguyen, L. Saulnier, S. Tan, P. O. Suarez, V. Sanh, SplitwiseSim due to limited hardware availability.
H. Laurençon, Y. Jernite, J. Launay, M. Mitchell, and C. Raffel, “BLOOM:
A 176B-Parameter Open-access Multilingual Language Model,” arXiv B. Artifact check-list (meta-information)
preprint arXiv:2211.05100, 2022.
[70] P. Schmid. Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging ● Data set: Production traces available as a part of the artifact.
Face Transformers. [Online]. Available: https://www.philschmid.de/ ● Run-time environment: Linux / Ubuntu.
fine-tune-flan-t5-deepspeed ● Hardware: Two machines connected over GPU Infiniband
[71] P. Schmid, O. Sanseviero, P. Cuenca, and L. Tunstall. Llama for the vLLM prototype (e.g. NVIDIA DGX-A100, NVIDIA
2 is here - Get it on Hugging Face. [Online]. Available: DGX-H100). x86-64 CPU machine for SplitwiseSim.
https://huggingface.co/blog/llama2
● Publicly available?: Yes.
[72] M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes,
“Omega: Flexible, Scalable Schedulers for Large Compute Clusters,”
● Code licenses (if publicly available)?: MIT.
in EuroSys, 2013. ● Data licenses (if publicly available)?: CC-BY.
[73] Y. Sheng, S. Cao, D. Li, B. Zhu, Z. Li, D. Zhuo, J. E. Gonzalez, and ● Archived (provide DOI)?: 10.5281/zenodo.11003049.
I. Stoica, “Fairness in Serving Large Language Models,” arXiv preprint
arXiv:2401.00588, 2023. C. Description
[74] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang,
C. Ré, I. Stoica, and C. Zhang, “FlexGen: High-Throughput Generative
Inference of Large Language Models with a Single GPU,” in ICML,
How to access. The entire artifact is available as an archive on
2023. Zenodo: https://doi.org/10.5281/zenodo.11003049. Individual
[75] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop components are also available online as follows:
Distributed File System,” in MSST, 2010.
[76] K. Van Craeynest, A. Jaleel, L. Eeckhout, P. Narvaez, and J. Emer, ● The production traces can be downloaded from the Azure
“Scheduling Heterogeneous Multi-cores Through Performance Impact Public Dataset GitHub repository [4].
Estimation (PIE),” ACM SIGARCH Computer Architecture News, 2012. ● The KV-cache transfer prototype can be downloaded from
[77] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, “Attention is All You Need,” in NeurIPS, the vLLM GitHub repository, currently available as a pull
2017. request [1].
131
Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
● SplitwiseSim, and the associated experiment and plotting Data sets. Coding and conversation traces from Microsoft
scripts, can be downloaded from a separate GitHub Azure are available online as a part of the artifact release [4].
repository [20].
D. Installation and Experiment Workflow
Hardware dependencies. The KV-cache transfer prototype Please refer to the README files within the artifact for
requires two GPU machines connected over Infiniband, such as installation and usage instructions.
NVIDIA DGX-A100s or NVIDIA DGX-H100s. SplitwiseSim
requires a standard x86-64 CPU machine; multiple machines E. Methodology
may be used to parallelize simulation runs. Submission, reviewing and badging methodology:
Software dependencies. The KV-cache transfer prototype is ● https://www.acm.org/publications/policies/
built on top of vLLM [51] and MSCCL++ [11]. SplitwiseSim artifact-review-and-badging-current
depends on a small set of publicly available Python packages, ● http://cTuning.org/ae/submission-20201122.html
which can be installed via the included requirements.txt. ● http://cTuning.org/ae/reviewing-20201122.html
132
Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.