0% found this document useful (0 votes)
34 views14 pages

Accelerating LLM Inference on Smartphones

asdsadfasdasd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views14 pages

Accelerating LLM Inference on Smartphones

asdsadfasdasd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Ripple: Accelerating LLM Inference on Smartphones

with Correlation-Aware Neuron Management


Tuowei Wang∗ Ruwen Fan∗ Minxing Huang
Tsinghua University Tsinghua University Tianjin University
Beijing, China Beijing, China Tianjin, China

Zixu Hao Kun Li Ting Cao


Tsinghua University Microsoft Research Microsoft Research
Beijing, China Beijing, China Beijing, China
arXiv:2410.19274v2 [cs.LG] 29 Oct 2024

Youyou Lu Yaoxue Zhang Ju Ren†


Tsinghua University Tsinghua University Tsinghua University
Beijing, China Beijing, China Beijing, China

Abstract parameters [5, 8, 14, 28, 54, 56, 76], these models demand
substantial computational and memory resources, typically
Large Language Models (LLMs) have achieved remarkable
available only in state-of-the-art data centers. Nonetheless,
success across various domains, yet deploying them on mo-
there is an increasing demand for deploying LLMs on resource-
bile devices remains an arduous challenge due to their exten-
constrained devices, such as smartphones [33, 55, 61, 62, 71,
sive computational and memory demands. While lightweight
74]. On one hand, stringent privacy regulations necessitate
LLMs have been developed to fit mobile environments, they
local data processing to protect user information. On the
suffer from degraded model accuracy. In contrast, sparsity-
other hand, LLMs on smartphones facilitate customization
based techniques minimize DRAM usage by selectively trans-
based on user habits, enabling enhanced personalization.
ferring only relevant neurons to DRAM while retaining the
Given the limited DRAM capacity of devices, LLMs on
full model in external storage, such as flash. However, such
smartphones are typically constrained to models specially
approaches are critically limited by numerous I/O operations,
designed for mobile deployment [1, 54, 72]. Although these
particularly on smartphones with severe IOPS constraints.
models are lightweight, the reduction in parameters inevitably
In this paper, we propose Ripple, a novel approach that
leads to a compromise in their capabilities [29]. As an alter-
accelerates LLM inference on smartphones by optimizing
native, many recent works [4, 37, 40, 50, 69, 81] explore the
neuron placement in flash memory. Ripple leverages the
exploitation of inherent sparsity within LLMs to address
concept of Neuron Co-Activation, where neurons frequently
memory limitations. Specially, rather than pruning model
activated together are linked to facilitate continuous read
parameters, these methods selectively activate a subset of
access and optimize data transfer efficiency. Our approach
model parameters based on the input while maintaining
incorporates a two-stage solution: an offline stage that reor-
the original performance. By transferring only the activated
ganizes neuron placement based on co-activation patterns,
parameters to DRAM for computation, larger and more
and an online stage that employs tailored data access and
powerful LLMs can be stored in external flash memory,
caching strategies to align well with hardware character-
effectively surpassing DRAM limitations of smartphones.
istics. Evaluations conducted on a variety of smartphones
However, the efficiency of this LLM inference paradigm
and LLMs demonstrate that Ripple achieves up to 5.93× im-
is significantly hindered by I/O overheads. Since different
provements in I/O latency compared to the state-of-the-art.
inference requests generally activate distinct sets of model
As the first solution to optimize storage placement under
parameters, frequent I/O operations are generated to swap
sparsity, Ripple explores a new optimization space at the
intersection of sparsity-driven algorithm and storage-level
system co-design in LLM inference. Table 1. Breakdown of average inference latency per token
when offloading 50% model parameters to flash memory.
1 Introduction Model Compute Load Total Load Ratio

Large Language Models (LLMs) have demonstrated excep- OPT-350M 34 ms 87 ms 121 ms 71.9%
tional performance across a wide range of applications [15, OPT-1.3B 84 ms 273 ms 357 ms 76.5%
30, 44, 45, 52, 64]. Comprising millions or even billions of OPT-6.7B 387 ms 1883 ms 2270 ms 82.9%
Llama2-7B 450 ms 10982 ms 11432 ms 96.1%
∗ Both authors contributed equally to this research. Mistral-7B 355 ms 15126 ms 15481 ms 97.7%
† Corresponding author: Ju Ren ([email protected]).
1
(1) Extensive Search Space. The vast number of neurons in
LLMs leads to an exponentially large space of possible neu-
IOPS bottleneck ron linking combinations. Identifying the optimized neuron
linking that maximizes global benefits is exceedingly difficult
and infeasible through brute-force enumeration alone.
(2) Random Activation Variation. Owing to varying model
OPT-350M OPT-350M (Ripple)
OPT-1.3B OPT-1.3B (Ripple)
inputs, the activation of model parameters exhibits intrinsic
OPT-6.7B OPT-6.7B (Ripple)
randomness. Although optimized placement strategies can
Llama2-7B Llama2-7B (Ripple) spatially co-locate activated neurons, access to these neurons
Mistral-7B Mistral-7B (Ripple) remains hindered by discontinuities caused by randomness.
(3) Misaligned Cache Strategy. Storing frequently acti-
vated neurons in memory is critical for minimizing transfer
workload. However, storing neurons individually leads to
Figure 1. Bandwidth utilization on smartphones is heavily fragmentation in their placement within flash memory, po-
constrained by IOPS. Ripple alleviates this bottleneck and tentially disrupting continuous access.
boosts bandwidth with neuron co-activation linking. To this end, Ripple employs a two-stage solution that
performs hierarchical optimizations both offline and online.
parameters between DRAM and flash memory. As shown in (1) In the Offline Phase, Ripple clusters neurons exhibiting
Table 1, even when only half of the model parameters reside high co-activation correlation and reorganizes their place-
in flash memory, 71.9%-97.7% of the inference latency arises ment in flash memory. To address Challenge (1), we ab-
from I/O operations. More critically, the scattered activation stract the problem into a complete graph, reformulating it as
of model parameters induces numerous small-grained read the discovery of the globally optimal Hamiltonian Path. By
accesses, limiting transfer efficiency due to constraints in leveraging graph-theoretic techniques, we propose a greedy
Input/Output Operations Per Second (IOPS) [38]. As depicted algorithm that efficiently searches for optimized placement
in Figure 1, this IOPS bottleneck severely restricts on-device based on observed neuron co-activation patterns.
bandwidth utilization across various LLMs. (2) In the Online Phase, Ripple performs fine-grained re-
Building on these insights, this paper proposes Ripple, finements based on optimized neuron placement, further
a novel approach to accelerating LLM inference on smart- enhancing access continuity. To tackle Challenge (2), we
phones through I/O optimizations. While previous works [37, devise an IOPS-friendly access collapse technique. By strate-
50] primarily focus on computation efficiency under activa- gically incorporating additional neurons between two sepa-
tion sparsity, they tend to exacerbate the existing I/O over- rate neuron links, we improve read access continuity with
head bottlenecks. Fewer studies [4, 69] explore mitigating negligible overhead. In response to Challenge (3), we design
I/O overhead through enhanced caching strategies to min- a linking-aligned in-memory caching policy. Rather than in-
imize data loading. However, without directly improving dividually caching the hottest neurons, we account for their
bandwidth utilization, overall efficiency remains suboptimal. interlinking relationships, ensuring efficient access patterns.
Orthogonal to these methods, Ripple addresses the primary We evaluate Ripple on three smartphones with distinct
bottleneck in LLM inference by maximizing bandwidth uti- hardware configurations, benchmarking a diverse range of
lization via the effective reduction of I/O operations. LLMs varying in structures and scales. The results demon-
The design of Ripple is rooted in Neuron Co-Activation, a strate that Ripple significantly boosts on-device bandwidth,
property prevalent in activation sparsity yet underexplored achieving improvements of up to 4.32×. Moreover, this band-
in current works. Specially, neurons in LLMs exhibit strong width optimization yields substantial reductions in I/O la-
correlations in their activation patterns. When processing tency during inference, offering speedups of up to 5.93×
real-world datasets, the activation of an individual neuron is when compared to state-of-the-art solutions.
consistently linked to the activation of a stable group of oth- To the best of our knowledge, Ripple is the first to ac-
ers. Given the efficiency of continuous reads, which enable celerate LLM inference on smartphones by enhancing I/O
the retrieval of larger data blocks with a single request, Rip- bandwidth through optimized neuron placement in flash
ple introduces a key insight: Why not establish links between memory. Ripple effectively bridges the performance gap be-
neurons that are frequently co-activated in flash memory, fa- tween flash memory and DRAM, enabling LLM inference
cilitating continuous read access to reduce IOPS? to exceed DRAM limitations on smartphones. Our contribu-
However, this is not a low-hanging fruit, as both neu- tions can be summarized as follows:
ron co-activation patterns and storage hardware characteris-
tics exhibit inherent complexity, complicating their effective
alignment. Our comprehensive analysis identifies that three • We identify the primary bottleneck in LLM inference
critical technical challenges must be tackled: on smartphones as IOPS, attributing it to the inherent
2
𝑋 𝑈 𝑌 𝐴 𝐷 𝑂 Layer 𝒊: 2
Flash DRAM
𝑥! 𝑢!! 𝑢!" 𝑢!# 𝑢!$ 𝑦! 𝑎! 𝑑!! 𝑑!" 𝑑!# 𝑑!$ 𝑑!% 𝑜! Selectively
Current layer’s ~100GB ~10GB
input
Loading
𝑥" 𝑢"! 𝑢"" 𝑢"# 𝑢"$ 𝑦" 0 𝑑"! 𝑑"" 𝑑"# 𝑑"$ 𝑑"% 𝑜" 𝑤!! 𝑤!$ 𝑤!# 𝑤!" 𝑤!! 𝑤!$ 𝑤!# 𝑤!"
𝑥#
× 𝑢#! 𝑢#" 𝑢## 𝑢#$
=𝑦 Re 0
× 𝑑#! 𝑑#" 𝑑## 𝑑#$ 𝑑#%
= 𝑜# 𝑥!
𝑤$! 𝑤$$ 𝑤$# 𝑤$"
~1GB/s
𝑤"! 𝑤"$ 𝑤"# 𝑤""
#

𝑦$ LU 𝑎$
𝑥$ Bottleneck!
𝑥$ 𝑢$! 𝑢$" 𝑢$# 𝑢$$ 𝑑$! 𝑑$" 𝑑$# 𝑑$$ 𝑑$% 𝑜$ 𝑤#! 𝑤#$ 𝑤## 𝑤#"
𝑥#
𝑤"! 𝑤"$ 𝑤"# 𝑤"" 3 Computation ~100GB/s
𝑢%! 𝑢%" 𝑢%# 𝑢%$ 𝑦% 0 1
𝑥" Activation
$ %
𝑤%! 𝑤%$ 𝑤%# 𝑤%" Processing Unit
Prediction
Up Projection: 𝑦& = ) 𝑢&' 𝑥' Down Projection: 𝑜) = ) 𝑑)& 𝑥&
'(! &(! Figure 3. A three-step procedure for deploying LLMs on
Activation: 𝑎' = ReLU 𝑦' = max 0, 𝑦' = (𝑦! , 0,0, 𝑦$ , 0) smartphones while leveraging activation sparsity. Instead of
Activated Neurons: 𝑈!,: = (𝑢!! , 𝑢!" , 𝑢!# , 𝑢!$ ) 𝐷:,! = (𝑑!! , 𝑑"! , 𝑑#! , 𝑑$! ) memory limitations, the communication between external
𝑈$,: = (𝑢$! , 𝑢$" , 𝑢$# , 𝑢$$ ) 𝐷:,$ = (𝑑!$ , 𝑑"$ , 𝑑#$ , 𝑑$$ ) storage and memory becomes the new bottleneck.
Figure 2. Activation sparsity introduced by ReLU. All non-
activated (uncolored) model parameters can be excluded 20GB. On the other hand, a substantial portion of this DRAM
from computation without impacting the model outputs. is allocated by the operating system and other active applica-
tions, leaving even less available for any single application.
In contrast, merely storing an LLM with 7 billion parameters
misalignment between scattered activation patterns
in half precision requires at least 14GB of DRAM.
and storage hardware characteristics.
Figure 3 presents a typical procedure for deploying LLMs
• We notably exploit neuron co-activation to mitigate
on smartphones with activation sparsity [4, 69]. Rather than
the IOPS bottleneck, pioneering the optimization of
relying solely on limited DRAM, larger flash memory can
neuron placement in flash memory for enhancing
be used to store model parameters. The process begins by
bandwidth efficiency on smartphones.
effectively predicting the activated neurons using a neural-
• We conduct extensive evaluations on various repre-
network-based predictor. Next, the activated neurons are
sentative LLMs and hardware, achieving substantial
selectively loaded into DRAM for final computation. This
improvements over state-of-the-art solutions.
approach enables the execution of models that surpass the
2 Background and Motivation available DRAM size. However, the I/O bandwidth between
flash memory and DRAM emerges as the new bottleneck.
2.1 Activation Sparsity in LLM Inference
2.2 Universal Flash Storage on Smartphones
Numerous studies [34, 41, 49, 51, 77] have shown that
LLMs exhibit considerable Activation Sparsity, allowing a Mobile devices, such as smartphones, predominantly uti-
substantial portion of activations can be disregarded without lize Universal Flash Storage (UFS) [27] as the storage pro-
impacting the final outputs. This characteristic greatly re- tocol. By leveraging NAND flash, UFS offers significantly
duces resource consumption, as only a subset of parameters larger storage capacity than the space available in DRAM,
participates in the computation. Importantly, since no pa- with scalability reaching terabyte (TB) levels. Furthermore,
rameters are pruned, the full capacity of the LLMs remains the introduction of the command queue in UFS markedly
intact. improves the efficiency of data transfer between flash and
Although prevalent across transformer-based LLMs [37, DRAM. In the latest version (UFS 4.0), the sustained read
59], this activation sparsity is particularly pronounced when speed per lane can reach up to 2.9 GB/s. This combination
the ReLU-family [2, 47, 78] functions are employed. As de- of extensive storage capacity and relatively high read speed
picted in Figure 2, the ReLU function zeros out all negative forms the foundation for the execution of LLMs on mobile
values in activations 𝐴, leading to the exclusion of the cor- devices.
responding neurons (e.g., rows of the up-projection matrix However, unlike server-side external storage (such as
𝑈 and columns of the down-projection matrix 𝐷) from the NVMe), UFS on smartphones typically features a shallow
computation without any loss. Consequently, recent research command queue, supporting only 32 entries. This limitation
efforts [41, 49–51, 69] explore replacing activation functions significantly restricts the IOPS for flash reads and can even
with ReLU across popular LLMs, achieving high sparsity hinder full utilization of the available bandwidth. As depicted
while maintaining comparable model performance. in Figure 4, the read bandwidth increases with the contin-
Compared to data centers, this property becomes even uous I/O sizes, since continuous reads can be issued by a
more critical when deploying LLMs on resource-constrained single read operation. Specially, when the continuous I/O
devices like smartphones. On one hand, smartphones typ- size is less than 24KB, the bandwidth scales almost linearly
ically offer limited DRAM capacity, ranging from 10GB to with the I/O size, indicating that these reads are primarily

3
OnePlus 12 OPT-350M

OnePlus Ace 2 Dense

IOPS Bandwidth
Bound Bound

Figure 4. Bandwidth at varying continuous I/O sizes. The Figure 5. Inference latency and achieved bandwidth of data
near-linear relationship indicates that the bottleneck lies in loading from flash under varying activation sparsity ratios.
IOPS, rather than the bandwidth capacity.

Alpaca OpenWebText WikiText


IOPS-bound. Consequently, the key to fully exploiting UFS 1.0
bandwidth lies in maximizing the continuity of read accesses.
OPT 7B
2.3 Analysis: IOPS as the Bottleneck 0.8

By storing the full model parameters in flash and selec-


tively transferring only the activated parameters to DRAM 0.6
for computation, mobile devices can accommodate larger and Llama2 7B
more powerful models while maintaining the resource de- 0.4
mands of running smaller models. However, this approach is
severely constrained by the data transfer overhead. As shown
in Table 1, I/O operations between flash and DRAM account Mistral 7B
0.2
for the majority of the inference latency. Consequently, the
efficiency of I/O operations emerges as the pivotal determi- 0.0
nant of the smooth execution of this process.
The root cause of this I/O overhead lies in the dynamic Figure 6. Visualization of neuron co-activation across vari-
nature of activation sparsity, where the specific subset of ous LLMs and datasets. Brighter colors denote high values.
neurons that is activated changes with the inputs. As a result,
each inference request necessitates loading a unique set of This causes the device to become heavily IOPS-bound, pre-
neurons from flash into DRAM, leading to considerable data venting the full exploitation of the available bandwidth.
transfer overhead. More critically, overlapping these data Drawing from these observations, we derive a crucial in-
transfers with computation proves challenging, as the pre- sight: the conventional neuron placement, guided by
diction of activated neurons is contingent on the inputs from model structure, is misaligned with the dynamic acti-
the current or adjacent layers. The amount of computation vation sparsity utilized in on-device LLM inference. As
available is insufficient to hide the substantial I/O latency. a result, the key to addressing the I/O bottleneck lies in ensur-
To address this, prior work predominantly seeks to miti- ing that read accesses to flash are as continuous as possible,
gate the volume of data loaded through optimized data man- thereby pushing the bandwidth towards full exploitation.
agement techniques. However, due to the minimal overlap
in activated neurons across different inference requests, the 3 Ripple Overview
efficiency remains suboptimal. Our findings reveal that the We propose Ripple, an efficient approach for accelerating
bottleneck in I/O operations stems not from the volume of LLM inference on smartphones through advanced I/O op-
data transferred but from the low effective bandwidth utiliza- timizations. While previous studies primarily focus on the
tion. We evaluate the inference latency of OPT-350M with efficiency of either computation or memory management,
different activation sparsity ratios, as depicted in Figure 5. Ripple tackles the I/O bottleneck by directly improving the
Despite the reduced data transfer, the inference latency with neuron transfer bandwidth between flash and DRAM.
activation sparsity approaches, or even surpasses, that of The design of Ripple is rooted in Neuron Co-Activation,
dense models. This is because the scattered nature of neuron a fundamental property inherent in LLM activation spar-
activation in conventional model-structure-based neuron sity. As illustrated in Figure 6, neurons within LLMs exhibit
placement results in numerous small-grained read accesses. strongly correlated activation patterns across various model
4
Offline Stage loading additional neurons between them. This merging ap-
1 Pattern Extraction 2 Problem Abstraction 3 Hamilton Pathfinding proach, with minimal overhead, substantially reduces IOPS,
8
n1 n2 n3 n4 n5 n6 n1 n2 n1 n2 n3 n5 n6 n4 thereby enhancing overall bandwidth efficiency. ❺ Cache
3
1 Policy. Retaining the most frequently activated neurons in
8 3 3 4 n6 n3 8 3 3 4 DRAM effectively reduces repeated neuron transfers. How-
6 5
n1 n2 n3 n4 n5 n6 4 3 n1 n2 n3 n4 n5 n6
ever, this approach alone risks disrupting the continuity of
n5 n4 optimized neuron placement in flash. To mitigate this, we
1 5 6 1 5 6
Record co-activation frequency Abstract into complete graph Greedy search algorithm propose caching neurons in DRAM at the granularity of
Online Stage
neuron segments, rather than individual neurons. This strat-
4 Access Collapse activated: n1 n2 n5 n6 + extra loaded: n3 egy helps prevent fragmentation within the flash, ultimately
access 1 access 2 enhancing overall bandwidth efficiency.
n1 n2 n3 n5 n6 n4
collapsed access: n1 n2 n3 n5 n6 4 Offline Correlation-Aware Clustering
5 Cache Policy continuous: n1 n2 n3 + sporadic: n6
threshold = 2
4.1 Step 1: Parameter-Efficient Pattern Extraction
segment 1 segment 2 flash
3 > 2; 1 < 2
n1 n2 n3 n5 n6 n4
DRAM cache: n6 n1 n2 n3 neuron LLMs [8, 28, 56, 76] are typically based on transformer
architectures, with two primary components: Multi-Head
Figure 7. Overview of Ripple.
Attention (MHA) block and Feed Forward Network (FFN)
block. To enhance inference efficiency, an increasing number
structures and datasets. Although similar observations have of LLMs are adopting the Group Query Attention [3] mecha-
been validated in prior studies [4, 69], this property remains nism, which significantly minimizes the parameter overhead
largely underexplored due to its intrinsic complexity. By of the MHA block. As a result, in Ripple, we focus primar-
employing both algorithmic and system-level optimizations, ily on offloading the parameters of the FFN block to flash
Ripple is the first to leverage neuron co-activation for op- memory, while prefetching all parameters within the MHA
timizing neuron placement in flash, effectively mitigating block. Nonetheless, this approach can similarly be exploited
the I/O bottleneck during LLM inference on smartphones. to optimize the offloading of the MHA block itself.
Figure 7 presents an overview of Ripple. To extract the neuron co-activation patterns, we initially
Offline Correlation-Aware Clustering (§ 4). Ripple be- utilize an Adjacency Matrix to record the activation frequen-
gins with the identification of an optimized neuron place- cies of neurons within LLMs. This step is performed only
ment in flash by clustering co-activated neurons. Specially, once, prior to inference, utilizing a dataset associated with
the process consists of three steps. ❶ Pattern Extraction. the upcoming tasks. By interpreting frequency 𝑓 as a proba-
We develop distinct strategies to extract co-activation pat- bility, we compute the probability of the activation of neuron
terns in transformer-based LLMs efficiently. These extracted 𝑛𝑖 , denoted as 𝑃 (𝑖), and the probability of the co-activation
patterns quantify the strength of co-activation correlations of neuron 𝑛𝑖 and 𝑛 𝑗 , denoted as 𝑃 (𝑖 𝑗), as follows:
among neurons, forming the foundation for subsequent neu- 𝑓 (𝑛𝑖 )
ron rearrangement. ❷ Problem Abstraction. We model the 𝑃 (𝑖) = Í𝑁 (1)
process of identifying optimal neuron placement as a graph 𝑘=1 𝑓 (𝑛𝑘 )
representation. Leveraging this abstraction, we reformulate 𝑓 (𝑛𝑖 , 𝑛 𝑗 )
𝑃 (𝑖 𝑗) = Í𝑁 Í𝑁 (2)
the problem into a Hamiltonian Pathfinding task, enabling a
𝑘=1 𝑙=1 𝑓 (𝑛𝑘 , 𝑛𝑙 )
more efficient solution through graph-theoretic techniques. Here, 𝑁 denotes the number of neurons in a weight matrix.
❸ Hamilton Pathfinding. Given the NP-hard nature of When performing statistics, Ripple accounts for the binding
the optimized problem, we devise a heuristic algorithm that relationships between neurons across different weight matri-
employs a greedy approach. We prove strictly that our algo- ces within the same FFN block. For instance, in OPT [76], the
rithm can search for the locally optimal solution with the columns of up projection matrix are bound to the correspond-
time complexity of 𝑂 (𝑛 2 log 𝑛). ing rows of down projection matrix, as their activations rely
Online Continuity-Centric Processing (§ 5). To efficiently on whether the same intermediate values are zero or not. A
align with the optimized neuron placement, Ripple adopts similar binding relationship exists among the gate, up, and
customized data access and DRAM management techniques, down projection matrices in Llama2 [56].
Specially designed to facilitate more continuous read access.
❹ Access Collapse. Despite optimized neuron placement, 4.2 Step 2: Graph-Based Problem Abstraction
the inherent randomness in neuron activation leads to persis- Following the extraction of the neuron co-activation pat-
tent discontinuous read access. To address this, we propose terns, the subsequent step involves determining an optimized
strategically merging nearby discontinuous read accesses by neuron placement within the flash. To enable more continu-
ous read access, neurons that frequently co-activate should
5
1 Initial State 2 (4, 2): 37 3 (5, 2): 29 4 (5, 3): 21
1 2 3 4 5 6 (n_x, n_y): dist

1 - (4,2): 37 n1 n2 n3 n4 n5 n6 n1 n2 n3 n4 n5 n6 n1 n2 n3 n4 n5 n6 n1 n2 n3 n4 n5 n6

(5,2): 29 R: 1 2 3 4 5 6 1 2 3 2 5 6 1 2 3 2 2 6 1 2 2 2 2 6
2 8 -
(5,3): 21
N: 0 0 0 0 0 0 0 1 0 1 0 0 0 2 0 1 1 0 0 2 1 1 2 0
3 15 7 - (6,5): 18
(3,1): 15 5 (6, 5): 18 6 (3, 1): 15 7 (6, 4): 10 8 Result
4 7 37 6 - ×
(6,4): 10
5 5 29 21 3 - n1 n2 n3 n4 n5 n6 n1 n2 n3 n4 n5 n6 n1 n2 n3 n4 n5 n6 n1 n3 n5 n2 n4 n6
(2,1): 8
6 2 1 2 10 18 - …… R: 1 2 2 2 2 6 2 2 2 2 2 6 2 2 2 2 2 2 2 2 2 2 2 2

Co-Activation Pattern Priority Queue N: 0 2 1 1 2 0 1 2 2 1 2 0 1 2 2 2 2 1 1 2 2 2 2 1

Figure 8. A search algorithm example. Based on co-activation patterns, neuron pairs are ranked by correlation and sequentially
retrieved to form neuron linkings. During construction, the root of each linking (R) and the neighbor counts of each neuron
(N) are recorded. By greedily merging the nearest linkings, a complete linking that encompasses all neurons is obtained.

ideally be positioned in close proximity. Given the immense Hamilton Pathfinding. The Hamiltonian path ensures that
number of potential neuron placements, a brute-force enu- all neurons stored within the flash are involved, while mini-
meration approach is infeasible. Innovatively, we reformulate mizing the path length maximizes the likelihood of clustering
this problem as a graph representation, allowing for a more co-activated neurons together. To find the optimal neuron
efficient solution leveraging graph-theoretic techniques. placement, we have to search for the global shortest Hamil-
Graph Abstraction. We abstract the neuron co-activation tonian path (i.e., finding the shortest path between any pair
relationships into a Complete Graph. In this graph, each node of nodes). We find this problem can be further reduced to
represents a neuron, and each edge captures the co-activation the classical Traveling Salesman Problem (TSP) [31].
between a pair of neurons. Specially, the value assigned to
4.3 Step 3: Heuristic-Driven Greedy Algorithm
each edge reflects the strength of their co-activation corre-
lation, referred to as the Distance Between Two Neurons. We Although the problem can be reduced into the classical
prove that the optimal definition of the distance between TSP, which is known to be NP-hard [57], finding an opti-
neuron 𝑛𝑖 and neuron 𝑛 𝑗 is given as follows: mal solution in polynomial time is generally infeasible. We
propose a heuristic algorithm that searches for the optimal
dist(𝑛𝑖 , 𝑛 𝑗 ) := 1 − 𝑃 (𝑖 𝑗) (3) neuron placement greedily, as shown in Figure 8.
Algorithm Details. The core idea of the algorithm is to
Proof. The objective of Ripple is to minimize the expected initially treat each neuron as an individual link. By itera-
number of I/O operations during a single inference request. tively merging the nearest links in a greedy manner, new
Initially, consider that all neurons are activated individually. links are formed until a single link encompassing all neurons
Therefore, the expected number of I/O operations, 𝑁𝑖𝑛𝑑𝑖𝑣 , remains. To achieve this, analogous to the Distance Between
can be expressed as: Two Neurons as defined in Equation 3, we define the Distance
Between Two Neuron Links, 𝑙𝑖 and 𝑙 𝑗 , as follows:
𝑁
∑︁
𝑁ˆ𝑖𝑛𝑑𝑖𝑣 = 𝑃 (𝑖) (4) dist(𝑙𝑖 , 𝑙 𝑗 ) := min{dist(𝑙𝑖 (ℎ), 𝑙 𝑗 (ℎ)), dist(𝑙𝑖 (ℎ), 𝑙 𝑗 (𝑡)),
𝑖=1 (6)
dist(𝑙𝑖 (𝑡), 𝑙 𝑗 (ℎ)), dist(𝑙𝑖 (𝑡), 𝑙 𝑗 (𝑡))}
Next, we account for the co-activation of neurons. When
neurons 𝑛𝑖 and 𝑛 𝑗 are co-activated, both can be accessed Here, 𝑙𝑖 (ℎ) and 𝑙𝑖 (𝑡) denote the head and tail neurons of a
with a single I/O operation. Therefore, the expected number neuron link 𝑙𝑖 , respectively. And dist(𝑙𝑖 (ℎ), 𝑙 𝑗 (ℎ)) represents
of I/O operations, 𝑁𝑐𝑜𝑎𝑐𝑡 , is given by: the distance between these two neurons.
𝑁 𝑁 ∑︁
𝑁
Algorithm 1 outlines the process in pseudocode. The al-
gorithm begins by taking a set of neurons N and a distance
∑︁ ∑︁
𝑁ˆ𝑐𝑜𝑎𝑐𝑡 = 𝑃 (𝑖) − 𝑃 (𝑖 𝑗) (5)
𝑖=1 𝑖=1 𝑗=1
function dist(𝑙𝑖 , 𝑙 𝑗 ), as defined in Equation 6 (Line 1). Initially,
each neuron in N is treated as an individual link (Line 2).
Given that the first term remains constant, minimizing 𝑁𝑐𝑜𝑎𝑐𝑡 The algorithm then proceeds iteratively, searching for the
Í𝑁 Í𝑁
is equivalent to maximizing the second term, 𝑖=1 𝑗=1 𝑃 (𝑖 𝑗). nearest pair of links to merge. For each pair of links, the dis-
By defining the distance between neuron 𝑛𝑖 and neuron 𝑛 𝑗 tance is computed using dist(𝑙𝑖 , 𝑙 𝑗 ) (Line 7), and the pair with
as 1 − 𝑃 (𝑖 𝑗), this problem can be formulated as identifying the smallest distance is selected for merging (Lines 8-10).
the shortest Hamiltonian path [48] in a complete graph. ■ This process repeats until only a single link remains, which
6
Algorithm 1 Neuron Placement Search Algorithm activated merged non-activated Bandwidth Bandwidth
Capacity Utilization
1: Input: Neuron set N , Distance function dist(𝑛𝑖 , 𝑛 𝑗 )
2: Output: Optimized neuron placement P Naïve:
IOPS-Bound n1 n2 Entry I 50%
3: function GreedySearch(N )
access I access II access III
4: Initialize NbrCnt[𝑛] ← 0 for all 𝑛 ∈ N
n1 n2 n3 n4 n5 n6 n7 n8 n9
5: Initialize disjoint sets 𝑆 (𝑛) for all 𝑛 ∈ N n4 Entry II 25%
6: Initialize priority queue 𝑄 ← ∅
7: for each pair (𝑛𝑖 , 𝑛 𝑗 ) ∈ N × N, 𝑛𝑖 ≠ 𝑛 𝑗 do merging threshold θ = 1
Ripple:
8: 𝑄.push((𝑛𝑖 , 𝑛 𝑗 ), dist(𝑛𝑖 , 𝑛 𝑗 )) n1 n2 n3 n4 Entry I 100%
2>1
9: while 𝑄 ≠ ∅ do access I access II
10: (𝑛𝑥 , 𝑛 𝑦 ) ← 𝑄.pop() n1 n2 n3 n4 n5 n6 n7 n8 n9 n7 n8 n9 Entry II 75%
11: if NbrCnt[𝑛𝑥 ] = 2 or NbrCnt[𝑛 𝑦 ] = 2 then
12: continue ⊲ Skip if either neuron is inside a link Figure 9. An example of access collapse. By strategically
13: 𝑟𝑜𝑜𝑡𝑥 ← Find(𝑛𝑥 ), 𝑟𝑜𝑜𝑡 𝑦 ← Find(𝑛 𝑦 ) combining nearby neurons, Ripple improves overall effi-
14: if 𝑟𝑜𝑜𝑡𝑥 ≠ 𝑟𝑜𝑜𝑡 𝑦 then ciency by maximizing bandwidth utilization.
15: NbrCnt[𝑛𝑥 ]++
16: NbrCnt[𝑛 𝑦 ]++
17: Union(𝑟𝑜𝑜𝑡𝑥 , 𝑟𝑜𝑜𝑡 𝑦 ) infeasible to consistently follow the extracted co-activation
18: Link(𝑛𝑥 , 𝑛 𝑦 ) ⊲ Update neuron linkings patterns. Although neurons that are frequently co-activated
19: P ← [] are placed in close positions, minor variances induced by
⊲ Set current neuron to starting point randomness can still lead to discontinuous read access.
20: 𝑐 ← Select first neuron from {𝑛 ∈ N | NbrCnt[𝑛] = 1} The second challenge stems from Misaligned Cache
21: while 𝑐 ≠ 𝑠 and NbrCnt[c] ≠ 1 do Strategy. Conventional cache mechanisms often fail to adapt
22: P.append(𝑐) ⊲ Add 𝑐 to the optimized placement to co-activation scenarios. Simply caching hot neurons lost
23: 𝑐 ← NextNeuron(𝑐) ⊲ Move to next neuron linked to 𝑐 the information on neuron placement, breaking the continu-
24: return P ity of neurons in flash, which is against our optimizations
for I/O operations reduction. However, directly continuously
caching all co-activated neurons may take up too much cache
contains all the neurons (Lines 3-11). The final merged link
space, wasting cache efficiency.
is then returned as the output of the algorithm (Line 12).
Complexity Analysis. Our implementation leverages the 5.1 IOPS-Friendly Access Collapse
union-find and priority queue data structures to optimize
In Ripple, we introduce an innovative online technique
the time complexity of the algorithm. The union-find struc-
that strategically combines nearby read accesses. The fun-
ture efficiently manages the connections between elements,
damental insight driving this approach is that while co-
ensuring that elements belonging to the same set are part
activated neurons cannot always be placed contiguously,
of the same link. Both the insertion and search operations
they are likely to be positioned in close proximity following
in the union-find structure have a time complexity of 𝑂 (1).
offline correlation-aware clustering. As illustrated in Figure 9,
Meanwhile, the priority queue is employed to identify the
consider a scenario where neurons n1, n2, n3, and n4 are
nearest neuron link, with a time complexity of 𝑂 (𝑛 2 log 𝑛).
stored contiguously, but occasionally only n1, n2, and n4 are
The 𝑛 2 factor arises from the pairwise enumeration of neu-
activated, necessitating two distinct read operations. How-
ron links 𝑙𝑖 and 𝑙 𝑗 . Consequently, the overall time complexity
ever, when IOPS is limited, the inclusion of more neurons per
of the algorithm is 𝑂 (𝑛 2 log 𝑛).
read operation yields superior overall performance. Capital-
5 Online Continuity-Centric Processing izing on this observation, when two disjoint but proximate
neuron groups are co-activated, we speculatively read the
Through offline correlation-aware clustering, neurons that intervening neurons. This strategy effectively coalesces the
are frequently co-activated are strategically placed contigu- two separate neuron groups into a single, contiguous read
ously in flash memory. However, the dynamic and intricate access, thereby substantially enhancing overall efficiency.
nature of neuron co-activation complicates static neuron The execution of this IOPS-friendly access collapse is gov-
placement, making it inadequate to entirely alleviate IOPS erned by two key factors during runtime. (1) Extra band-
limitations. To fully exploit the flash and DRAM resources to width cost. Introducing additional neurons for merging
serve neuron read requests, we design specific online serving involves a trade-off between increasing the data transfer
techniques, aiming to address the two primary categories of size and decreasing IO operations, aiming to enhance band-
these challenges, manifesting in data access and caching. width utilization. We employ a threshold-based approach:
The first challenge arises from Random Activation Vari- if the number of neurons between two neuron groups falls
ation. Due to the stochastic nature of neuron activation, it is
7
below a predefined threshold, collapse is performed; other- Table 2. Smartphone hardware configurations.
wise, skipped. This threshold is dynamically adjusted during Device SoC DRAM Flash Storage
runtime to balance the overhead and efficiency gains. (2)
Storage Bottleneck. While merging can reduce IO opera- OnePlus 12 Snapdragon 8 Gen 3 24GB 1TB UFS4.0
OnePlus Ace 3 Snapdragon 8 Gen 2 16GB 512GB UFS4.0
tions, it only improves bandwidth efficiency if the storage is
OnePlus Ace 2 Snapdragon 8+ Gen 1 16GB 512GB UFS3.1
IOPS-bound rather than bandwidth-bound. To handle this,
we implement an online bottleneck detector that periodi-
cally checks whether the achieved bandwidth has reached Table 3. Model configurations.
the hardware’s maximum capacity. If the bandwidth is fully
utilized, the system defaults to the original read strategy. Model+ # Params # Layers # Neurons Neuron Dim Sparsity
OPT 350M 24 8192 1024 9.49%
5.2 Linking-aligned Cache Policy OPT 1.3B 24 16384 2048 4.09%
It is natural to store the neurons that are most frequently OPT 6.7B 32 32768 4096 3.28%
Llama2 7B 32 33024 4096 13.88%
activated in DRAM to reduce the redundant data transfer
Mistral 7B 32 43008 4096 60.52%
between flash and DRAM. However, directly applying the
+ Neurons per FFN block, with 2 linear layers in OPTs and 3 in others.
existing cache policy is inefficient, since all these policies
are performed at the level of neuron individuals, which ig-
nores the co-activation pattern and neuron placement in Models. We choose models from three widely adopted LLM
flash. Therefore, we added a layer of access management to families [28, 56, 76] for evaluation, as outlined in Table 3. For
the existing cache to achieve linkage with Access Collapse Llama2 and Mistral, we utilize their ReLU variants [49, 51],
and further improve system efficiency. For example, neu- which offer an effective balance between sparsity and per-
rons A, B, C, and D are stored together and often co-activate formance. These LLMs exhibit diverse model architectures,
together, but if B is hotter than others, B has a higher proba- parameter sizes, and sparsity ratios, thereby offering com-
bility of being cached, which may cause discontinuous read prehensive benchmarks on Ripple.
operations. A further idea is that cache the neurons stored Datasets. We evaluate Ripple using three real-world datasets,
continuously in the flash together to reduce the occurrence each representing a diverse set of content and linguistic struc-
of the above situation, but this will take up lots of cache tures. Alpaca [53] offers task-specific instruction data, Open-
space at once, which is not worth the loss. WebText [21] captures web-scale information, and Wiki-
In Ripple, activated neurons are divided into two cat- Text [39] provides formal, encyclopedic content. For each
egories, sporadic neurons and continuous segments. Spo- dataset, we collect 1,000 sentences to extract the neuron
radic neurons, as its name, refer to those neurons being co-activation patterns during the offline stage.
co-activated with few surrounding neurons. Continuous seg- Baselines. We benchmark Ripple with two state-of-the-art
ments consist of a series of neurons that are activated to- LLM inference frameworks for smartphone deployment. The
gether in succession. For sporadic neurons, Ripple cache first, Llama.cpp [19], is the most widely used LLM inference
them as usual. But Ripple cache these continuous segments framework and currently the fastest, supporting offloading
with a lower probability than sporadic neurons. This is mainly of model parameters to flash storage. The second baseline is
because, first of all, caching continuous segments requires LLM in a Flash (LLMFlash) [4], the representation of current
more memory resources and brings more limited benefits. methods on on-device LLM inference. Although it is not
If some neurons in these continuous segments are evicted open-source, we port it into Llama.cpp by integrating its
and some remain in the cache, this will lead to discontinuous key I/O optimizations, such as row-column bundling. While
reads on the flash. Although the waste of IOPS is alleviated our evaluation primarily focuses on I/O efficiency between
by Access Collapse, DRAM resources are wasted. Our cache flash and DRAM, we integrate a high-performance cache,
policy also cooperates well with the state-of-the-art cache S3-FIFO [70] into all baselines and maintain a DRAM cache
design. Since we only control the caching admitting policy, ratio of 0.1 during the comparison.
yet leave the other unchanged. Metrics. Latency remains the most critical concern in mobile
scenarios. Consequently, our primary performance metric
6 Evaluation is the I/O latency during inference, which constitutes the
6.1 Evaluation Setup majority of the overall end-to-end latency. For a more gran-
ular analysis, we also consider metrics such as IOPS and
Hardware. We conduct evaluations across a diverse set of Bandwidth. Notably, bandwidth here refers to the effective
smartphones, as detailed in Table 2. This broad spectrum bandwidth, which only considers the activated neurons. We
underscores the wide applicability and robustness of Ripple normalize the metrics values when large discrepancies occur
across different hardware platforms. for clarity. All metrics are averaged over 100 token genera-
tions and repeated across 10 trials.
8
A.OPT-350M B.OPT-1.3B C.OPT-6.7B 1.Alpaca 2.OpenWebText 3.WikiText
Llama.cpp LLMFlash Ripple
D.Llama2-7B E.Mistral-7B

(a)

(b)

Figure 10. Overall Performance (latency and bandwidth) of Ripple across various LLMs and datasets on OnePlus 12.

6.2 Overall Performance Origin Offline Offline + Online

1.08×
1.13×
1.14×
Latency. As depicted in Figure 10(a), we evaluate the I/O

1.02×
1.33×
latency per token and the corresponding speedup of Ripple

1.23×
1.46×
1.52×
on OnePlus 12. The results indicate that Ripple effectively
mitigates the I/O bottleneck during LLM inference on smart-

1.38×

1.66×
phones, yielding speedups of up to 5.93× over Llama.cpp and
3.23× over LLMFlash. For OPT models, which exhibit high
sparsity, Ripple achieves an average speedup of 2.23× over
LLMFlash across all model sizes and datasets. For denser
OPT-350M OPT-1.3B OPT-6.7B Llama2-7B Mistral-7B
models like Llama2-7B and Mistral-7B, optimizing I/O op-
erations becomes much more challenging. However, with Figure 11. Performance breakdown on OnePlus 12.
IOPS-oriented techniques, Ripple still achieves speedups of
13.8% and 10.2% over LLMFlash.
6.4 Offline Ablation Study
Effective Bandwidth. As shown in Figure 10(b). Consis-
tent with the observed latency results, Ripple demonstrates Continuous Access. The core insight of Ripple lies in op-
a marked enhancement in bandwidth, achieving improve- timizing bandwidth utilization by significantly maximizing
ments of up to 4.32× and 2.13× over both two baselines, continuous read access to flash. As depicted in Figure 12,
respectively. These gains primarily come from a substantial we evaluate the lengths of read accesses in both Ripple and
reduction in I/O operations due to more continuous access. LLMFlash. Prior to optimization, the read access lengths re-
By alleviating the device from IOPS constraints, the overall main below 10 neurons, with averages of 1.05 and 1.10 across
bandwidth utilization is boosted. the two models. In contrast, Ripple exhibits a marked im-
provement, with read access length increasing by 213% and
6.3 Performance Breakdown
160%, respectively. Remarkably, the maximum continuous
Figure 11 presents a detailed performance analysis of Rip- read access length reaches up to 620 in OPT and 344 in Llama.
ple. The evaluation is conducted on two LLM families, OPT This considerable improvement in continuous read access
and Llama2, with OPT model sizes ranging from 350M to facilitates full utilization of available bandwidth.
6.7B. Using the strongest baseline, LLMFlash, as the start Overhead Analysis. We measure the time cost of executing
point, the results demonstrate that the offline and online the offline search algorithm in Ripple. Since the time com-
stages of Ripple yield average performance improvements plexity is primarily determined by the number of activated
of 1.30× and 1.26×, respectively, underscoring the effective- neurons, we perform evaluations across various datasets and
ness of both stages. By closely integrating both stages, Ripple model sizes. To expedite the search process, we implement
achieves cumulative speedups of 1.68× on average across all parallel computation by exploiting the independence of dif-
five models. ferent model layers. Table 4 indicates that all search processes
complete within a few minutes, even for the largest 13B
model. With the theoretical time complexity is 𝑂 (𝑛 2 log 𝑛),
9
Origin Collapse
LLMFlash Ripple
1.18
OPT-6.7B
Ripple: 2.24 0.87 OPT-6.7B
1.21
1.17

LLMFlash: 1.05 0.78 Llama2-7B


1.09

Figure 13. Data transfer volume, IOPS, and bandwidth of


activated neurons before and after applying access merging.

Llama2-7B
OPT-6.7B Llama2-7B
Ripple: 1.76
Origin Origin
Ripple Ripple
LLMFlash: 1.10

1.50× 1.36×

Figure 12. Continuous access length in Ripple and LLM- Figure 14. Per-token latency on varying DRAM cache ratios.
Flash. A length of 1 corresponds to a neuron bundle, com-
prising two neurons in OPT and three neurons in Llama2. focus primarily on I/O optimization from flash, Ripple re-
mains highly efficient when interfacing with caching sys-
tems, achieving DRAM caching space savings of up to 1.50×
and 1.36× on two models, respectively.
6.6 Sensitivity Analysis
Table 4. Time cost of the offline search algorithm across
In this section, we examine the sensitivity of Ripple to
different datasets and models (in seconds).
several key factors, including inputs, hardware and precision.
Dataset OPT-350M OPT-1.3B OPT-6.7B Llama2-7B Mistral-7B Sensitivity on Inputs. In the offline stage, Ripple optimizes
Alpaca 5.47 19.59 97.96 57.12 104.95 neuron placements in flash based on co-activation patterns
OpenWebText 5.32 19.74 98.26 56.36 103.04 extracted from the preprocessed dataset. Figure 15 illustrates
WikiText 5.59 21.63 104.04 56.23 104.76
the I/O performance of Ripple when inference inputs are
sourced from a different dataset. The results reveal that the
the growth of time cost is modest. Given that this search neuron placements determined offline remain effective when
process is required only once, the overhead is negligible inputs are biased. This suggests that neuron co-activation
compared to the subsequent inference process. patterns may be an intrinsic property of the model itself,
with input variations exerting limited influence.
6.5 Online Ablation Study
Sensitivity on Hardware. Figure 16 shows the I/O perfor-
Access Collapse. Figure 13 shows the effectiveness of Ac- mance of Ripple on smartphones with varying hardware
cess Collapse. For both OPT-6.7B and Llama2-7B, the ef- configurations. Compared to the OnePlus 12 (OP 12), the
fective bandwidth for neuron transfer increases due to the OnePlus Ace3 (OP Ace3) share the same storage but features
optimized trade-off between data volume and IO operations. a less powerful SoC, while the OnePlus Ace2 (OP Ace2) has
In the OPT-6.7B and Llama2-7B models, the Access Collapse both weaker UFS storage and SoC. The results show that the
strategy brings an effective bandwidth improvement of 1.21× performance of OP 12 and OP Ace3 is comparable, indicating
and 1.09× respectively. This optimization successfully shifts that storage is a more critical factor than the SoC. In contrast,
the bottleneck from IOPS to bandwidth on smartphones, OP Ace2 exhibits roughly half the performance of the other
resulting in enhanced overall performance. two, aligning with the hardware bandwidth limitations, as
Cache Ratio. We compare the baseline with different cache shown in Figure 4.
ratios to show the memory savings of Ripple. Figure 14 Sensitivity on Precision. For each model, the neuron di-
presents the latency comparison when caching various ra- mension is fixed, and lower precision results in a smaller
tios of neurons in DRAM. The results indicate that, although neuron size. Figure 17 presents the per-token latency across
10
LLMFlash Alpaca OpenWebText WikiText Several works [23, 24, 26, 32, 75] have explored static prun-
ing, where parameters are pruned offline, prior to inference.
In contrast, dynamic sparsity methods [7, 12, 22, 25, 46, 67]
determines which parameters to prune during runtime, en-
abling seamless integration with training or inference. Dif-
ferent from these pruning techniques, Ripple exploits the
Activation Sparsity inherent in LLMs, retaining all model
parameters but enhancing resource efficiency by selectively
activating only a subset. This approach preserves the model’s
generalization ability, particularly critical for LLMs.
Another one is Model Quantization [11, 20], which re-
duces the precision of model parameters by optimizing the
utilization of available bits to encode model information
more efficiently. Numerous studies have driven precision
Figure 15. I/O performance of Ripple during inference re- progressively lower, with efforts ranging from 8-bit [13, 66]
quests from various datasets (columns), using optimized neu- to 4-bit [16, 73], 2-bit [10], and even 1-bit [58, 68]. However,
ron placements generated from a specific dataset (x-axis). as precision decreases, the resulting data access patterns be-
come increasingly fine-grained, leading to more scattered
access. This, in turn, heightens the significance of Ripple.
OP 12 OP Ace2 OP Ace3 Sparse Computation Optimization. Sparse linear algebra
often falls short in performance compared to its dense coun-
terparts, primarily due to its inherent irregular computation
patterns and scattered memory accesses. Many works has
focused on optimizing computation under sparsity patterns.
Several compiler-based techniques, such as SparTA [80] and
SparseRT [60], are tailored for static sparsity patterns, while
OPT-350M OPT-1.3B OPT-6.7B Llama2-7B Mistral-7B others, including Sputnik [17],cuSPARSE [42],PiT [79],Flash-
LLM [65], provide support for more general sparsity pat-
Figure 16. Per-token latency on varying smartphones. terns. Recently, an increasing number of hardware solu-
tions [6, 9, 18, 63] have been specially designed to accelerate
sparsity computations, including NVIDIA’s Sparse Tensor
OPT-6.7B Core [43]. Although these advancements significantly en-
OPT: 1.84×
Llama2-7B hance the sparse computation efficiency, the I/O bottleneck
Llama2: 2.05× Mistral-7B in on-device LLM inference has become increasingly pro-
Mistral: 1.92×
OPT: 1.57× nounced. Complementary to these efforts, Ripple addresses
Llama2: 1.68×
Mistral: 1.69× this I/O bottleneck with neuron co-activation linking.
Activation Sparsity Application. Many recent works have
begun leveraging activation sparsity to reduce the resource
Figure 17. Per-token latency on varying model precision.
demands of LLM inference. For instance, Deja Vu [37] pio-
neered a predictor-based approach for sparsity-based LLM
varying floating-point precision. The result demonstrates inference, greatly reducing inference latency. Building upon
that Ripple scales efficiently with data format precision, this, Powerinfer [50] exploits this property to enable LLM
maintaining consistent performance across three models. execution on consumer-grade GPUs by offloading model
Although lower precision amplifies the extent of scattered parameters to CPU. Particularly in mobile scenarios, LLM
read access, Ripple still achieves an average speedup of 1.65× in a Flash [4] first proposes using flash on smartphones for
from 16bit to 8bit. model offloading. Powerinfer2 [69] extends this approach
further, serving a 47B LLM on a smartphone. However, these
7 Related Works
methods primarily concentrate on optimizing DRAM man-
Model Parameter Reduction. To alleviate the memory agement and overlapping computation with data transfers,
and computational burdens of LLM execution, substantial achieving only limited bandwidth improvements. Ripple
efforts have focused on the reduction of model parameters, complements these efforts by directly enhancing the neuron
with two principal approaches emerging. The first, Model transfer bandwidth, an optimization that can integrate with
Pruning [35, 36], seeks to reduce the number of model pa- existing techniques to fast LLM inference on smartphones.
rameters while ensuring minimal performance degradation.
11
8 Conclusion [15] Mohammad Fraiwan and Natheer Khasawneh. A review of chatgpt ap-
plications in education, marketing, software engineering, and health-
We propose Ripple, an efficient approach to accelerating care: Benefits, drawbacks, and research directions. arXiv preprint
LLM inference on smartphones through I/O optimizations. arXiv:2305.00237, 2023.
Leveraging neuron co-activation, Ripple notably reorganizes [16] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh.
neuron placement within flash to facilitate more continu- Gptq: Accurate post-training quantization for generative pre-trained
transformers. arXiv preprint arXiv:2210.17323, 2022.
ous read access, shifting the primary performance bottleneck [17] Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. Sparse GPU
from IOPS to bandwidth. This work unveils a novel optimiza- kernels for deep learning. In Proceedings of the International Conference
tion space at the intersection of sparsity-driven algorithm for High Performance Computing, Networking, Storage and Analysis,
and storage-level system co-design in LLM inference. SC 2020, 2020.
[18] Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. Sparse gpu
References kernels for deep learning. In SC20: International Conference for High
Performance Computing, Networking, Storage and Analysis, pages 1–14.
[1] Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, IEEE, 2020.
Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash [19] Georgi Gerganov. ggerganov/llama.cpp: Port of facebook’s llama
Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable model in c/c++. https://github.com/ggerganov/llama.cpp, 2024.
language model locally on your phone. arXiv preprint arXiv:2404.14219, [20] Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W
2024. Mahoney, and Kurt Keutzer. A survey of quantization methods for
[2] AF Agarap. Deep learning using rectified linear units (relu). arXiv efficient neural network inference. In Low-Power Computer Vision,
preprint arXiv:1803.08375, 2018. pages 291–326. Chapman and Hall/CRC, 2022.
[3] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, [21] Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie
Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi- Tellex. Openwebtext corpus. http://Skylion007.github.io/
query transformer models from multi-head checkpoints. arXiv preprint OpenWebTextCorpus, 2019.
arXiv:2305.13245, 2023. [22] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery
[4] Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, for efficient dnns. Advances in neural information processing systems,
Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad 29, 2016.
Farajtabar. Llm in a flash: Efficient large language model inference [23] Song Han, Huizi Mao, and William J Dally. Deep compression: Com-
with limited memory. arXiv preprint arXiv:2312.11514, 2023. pressing deep neural networks with pruning, trained quantization
[5] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige [24] Stephen Hanson and Lorien Pratt. Comparing biases for minimal
Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint network construction with back-propagation. Advances in neural
arXiv:2305.10403, 2023. information processing systems, 1, 1988.
[6] Nathan Bell and Michael Garland. Efficient sparse matrix-vector [25] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft
multiplication on cuda. Technical report, Nvidia Technical Report filter pruning for accelerating deep convolutional neural networks.
NVR-2008-004, Nvidia Corporation, 2008. arXiv preprint arXiv:1808.06866, 2018.
[7] Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. [26] Zehao Huang and Naiyan Wang. Data-driven sparse structure se-
Conditional computation in neural networks for faster models. arXiv lection for deep neural networks. In Proceedings of the European
preprint arXiv:1511.06297, 2015. conference on computer vision (ECCV), pages 304–320, 2018.
[8] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D [27] JEDEC. Jedec announces publication of universal flash storage (ufs)
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish standard. https://www.jedec.org, February 2021. Accessed: 2024-10-02.
Sastry, Amanda Askell, et al. Language models are few-shot learners. [28] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford,
Advances in neural information processing systems, 33:1877–1901, 2020. Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna
[9] Aydin Buluç, Jeremy T Fineman, Matteo Frigo, John R Gilbert, and Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv
Charles E Leiserson. Parallel sparse matrix-vector and matrix- preprint arXiv:2310.06825, 2023.
transpose-vector multiplication using compressed sparse blocks. In [29] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Ben-
Proceedings of the twenty-first annual symposium on Parallelism in jamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and
algorithms and architectures, pages 233–244, 2009. Dario Amodei. Scaling laws for neural language models. arXiv preprint
[10] Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher M De Sa. arXiv:2001.08361, 2020.
Quip: 2-bit quantization of large language models with guarantees. [30] Tiffany H Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos,
Advances in Neural Information Processing Systems, 36, 2024. Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao,
[11] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model Giezel Diaz-Candido, James Maningo, et al. Performance of chatgpt on
compression and acceleration for deep neural networks. arXiv preprint usmle: potential for ai-assisted medical education using large language
arXiv:1710.09282, 2017. models. PLoS digital health, 2(2):e0000198, 2023.
[12] Andrew Davis and Itamar Arel. Low-rank approximations for con- [31] Gilbert Laporte. The traveling salesman problem: An overview of
ditional feedforward computation in deep neural networks. arXiv exact and approximate algorithms. European Journal of Operational
preprint arXiv:1312.4461, 2013. Research, 59(2):231–247, 1992.
[13] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. [32] Vadim Lebedev and Victor Lempitsky. Fast convnets using group-wise
Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. brain damage. In Proceedings of the IEEE conference on computer vision
Advances in Neural Information Processing Systems, 35:30318–30332, and pattern recognition, pages 2554–2564, 2016.
2022. [33] Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guo-
[14] Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for hong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, et al. Per-
language understanding. arXiv preprint arXiv:1810.04805, 2018. sonal llm agents: Insights and survey about the capability, efficiency
and security. arXiv preprint arXiv:2401.05459, 2024.

12
[34] Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh [55] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-
Rawat, Sashank J Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, et al. Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M
The lazy neuron phenomenon: On emergence of activation sparsity Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal
in transformers. arXiv preprint arXiv:2210.06313, 2022. models. arXiv preprint arXiv:2312.11805, 2023.
[35] Tailin Liang, John Glossner, Lei Wang, Shaobo Shi, and Xiaotong [56] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma-
Zhang. Pruning and quantization for deep neural network acceleration: hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal
A survey. Neurocomputing, 461:370–403, 2021. Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-
[36] Jiayi Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah. Prun- tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
ing algorithms to accelerate convolutional neural networks for edge [57] Jan Van Leeuwen. Handbook of theoretical computer science (vol. A)
applications: A survey. arXiv preprint arXiv:2005.04275, 2020. algorithms and complexity. Mit Press, 1991.
[37] Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao [58] Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang,
Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet:
Re, et al. Deja vu: Contextual sparsity for efficient llms at inference Scaling 1-bit transformers for large language models. arXiv preprint
time. In International Conference on Machine Learning, pages 22137– arXiv:2310.11453, 2023.
22176. PMLR, 2023. [59] Hongyu Wang, Shuming Ma, Ruiping Wang, and Furu Wei. Q-sparse:
[38] Scott Lowe. Calculate iops in a storage array. TechRepublic, verkkosivu, All large language models can be fully sparsely-activated. arXiv
Saatavissa (viitattu 27.02. 2020): https://www. techrepublic. com/blog/the- preprint arXiv:2407.10969, 2024.
enterprise-cloud/calculate-iops-in-a-storage-array, 2010. [60] Ziheng Wang. Sparsert: Accelerating unstructured sparsity on gpus
[39] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. for deep learning inference. In Proceedings of the ACM international
Pointer sentinel mixture models, 2016. conference on parallel architectures and compilation techniques, pages
[40] Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi 31–42, 2020.
Jin, Tianqi Chen, and Zhihao Jia. Towards efficient generative large [61] Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby
language model serving: A survey from algorithms to systems. arXiv Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu.
preprint arXiv:2312.15234, 2023. Empowering llm to use smartphone for intelligent task automation.
[41] Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, arXiv preprint arXiv:2308.15272, 2023.
Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, and Mehrdad [62] Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby
Farajtabar. Relu strikes back: Exploiting activation sparsity in large Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu.
language models. arXiv preprint arXiv:2310.04564, 2023. Autodroid: Llm-powered task automation in android. In Proceedings
[42] Maxim Naumov, L Chien, Philippe Vandermersch, and Ujval Kapasi. of the 30th Annual International Conference on Mobile Computing and
Cusparse library. In GPU Technology Conference, 2010. Networking, pages 543–557, 2024.
[43] NVIDIA. Accelerating inference with sparsity using the nvidia ampere [63] Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine
architecture and nvidia tensorrt, 2021. Accessed: 2024-10-18. Yelick, and James Demmel. Optimization of sparse matrix-vector
[44] OpenAI. ChatGPT: Get instant answers, find creative inspiration, multiplication on emerging multicore platforms. In Proceedings of the
learn something new. https://openai.com/chatgpt, 2022. 2007 ACM/IEEE Conference on Supercomputing, pages 1–12, 2007.
[45] OpenAI. GPT-4 Technical Report. Technical report, 2023. [64] Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze,
[46] Victor Sanh, Thomas Wolf, and Alexander Rush. Movement pruning: Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and
Adaptive sparsity by fine-tuning. Advances in Neural Information Gideon Mann. Bloomberggpt: A large language model for finance.
Processing Systems, 33:20378–20389, 2020. arXiv preprint arXiv:2303.17564, 2023.
[47] Noam Shazeer. Glu variants improve transformer. arXiv preprint [65] Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou,
arXiv:2002.05202, 2020. Xiafei Qiu, Yong Li, Wei Lin, and Shuaiwen Leon Song. Flash-llm:
[48] Michael Sipser. Introduction to the theory of computation. ACM Enabling cost-effective and highly-efficient large generative model
Sigact News, 27(1):27–29, 1996. inference with unstructured sparsity. arXiv preprint arXiv:2309.10285,
[49] Chenyang Song, Xu Han, Zhengyan Zhang, Shengding Hu, Xiyu Shi, 2023.
Kuai Li, Chen Chen, Zhiyuan Liu, Guangli Li, Tao Yang, et al. Prosparse: [66] Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth,
Introducing and enhancing intrinsic activation sparsity within large and Song Han. Smoothquant: Accurate and efficient post-training
language models. arXiv preprint arXiv:2402.13516, 2024. quantization for large language models. In International Conference
[50] Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. Powerinfer: on Machine Learning, pages 38087–38099. PMLR, 2023.
Fast large language model serving with a consumer-grade gpu. arXiv [67] Dongkuan Xu, Ian EH Yen, Jinxi Zhao, and Zhibin Xiao. Rethinking
preprint arXiv:2312.12456, 2023. network pruning–under the pre-train and fine-tune paradigm. arXiv
[51] Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, preprint arXiv:2104.08682, 2021.
and Haibo Chen. Turbo sparse: Achieving llm sota performance with [68] Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu,
minimal activated parameters. arXiv preprint arXiv:2406.05955, 2024. Zhiyuan Liu, Weidong Liu, and Wanxiang Che. Onebit: Towards ex-
[52] Jiahong Su and Weipeng Yang. Unlocking the power of chatgpt: A tremely low-bit large language models. arXiv preprint arXiv:2402.11295,
framework for applying generative ai in education. ECNU Review of 2024.
Education, 6(3):355–366, 2023. [69] Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, and Haibo
[53] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Chen. Powerinfer-2: Fast large language model inference on a smart-
Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford phone. arXiv preprint arXiv:2406.06282, 2024.
alpaca: An instruction-following llama model. https://github.com/ [70] Juncheng Yang, Yazhuo Zhang, Ziyue Qiu, Yao Yue, and Rashmi
tatsu-lab/stanford_alpaca, 2023. Vinayak. Fifo queues are all you need for cache eviction. In Pro-
[54] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean- ceedings of the 29th Symposium on Operating Systems Principles, pages
Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M 130–149, 2023.
Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal [71] Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue
models. arXiv preprint arXiv:2312.11805, 2023. Zhang. A survey on large language model (llm) security and privacy:
The good, the bad, and the ugly. High-Confidence Computing, page

13
100211, 2024. of experts. arXiv preprint arXiv:2110.01786, 2021.
[72] Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji [78] Zhengyan Zhang, Yixin Song, Guanghui Yu, Xu Han, Yankai Lin,
Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: Chaojun Xiao, Chenyang Song, Zhiyuan Liu, Zeyu Mi, and Maosong
A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, Sun. Relu2 wins: Discovering efficient activation functions for sparse
2024. llms. arXiv preprint arXiv:2402.03804, 2024.
[73] Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, [79] Ningxin Zheng, Huiqiang Jiang, Quanlu Zhang, Zhenhua Han, Lingx-
Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable iao Ma, Yuqing Yang, Fan Yang, Chengruidong Zhang, Lili Qiu, Mao
post-training quantization for large-scale transformers. Advances in Yang, et al. Pit: Optimization of dynamic sparse deep learning models
Neural Information Processing Systems, 35:27168–27183, 2022. via permutation invariant transformation. In Proceedings of the 29th
[74] Wangsong Yin, Mengwei Xu, Yuanchun Li, and Xuanzhe Liu. Llm as Symposium on Operating Systems Principles, pages 331–347, 2023.
a system service on mobile devices. arXiv preprint arXiv:2403.11805, [80] Ningxin Zheng, Bin Lin, Quanlu Zhang, Lingxiao Ma, Yuqing Yang,
2024. Fan Yang, Yang Wang, Mao Yang, and Lidong Zhou. SparTA: Deep-
[75] Ming Yuan and Yi Lin. Model selection and estimation in regression Learning Model sparsity via Tensor-with-Sparsity-Attribute. In 16th
with grouped variables. Journal of the Royal Statistical Society Series USENIX Symposium on Operating Systems Design and Implementation
B: Statistical Methodology, 68(1):49–67, 2006. (OSDI 22), pages 213–232, 2022.
[76] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya [81] Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen,
Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Vic- Haolun Wu, Dun Yuan, Li Jiang, Di Wu, et al. Large language model
toria Lin, et al. Opt: Open pre-trained transformer language models. (llm) for telecommunications: A comprehensive survey on principles,
arXiv preprint arXiv:2205.01068, 2022. key techniques, and opportunities. arXiv preprint arXiv:2405.10825,
[77] Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and 2024.
Jie Zhou. Moefication: Transformer feed-forward layers are mixtures

14

You might also like