0% found this document useful (0 votes)

35 views17 pages

Mobile Foundation Model As Firmware

Uploaded by

Rest

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views17 pages

Mobile Foundation Model As Firmware

Uploaded by

Rest

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Mobile Foundation Model as Firmware

The Way Towards a Unified Mobile AI Landscape

Jinliang Yuan†, Chen Yang†, Dongqi Cai†,

Shihe Wang, Xin Yuan, Zeling Zhang, Xiang Li, Dingge Zhang, Hanzi Mei,
Xianqing Jia, Shangguang Wang, Mengwei Xu*
Beijing University of Posts and Telecommunications
Beijing, China
ABSTRACT CCS CONCEPTS
arXiv:2308.14363v3 [cs.AI] 12 Mar 2024

In the current AI era, mobile devices such as smartphones • Human-centered computing → Ubiquitous and mobile
are tasked with executing a myriad of deep neural networks computing systems and tools.
(DNNs) locally. It presents a complex landscape, as these mod-
els are highly fragmented in terms of architecture, operators, KEYWORDS
and implementations. Such fragmentation poses significant Mobile computing, multimodal foundation model, efficient
challenges to the co-optimization of hardware, systems, and and scalable mobile AI
algorithms for efficient and scalable mobile AI.
Inspired by the recent groundbreaking progress in large ACM Reference Format:
foundation models, this work introduces a novel paradigm Jinliang Yuan†, Chen Yang†, Dongqi Cai†,, Shihe Wang, Xin Yuan,
Zeling Zhang, Xiang Li, Dingge Zhang, Hanzi Mei, Xianqing Jia,
for mobile AI, where mobile OS and hardware jointly man-
Shangguang Wang, Mengwei Xu*. 2024. Mobile Foundation Model
age a foundation model that is capable of serving a wide
as Firmware The Way Towards a Unified Mobile AI Landscape.
array of mobile AI tasks. This foundation model functions In International Conference On Mobile Computing And Networking
akin to firmware, unmodifiable by apps or the OS, exposed (ACM MobiCom ’24), September 30–October 4, 2024, Washington
as a system service to Apps. They can invoke this founda- D.C., DC, USA. ACM, New York, NY, USA, 17 pages. https://doi.org/
tion model through a small, offline fine-tuned “adapter” for 10.1145/3636534.3649361
various downstream tasks. We propose a tangible design of
this vision called M4, and prototype it from publicly available
1 INTRODUCTION
pre-trained models. To assess its capability, we also build a
comprehensive benchmark consisting of 38 mobile AI tasks Machine learning is revolutionizing mobile applications by
and 50 datasets, spanning 5 multimodal inputs. Extensive facilitating a more automated, intelligent, and efficient in-
experiments demonstrate M4’s remarkable results: it achieves teraction between users and devices. These advancements
comparable accuracy in 85% of tasks, offers enhanced scala- enable humans to enjoy the convenience provided by deep
bility regarding storage and memory, and has much simpler models at all times and locations, from voice assistants [9, 61],
operations. In broader terms, this work paves a new way image editers [112, 124, 129], to augmented reality [100, 148].
towards efficient and scalable mobile AI in the post-LLM era. As reported in [39, 130], the number of deep models incorpo-
rated within individual devices is growing rapidly, making
mobile devices a primary vehicle for AI.
Executing deep models on devices offers benefits in data
†Authors contributed equally to this research. *Corresponding author.
privacy and service availability but also demands significant
resources such as memory, energy, and time. For efficient and
scalable on-device execution of these models, a comprehen-
Permission to make digital or hard copies of all or part of this work for sive co-design approach that integrates hardware, system,
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear and algorithms is needed. However, this task is challenged by
this notice and the full citation on the first page. Copyrights for components the fragmented ecosystem of mobile deep models: they signifi-
of this work owned by others than ACM must be honored. Abstracting with cantly differ in architecture, operators, and implementations
credit is permitted. To copy otherwise, or republish, to post on servers or to [46, 56, 68, 77, 113, 135]. This fragmentation, which often
redistribute to lists, requires prior specific permission and/or a fee. Request results in ad-hoc optimization efforts [40, 99, 107], seems
permissions from [email protected].
ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA
unavoidable. It originates from the complex nature of mobile
© 2024 Association for Computing Machinery. AI tasks (CV/NLP/TTS/HAR/..), multimodal data from vari-
ACM ISBN 979-8-4007-0489-5/24/09. . . $15.00 ous sensors (camera, screen, microphone, etc.), and diverse
https://doi.org/10.1145/3636534.3649361 application demands (high accuracy, low latency, etc.).
ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA Jinliang Yuan et al.

unified latent space

IMG-Enc 1.4 -0.2 -2.3 0.6 9.1 … IMG-Dec
Object detection
Image retrieval
pre-trained foundation models [92, 105, 117] that capture
Super-resolution

TXT-Enc
Style transfer
OCR
extensive knowledge from vast Internet data; (2) The devel-
Projector TTS-Dec Vehicle ReID
Crowd counting
opment of algorithms to accurately align multimodal data
IMU-Enc Q&A
Word prediction
Doc. summary
input [57, 114]; (3) The demonstration of parameter-efficient
Pretrained CLS-Dec ASR
fine-tuning (PEFT) methods like LoRA [65, 89] that efficiently
AUD-Enc SLU
LLM Backbone HAR
(LLaMA, Vicuna, etc)
TTS
VQA
adapt pre-trained models to diverse downstream tasks.
Multimodal Multimodal
Image captioning
… While the vision is intriguing, there are two key miss-
Embedding Understanding & Reasoning Generator (more in Table 3)
ing pieces to turn it into reality. (1) How to build such a
one-size-fits-all foundation model to ubiquitously handle
Figure 1: An overview of M4. highly diversified, multimodal mobile AI tasks? While re-
search on the multimodal foundation model has achieved
Such fragmentation fundamentally undermines the effi- impressive progress in recent years, they are still not ade-
ciency of constructing an efficient and scalable mobile AI quate in our case: most of them [80, 91, 101] handle only
stack, notably in the three following aspects: a small fixed number of input/output modalities (e.g., text-
• Hardware aspect: it complicates the design of ASIC-based image) and cannot be flexibly adapted to more; a concurrent
accelerators (NPUs), by forcing difficult trade-offs between effort CoDi [114] with this work enables any-to-any genera-
generality and performance. §2.2 shows that mobile NPUs tion across three modalities (image-text-audio), but requires
can achieve up to a 22× speedup over multi-core CPUs on more than 34GBs of on-device storage/memory. (2) How to
the Qualcomm Snapdragon 8+ Gen 1 platform. However, properly evaluate the performance of the proposed foun-
this advantage only extends to a small fraction (around dation model? To our best knowledge, there has been no
8%) of deep models due to the lack of operator support. comprehensive benchmark or a set of standard metrics for
• OS aspect: It hampers the system-wise sharing of weights mobile AI tasks.
and computations across different applications. Mobile M4: a composable mobile foundation model (§3). We in-
apps often perform similar meta-ML tasks (e.g., object de- troduce M4, the first architectural design and implementation
tection for augmented reality, image enhancement, OCR of a multimodal mobile foundation model, as illustrated in
apps), and there exist temporal, spatial, and semantic cor- Figure 1. Unlike prior approaches like CoDi that directly use
relations among the input data [62]. However, exploit- (Nx) heavy encoders to align multimodal input data and (Mx)
ing such similarities to reduce memory or computing via heavy decoders to generate specific data format, M4 adds a
cache-reuse is currently impractical, due to the model frag- backbone module in between (a “narrow waist”) that com-
mentation and OSes’ lack of visibility into those models prehends and reasons for each downstream task. Through
managed at the application level. such “N-1-M” design, M4 is able to achieve better accuracy-
• Software aspect: It makes library-level optimizations ad- to-parameter efficiency as compared to traditional “N-M”
hoc. As noted in [144], there is a wide array of frameworks architecture. Moreover, M4 could be partially activated by
available to developers, but their performance can vary various tasks based on their characteristics (input/output
significantly across different models and devices. No sin- modality, the need for complex comprehension, etc.). We
gle solution excels universally, often leaving developers have fully prototyped M4 with only pre-trained models pub-
struggling to differentiate between them. licly available from HuggingFace [126], which guarantees
Mobile foundation model as firmware. In order to fun- the reproducibility of M4 and also demonstrates its compati-
damentally tackle the aforementioned issues, we propose a bility with the existing LLM ecosystem. Overall, M4 contains
novel paradigm for mobile AI in which the OS and hardware 9.2B parameters and demands 7.5GBs of peak memory foot-
co-manage a foundation model that is capable of address- print. Such a size is only affordable on high-end mobile de-
ing most, if not all, mobile AI tasks. This model, akin to vices nowadays, but we deem it to be soon feasible for more
firmware, is exposed as a system service to applications sim- commons whose memory/storage capacity is significantly
ilar to the unified ML interface NNAPI [60] in Android. It increasing yearly.
remains unaltered by apps or the OS. To utilize it, each ap- eAIBench: a comprehensive edge-oriented AI bench-
plication embeds it a lightweight "adapter" that is fine-tuned mark (§4.1). To assess M4 and future endeavors, we have
offline for downstream tasks. This approach could greatly constructed the first comprehensive benchmark for diverse
simplify NPU design and allow the OS to take control of mobile AI tasks, named eAIBench. Through an extensive
AI computing across applications, thereby facilitating the examination of real-world mobile AI and publications in mo-
sharing of weights, computations, and task scheduling. This bile venues, M4 presently includes 38 important mobile AI
vision becomes feasible thanks to recent advancements in tasks and 50 classic datasets. The tasks include five differ-
the ML community, specifically: (1) The establishment of ent input/output data modalities (vision, text, audio, IMU,
Mobile Foundation Model as Firmware ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA

and mix). Each task is also linked with a task-specific model, addressed by running M4 on a highly optimized NPU, since
representative of the DNN in the pre-LLM era (e.g., ResNet- existing NPUs already offer up to a 22-fold speedup over
152 for image classification [63] and LSTM for input token CPUs, as mentioned in §2.2.
prediction [67]). We also standardize a set of key metrics to • Simplicity – M4 requires fewer operators for execution,
quantify the capability of a foundation model. greatly simplifying hardware design. In the ONNX format,
Key results (§4). We then conduct extensive experiments M4 utilizes a mere 39 different mathematical operators, in
to evaluate M4 using eAIBench on three kinds of hardware contrast to the cumulative 156 operators required by 50 task-
platforms: NVIDIA A100 GPU, NVIDIA Jetson ORIN NX and specific models. More impressively, M4 can expand its capa-
Pixel 7 Pro smartphone. We summarize our major results. bilities using the same number of operators. The traditional
• Ubiquity– M4 effectively supports most tasks and datasets approach, on the other hand, continuously introduces new
in eAIBench. Compared with the models tailored for each operators [139, 149], thereby complicating NPU design.
task, M4 shows comparable accuracy on 85% of the 50 datasets In addition to conventional mobile AI tasks, M4 also en-
and a significant improvement on 4 of them (including image ables more complex and innovative mobile applications, e.g.,
captioning and text-to-image retrieval). In only six instances a sophisticated assistant capable of processing multimodal
does M4 experience nontrivial accuracy degradation, marked input data, understanding user intentions, and responding
by a greater than 10% gap. The system also demonstrates with precision as demonstrated in §4.7.
promising zero-shot and few-shot capabilities, achieving Contributions Major contributions are summarized below.
usable accuracy on certain tasks without any fine-tuning. • We delineate a vision for a mobile foundation model, har-
Moreover, quantization minimally affects the performance nessing cutting-edge machine learning techniques to con-
of M4: when reduced to 8 bits on two tested tasks, accuracy solidate the mobile AI ecosystem and foster integrated
degradation ranges only between 0.2% and 0.8%. To be noted, hardware-system co-design.
the backbone LLM used in the current prototype of M4, i.e., • We design and prototype the first mobile foundation model
LLaMA (Feb. 2023), has been defeated by many other open- with public, pre-trained LLMs.
source LLMs since its release, such as LLaMA-2 (July. 2023) • We have constructed the first comprehensive edge-oriented
and Mistral-7B [71] (Oct. 2023). We expect the performance AI benchmark, through which our prototype demonstrates
of M4 to improve substantially as well by using such more significant potential in catering to widespread mobile AI
powerful backbone LLMs. This is also confirmed by our tasks, while exhibiting strong scalability, flexibility, and
preliminary experiments by replacing LLaMA with LLaMA- velocity in its performance.
2 on two tasks, as will be discussed in §4.2. Open-source M4 and eAIBench are publicly available at
• Scalability – Despite M4 foundation model’s heavier foot- https://github.com/UbiquitousLearning/MobileFM.
print, its adaptation to downstream mobile tasks is lightweight
and therefore more scalable. The current implementation
of M4 encompasses ∼10 billion parameters, in contrast to 2 BACKGROUND AND MOTIVATION
the mere 1 million to 500 million parameters found in task- 2.1 Mobile AI Characteristics
specific models. Nevertheless, the "adapters" of M4 require
only 1,000 to 10 million parameters, which enhances scalabil- Mobile AI is pervasive. An important trend of AI de-
ity across various mobile AI tasks, given that the foundation ployment is the migration of deep learning inference tasks
model is shared. For example, on a device with 12GB of mem- from data centers to smartphones, aiming to minimize user-
ory, M4 (4-bit quantized) with all 50 adapters can be hosted in perceived latency and better preserve data privacy [45, 49].
memory, eliminating cold-start latency, whereas only 20 of 50 For instance, it is reported that Android apps embedded with
task-specific models would fit the same memory constraints. on-device DNNs on the Google Play market have experi-
• Velocity – M4 is much slower than task-specific models, yet enced a remarkable 60% growth from February 2020 to April
the gap might be mitigated through a highly-optimized NPU. 2021 [39]; Such DL-enhanced apps have been downloaded
On a high-end autonomous board Jetson Orin NX (16GB by users billions of times. Unsurprisingly, mobile devices
memory), M4 runs 18× slower on average. We also test the like smartphones and laptops have become a major carriers
performance of M4 on smartphone CPUs 1 , which shows of intelligence, where DNN inference happens frequently
that the prediction delay could be too slow, i.e., 2.1 secs anywhere anytime even without users being aware of it.
to classify an image or 240 msecs to generate a token in Mobile DNNs are fragmented. Unlike cloud AI where
QA. However, such a performance degradation might be each computing unit (e.g., an NVIDIA GPU) only serves
one model for user requests [55, 90, 106], a mobile device
1 Currently,M4 cannot run on COTS smartphones GPU/NPU due to the lack needs to handle highly diversified mobile AI tasks by itself.
of operator support. Such diversification is inevitable since mobile AI tasks could
# of NPU operators is limited
and the growth is slow NPU improved significantly in terms of latency and energy consumption

ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA Jinliang Yuan et al.

3 15
CPU CPU
1000 TensorFlow (CPU) GPU GPU
# of operators

Energy (W∙s)
Latency (s)
2 NPU 10
TFLite (CPU) NPU
100 TFLite (NPU-v2.0)
1 5

TFLite (NPU-v1.0)
10 0 0
2019 2020 2021 2022 2023 ResNet-152 Bert-base ResNet-152 Bert-base
(a) Increased operators over time (b) Latency reduction (c) Energy saving

Figure 2: Empirical study on mobile NN processors:

(1) A longitudinal analysis of operator support on
CPU/NPU; (2) The performance gap between NPU and
CPU/GPU on Google Pixel 7 Pro.

leverage multimodal sensor data from devices, including im-

agery data from cameras, audio data from microphones, IMU Figure 3: An empirical study of 110 in-the-wild DNNs
data from motion sensors, and textual/code data from users crawled from public sources on Google Pixel 7 Pro.
typing. Each modality itself has a wide spectrum of applica-
tions, e.g., Google Translate for NLP and Apple Siri for Audio. For the lucky DNNs fully supported, mobile NPU is
Meanwhile, there are a wide range of cross-modal applica- able to deliver significant inference speedup and en-
tions in mobile scenarios: visual question answering [146], ergy reduction compared to mobile CPU/GPU. As an
image captioning [48], and multimedia content retrieval [86]. ASIC-based customized processor, mobile NPU is expected
Research suggests that the quantity of multimodal applica- to offer faster and more energy-efficient DNN inference. To
tions on mobile devices has almost doubled in the last two understand the performance of contemporary mobile NPU,
years due to rapid advancements in multimodal technologies we measure the inference latency and energy consumption
[138]. A recent empirical in-the-wild study [39] reported of EdgeTPU on Google Pixel 7 Pro. The results are illustrated
significant heterogeneity in the architectures and internal in Figure 2 (b) and (c). ResNet-152, the NPU achieves an in-
operators of DNNs handling various modal data. ference latency of only 76ms, which is 39× and 11× faster
than the accompanying CPU (4-cores used) and GPU, re-
2.2 A Dilemma of Mobile NPU spectively. Similarly, on BERT-base, the NPU consumes only
Fragmented DNNs significantly strain mobile AI stack hard- 0.3J of energy per image, while the CPU and GPU consume
ware, system, and library design, as discussed in Section 1. significantly higher amounts of energy, i.e., 1.7J (5.78×) and
Here, we emphasize the challenges faced by mobile NPUs 9.0J (30×), respectively.
specifically. Pilot measurements are conducted to assess mo- However, such significant benefits only apply to a very
bile NPUs’ performance gains compared to mobile CPU/GPU small portion of in-the-wild popular DNNs. To further
when executing typical mobile DNNs. understand the ubiquity of mobile NPU, we download more
The DNN operator support of mobile NPU is signifi- representative DNNs and test their performance. In total,
cantly lagged behind general-purpose processors. We we have found 110 TensorFlow-format DNNs from Model
conducted an investigation on the number of supported NN Zoo [140] and HuggingFace [53], prioritized based on their
operators by TensorFlow and TFLite. As illustrated in Fig- stars and download times. We then try to convert them to
ure 2 (a), we have two key observations: (1) The number of TFLite format using the official tool developed by Google and
NN operator types is still increasing noticeably lately, e.g., measure their performance on Google Pixel 7 Pro. As shown
from 1240 to 1399 as supported by TensorFlow from 2019 to in Figure 3 (a), unfortunately, only 8% of those models can
2023. Such evolvement of DNN architecture poses significant entirely run on mobile NPU, while for the rest: 13% fail to be
challenges in designing ubiquitous and efficient mobile NPU converted to TFLite format; 30% fail to run on mobile NPU;
design. (2) Mobile chips, especially its NPU, support only and 49% require CPU-NPU co-running due to the lack of
a small portion of existing NN operators. TFLite supports NN operator support by NPU. Figure 3 (b) further illustrates
less than 160 operators on mobile CPUs, which is nearly 90% the performance of those DNNs on devices. It reveals that,
fewer than TensorFlow. Furthermore, the number of sup- only the 8% fully supported DNNs gain significant improve-
ported operators by mobile NPU (EdgeTPU on Pixel 7 Pro) ment over CPUs (i.e., > 20× median speedup), while the
is even fewer, i.e., 33 in 2022 and 63 in 2023. Consequently, rest (either running partly on GPU/NPU or entirely on GPU)
mobile NPUs might benefit only a small number of DNNs. obtain much less profound speedup. In fact, for the DNNs
Mobile Foundation Model as Firmware ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA

that require CPU-NPU co-running, the inference speed is out to more modalities (either for input or output), e.g., for
even not as good as running on mobile GPU. Figure 3 (c) new types of sensor/app data; (3) multimodal: M4 can take
further digs into the reason for such phenomenon: the NPU- multimodal input or generate multimodal output as needed,
incompatible DNNs need to be split into many sub-models to e.g., for advanced mobile applications like visual question
be scheduled between CPU and NPU (e.g., median number answering or audio caption.
of 50); therefore the data movement and format exchange Model architecture Figure 1 illustrates the overall archi-
could severely delay the inference. tecture of M4, which consists of three major components:
• Multimodal Embedding is to align the contents of differ-
2.3 Emergence of Foundation Models ent modalities by converting multimodal input data into
Foundation models are renovating AI ecosystem; the a unified representation (i.e., a vector). It is typically im-
plemented as a set of transformer encoders [57] for each
model design is converging. In recent years, significant ad-
modality, except that audio has two independent encoders
vancements have been made in large-scale neural networks
for language, image, and audio understanding and genera- to differentiate the context information (e.g., background
tion. GPT-3 [42] exemplifies this progress with impressive noise, speaker emotions) and spoken language (e.g., auto-
performance across various tasks, revolutionizing human- matic speech recognition).
computer interaction and intelligent assistance. In the visual • Foundation Backbone (i.e., Pre-trained LLM Backbone) is to
comprehend and reason about the input data. It encapsu-
domain, Meta’s SAM [75] demonstrates exceptional zero-
lates abundant knowledge to understand complex embedded
shot proficiency. Additionally, models like Kosmos-1 [66]
and PaLM-E [47] handle inputs from multiple modalities, data, performs task-specific inference, and generates easily
enabling diverse task capabilities. These models share the intelligible output for generator. It uses a decoder-based ar-
transformer architecture [119], differing mainly in layer con- chitecture trained on huge amount of textual dataset since
figurations or input modality processing. This convergence language has been acknowledged as the most representative
type of data [50, 92, 117]. The backbone is the most heavy
trend in AI model design is expected to continue in the future.
part of M4.
However, there has been no effort in building one model • Multimodal Generator is to adapt the output from the foun-
to fit highly diversified mobile AI tasks. None of the dation backbone to task-specific data format. For classifica-
aforementioned foundation models is capable of (not even tion tasks, it is simply a MLP with softmax layer; for image
close to) solving all mobile AI tasks. A single modality model tasks, it is a stable diffusion model [104]; etc.
(such as GPT for NLP) cannot comprehend or generate data
in other modalities. Existing multimodal models (such as Trainable parameters M4 contains three trainable parts
to be fine-tuned for downstream mobile AI tasks: two PEFT
CLIP for CV-NLP) can only deal with very limited multi-
modules inserted to the multimodal embedding and founda-
modal AI tasks. One might seek to include a foundation
model for each < 𝑖𝑛𝑝𝑢𝑡 : 𝑀1, 𝑜𝑢𝑡𝑝𝑢𝑡 : 𝑀2 > pair to solve tion backbone, respectively; and one MLP projection layer
the above issue, but: (1) It is not parameter-efficient as the that adapts the output of multimodal embedding to the re-
comprehension and conversion between different modality quired representation of the foundation backbone. In later
data share inherent common sense [80, 101]; (2) It cannot experiments, we use LoRA as the default PEFT method, but
also report results for other PEFT methods. As will be demon-
support AI tasks that take multimodal input or output, such
strated in §4, the trainable parameter size is trivial compared
as visual question answering [146]. There have been ad-hoc
approaches to deal with those issues [111], yet we are not to the pre-trained part of M4 and is also much smaller than
aware of any systematic strategy to build a one-size-fits-all traditional state-of-the-art DNNs.
foundation model for diversified mobile AI tasks.
3.2 Prototyping with Off-the-Shelf LLMs
3 M4 DESIGN AND PROTOTYPING We have fully prototyped M4 with only pre-trained, off-the-
shelf models publicly available from HuggingFace [126]. It
3.1 Overview guarantees the reproducibility of M4 and also demonstrates
Design principles M4 is a one-size-fits-all foundation model its compatibility with the existing LLM ecosystem.
for diversified mobile AI tasks. It is designed with follow- • Multimodal Embedding. Multimodal embedding is com-
ing principles: (1) unified: instead of building independent posed of five parallel modules with transoformer encoder-
foundation models for different possible modalities, M4 pro- only architecture: Image (IMG_enc), Text (TXT_enc), Iner-
vides a unified architecture that maximizes the capability tial Measurement Unit (IMU_enc), Audio-Background (AUD-
sharing across different modalities, thus being more resource- B_enc), and Audio-Intent (AUD-I_enc). The IMG_enc employs
efficient and extensible; (2) elastic: M4 can be easily scaled the Vision Transformer (ViT) architecture and is utilized
ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA Jinliang Yuan et al.

Types Params (109 ) Format Architecture GFLOPs

IMG_enc [57] 0.6328 FP16 Encoder-only 167.5963
TXT_enc [57] 0.354 FP16 Encoder-only 23.4189
Embedding AUD-B_enc [57] 0.0862 FP16 Encoder-only 61.4679 Input Embedding Backbone Generator 1
AUD-I_enc [102] 0.037 FP16 Encoder-Decoder 26 2
IMU_enc [57] 0.0196 FP16 Encoder-only 5.1417 4 3
Backbone LLaMA [117] 6.28 INT8 Decoder-only 312
TTS_dec < 0.01 FP32 Encoder-Decoder 8.58 1 T32 T33 T34 T37 T42T43 T44T45 T49 T50 2 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T17
Generator IMG_dec [104] 1.0663 FP16 Encoder-Decoder 267 2 4
GEN_dec [117] < 0.01 FP16 MLP 125.0 3 T14 T15 T16 T18T19 T20 T21 T22 T23 T24 T25T26T27 T28 T29T30 T31T35 T36T38 T39 T40T46T47 T48 T41

Table 1: M4 sub-model parameters.

Figure 4: Execution path to each task in Table 3.
to encode visual information derived from input images
into a sequential matrix of embeddings. The TXT_enc for 3.3 Multi-path Task Execution
input text is based on the CLIP architecture with a 12-layer
encoder [101]. The IMU_enc for IMU data is a lightweight Task-specific partial activation of M4. Not every task
6-layer encoder transformer model [101]. The AUD-B_enc needs to go through the end-to-end workflow of M4, i.e.,
encoder is also derived from ViT and is used for encoding au- embedding-backbone-generator. Inspired by the early-exit in-
dio backgrounds [59]. The pre-trained weights of the above ference [78, 115, 150] and multi-path design in hardware [121,
four encoders are from ImageBind [57] multimodal model. 122], we propose a multi-path task execution design for M4.
The AUD-I_enc encoder is based on a sequence-to-sequence For simpler tasks that can be well solved by only part of
Transformer model for encoding audio intents, with pre- M4’s modules, we allow partial activation of M4 to reduce the
trained weights from Whisper.tiny.en [102]. computing complexity. In practice, developers could assign
• Foundation Backbone. We use LLaMA-7B (INT8 format) [117], specific execution path to different tasks to achieve the best
pre-trained on one trillion tokens by Meta, as M4’s backbone. performance. In Figure 4, we have pre-defined 4 paths that
Released in Feb. 2023, LLaMA is a research project aimed can suffice the 50 mobile tasks we have investigated.
at creating a more versatile and efficient language model. It • Path-1 means a full-model activation of M4. Tasks taking
emphasizes training on a broad array of multilingual and this path often require cross-modality alignment and com-
multitask supervised data to enhance performance across plicated task compresion/reasoning. For instance: spoken
various natural language processing tasks. language understanding and visual question answering.
• Multimodal Generator. Multimodal generator is composed • Path-2 activates only the backbone and generator, mostly
of three parallel decoders: Text-to-Speech (TTS_dec), Image used by language tasks since M4’s backbone can directly
(IMG_dec) and Generation (GEN_dec). The TTS_dec decoder take raw textual input. For instance: input word prediction
is a integral element within the FAIRSEQ [93], the open- and machine translation.
source sequence modeling toolkit released by Meta, tasked • Path-3 activates only the multimodal embedding, often
with converting the input text into corresponding speech used by tasks that can be accomplished through cross-
signals. The IMG_dec decoder is a key component of the modality (often language-X) alignment. For instance, im-
Diffusion Model, which generates image output from text age classification aligns the images with their correspond-
input. The GEN_dec decoder serves as distinctive entities ing textual labels (e.g., “cat”); human activity recognition
employed to lead the backbone language model to perform tasks align the IMU data with the textual activity types
generation tasks, the parameters of which are initialized (e.g., “walking”). In practice, the labels are embedded into
with the last layer of pre-trained LLaMA. Classification tasks a carefully designed prompt as shown below.
could be reformulated with a generation prompt according • Path-4 activates only a specific generator, as taken by very
to prompt learning literature [85] and re-use the generation few tasks. For instance: super-resolution can be accom-
decoder GEN_dec. Or it could re-initialize a new MLP decoder plished by the visual generator (IMG_dec); text-to-speech
according to traditional classification literature. is accomplished by the speech generator (TTS_dec).
System complexity Table 1 presents the model complexity Training details Fine-tuning M4 follows three steps: (1)
of M4’s different modules. The model comprises five types of input data processing: NLP tasks take as input the context
embeddings (encoders), with parameter sizes ranging from information along with task-specific descriptive language
0.03B to 0.63B, and complexities ranging from 26GFLOPs prompts. CV/Audio/IMU tasks will do alike according to
to 167GFLOPs. The backbone of M4 is LLaMA, with a pa- previous work [57]. For more intricate tasks, such as object
rameter count of 6.3B, and a complexity of 312GFLOPs. The detection, we utilize a region proposal network to generate a
LLaMA backbone is the largest component (86.1%) in terms set of object proposals [151]. (2) tunable weights setting: For
of parameter size within the entire model. The generators tasks going through backbone (Path 1, 2), only PEFT parame-
(decoders) contribute trivially to the overall model size. ters in backbones are activated while freezing encoder PEFT
Mobile Foundation Model as Firmware ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA

Tasks Path Prompts at Text Embedding and Backbone

Image classification Path-3 [E]: There is a photo of a [Image_label: car].
the status quo on mobile devices, though not necessary to be
[B]: Translate the following sentences from absolute state-of-the-art. Consequently, our model selection
Machine translation Path-2
[SRC_language: en] to [TGT_language: de]. draws primarily from two sources: open-source endeavors
[B]: Write an assembly code according
Code generation Path-2
to the [sentence] requirements. showcased at prominent mobile conferences like MobiCom
HAR Path-3 [E]: The human is [Activity_label: sitting]. and MobiSys during the past five years, and contemporary
[E]: Give a very short caption of the audio, models showcased on the Paperwithcode platform [94]. A
Audio captioning Path-1
the caption have 16 words at most.
[E]: Give a very short caption of the image, comprehensive list of employed TS-models is provided in
Image captioning Path-1
the caption have 16 words at most. Table 3, including instances such as ResNet-152 [63] for im-
Video classification Path-3 [E]: There is a video of [Video_label: abseiling].
OCR Path-3 [E]: A [negative / positive] review of a movie. age classification, RoBERTa [87] for question answering, and
Table 2: A few prompt examples used in M4. [E] denotes CRDNN [128] for spoken language understanding.
the Text Embedding. [B] denotes the Backbone. Hardware We use three kinds of hardware platform:
• NVIDIA A100 GPU (A100), a high performance accelerator
modules. For discriminative tasks that only go through en- released in May. 2020 that has 40GB RAM, 384GB storage.
coders (Path 3), the encoder PEFT modules will be activated. • NVIDIA Jetson ORIN NX (Orin), a high-end edge board for
Linear mapping contacting encoders and backbones will be autonomous robotics or cars released in Feb. 2023 that has
always activated for shaping alignment. (3) model weights 16GB RAM, 64GB storage, and 1024-core NVIDIA Ampere
updating: During each training iteration, we compute the architecture GPU with 32 Tensor Cores.
CrossEntropy loss from actual labels and predicted tokens, • Pixel 7 Pro smartphone (Pixel), a high-end smartphone
utilizing it to update the PEFT/MLP parameters. released in Oct. 2022 that has 12GB RAM, 256GB storage,
Prompt design Two parts of M4 need careful prompt engi- Octa-core CPU, Mali-G710 MP7 GPU, and an edge TPU.
neering [73, 109, 152] to fully exploit its potential: the text
embedding and foundation backbone. We have designed All training experiments are conducted on two GPU servers,
prompts for each mobile AI task in §4.1, and Table 2 lists a each equipped with 8 A100s. This encompassed the fine-
few of them as exemplifications. tuning of M4 and a portion of the training for the TS-model
starting from scratch. To facilitate the accuracy of TS-models
4 EXPERIMENTS AND ANALYSIS on each dataset quickly, we performe the inference experi-
ments on A100 and Orin. We show the comprehensive results
4.1 Benchmark and Setups and corresponding platforms in Table 3. In total, our experi-
eAIBench: a comprehensive edge-oriented benchmark ments take 100,000 GPU-hours. We use ORIN and Pixel, two
for AI tasks. As the very first effort for a one-size-fits-all typical mobile devices, to measure the runtime performance
mobile foundation model, a pivotal undertaking is the com- (memory, latency, energy, etc.) of M4 on real-world devices.
prehensive evaluation of its versatility across diverse mobile Implementation The benchmark results are tested with
AI tasks. Therefore, we embark on constructing an exhaus- PyTorch [98], and model parameters were obtained through
tive edge-oriented benchmark for AI tasks, encompassing two methods: (1) Directly downloading pre-trained parame-
38 tasks spanning 50 public datasets, as shown in Table 3. ters from open-source websites, such as Hugging Face [53];
Those tasks are essential to real-world mobile applications (2) Training model parameters from scratch based on the
(e.g., translation, object detection, and voice assistant). Many open-source code, available on platforms like GitHub [58].
of these tasks have also received substantial attention within The M4 prototype is built upon PandaGPT [111], a versa-
the mobile community itself [44, 72, 74, 76, 97, 108, 120, 133, tile instruction-following foundation model. Additionally,
141, 145]. Each task is accompanied by its designated accu- we implement two crucial modules using PyTorch [98]: (1)
racy metric. eAIBench includes 5 modality domains: NLP, A redesigned multimodal generator to broaden output ca-
CV, Audio, Sensing (IMU), and Misc (Multimodal). While the pabilities for mobile scenarios, including classification, text-
majority of tasks are tailored to smartphones, we extend our to-speech, and image generation, surpassing the exclusive
scope to encompass pivotal devices such as laptops (code focus on text generation. (2) A multi-path controller aimed
generation), autonomous cars (traffic sign classification), and at enhancing compatibility with diverse mobile AI tasks. We
IoT cameras (Counting). directly acquire pre-trained parameters for the embedding
To understand M4’s performance, we select one task-specific and backbone from official releases [95, 96]. Following pre-
model (namely TS-model) for each dataset as a baseline. The vious researches [65, 81], we fine-tune their adapters from
selection of these models adheres to two primary criteria: scratch, as elaborated in Section 4.4.
(1) They must remain within the confines of mobile device The runtime-cost performance on Orin is obtained through
resource constraints, specifically with fewer than 1 billion jetson-stats [70], which is a powerful tool to analyze your
parameters; (2) The model accuracy shall be representative to board. Regarding the Pixel smartphone, the latency and
ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA Jinliang Yuan et al.

Category Tasks Mobile Application Dataset Specific-Models Results Metrics

Input word prediction T1 Input method (GBoard) PTB RNN [23] 0.17* Accuracy
SQuAD v2.0 RoBERTa [35] 0.79* F1
Question answering T2 T3 Private assistant (Siri)
TyDi QA AraELECTRA [36] 0.87 F1
Machine translation T4 Translator (Google Translate) wmt22 en-de Transformer [32] 0.34* BLEU
Emoji prediction T5 Input method (GBoard) tweet_eval RoBERTa [22] 0.33* Accuracy
Emotion prediction T6 Conversational analytics (Clarabridge) go_emotion RoBERTa [29] 0.57* Accuracy
NLP
Sentiment analysis T7 Conversational analytics (Clarabridge) tweet_eval RoBERTa [27] 0.77* Accuracy
ag_news BERT [37] 0.93* Accuracy
Text classification T8 T9 Spam SMS filtering (Truecaller)
SST2 DistilBERT [38] 0.91* Accuracy
Grammatical error correction T10 Writing assistant (Grammarly) JFLEG FLAN-t5 [30] 0.68* BLEU
Text summary T11 Reading assistant (ChatPDF) CNN Daily Mail BART [5] 0.43* ROUGE1
Code document generation T12 Code editor (Javadoc) CodeSearchNet CodeT5-base [20] 0.33* ROUGE1
Code generation T13 Code editor (Copilot) Shellcode_IA32 CodeBERT [21] 0.92 BLEU
COCO Libra-rcnn [24] 0.43* mAP
Object detection T14 T15 Augmented Reality (Google Lens)
LVIS X-Paste [33] 0.51 AP
Image retrieval T16 Image searcher (Google Photos) Clothes Retrieval Resnet50-arcface [31] 0.90* Recall
Super-resolution T17 Video/Image super-resolution (VSCO) set5 Real-ESRGAN [19] 0.82* SSIM
Styler transfer T18 Painting & Beatifying (Meitu) COCO, Wikiart StyleGAN-nada [4] 0.23 CLIP score
ADE20K-150 Deeplabv3plus [25] 0.43* mIoU
Semantic segmentation T19 T20 Smart camera (Segmentix)
PASCAL VOC Deeplabv3plus [26] 0.79* mIoU
Intelligent document automation
Optical character recognition T21 Rendered SST2 CLIP [34] 0.71 Accuracy
CV software (Ocrolus)
CIFAR100 GFNet-XS [103] 0.89 Accuracy
Image classification T22 T23 Album management (Google Photos)
ImageNet Resnet-152 [64] 0.79 Accuracy
Traffic sign classification T24 Intelligent transportation (Waze) GTSRB MicronNet [8] 0.98 Accuracy
Vehicle re-identification T25 Surveillance camera (AI Re-ID) Veri776 MSINet [6] 0.96 Rank
Gender recognition T26 Smart camera (Face++) Adience MiVOLO-D1 [2] 0.96 Accuracy
Location recognition T27 Navigation search (Google Maps) Country211 CLIP [34] 0.46 Accuracy
Pose estimation T28 AI fitness coach (Keep) AP-10K ViTPose [134] 0.69 AP
Video classification T29 Video player (YouTube) kinetics400 SlowFast [28] 0.79 Accuracy
Crowd Counting T30 Smart camera (Fitness Tracking) UCF-QNRF CSS-CCNN [12] 437 MAE
Image matting T31 Virtual backgrounds (Zoom) RefMatte-RW100 MDETR [79] 0.06 MSE
Automatic speech recognition T32 Private assistant (Siri) LibriSpeech CTC+attention [14] 3.16%* WER
FSC Transformer [18] 0.37% WER
Spoken language understanding T33 T34 Private assistant (Siri)
SLURP CRDNN [3] 0.82* Accuracy
Audio
Emotion recognition T35 Emoji recommendation (WeChat) IEMOCAP ECAPA-TDNN [15] 0.64* Accuracy
Audio classification T36 Music discovery (Shazam) ESC-50 ACDNet [1] 0.87 Accuracy
Keyword spotting T37 Private assistant (Siri) Speech command Cnn-trad-fpool3 [17] 0.88* Accuracy
Using Smartphones TS-TCC [51] 0.90 Accuracy
Sensing Human activity recognition T38 T39 T40 AI fitness coach (Keep) HHAR LIMU-BERT [16] 0.84 Accuracy
MotionSense LIMU-BERT [16] 0.91 Accuracy
Text-to-speech T41 Voice broadcast (WeChat reading) LJSpeech Transformer [13] 3.26 MCD
Clotho Transformer [10] 0.52* BLEU
Audio captioning T42 T43 Hearing-impaired accessibility (Ava)
AudioSet Transformer [10] 0.64 BLEU
Visual-impaired accessibility MSCOCO’14 LSTM [7] 0.73* BLEU
Image captioning T44 T45
(Supersence) Flickr8k LSTM [110] 0.58 BLEU
Multimodal
Flickr8k NAPReg [69] 0.39 Recall
Text-to-image retrieval T46 T47 Image search (Google Photos)
Flickr30k CLIP [34] 0.69 Recall
Audio/Text-to-image generation T48 Art creation (Verb Art) VGGSound Wav2clip [11] 99.89 FID
Visual-impaired accessibility VQA v2.0 MUTAN [41] 0.63 Accuracy
Visual question answering T49 T50
(Answerables) VizWiz MUTAN [41] 0.52 Accuracy
Table 3: A comprehensive benchmark of edge-oriented AI tasks. Circled abbreviation denotes specific task and
dataset. * denotes the results obtained from Jetson ORIN, while others are obtained from A100.

memory results are measured using TFLite’s benchmark (or degradation) compared to TS-models. As observed, M4
tools [116], while power consumption data is extracted from can achieve comparable performance across 85% of tasks,
Android’s virtual file system (e.g., /sys and /proc). with over 50% of these tasks showcasing considerable perfor-
mance improvement. Note that the vertical axis, normalized
in Table 3, reflects variations in accuracy across the 50 tasks.
4.2 Overall Accuracy Our main focus is on evaluating whether M4 demonstrates
M4 can well support most mobile AI tasks and datasets. superior or inferior accuracy compared to respective tasks,
Figure 5 illustrates M4’s overall performance improvement
Mobile Foundation Model as Firmware ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA

Figure 5: Normalized accuracy comparison of M4 and TS-models on 50 popular mobile tasks and datasets.

inherently embedded within the current foundation model’s

pre-training phase.
M4 can be further enhanced with enhanced foundation
models. Figure 6 illustrates the performance improvement
realized by M4 through the integration of the latest LLaMA2,
in comparison to LLaMA. LLaMA2, a refined evolution of its
precursor, LLaMA, introduces heightened capabilities and
marked improvements [117]. Released in July 2023, LLaMA2
Figure 6: Performance improvement of M4 when replac- marks a substantial leap forward, expanding the context
ing LLaMA with LLaMA2 as foundation backbone. window and ushering in the innovation of grouped-query
attention. This novel architectural element empowers the
model with rapid information processing capabilities. As
observed, M4 using LLaMA2 attains a remarkable 15% accu-
racy enhancement and a 2% improvement in BLEU scores
for T5 emoji prediction and T42 image caption tasks, re-
spectively. This prowess is attributed to LLaMA2’s optimized
architectural schema, expansive training corpus (comprising
2T tokens), and elevated data quality [117]. As M4 is inher-
Figure 7: Accuracy of quantized M4 compared with ently adaptable to its foundation underpinnings, it seamlessly
TS-models. FP16/INT8/INT4 represent the numerical integrates and capitalizes upon the latest components.
representation bit-width employed by LLaMA. M4 can efficiently preserve the performance with low-
bit quantization. Figure 7 illustrates the performance com-
assessing its universal capability. Emphasis is placed on over- parison of M4 using quantized backbone with respect to
all performance rather than specific task improvements. For the TS-model. As observed, M4 using 8-bit (INT8) and 4-bit
example on T1 input word prediction, T42 audio captioning, (INT4) quantization both achieve nearly lossless accuracy,
and T46 text-to-image retrieval, M4 yields accuracy incre- compared to M4 using 16-bit float representation (FP16). For
ments of 6%, 19%, and 28% respectively. Such commendable example, on T8 text classification and T13 code generation,
improvements are attributed to the well-engineered design INT8 and INT4-based M4 achieved only a marginal decrease
of M4, characterized by its unified, adaptive, and multimodal in accuracy compared to FP16-based M4, with reductions of
foundation model. While M4’s prowess is manifest, it is pru- 0.2%-0.9% and 0.2%-2%, respectively. The reasonable behind
dent to acknowledge marginal performance dips (not sur- is that large models possess an abundance or even surplus
passing 10%) observed in specific tasks. Instances such as T7 knowledge representation, which contributes to more exten-
sentiment analysis, T16 image retrieval, and T37 keyword sive knowledge even after quantization [54]. Therefore, we
spotting exemplify this trend, with accuracy experiencing consistently employed M4 with the default INT8 quantization
nominal decrements of 4%, 1%, and 6%, respectively. How- of the LLaMA backbone.
ever, even in these cases, M4 remains viable for deployment
with usable performance. Additionally, M4 only showcases
diminished performance in 4 tasks, with accuracy drop-offs 4.3 Zero/Few-shot Ability
of up to 20%. The reason behind this reduction stems from We experiment on two tasks, image classification and spoken
the unique requirements of certain low-resource translation language understanding. For each task, we follow prior work
tasks, necessitating extensive language knowledge that isn’t [43] to randomly select gold labels, with the sample size
ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA Jinliang Yuan et al.

PEFT settings PEFT results

Tasks
Techniques Rank Size (Ratio) Acc (Dif)
Emoji prediction LoRA 4 2M (0.03%) 31 (1↓)
Image classification LoRA 4 8M (0.007%) 90 (1↑)
Human activity recognition LoRA 1 5M (0.004%) 96 (5↑)
Audio captioning LoRA 4 4M (0.06%) 72 (19↑)
Table 4: PEFT-enhanced M4’s optimal results. Size (Ra-
tio) denotes the trainable parameter size and its ratio
to total parameters. Acc (Dif) denotes the performance
(a) Image classification (b) SLU (lower is better) of PEFT-enhanced M4 along with the differences com-
pared to TS-model, and the units are %.
Figure 8: Few-shot testing of M4 and TS-model. SLU: spo-
ken language understanding.

(a) PEFT techniques (b) PEFT LoRA rank

Figure 10: Impact analysis of PEFT-enhanced M4 on text

Figure 9: Zero-shot testing of M4 and TS-model. classification. Size (M): trainable parameter size.

4.4 Parameter-efficient Fine-tuning

varying between 1% and 10% of the entire dataset. By default, A proper PEFT technique and its configuration is cru-
the labels form a skewed distribution across clients to be cial to trade off M4 performance and cost. Table 4 reports
more realistic to real-world situations. For each dataset, we the optimal results of M4 on the trade-off between model
conduct 5 repeated experiments and report the mean results. accuracy and trainable parameter size. Our observations
M4 has better few-shot ability than TS-models that are highlight the efficacy of the LoRA tuning technique, paired
trained from scratch. In Figure 8, few-shot M4 performs on with well-suited rank settings, in yielding optimal results
par with or slightly lower than M4 with full data tuning. No- across a majority of tasks. PEFT-enhanced M4 attains a note-
tably, it consistently outperforms TS-model by up to 67.1%. worthy 6% accuracy boost over TS-models, while engaging
For example on T33 , even with a mere 1% sample (equating a mere 0.0253% of parameters for fine-tuning on average.
to just 231 samples), few-shot M4 achieves a Word Error Rate Diving deeper, Figure 10 provides a comprehensive analy-
(WER) of 0.7%. It is a mere 0.4% higher than M4 with full data, sis of the impact of diverse PEFT techniques and associated
but 25.4% lower than TS-model that is trained from scratch. hyper-parameters on the performance of PEFT-enhanced M4.
This outcome underscores M4’s prowess in leveraging pre- In Figure 10(a), the discernible trend showcases LoRA tun-
trained multimodal knowledge for swift adaptation to new ing as a standout performer, surpassing Prompt and Prefix
tasks, even with scarce data. tuning by 19% and 52% in terms of accuracy. Additionally,
M4 also has a certain zero-shot ability, but fine-tuning the fine-tuning process using LoRA mandates a mere 16.8
makes it much more accurate. Figure 9 illustrates M4’s million trainable parameters, resulting in an exceptionally
zero-shot capabilities on 6 tasks. Evidently, M4 demonstrates frugal training cost. Figure 10(b) offers further insights, indi-
commendable zero-shot proficiency, attaining approximately cating that selecting an appropriate LoRA rank value plays a
80% of the TS-model performance in most cases. Notable in- pivotal role in propelling M4 towards heightened model accu-
stances include T8 T21 T42 and T44 , where M4’s zero-shot racy while simultaneously minimizing trainable parameter
performance remains acceptably close to the correspond- size. For instance, with the LoRA rank set at 32, M4 attains
ing TS-models, with reductions ranging from 7% to 20%. In a commendable accuracy of 98% on the task, leveraging a
T45 , M4 showcases a 4% improvement over TS-models, a mere 16.8 million trainable parameters.
testament to the efficacy of prompt learning methodologies.
Notwithstanding these accomplishments, the application of 4.5 Runtime Cost
fine-tuning to these datasets yields substantial accuracy en- This subsection evaluates the storage, peak memory, latency,
hancements for M4, surging by 11%-39%. This improvement and energy consumption of running M4 and 50 TS-models
arises from M4’s robust attention-based architecture [125]. on Jetson ORIN and Pixel 7 Pro.
Mobile Foundation Model as Firmware ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA

(a) Storage (b) Peak memory (a) Average latency (b) Average energy

Figure 11: M4’s scalability analysis of storage and peak Figure 12: M4’s runtime cost of latency and energy mea-
memory measured on Jetson ORIN. sured on Jetson ORIN (GPU).

M4 is more storage-efficient when the model number

scales out. Figure 11(a) presents a comparative analysis
of storage between M4 and TS-models as the task count
increases. As observed, M4’s storage footprint is notably
greater when serving a limited number of tasks compared
to TS-models. However, the narrative changes as task di- (a) Latency (b) Energy
versity proliferates. With the deployment of a modest num-
ber of tasks (e.g. 15 tasks), the storage of M4, specifically Figure 13: What-if cost analysis of latency and energy
those equipped with INT4 quantization, outpaces that of when running M4 on Pixel 7 Pro. TS-model: on CPU.
TS-models. This trend intensifies as the number of tasks ex-
pands. Ultimately, TS-models surpass the storage allocation 8× increase in latency and 12× surge in energy consump-
of INT4-based M4, culminating at 15.2GB, signifying a sub- tion compared to TS-models. This substantial performance
stantial 2.5-fold escalation. This underscores M4’s compelling degradation is primarily due to M4’s substantial parameter
storage scalability. count and intricate computational demands.
M4 is memory hungry, but is capable of holding more M4 could get on par execution speed as TS-models if it
tasks for warm in-memory inference when task num- can be deployed to run on the NPU. Figure 13 provides a
ber scales out. Figure 11(b) shows that even when serving runtime cost comparison between M4 and TS-models on the
50 tasks simultaneously, the cumulative peak memory usage CPU and NPU. We obtain the latency and energy of 50 tasks
of INT4-based M4 remains at a modest 7.5GB. This constitutes on CPU, denoted as TS-model and Ours (CPU). As observed,
a mere 2.7% increase, while concurrently yielding a notable M4’s inference latency and energy consumption on CPU are
5.1-fold reduction in peak memory consumption compared notably 13× and 11× (on average) higher than the TS-models.
to TS-models. This exceptional memory efficiency can be Given the substantial performance advantage of NPU over
attributed to M4’s foundation design, which initially houses CPU as shown in §2.2, we aim to evaluate the optimized
all requisite model parameters. Subsequently, the integration runtime cost when deploying M4 on the NPU. However, the
of new tasks necessitates only a marginal addition of fine- NPU currently supports a limited set of operators (details
tuning parameters, typically amounting to less than 10MB in §2.2) and cannot directly execute all components of M4.
each. The 7.5GB of peak memory cannot fit some mobile Similarly, the majority of TS-models cannot run directly
devices, but it is entirely affordable for many high-end smart- on the NPU. Therefore, we conduct a what-if analysis to
phones with 12/16/32GB of RAM, like the Pixel 7 Pro we estimate its runtime latency and energy consumption on
used. This underscores M4’s practicality and potential to be NPU, denoted as Ours (𝑁 𝑃𝑈 ∗ ). This projection is achieved by
effectively deployed across a spectrum of devices. leveraging the observed performance ratio from TS-models
M4 is 18× slower and incurs 19× more energy than between the NPU and CPU, as discussed in Section §2.2.
TS-models on the same processor. Figure 12 provides a Subsequently, utilizing the measured performance of M4 on
comparison of the inference latency and energy consump- the CPU, we can derive its estimated performance on NPU.
tion between M4 and TS-models across the spectrum of 50 As observed, NPU-enabled M4 achieves an average latency
tasks. As observed, M4 using INT8-format LLaMA exhibits of 0.48s and energy consumption of 1.3J, which are even
12× and 19× (on average) higher inference latency and more 11.1% and 55.2% lower than TS-models on CPU. Furthermore,
energy consumption, compared to TS-models. While the we delve into the architectural intricacies of M4 to analyze the
transition to INT4 quantization offers a marginal ameliora- latency breakdown performance in Table 5. From this table,
tion, the performance gap remains significant—a respective we observe that the latency optimization bottleneck for M4
ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA Jinliang Yuan et al.

Latency (s)
Tasks Path
CPU NPU*
Image classification Path-3 IMG_enc: 2.10 0.11
Audio classification Path-3 AUD-I_enc: 0.28 0.014 Red Panda
First token: 6.34 0.32 Red pandas are adorable, medium-sized mammals
Question answering Path-2 M4 native to the Himalayas and southeast Asia. They are
Sequent tokens: 0.24/token 0.012/token
also known as lesser pandas or red-and-white pandas.
Visual First token: 6.47 0.32
Path-1 What is its relationship with the giant panda?
question answering Sequent tokens: 0.25/token 0.013/token
Text-to-speech Path-4 TTS_dec: 0.82 0.041 They are both members of the bear family (Ursidae), but they are
Table 5: An in-depth what-if cost analysis of latency different species. The giant panda (Ailuropoda melanoleuca) is
M4 much larger than the red panda (Ailurus fulgens), and is found
when running M4 on Pixel 7 Pro. NPU*: M4’s estimated only in a few isolated mountain ranges in China.
latency based on the NPU acceleration rate of TS-model 4″
if it can be deployed on NPU. Voice to text
What does this animal eat?

M4 They are omnivores, feeding on bamboo, fruits, leaves, and small

animals like rodents and insects.

Figure 15: A demo of M4: multimodal chat.

phenomenon emanates from M4’s transformer-based archi-

tecture, which inherently involves quantity and NPU friendly
operators, thus enhancing operational efficiency [99].
(a) Total operators (b) NPU supported operators

Figure 14: Simplicity analysis of M4’s operators. 4.7 Novel Application with M4
M4 enables complex, unpresent mobile applications.
lies in the time taken by the IMG-encoder and the generation Based on our proposed M4, we build a demo of a multimodal
of the first token by the backbone, which is approximately chat use case as shown in Figure 15. Users engage in multi-
2.1s and 6.3s, respectively. These components collectively turn chats with the M4 client using multimodal inputs such
account for 31% and 93% of M4’s average latency (6.8s in as images, text, and audio, thereby obtaining precise and
Figure 12(a)). However, the other components could exhibit tailored answers that meet their requirements. This multi-
near real-time inference (less than 100ms) if being deployed modal computing capability is also crucial for the recent
on NPU. These achievements demonstrate that, if M4 can be popular mobile agents [82]. We build this prototype system
accelerated on NPU, it could get on par execution speed and of M4 based on the architecture depicted in Figure 1. It first
energy consumption with TS-models. The aim of comparing aligns the contents of image, text, and audio by converting
M4’s NPU performance with TS-models on CPU isn’t to claim multimodal input data into a unified representation. Then, it
superiority, but to showcase how future NPU support can encapsulates abundant knowledge to understand complex
enhance efficiency for mobile foundation models. embedded data, perform task-specific inference, and gen-
erate the required information. This innate capability for
4.6 Model Architecture Simplicity multimodal processing harbors the potential to significantly
enrich the landscape of mobile applications.
M4’s architectural design is much simpler and cleaner
in terms of NN operators, therefore could greatly sim-
plify accelerator design. Figure 14(a) shows that the num- 5 RELATED WORK
ber of operators in the TS-models increases rapidly with the Foundation models. Building one-for-all foundation mod-
growth in the number of tasks. Notably, as the task spectrum els to serve generic AI tasks has been a primary research
broadens to encompass 50 tasks, the number of operator goal of the machine learning community. The recent advance-
types culminates at 156. In contrast, M4 engages a mere 39 ments of LLMs [92, 117, 118, 143], multimodalities alignment
operator types, encompassing both foundation model and [57, 91, 111, 114], and parameter-efficient training methods
task-specific "adapters". Furthermore, Figure 14(b) under- [89, 136, 142] have shed lights on this challenging goal. For
takes a granular exploration of NPU supported operators for instance, ImageBind [57] and CoDi [114] focus on how to
both M4 and TS-models. It underscores that only 51 out of align the embeddings of each modality, and PandaGPT [111]
156 operators in TS-models are supported by the NPU, with further attempts to compose semantics from different modal-
more than 2/3 of the operators unable to fully run on the NPU; ities naturally based on LLaMA [117]. However, there have
But for M4, NPU-supported operators account for 64%. This been no efforts like M4 that try to fit extremely diversified AI
Mobile Foundation Model as Firmware ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA

tasks into one model. Meanwhile, M4 leverages the most state- and cleaner architecture of M4, it would be much easier to
of-the-art pre-trained LLMs to reuse the wisdoms as well as design accelerator to support M4 with high precision. (2) M4
the investments from the ML community&industry. The con- underperforms baseline models on certain ML tasks. This
current work NExT-GPT [127] shares a similar architecture unveils the limitation of existing pre-trained foundation mod-
as M4. Nonetheless, M4 introduces two distinctive contribu- els, e.g., translation. On the one hand, we do not expect M4
tions: (1) It marks the inaugural proposal of a transformer- to be able to solve all mobile AI tasks in the near future; it
based N-1-M architecture, aiming to curtail resource costs in could co-exist with traditional DNNs that run on mobile CPU.
any-to-any modal generation; (2) Its innovative multi-path On the other hand, the LLM capacity is still fast evolving:
execution design is tailored to enhance compatibility with from LLaMA-1/2 used in this study, to the Mistral-7B [71]
highly diversified mobile AI tasks. that ranks higher even than LLaMA-13B. Such continuous
Hardware-system-algorithm co-design for mobile AI. improvement endeavors our vision with much confidence.
AI workloads are highly compute-intensive and exhibit anal- To be noted, M4 is the very first step towards the vision of
ogous patterns, therefore is better to be accelerated domain- mobile foundation model. We believe it could potentially rev-
specific accelerator (e.g., NPUs). For instance, SpAtten [122] olutionize the mobile AI landscape and open a new research
and Sanger [88] focus on how efficient algorithm-architecture domain. However, to fully realize the vision, there are a few
co-designs can reduce sparse attention computation and key designs to be explored. For instance: (1) Foundation model
memory access. Besides, QNAS [83] and NAAS [84] focus on design: As a preliminary prototype, M4 is currently built atop
composing highly matched neural-hardware architectures off-the-shelf, separately pre-trained LLMs from Internet in-
by jointly searching for neural network architecture, acceler- stead of being tailored for mobile devices. Therefore, it is
ator architecture, and compiler mapping. However, all prior still highly inefficient in terms of accuracy and model param-
literature makes tradeoffs between the ubiquity of operator eter size. With enough resources (GPUs and data), hardware
support and the performance, instead of for a foundation vendors can build a more compact mobile foundation model
model that can serve generic AI tasks itself. The vision of that is expected to deliver significantly higher accuracy with
the mobile foundation model could open a new research lower runtime cost than M4. (2) Accelerator design: fine-tuning
domain for cross-layer co-design of mobile AI. There have for downstream tasks generates small “adapters” that are
been preliminary attempts [131, 132, 137] to alleviate the inserted into the mobile foundation model. The NPU better
huge resource cost of large foundation models for devices. has the flexibility to run those adapters as well; otherwise
Those work are orthogonal to this work. the inference must involve CPU/GPU computation and data
Managing AI as a mobile system service. AI has been a movement overhead. Fortunately, the adapters have sim-
ubiquitous workload on mobile devices, and managing it at ple structure (e.g., linear matrix operations) and very few
a system aspect (instead of individual app) could facilitate weights. (3) FM upgradating: the foundation model capacity
OS-wise runtime scheduling and software deployment. Some could evolve with better architecture/weights as shown in
early studies [52, 123, 133, 144, 147] attempt to mitigate the §4.2. Yet the adapters trained for the old foundation model
severe fragment across different libs in the mobile DL ecosys- cannot work with the new one. We therefore need a unified
tem. Google introduced a unified ML interface NNAPI [60] interface between LLMs and adapters to allow them to evolve
into Android in 2017, to relieve the gap between heteroge- independently without interfering with each other.
neous mobile processors and user-defined ML frameworks.
Compared to the above work, M4 takes another giant step
further that mobile devices shall manage a foundation model 7 CONCLUSIONS
for each ML task and expose it as firmware. We envision a mobile hardware-OS co-managed multimodal
foundation model that is exposed to mobile applications as a
system service to serve almost every on-device AI task. We
6 LIMITATIONS AND FUTURE WORK design and prototype the first such model using off-the-shelf
This study has several potential limitations. (1) eAIBench’s LLMs. Evaluated on a comprehensive benchmark consisting
results are evaluated on datacenter GPU (NVIDIA A100) of 50 representative mobile AI tasks, M4 shows good accuracy,
and edge GPU (Jetson Orin), lacking assessment on mobile better scalability and reduced runtime cost.
devices. It’s mainly due to the highly diverse code imple-
mentation of baseline models and the huge time span of
evaluating M4 on large test dataset. There might exist per- 8 ACKNOWLEDGES
formance gap between different hardware architectures. Yet, The authors thank the anonymous reviewers and the shep-
the comparison is fair as both baseline models and M4 are herd for their insightful feedbacks. This work was supported
evaluated on the same hardware. In fact, due to the simpler by NSFC (61921003, 62102045).
ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA Jinliang Yuan et al.

REFERENCES [34] Optical character recognition, location recognition, and text-to-image

[1] Audio classification task. https://github.com/mohaimenz/acdnet. retrieval task. https://github.com/mlfoundations/open_clip/, 2023.
[2] Gender recognition task. https://github.com/wildchlamydia/mivolo. [35] Question answering task1. https://huggingface.co/deepset/roberta-
[3] Spoken language understanding task2. https://drive.google.com/ base-squad2, 2023.
drive/folders/103t4_zqBZNqa_gGlIfteIs8_mdKhn3Rd?usp=sharing. [36] Question answering task2. https://huggingface.co/aubmindlab/
[4] Styler transfer task. https://github.com/rinongal/StyleGAN-nada. araelectra-base-discriminator, 2023.
[5] Text summary task. https://huggingface.co/facebook/bart-large-cnn. [37] Text classification task1. https://huggingface.co/JiaqiLee/bert-
[6] Vehicle re-identification task. https://github.com/vimar-gu/msinet. agnews, 2023.
[7] Image captioning task. https://github.com/maxy0524/ [38] Text classification task2. https://huggingface.co/distilbert-base-
image_captioning/tree/master, 2015. uncased-finetuned-sst-2-english, 2023.
[8] Traffic sign classification task. https://github.com/ppriyank/ [39] Mario Almeida, Stefanos Laskaridis, Abhinav Mehrotra, Lukasz
MicronNet, 2018. Dudziak, Ilias Leontiadis, and Nicholas D Lane. Smart at what
[9] How apple personalizes siri without hoovering up your data. cost? characterising mobile deep neural networks in the wild. In
https://www.technologyreview.com/2019/12/11/131629/apple-ai- Proceedings of the 21st ACM Internet Measurement Conference,
personalizes-siri-federated-learning/, 2019. pages 658–672, 2021.
[10] Audio captioning task. https://github.com/lukewys/dcase_2020_T6/ [40] Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. Fused-
tree/master, 2020. layer cnn accelerators. In 2016 49th Annual IEEE/ACM International
[11] Audio/text-to-image generation task. https://github.com/descriptinc/ Symposium on Microarchitecture (MICRO), pages 1–12. IEEE, 2016.
lyrebird-wav2clip, 2020. [41] Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, and Nicolas Thome.
[12] Crowd counting task. https://github.com/val-iisc/css-ccnn, 2020. Mutan: Multimodal tucker fusion for visual question answering.
[13] Text-to-speech task. https://github.com/facebookresearch/fairseq/ In Proceedings of the IEEE international conference on computer
tree/main/examples/wav2vec, 2020. vision, pages 2612–2620, 2017.
[14] Automatic speech recognition task. https://huggingface.co/ [42] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D
speechbrain/asr-crdnn-rnnlm-librispeech/tree/main, 2021. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam,
[15] Emotion recognition task. https://drive.google.com/drive/folders/ Girish Sastry, Amanda Askell, et al. Language models are few-
1U9SiO4KkCNBKfxilXzJqBZ_k-vHz4ltV?usp=sharing, 2021. shot learners. Advances in neural information processing systems,
[16] Human activity recognition task. https://github.com/dapowan/LIMU- 33:1877–1901, 2020.
BERT-Public, 2021. [43] Dongqi Cai, Yaozong Wu, Shangguang Wang, Felix Xiaozhu Lin, and
[17] Keyword spotting task. https://github.com/tensorflow/tensorflow/ Mengwei Xu. Fedadapter: Efficient federated learning for modern
tree/master/tensorflow/examples/speech_commands, 2021. nlp. arXiv preprint arXiv:2205.10162, 2022.
[18] Spoken language understanding task1. https://github.com/ [44] Qingqing Cao, Noah Weber, Niranjan Balasubramanian, and Aruna
speechbrain/speechbrain/tree/develop/recipes/fluent-speech- Balasubramanian. Deqa: On-device question answering. In
commands, 2021. Proceedings of the 17th Annual International Conference on Mobile
[19] Super resolution task. https://github.com/open-mmlab/mmagic/tree/ Systems, Applications, and Services, pages 27–40, 2019.
main/configs/real_esrgan, 2021. [45] Xingyu Chen, Yufeng Liu, Yajiao Dong, Xiong Zhang, Chongyang
[20] Code document generation task. https://huggingface.co/Salesforce/ Ma, Yanmin Xiong, Yuan Zhang, and Xiaoyan Guo. Mobrecon:
codet5-base-multi-sum, 2022. Mobile-friendly hand mesh reconstruction from monocular image. In
[21] Code generation task. https://huggingface.co/microsoft/codebert- Proceedings of the IEEE/CVF Conference on Computer Vision and
base, 2022. Pattern Recognition, pages 20544–20554, 2022.
[22] Emoji prediction task. https://huggingface.co/cardiffnlp/twitter- [46] Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Xiaoyi
roberta-base-emoji, 2022. Dong, Lu Yuan, and Zicheng Liu. Mobile-former: Bridging mobilenet
[23] Input word prediction task. https://github.com/friedrichor/Language- and transformer. In Proceedings of the IEEE/CVF Conference on
Model-Next-Word-Prediction/tree/main, 2022. Computer Vision and Pattern Recognition, pages 5270–5279, 2022.
[24] Object detection task1. https://github.com/open-mmlab/ [47] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Gaurav Mishra,
mmdetection/tree/main/configs/libra_rcnn, 2022. Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton,
[25] Semantic segmentation task1. https://github.com/open-mmlab/ Sebastian Gehrmann, et al. Palm: Scaling language modeling with
mmsegmentation/tree/main/configs/deeplabv3plus, 2022. pathways. arXiv preprint arXiv:2204.02311, 2022.
[26] Semantic segmentation task2. https://github.com/open-mmlab/ [48] Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Show, control
mmsegmentation/tree/main/configs/deeplabv3plus, 2022. and tell: A framework for generating controllable and grounded
[27] Sentiment analysis task. https://huggingface.co/cardiffnlp/twitter- captions. In Proceedings of the IEEE/CVF Conference on Computer
roberta-base-sentiment-latest, 2022. Vision and Pattern Recognition, pages 8307–8316, 2019.
[28] Video classification task. https://github.com/open-mmlab/ [49] Weihao Cui, Han Zhao, Quan Chen, Hao Wei, Zirui Li, Deze Zeng,
mmaction2/tree/main/configs/recognition/slowfast, 2022. Chao Li, and Minyi Guo. {DVABatch}: Diversity-aware {Multi-
[29] Emotion prediction task. https://huggingface.co/SamLowe/roberta- Entry} {Multi-Exit} batching for efficient processing of {DNN} ser-
base-go_emotions, 2023. vices on {GPUs}. In 2022 USENIX Annual Technical Conference
[30] Grammatical error correction task. https://huggingface.co/pszemraj/ (USENIX ATC 22), pages 183–198, 2022.
flan-t5-large-grammar-synthesis, 2023. [50] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
[31] Image retrieval task. https://github.com/open-mmlab/mmpretrain/ Bert: Pre-training of deep bidirectional transformers for language
tree/main/configs/arcface, 2023. understanding. arXiv preprint arXiv:1810.04805, 2018.
[32] Machine translation task. https://github.com/huggingface/ [51] Emadeldeen Eldele, Mohamed Ragab, Zhenghua Chen, Min Wu,
transformers, 2023. Chee Keong Kwoh, Xiaoli Li, and Cuntai Guan. Time-series rep-
[33] Object detection task2. https://github.com/yoctta/xpaste, 2023. resentation learning via temporal and contextual contrasting. arXiv
Mobile Foundation Model as Firmware ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA

preprint arXiv:2106.14112, 2021. of the IEEE/CVF Winter Conference on Applications of Computer

[52] Amin Eslami Abyane and Hadi Hemmati. Robustness analy- Vision, pages 1135–1144, 2023.
sis of deep learning frameworks on mobile platforms. In IFIP [70] jetson stats. jetson-stats. https://developer.nvidia.com/embedded/
International Conference on Testing Software and Systems, pages community/jetson-projects/jetson_stats, 2023.
160–177. Springer, 2021. [71] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam-
[53] Hugging Face. Hugging face. https://huggingface.co/, 2023. ford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand,
[54] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.
Gptq: Accurate post-training quantization for generative pre-trained arXiv preprint arXiv:2310.06825, 2023.
transformers. arXiv preprint arXiv:2210.17323, 2022. [72] Shiqi Jiang, Zhiqi Lin, Yuanchun Li, Yuanchao Shu, and Yunxin Liu.
[55] X Yu Geoffrey, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko. Flexible high-resolution object detection on edge devices with tunable
Habitat: A {Runtime-Based} computational performance predictor latency. In Proceedings of the 27th Annual International Conference
for deep neural network training. In 2021 USENIX Annual Technical on Mobile Computing and Networking, pages 559–572, 2021.
Conference (USENIX ATC 21), pages 503–521, 2021. [73] Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How
[56] Petko Georgiev, Nicholas D Lane, Kiran K Rachuri, and Cecilia Mas- can we know what language models know? Transactions of the
colo. Leo: Scheduling sensor inference algorithms across hetero- Association for Computational Linguistics, 8:423–438, 2020.
geneous mobile processors and network resources. In Proceedings [74] Liuyi Jin, Tian Liu, Amran Haroon, Radu Stoleru, Michael Middleton,
of the 22nd Annual International Conference on Mobile Computing Ziwei Zhu, and Theodora Chaspari. Emsassist: An end-to-end mo-
and Networking, pages 320–333, 2016. bile voice assistant at the edge for emergency medical services. In
[57] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Proceedings of the 21st Annual International Conference on Mobile
Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Image- Systems, Applications and Services, pages 275–288, 2023.
bind: One embedding space to bind them all. In Proceedings of the [75] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexan-
pages 15180–15190, 2023. der C Berg, Wan-Yen Lo, et al. Segment anything. arXiv preprint
[58] github. github. https://github.com. arXiv:2304.02643, 2023.
[59] Yuan Gong, Yu-An Chung, and James Glass. Ast: Audio spectrogram [76] Rui Kong, Yuanchun Li, Yizhen Yuan, and Linghe Kong. Convrelu++:
transformer. arXiv preprint arXiv:2104.01778, 2021. Reference-based lossless acceleration of conv-relu operations on mo-
[60] Google. Android neural networks api. https://github.com/android/ bile cpu. In Proceedings of the 21st Annual International Conference
ndk-samples/tree/main/nn-samples, 2017. on Mobile Systems, Applications and Services, pages 503–515, 2023.
[61] Filip Granqvist, Matt Seigel, Rogier Van Dalen, Aine Cahill, Stephen [77] Nicholas D Lane, Sourav Bhattacharya, Petko Georgiev, Claudio For-
Shum, and Matthias Paulik. Improving on-device speaker verification livesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. Deepx: A software
using federated learning with privacy. Proceedings of International accelerator for low-power deep learning inference on mobile devices.
Speech Communication Association, 2020. In 2016 15th ACM/IEEE International Conference on Information
[62] Peizhen Guo and Wenjun Hu. Potluck: Cross-application approxi- Processing in Sensor Networks (IPSN), pages 1–12. IEEE, 2016.
mate deduplication for computation-intensive mobile applications. [78] Stefanos Laskaridis, Alexandros Kouris, and Nicholas D Lane. Adap-
In Proceedings of the Twenty-Third International Conference on tive inference through early-exit networks: Design, challenges and
Architectural Support for Programming Languages and Operating directions. In Proceedings of the 5th International Workshop on
Systems, pages 271–284, 2018. Embedded and Mobile Deep Learning, pages 1–6, 2021.
[63] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep [79] Jizhizi Li, Jing Zhang, and Dacheng Tao. Referring image matting.
residual learning for image recognition. In Proceedings of the IEEE In Proceedings of the IEEE/CVF conference on computer vision and
conference on computer vision and pattern recognition, pages 770– pattern recognition, pages 22448–22457, 2023.
778, 2016. [80] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Caiming Xiong,
[64] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep and Steven Chu Hong Hoi. Align before fuse: Vision and language
residual learning for image recognition. In Proceedings of the IEEE representation learning with momentum distillation. Advances in
conference on computer vision and pattern recognition, pages 770– neural information processing systems, 34:9694–9705, 2021.
778, 2016. [81] Yixiao Li, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu
[65] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Chen, and Tuo Zhao. Loftq: Lora-fine-tuning-aware quantization for
Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation large language models. arXiv preprint arXiv:2310.08659, 2023.
of large language models. In International Conference on Learning [82] Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan,
Representations, 2021. Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, et al.
[66] Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Personal llm agents: Insights and survey about the capability, effi-
Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang ciency and security. arXiv preprint arXiv:2401.05459, 2024.
Liu, et al. Language is not all you need: Aligning perception with [83] Yujun Lin, Driss Hafdi, Kuan Wang, Zhijian Liu, and Song Han.
language models. arXiv preprint arXiv:2302.14045, 2023. Neural-hardware architecture search. NeurIPS WS, 2019.
[67] Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional lstm-crf models [84] Yujun Lin, Mengtian Yang, and Song Han. Naas: Neural accelerator
for sequence tagging. arXiv preprint arXiv:1508.01991, 2015. architecture search. In 2021 58th ACM/IEEE Design Automation
[68] Loc N Huynh, Youngki Lee, and Rajesh Krishna Balan. Deepmon: Mo- Conference (DAC), pages 1051–1056. IEEE, 2021.
bile gpu-based deep learning framework for continuous vision appli- [85] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi,
cations. In Proceedings of the 15th Annual International Conference and Graham Neubig. Pre-train, prompt, and predict: A systematic
on Mobile Systems, Applications, and Services, pages 82–95, 2017. survey of prompting methods in natural language processing. ACM
[69] Bhavin Jawade, Deen Dayal Mohan, Naji Mohamed Ali, Srirangaraj Computing Surveys, 55(9):1–35, 2023.
Setlur, and Venu Govindaraju. Napreg: nouns as proxies regulariza- [86] Song Liu, Shengsheng Qian, Yiru Chen, Wenkui Ding, and Zhongyuan
tion for semantically aware cross-modal embeddings. In Proceedings Wang. Hit: Hierarchical transformer with momentum contrast for
ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA Jinliang Yuan et al.

video-text retrieval. In Proceedings of the IEEE/CVF International information processing systems, 34:980–993, 2021.
Conference on Computer Vision, pages 11915–11925, 2021. [104] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser,
[87] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi and Björn Ommer. High-resolution image synthesis with latent
Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoy- diffusion models. In Proceedings of the IEEE/CVF conference on
anov. Roberta: A robustly optimized bert pretraining approach. arXiv computer vision and pattern recognition, pages 10684–10695, 2022.
preprint arXiv:1907.11692, 2019. [105] Teven Le Scao, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel
[88] Liqiang Lu, Yicheng Jin, Hangrui Bi, Zizhang Luo, Peng Li, Tao Wang, Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon,
and Yun Liang. Sanger: A co-design framework for enabling sparse at- Matthias Gallé, et al. Bloom: A 176b-parameter open-access multilin-
tention using reconfigurable architecture. In MICRO-54: 54th Annual gual language model. arXiv preprint arXiv:2211.05100, 2022.
IEEE/ACM International Symposium on Microarchitecture, pages [106] Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong,
977–991, 2021. Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. Nexus:
[89] Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, A gpu cluster engine for accelerating dnn-based video analysis. In
and Sayak Paul. Peft: State-of-the-art parameter-efficient fine-tuning Proceedings of the 27th ACM Symposium on Operating Systems
methods. https://github.com/huggingface/peft, 2022. Principles, pages 322–337, 2019.
[90] Harini Muthukrishnan, David Nellans, Daniel Lustig, Jeffrey A Fessler, [107] Yongming Shen, Michael Ferdman, and Peter Milder. Maximizing cnn
and Thomas F Wenisch. Efficient multi-gpu shared memory via auto- accelerator efficiency through resource partitioning. ACM SIGARCH
matic optimization of fine-grained transfers. In 2021 ACM/IEEE 48th Computer Architecture News, 45(2):535–547, 2017.
Annual International Symposium on Computer Architecture (ISCA), [108] Cong Shi, Xiangyu Xu, Tianfang Zhang, Payton Walker, Yi Wu, Jian
pages 139–152. IEEE, 2021. Liu, Nitesh Saxena, Yingying Chen, and Jiadi Yu. Face-mic: infer-
[91] Ivona Najdenkoska, Xiantong Zhen, and Marcel Worring. Meta learn- ring live speech and speaker identity via subtle facial dynamics cap-
ing to bridge vision and language models for multimodal few-shot tured by ar/vr motion sensors. In Proceedings of the 27th Annual
learning. arXiv preprint arXiv:2302.14794, 2023. International Conference on Mobile Computing and Networking,
[92] OpenAI. Gpt-4 technical report. https://cdn.openai.com/papers/gpt- pages 478–490, 2021.
4.pdf, 2023. [109] Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace,
[93] Myle Ott, Sergey Edunov, Angela Fan, Sam Gross, Nathan Ng, David and Sameer Singh. Autoprompt: Eliciting knowledge from lan-
Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for guage models with automatically generated prompts. arXiv preprint
sequence modeling. arXiv preprint arXiv:1904.01038, 2019. arXiv:2010.15980, 2020.
[94] paperswithcode. Paper with code. https://paperswithcode.com/, 2023. [110] Lingyun Song, Jun Liu, Buyue Qian, and Yihe Chen. Connecting lan-
[95] Imagebind parameters. Imagebind parameters. https://github.com/ guage to images: A progressive attention-guided network for simul-
facebookresearch/ImageBind, 2023. taneous image captioning and language grounding. In Proceedings
[96] LLaMa parameters. Llama parameters. https://github.com/ of the AAAI Conference on Artificial Intelligence, volume 33, pages
facebookresearch/llama, 2023. 8885–8892, 2019.
[97] Seonghoon Park, Yeonwoo Cho, Hyungchol Jun, Jeho Lee, and Hojung [111] Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai.
Cha. Omnilive: Super-resolution enhanced 360° video live streaming Pandagpt: One model to instruction-follow them all. arXiv preprint
for mobile devices. In Proceedings of the 21st Annual International arXiv:2305.16355, 2023.
Conference on Mobile Systems, Applications and Services, pages [112] Yanan Sun, Chi-Keung Tang, and Yu-Wing Tai. Semantic image
261–274, 2023. matting. In Proceedings of the IEEE/CVF Conference on Computer
[98] Adam Paszke, Francisco Massa, Adam Lerer, James Bradbury, Gregory Vision and Pattern Recognition, pages 11120–11129, 2021.
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, [113] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang,
et al. Pytorch: An imperative style, high-performance deep learning and Denny Zhou. Mobilebert: a compact task-agnostic bert for
library. arXiv preprint arXiv:1912.01703, 2019. resource-limited devices. arXiv preprint arXiv:2004.02984, 2020.
[99] Suchita Pati, Shaizeen Aga, Nuwan Jayasena, and Matthew D Sin- [114] Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, and Mohit
clair. Demystifying bert: System design implications. In 2022 IEEE Bansal. Any-to-any generation via composable diffusion. arXiv
International Symposium on Workload Characterization (IISWC), preprint arXiv:2305.11846, 2023.
pages 296–309. IEEE, 2022. [115] Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung.
[100] Siddhant Prakash, Alireza Bahremand, Linda D Nguyen, and Robert Branchynet: Fast inference via early exiting from deep neural net-
LiKamWa. Gleam: An illumination estimation framework for works. In 2016 23rd international conference on pattern recognition
real-time photorealistic augmented reality on mobile devices. In (ICPR), pages 2464–2469. IEEE, 2016.
Proceedings of the 17th Annual International Conference on Mobile [116] TFLite tools. Tflite tools. https://www.tensorflow.org/lite/
Systems, Applications, and Services, pages 142–154, 2019. performance/measurement, 2023.
[101] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel [117] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet,
Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman
Mishkin, Jack Clark, et al. Learning transferable visual models Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient
from natural language supervision. In International conference on foundation language models. arXiv preprint arXiv:2302.13971, 2023.
machine learning, pages 8748–8763. PMLR, 2021. [118] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma-
[102] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal
McLeavey, and Ilya Sutskever. Robust speech recognition via large- Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-
scale weak supervision. In International Conference on Machine tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Learning, pages 28492–28518. PMLR, 2023. [119] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
[103] Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and Jie Zhou. Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention
Global filter networks for image classification. Advances in neural is all you need. Advances in neural information processing systems,
30, 2017.
Mobile Foundation Model as Firmware ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA

[120] Dan Wang, Haibo Lei, Haozhi Dong, Yunshu Wang, Yongpan Zou, time-sensitive networking. In IEEE INFOCOM, 2023.
and Kaishun Wu. What you wear know how you feel: An emotion [137] Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, and
inference system with multi-modal wearable devices. In Proceedings Mengwei Xu. Edgemoe: Fast on-device inference of moe-based large
of the 26th Annual International Conference on Mobile Computing language models. arXiv preprint arXiv:2308.14352, 2023.
and Networking, pages 1–3, 2020. [138] Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and
[121] Hanrui Wang, Yongshan Ding, Jiaqi Gu, Yujun Lin, David Z Pan, Enhong Chen. A survey on multimodal large language models. arXiv
Frederic T Chong, and Song Han. Quantumnas: Noise-adaptive search preprint arXiv:2306.13549, 2023.
for robust quantum circuits. In 2022 IEEE International Symposium [139] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim,
on High-Performance Computer Architecture (HPCA), pages 692– and Byung-Gon Chun. Orca: A distributed serving system
708. IEEE, 2022. for {Transformer-Based} generative models. In 16th USENIX
[122] Hanrui Wang, Zhekai Zhang, and Song Han. Spatten: Efficient Symposium on Operating Systems Design and Implementation
sparse attention architecture with cascade token and head prun- (OSDI 22), pages 521–538, 2022.
ing. In 2021 IEEE International Symposium on High-Performance [140] Koh Jing Yu. Model zoo. https://modelzoo.co/, 2023.
Computer Architecture (HPCA), pages 97–110. IEEE, 2021. [141] Mu Yuan, Lan Zhang, Fengxiang He, and Xiang-Yang Li. Infi: end-to-
[123] Haoyu Wang, Hao Li, and Yao Guo. Understanding the evolution of end learnable input filter for resource-efficient mobile-centric infer-
mobile app ecosystems: A longitudinal measurement study of google ence. In Proceedings of the 28th Annual International Conference
play. In The World Wide Web Conference, pages 1988–1999, 2019. on Mobile Computing And Networking, pages 228–241, 2022.
[124] Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun [142] Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Sim-
Huang. Images speak in images: A generalist painter for in-context ple parameter-efficient fine-tuning for transformer-based masked
visual learning. In Proceedings of the IEEE/CVF Conference on language-models. arXiv preprint arXiv:2106.10199, 2021.
Computer Vision and Pattern Recognition, pages 6830–6839, 2023. [143] Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai,
[125] Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al.
Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Fine- Glm-130b: An open bilingual pre-trained model. arXiv preprint
tuned language models are zero-shot learners. arXiv preprint arXiv:2210.02414, 2022.
arXiv:2109.01652, 2021. [144] Qiyang Zhang, Xiang Li, Xiangying Che, Xiao Ma, Ao Zhou, Mengwei
[126] Thomas Wolf, Lysandre Debut, Julien Chaumond, Clement Delangue, Xu, Shangguang Wang, Yun Ma, and Xuanzhe Liu. A comprehen-
Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Fun- sive benchmark of deep learning libraries on mobile devices. In
towicz, et al. Huggingface’s transformers: State-of-the-art natural Proceedings of the ACM Web Conference, pages 3298–3307, 2022.
language processing. arXiv preprint arXiv:1910.03771, 2019. [145] Wuyang Zhang, Zhezhi He, Zhenhua Jia, Yunxin Liu, Marco Gruteser,
[127] Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Dipankar Raychaudhuri, and Yanyong Zhang. Elf: accelerate high-
Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint resolution mobile deep vision with content-aware parallel offload-
arXiv:2309.05519, 2023. ing. In Proceedings of the 27th Annual International Conference on
[128] Yusheng Xiang and Marcus Geimer. Optimization of operation strat- Mobile Computing and Networking, pages 201–214, 2021.
egy for primary torque based hydrostatic drivetrain using artificial [146] Xi Zhang, Feifei Zhang, and Changsheng Xu. Vqacl: A novel visual
intelligence. arXiv preprint arXiv:2003.10011, 2020. question answering continual learning setting. In Proceedings of the
[129] Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluat- pages 19102–19112, 2023.
ing human preferences for text-to-image generation. arXiv preprint [147] Yi Zhao, Zheng Yang, Xiaowu He, Jiahang Wu, Hao Cao, Liang Dong,
arXiv:2304.05977, 2023. Fan Dang, and Yunhao Liu. E-tsn: Enabling event-triggered critical
[130] Mengwei Xu, Jiawei Liu, Yuanqiang Liu, Felix Xiaozhu Lin, Yunxin Liu, traffic in time-sensitive networking for industrial applications. In
and Xuanzhe Liu. A first look at deep learning apps on smartphones. 2022 IEEE 42nd International Conference on Distributed Computing
In The World Wide Web Conference, pages 2125–2136, 2019. Systems (ICDCS), pages 691–701. IEEE, 2022.
[131] Mengwei Xu, Yaozong Wu, Dongqi Cai, Xiang Li, and Shangguang [148] Yiqin Zhao and Tian Guo. Xihe: a 3d vision-based lighting estimation
Wang. Federated fine-tuning of billion-sized language models across framework for mobile augmented reality. In Proceedings of the 19th
mobile devices. arXiv preprint arXiv:2308.13894, 2023. Annual International Conference on Mobile Systems, Applications,
[132] Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, and Services, pages 28–40, 2021.
Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, [149] Quan Zhou, Haiquan Wang, Xiaoyan Yu, Cheng Li, Youhui Bai, Feng
et al. A survey of resource-efficient llm and multimodal foundation Yan, and Yinlong Xu. Mpress: Democratizing billion-scale model train-
models. arXiv preprint arXiv:2401.08092, 2024. ing on multi-gpu servers via memory-saving inter-operator paral-
[133] Mengwei Xu, Mengze Zhu, Yunxin Liu, Felix Xiaozhu Lin, and Xu- lelism. In 2023 IEEE International Symposium on High-Performance
anzhe Liu. Deepcache: Principled cache for mobile deep vision. In Computer Architecture (HPCA), pages 556–569. IEEE, 2023.
Proceedings of the 24th annual international conference on mobile [150] Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, and
computing and networking, pages 129–144, 2018. Furu Wei. Bert loses patience: Fast and robust inference with early
[134] Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose++: Vi- exit. Advances in Neural Information Processing Systems, 33:18330–
sion transformer for generic body pose estimation. IEEE Transactions 18341, 2020.
on Pattern Analysis and Machine Intelligence, 2023. [151] Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, and
[135] Zheng Yang, Lirong Jian, Chenshu Wu, and Yunhao Liu. Beyond Ishan Misra. Detecting twenty-thousand classes using image-level
triangle inequality: Sifting noisy and outlier distance measurements supervision. In European Conference on Computer Vision, pages
for localization. ACM Transactions on Sensor Networks (TOSN), 350–368. Springer, 2022.
9(2):1–20, 2013. [152] Yongchao Zhou, Andrei Ioan Muresanu, Keiran Paster, Silviu Pitis,
[136] Zheng Yang, Yi Zhao, Fan Dang, Xiaowu He, Jiahang Wu, Hao Cao, Harris Chan, and Jimmy Ba. Large language models are human-level
Zeyu Wang, and Yunhao Liu. Caas: Enabling control-as-a-service for prompt engineers. arXiv preprint arXiv:2211.01910, 2022.