Mobile Foundation Model As Firmware
Mobile Foundation Model As Firmware
In the current AI era, mobile devices such as smartphones • Human-centered computing → Ubiquitous and mobile
are tasked with executing a myriad of deep neural networks computing systems and tools.
(DNNs) locally. It presents a complex landscape, as these mod-
els are highly fragmented in terms of architecture, operators, KEYWORDS
and implementations. Such fragmentation poses significant Mobile computing, multimodal foundation model, efficient
challenges to the co-optimization of hardware, systems, and and scalable mobile AI
algorithms for efficient and scalable mobile AI.
Inspired by the recent groundbreaking progress in large ACM Reference Format:
foundation models, this work introduces a novel paradigm Jinliang Yuan†, Chen Yang†, Dongqi Cai†,, Shihe Wang, Xin Yuan,
Zeling Zhang, Xiang Li, Dingge Zhang, Hanzi Mei, Xianqing Jia,
for mobile AI, where mobile OS and hardware jointly man-
Shangguang Wang, Mengwei Xu*. 2024. Mobile Foundation Model
age a foundation model that is capable of serving a wide
as Firmware The Way Towards a Unified Mobile AI Landscape.
array of mobile AI tasks. This foundation model functions In International Conference On Mobile Computing And Networking
akin to firmware, unmodifiable by apps or the OS, exposed (ACM MobiCom ’24), September 30–October 4, 2024, Washington
as a system service to Apps. They can invoke this founda- D.C., DC, USA. ACM, New York, NY, USA, 17 pages. https://doi.org/
tion model through a small, offline fine-tuned “adapter” for 10.1145/3636534.3649361
various downstream tasks. We propose a tangible design of
this vision called M4, and prototype it from publicly available
1 INTRODUCTION
pre-trained models. To assess its capability, we also build a
comprehensive benchmark consisting of 38 mobile AI tasks Machine learning is revolutionizing mobile applications by
and 50 datasets, spanning 5 multimodal inputs. Extensive facilitating a more automated, intelligent, and efficient in-
experiments demonstrate M4’s remarkable results: it achieves teraction between users and devices. These advancements
comparable accuracy in 85% of tasks, offers enhanced scala- enable humans to enjoy the convenience provided by deep
bility regarding storage and memory, and has much simpler models at all times and locations, from voice assistants [9, 61],
operations. In broader terms, this work paves a new way image editers [112, 124, 129], to augmented reality [100, 148].
towards efficient and scalable mobile AI in the post-LLM era. As reported in [39, 130], the number of deep models incorpo-
rated within individual devices is growing rapidly, making
mobile devices a primary vehicle for AI.
Executing deep models on devices offers benefits in data
†Authors contributed equally to this research. *Corresponding author.
privacy and service availability but also demands significant
resources such as memory, energy, and time. For efficient and
scalable on-device execution of these models, a comprehen-
Permission to make digital or hard copies of all or part of this work for sive co-design approach that integrates hardware, system,
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear and algorithms is needed. However, this task is challenged by
this notice and the full citation on the first page. Copyrights for components the fragmented ecosystem of mobile deep models: they signifi-
of this work owned by others than ACM must be honored. Abstracting with cantly differ in architecture, operators, and implementations
credit is permitted. To copy otherwise, or republish, to post on servers or to [46, 56, 68, 77, 113, 135]. This fragmentation, which often
redistribute to lists, requires prior specific permission and/or a fee. Request results in ad-hoc optimization efforts [40, 99, 107], seems
permissions from [email protected].
ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA
unavoidable. It originates from the complex nature of mobile
© 2024 Association for Computing Machinery. AI tasks (CV/NLP/TTS/HAR/..), multimodal data from vari-
ACM ISBN 979-8-4007-0489-5/24/09. . . $15.00 ous sensors (camera, screen, microphone, etc.), and diverse
https://doi.org/10.1145/3636534.3649361 application demands (high accuracy, low latency, etc.).
ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA Jinliang Yuan et al.
TXT-Enc
Style transfer
OCR
extensive knowledge from vast Internet data; (2) The devel-
Projector TTS-Dec Vehicle ReID
Crowd counting
opment of algorithms to accurately align multimodal data
IMU-Enc Q&A
Word prediction
Doc. summary
input [57, 114]; (3) The demonstration of parameter-efficient
Pretrained CLS-Dec ASR
fine-tuning (PEFT) methods like LoRA [65, 89] that efficiently
AUD-Enc SLU
LLM Backbone HAR
(LLaMA, Vicuna, etc)
TTS
VQA
adapt pre-trained models to diverse downstream tasks.
Multimodal Multimodal
Image captioning
… While the vision is intriguing, there are two key miss-
Embedding Understanding & Reasoning Generator (more in Table 3)
ing pieces to turn it into reality. (1) How to build such a
one-size-fits-all foundation model to ubiquitously handle
Figure 1: An overview of M4. highly diversified, multimodal mobile AI tasks? While re-
search on the multimodal foundation model has achieved
Such fragmentation fundamentally undermines the effi- impressive progress in recent years, they are still not ade-
ciency of constructing an efficient and scalable mobile AI quate in our case: most of them [80, 91, 101] handle only
stack, notably in the three following aspects: a small fixed number of input/output modalities (e.g., text-
• Hardware aspect: it complicates the design of ASIC-based image) and cannot be flexibly adapted to more; a concurrent
accelerators (NPUs), by forcing difficult trade-offs between effort CoDi [114] with this work enables any-to-any genera-
generality and performance. §2.2 shows that mobile NPUs tion across three modalities (image-text-audio), but requires
can achieve up to a 22× speedup over multi-core CPUs on more than 34GBs of on-device storage/memory. (2) How to
the Qualcomm Snapdragon 8+ Gen 1 platform. However, properly evaluate the performance of the proposed foun-
this advantage only extends to a small fraction (around dation model? To our best knowledge, there has been no
8%) of deep models due to the lack of operator support. comprehensive benchmark or a set of standard metrics for
• OS aspect: It hampers the system-wise sharing of weights mobile AI tasks.
and computations across different applications. Mobile M4: a composable mobile foundation model (§3). We in-
apps often perform similar meta-ML tasks (e.g., object de- troduce M4, the first architectural design and implementation
tection for augmented reality, image enhancement, OCR of a multimodal mobile foundation model, as illustrated in
apps), and there exist temporal, spatial, and semantic cor- Figure 1. Unlike prior approaches like CoDi that directly use
relations among the input data [62]. However, exploit- (Nx) heavy encoders to align multimodal input data and (Mx)
ing such similarities to reduce memory or computing via heavy decoders to generate specific data format, M4 adds a
cache-reuse is currently impractical, due to the model frag- backbone module in between (a “narrow waist”) that com-
mentation and OSes’ lack of visibility into those models prehends and reasons for each downstream task. Through
managed at the application level. such “N-1-M” design, M4 is able to achieve better accuracy-
• Software aspect: It makes library-level optimizations ad- to-parameter efficiency as compared to traditional “N-M”
hoc. As noted in [144], there is a wide array of frameworks architecture. Moreover, M4 could be partially activated by
available to developers, but their performance can vary various tasks based on their characteristics (input/output
significantly across different models and devices. No sin- modality, the need for complex comprehension, etc.). We
gle solution excels universally, often leaving developers have fully prototyped M4 with only pre-trained models pub-
struggling to differentiate between them. licly available from HuggingFace [126], which guarantees
Mobile foundation model as firmware. In order to fun- the reproducibility of M4 and also demonstrates its compati-
damentally tackle the aforementioned issues, we propose a bility with the existing LLM ecosystem. Overall, M4 contains
novel paradigm for mobile AI in which the OS and hardware 9.2B parameters and demands 7.5GBs of peak memory foot-
co-manage a foundation model that is capable of address- print. Such a size is only affordable on high-end mobile de-
ing most, if not all, mobile AI tasks. This model, akin to vices nowadays, but we deem it to be soon feasible for more
firmware, is exposed as a system service to applications sim- commons whose memory/storage capacity is significantly
ilar to the unified ML interface NNAPI [60] in Android. It increasing yearly.
remains unaltered by apps or the OS. To utilize it, each ap- eAIBench: a comprehensive edge-oriented AI bench-
plication embeds it a lightweight "adapter" that is fine-tuned mark (§4.1). To assess M4 and future endeavors, we have
offline for downstream tasks. This approach could greatly constructed the first comprehensive benchmark for diverse
simplify NPU design and allow the OS to take control of mobile AI tasks, named eAIBench. Through an extensive
AI computing across applications, thereby facilitating the examination of real-world mobile AI and publications in mo-
sharing of weights, computations, and task scheduling. This bile venues, M4 presently includes 38 important mobile AI
vision becomes feasible thanks to recent advancements in tasks and 50 classic datasets. The tasks include five differ-
the ML community, specifically: (1) The establishment of ent input/output data modalities (vision, text, audio, IMU,
Mobile Foundation Model as Firmware ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA
and mix). Each task is also linked with a task-specific model, addressed by running M4 on a highly optimized NPU, since
representative of the DNN in the pre-LLM era (e.g., ResNet- existing NPUs already offer up to a 22-fold speedup over
152 for image classification [63] and LSTM for input token CPUs, as mentioned in §2.2.
prediction [67]). We also standardize a set of key metrics to • Simplicity – M4 requires fewer operators for execution,
quantify the capability of a foundation model. greatly simplifying hardware design. In the ONNX format,
Key results (§4). We then conduct extensive experiments M4 utilizes a mere 39 different mathematical operators, in
to evaluate M4 using eAIBench on three kinds of hardware contrast to the cumulative 156 operators required by 50 task-
platforms: NVIDIA A100 GPU, NVIDIA Jetson ORIN NX and specific models. More impressively, M4 can expand its capa-
Pixel 7 Pro smartphone. We summarize our major results. bilities using the same number of operators. The traditional
• Ubiquity– M4 effectively supports most tasks and datasets approach, on the other hand, continuously introduces new
in eAIBench. Compared with the models tailored for each operators [139, 149], thereby complicating NPU design.
task, M4 shows comparable accuracy on 85% of the 50 datasets In addition to conventional mobile AI tasks, M4 also en-
and a significant improvement on 4 of them (including image ables more complex and innovative mobile applications, e.g.,
captioning and text-to-image retrieval). In only six instances a sophisticated assistant capable of processing multimodal
does M4 experience nontrivial accuracy degradation, marked input data, understanding user intentions, and responding
by a greater than 10% gap. The system also demonstrates with precision as demonstrated in §4.7.
promising zero-shot and few-shot capabilities, achieving Contributions Major contributions are summarized below.
usable accuracy on certain tasks without any fine-tuning. • We delineate a vision for a mobile foundation model, har-
Moreover, quantization minimally affects the performance nessing cutting-edge machine learning techniques to con-
of M4: when reduced to 8 bits on two tested tasks, accuracy solidate the mobile AI ecosystem and foster integrated
degradation ranges only between 0.2% and 0.8%. To be noted, hardware-system co-design.
the backbone LLM used in the current prototype of M4, i.e., • We design and prototype the first mobile foundation model
LLaMA (Feb. 2023), has been defeated by many other open- with public, pre-trained LLMs.
source LLMs since its release, such as LLaMA-2 (July. 2023) • We have constructed the first comprehensive edge-oriented
and Mistral-7B [71] (Oct. 2023). We expect the performance AI benchmark, through which our prototype demonstrates
of M4 to improve substantially as well by using such more significant potential in catering to widespread mobile AI
powerful backbone LLMs. This is also confirmed by our tasks, while exhibiting strong scalability, flexibility, and
preliminary experiments by replacing LLaMA with LLaMA- velocity in its performance.
2 on two tasks, as will be discussed in §4.2. Open-source M4 and eAIBench are publicly available at
• Scalability – Despite M4 foundation model’s heavier foot- https://github.com/UbiquitousLearning/MobileFM.
print, its adaptation to downstream mobile tasks is lightweight
and therefore more scalable. The current implementation
of M4 encompasses ∼10 billion parameters, in contrast to 2 BACKGROUND AND MOTIVATION
the mere 1 million to 500 million parameters found in task- 2.1 Mobile AI Characteristics
specific models. Nevertheless, the "adapters" of M4 require
only 1,000 to 10 million parameters, which enhances scalabil- Mobile AI is pervasive. An important trend of AI de-
ity across various mobile AI tasks, given that the foundation ployment is the migration of deep learning inference tasks
model is shared. For example, on a device with 12GB of mem- from data centers to smartphones, aiming to minimize user-
ory, M4 (4-bit quantized) with all 50 adapters can be hosted in perceived latency and better preserve data privacy [45, 49].
memory, eliminating cold-start latency, whereas only 20 of 50 For instance, it is reported that Android apps embedded with
task-specific models would fit the same memory constraints. on-device DNNs on the Google Play market have experi-
• Velocity – M4 is much slower than task-specific models, yet enced a remarkable 60% growth from February 2020 to April
the gap might be mitigated through a highly-optimized NPU. 2021 [39]; Such DL-enhanced apps have been downloaded
On a high-end autonomous board Jetson Orin NX (16GB by users billions of times. Unsurprisingly, mobile devices
memory), M4 runs 18× slower on average. We also test the like smartphones and laptops have become a major carriers
performance of M4 on smartphone CPUs 1 , which shows of intelligence, where DNN inference happens frequently
that the prediction delay could be too slow, i.e., 2.1 secs anywhere anytime even without users being aware of it.
to classify an image or 240 msecs to generate a token in Mobile DNNs are fragmented. Unlike cloud AI where
QA. However, such a performance degradation might be each computing unit (e.g., an NVIDIA GPU) only serves
one model for user requests [55, 90, 106], a mobile device
1 Currently,M4 cannot run on COTS smartphones GPU/NPU due to the lack needs to handle highly diversified mobile AI tasks by itself.
of operator support. Such diversification is inevitable since mobile AI tasks could
# of NPU operators is limited
and the growth is slow NPU improved significantly in terms of latency and energy consumption
ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA Jinliang Yuan et al.
3 15
CPU CPU
1000 TensorFlow (CPU) GPU GPU
# of operators
Energy (W∙s)
Latency (s)
2 NPU 10
TFLite (CPU) NPU
100 TFLite (NPU-v2.0)
1 5
TFLite (NPU-v1.0)
10 0 0
2019 2020 2021 2022 2023 ResNet-152 Bert-base ResNet-152 Bert-base
(a) Increased operators over time (b) Latency reduction (c) Energy saving
that require CPU-NPU co-running, the inference speed is out to more modalities (either for input or output), e.g., for
even not as good as running on mobile GPU. Figure 3 (c) new types of sensor/app data; (3) multimodal: M4 can take
further digs into the reason for such phenomenon: the NPU- multimodal input or generate multimodal output as needed,
incompatible DNNs need to be split into many sub-models to e.g., for advanced mobile applications like visual question
be scheduled between CPU and NPU (e.g., median number answering or audio caption.
of 50); therefore the data movement and format exchange Model architecture Figure 1 illustrates the overall archi-
could severely delay the inference. tecture of M4, which consists of three major components:
• Multimodal Embedding is to align the contents of differ-
2.3 Emergence of Foundation Models ent modalities by converting multimodal input data into
Foundation models are renovating AI ecosystem; the a unified representation (i.e., a vector). It is typically im-
plemented as a set of transformer encoders [57] for each
model design is converging. In recent years, significant ad-
modality, except that audio has two independent encoders
vancements have been made in large-scale neural networks
for language, image, and audio understanding and genera- to differentiate the context information (e.g., background
tion. GPT-3 [42] exemplifies this progress with impressive noise, speaker emotions) and spoken language (e.g., auto-
performance across various tasks, revolutionizing human- matic speech recognition).
computer interaction and intelligent assistance. In the visual • Foundation Backbone (i.e., Pre-trained LLM Backbone) is to
comprehend and reason about the input data. It encapsu-
domain, Meta’s SAM [75] demonstrates exceptional zero-
lates abundant knowledge to understand complex embedded
shot proficiency. Additionally, models like Kosmos-1 [66]
and PaLM-E [47] handle inputs from multiple modalities, data, performs task-specific inference, and generates easily
enabling diverse task capabilities. These models share the intelligible output for generator. It uses a decoder-based ar-
transformer architecture [119], differing mainly in layer con- chitecture trained on huge amount of textual dataset since
figurations or input modality processing. This convergence language has been acknowledged as the most representative
type of data [50, 92, 117]. The backbone is the most heavy
trend in AI model design is expected to continue in the future.
part of M4.
However, there has been no effort in building one model • Multimodal Generator is to adapt the output from the foun-
to fit highly diversified mobile AI tasks. None of the dation backbone to task-specific data format. For classifica-
aforementioned foundation models is capable of (not even tion tasks, it is simply a MLP with softmax layer; for image
close to) solving all mobile AI tasks. A single modality model tasks, it is a stable diffusion model [104]; etc.
(such as GPT for NLP) cannot comprehend or generate data
in other modalities. Existing multimodal models (such as Trainable parameters M4 contains three trainable parts
to be fine-tuned for downstream mobile AI tasks: two PEFT
CLIP for CV-NLP) can only deal with very limited multi-
modules inserted to the multimodal embedding and founda-
modal AI tasks. One might seek to include a foundation
model for each < 𝑖𝑛𝑝𝑢𝑡 : 𝑀1, 𝑜𝑢𝑡𝑝𝑢𝑡 : 𝑀2 > pair to solve tion backbone, respectively; and one MLP projection layer
the above issue, but: (1) It is not parameter-efficient as the that adapts the output of multimodal embedding to the re-
comprehension and conversion between different modality quired representation of the foundation backbone. In later
data share inherent common sense [80, 101]; (2) It cannot experiments, we use LoRA as the default PEFT method, but
also report results for other PEFT methods. As will be demon-
support AI tasks that take multimodal input or output, such
strated in §4, the trainable parameter size is trivial compared
as visual question answering [146]. There have been ad-hoc
approaches to deal with those issues [111], yet we are not to the pre-trained part of M4 and is also much smaller than
aware of any systematic strategy to build a one-size-fits-all traditional state-of-the-art DNNs.
foundation model for diversified mobile AI tasks.
3.2 Prototyping with Off-the-Shelf LLMs
3 M4 DESIGN AND PROTOTYPING We have fully prototyped M4 with only pre-trained, off-the-
shelf models publicly available from HuggingFace [126]. It
3.1 Overview guarantees the reproducibility of M4 and also demonstrates
Design principles M4 is a one-size-fits-all foundation model its compatibility with the existing LLM ecosystem.
for diversified mobile AI tasks. It is designed with follow- • Multimodal Embedding. Multimodal embedding is com-
ing principles: (1) unified: instead of building independent posed of five parallel modules with transoformer encoder-
foundation models for different possible modalities, M4 pro- only architecture: Image (IMG_enc), Text (TXT_enc), Iner-
vides a unified architecture that maximizes the capability tial Measurement Unit (IMU_enc), Audio-Background (AUD-
sharing across different modalities, thus being more resource- B_enc), and Audio-Intent (AUD-I_enc). The IMG_enc employs
efficient and extensible; (2) elastic: M4 can be easily scaled the Vision Transformer (ViT) architecture and is utilized
ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA Jinliang Yuan et al.
memory results are measured using TFLite’s benchmark (or degradation) compared to TS-models. As observed, M4
tools [116], while power consumption data is extracted from can achieve comparable performance across 85% of tasks,
Android’s virtual file system (e.g., /sys and /proc). with over 50% of these tasks showcasing considerable perfor-
mance improvement. Note that the vertical axis, normalized
in Table 3, reflects variations in accuracy across the 50 tasks.
4.2 Overall Accuracy Our main focus is on evaluating whether M4 demonstrates
M4 can well support most mobile AI tasks and datasets. superior or inferior accuracy compared to respective tasks,
Figure 5 illustrates M4’s overall performance improvement
Mobile Foundation Model as Firmware ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA
Figure 5: Normalized accuracy comparison of M4 and TS-models on 50 popular mobile tasks and datasets.
(a) Storage (b) Peak memory (a) Average latency (b) Average energy
Figure 11: M4’s scalability analysis of storage and peak Figure 12: M4’s runtime cost of latency and energy mea-
memory measured on Jetson ORIN. sured on Jetson ORIN (GPU).
Latency (s)
Tasks Path
CPU NPU*
Image classification Path-3 IMG_enc: 2.10 0.11
Audio classification Path-3 AUD-I_enc: 0.28 0.014 Red Panda
First token: 6.34 0.32 Red pandas are adorable, medium-sized mammals
Question answering Path-2 M4 native to the Himalayas and southeast Asia. They are
Sequent tokens: 0.24/token 0.012/token
also known as lesser pandas or red-and-white pandas.
Visual First token: 6.47 0.32
Path-1 What is its relationship with the giant panda?
question answering Sequent tokens: 0.25/token 0.013/token
Text-to-speech Path-4 TTS_dec: 0.82 0.041 They are both members of the bear family (Ursidae), but they are
Table 5: An in-depth what-if cost analysis of latency different species. The giant panda (Ailuropoda melanoleuca) is
M4 much larger than the red panda (Ailurus fulgens), and is found
when running M4 on Pixel 7 Pro. NPU*: M4’s estimated only in a few isolated mountain ranges in China.
latency based on the NPU acceleration rate of TS-model 4″
if it can be deployed on NPU. Voice to text
What does this animal eat?
Figure 14: Simplicity analysis of M4’s operators. 4.7 Novel Application with M4
M4 enables complex, unpresent mobile applications.
lies in the time taken by the IMG-encoder and the generation Based on our proposed M4, we build a demo of a multimodal
of the first token by the backbone, which is approximately chat use case as shown in Figure 15. Users engage in multi-
2.1s and 6.3s, respectively. These components collectively turn chats with the M4 client using multimodal inputs such
account for 31% and 93% of M4’s average latency (6.8s in as images, text, and audio, thereby obtaining precise and
Figure 12(a)). However, the other components could exhibit tailored answers that meet their requirements. This multi-
near real-time inference (less than 100ms) if being deployed modal computing capability is also crucial for the recent
on NPU. These achievements demonstrate that, if M4 can be popular mobile agents [82]. We build this prototype system
accelerated on NPU, it could get on par execution speed and of M4 based on the architecture depicted in Figure 1. It first
energy consumption with TS-models. The aim of comparing aligns the contents of image, text, and audio by converting
M4’s NPU performance with TS-models on CPU isn’t to claim multimodal input data into a unified representation. Then, it
superiority, but to showcase how future NPU support can encapsulates abundant knowledge to understand complex
enhance efficiency for mobile foundation models. embedded data, perform task-specific inference, and gen-
erate the required information. This innate capability for
4.6 Model Architecture Simplicity multimodal processing harbors the potential to significantly
enrich the landscape of mobile applications.
M4’s architectural design is much simpler and cleaner
in terms of NN operators, therefore could greatly sim-
plify accelerator design. Figure 14(a) shows that the num- 5 RELATED WORK
ber of operators in the TS-models increases rapidly with the Foundation models. Building one-for-all foundation mod-
growth in the number of tasks. Notably, as the task spectrum els to serve generic AI tasks has been a primary research
broadens to encompass 50 tasks, the number of operator goal of the machine learning community. The recent advance-
types culminates at 156. In contrast, M4 engages a mere 39 ments of LLMs [92, 117, 118, 143], multimodalities alignment
operator types, encompassing both foundation model and [57, 91, 111, 114], and parameter-efficient training methods
task-specific "adapters". Furthermore, Figure 14(b) under- [89, 136, 142] have shed lights on this challenging goal. For
takes a granular exploration of NPU supported operators for instance, ImageBind [57] and CoDi [114] focus on how to
both M4 and TS-models. It underscores that only 51 out of align the embeddings of each modality, and PandaGPT [111]
156 operators in TS-models are supported by the NPU, with further attempts to compose semantics from different modal-
more than 2/3 of the operators unable to fully run on the NPU; ities naturally based on LLaMA [117]. However, there have
But for M4, NPU-supported operators account for 64%. This been no efforts like M4 that try to fit extremely diversified AI
Mobile Foundation Model as Firmware ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA
tasks into one model. Meanwhile, M4 leverages the most state- and cleaner architecture of M4, it would be much easier to
of-the-art pre-trained LLMs to reuse the wisdoms as well as design accelerator to support M4 with high precision. (2) M4
the investments from the ML community&industry. The con- underperforms baseline models on certain ML tasks. This
current work NExT-GPT [127] shares a similar architecture unveils the limitation of existing pre-trained foundation mod-
as M4. Nonetheless, M4 introduces two distinctive contribu- els, e.g., translation. On the one hand, we do not expect M4
tions: (1) It marks the inaugural proposal of a transformer- to be able to solve all mobile AI tasks in the near future; it
based N-1-M architecture, aiming to curtail resource costs in could co-exist with traditional DNNs that run on mobile CPU.
any-to-any modal generation; (2) Its innovative multi-path On the other hand, the LLM capacity is still fast evolving:
execution design is tailored to enhance compatibility with from LLaMA-1/2 used in this study, to the Mistral-7B [71]
highly diversified mobile AI tasks. that ranks higher even than LLaMA-13B. Such continuous
Hardware-system-algorithm co-design for mobile AI. improvement endeavors our vision with much confidence.
AI workloads are highly compute-intensive and exhibit anal- To be noted, M4 is the very first step towards the vision of
ogous patterns, therefore is better to be accelerated domain- mobile foundation model. We believe it could potentially rev-
specific accelerator (e.g., NPUs). For instance, SpAtten [122] olutionize the mobile AI landscape and open a new research
and Sanger [88] focus on how efficient algorithm-architecture domain. However, to fully realize the vision, there are a few
co-designs can reduce sparse attention computation and key designs to be explored. For instance: (1) Foundation model
memory access. Besides, QNAS [83] and NAAS [84] focus on design: As a preliminary prototype, M4 is currently built atop
composing highly matched neural-hardware architectures off-the-shelf, separately pre-trained LLMs from Internet in-
by jointly searching for neural network architecture, acceler- stead of being tailored for mobile devices. Therefore, it is
ator architecture, and compiler mapping. However, all prior still highly inefficient in terms of accuracy and model param-
literature makes tradeoffs between the ubiquity of operator eter size. With enough resources (GPUs and data), hardware
support and the performance, instead of for a foundation vendors can build a more compact mobile foundation model
model that can serve generic AI tasks itself. The vision of that is expected to deliver significantly higher accuracy with
the mobile foundation model could open a new research lower runtime cost than M4. (2) Accelerator design: fine-tuning
domain for cross-layer co-design of mobile AI. There have for downstream tasks generates small “adapters” that are
been preliminary attempts [131, 132, 137] to alleviate the inserted into the mobile foundation model. The NPU better
huge resource cost of large foundation models for devices. has the flexibility to run those adapters as well; otherwise
Those work are orthogonal to this work. the inference must involve CPU/GPU computation and data
Managing AI as a mobile system service. AI has been a movement overhead. Fortunately, the adapters have sim-
ubiquitous workload on mobile devices, and managing it at ple structure (e.g., linear matrix operations) and very few
a system aspect (instead of individual app) could facilitate weights. (3) FM upgradating: the foundation model capacity
OS-wise runtime scheduling and software deployment. Some could evolve with better architecture/weights as shown in
early studies [52, 123, 133, 144, 147] attempt to mitigate the §4.2. Yet the adapters trained for the old foundation model
severe fragment across different libs in the mobile DL ecosys- cannot work with the new one. We therefore need a unified
tem. Google introduced a unified ML interface NNAPI [60] interface between LLMs and adapters to allow them to evolve
into Android in 2017, to relieve the gap between heteroge- independently without interfering with each other.
neous mobile processors and user-defined ML frameworks.
Compared to the above work, M4 takes another giant step
further that mobile devices shall manage a foundation model 7 CONCLUSIONS
for each ML task and expose it as firmware. We envision a mobile hardware-OS co-managed multimodal
foundation model that is exposed to mobile applications as a
system service to serve almost every on-device AI task. We
6 LIMITATIONS AND FUTURE WORK design and prototype the first such model using off-the-shelf
This study has several potential limitations. (1) eAIBench’s LLMs. Evaluated on a comprehensive benchmark consisting
results are evaluated on datacenter GPU (NVIDIA A100) of 50 representative mobile AI tasks, M4 shows good accuracy,
and edge GPU (Jetson Orin), lacking assessment on mobile better scalability and reduced runtime cost.
devices. It’s mainly due to the highly diverse code imple-
mentation of baseline models and the huge time span of
evaluating M4 on large test dataset. There might exist per- 8 ACKNOWLEDGES
formance gap between different hardware architectures. Yet, The authors thank the anonymous reviewers and the shep-
the comparison is fair as both baseline models and M4 are herd for their insightful feedbacks. This work was supported
evaluated on the same hardware. In fact, due to the simpler by NSFC (61921003, 62102045).
ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA Jinliang Yuan et al.
video-text retrieval. In Proceedings of the IEEE/CVF International information processing systems, 34:980–993, 2021.
Conference on Computer Vision, pages 11915–11925, 2021. [104] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser,
[87] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi and Björn Ommer. High-resolution image synthesis with latent
Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoy- diffusion models. In Proceedings of the IEEE/CVF conference on
anov. Roberta: A robustly optimized bert pretraining approach. arXiv computer vision and pattern recognition, pages 10684–10695, 2022.
preprint arXiv:1907.11692, 2019. [105] Teven Le Scao, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel
[88] Liqiang Lu, Yicheng Jin, Hangrui Bi, Zizhang Luo, Peng Li, Tao Wang, Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon,
and Yun Liang. Sanger: A co-design framework for enabling sparse at- Matthias Gallé, et al. Bloom: A 176b-parameter open-access multilin-
tention using reconfigurable architecture. In MICRO-54: 54th Annual gual language model. arXiv preprint arXiv:2211.05100, 2022.
IEEE/ACM International Symposium on Microarchitecture, pages [106] Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong,
977–991, 2021. Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. Nexus:
[89] Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, A gpu cluster engine for accelerating dnn-based video analysis. In
and Sayak Paul. Peft: State-of-the-art parameter-efficient fine-tuning Proceedings of the 27th ACM Symposium on Operating Systems
methods. https://github.com/huggingface/peft, 2022. Principles, pages 322–337, 2019.
[90] Harini Muthukrishnan, David Nellans, Daniel Lustig, Jeffrey A Fessler, [107] Yongming Shen, Michael Ferdman, and Peter Milder. Maximizing cnn
and Thomas F Wenisch. Efficient multi-gpu shared memory via auto- accelerator efficiency through resource partitioning. ACM SIGARCH
matic optimization of fine-grained transfers. In 2021 ACM/IEEE 48th Computer Architecture News, 45(2):535–547, 2017.
Annual International Symposium on Computer Architecture (ISCA), [108] Cong Shi, Xiangyu Xu, Tianfang Zhang, Payton Walker, Yi Wu, Jian
pages 139–152. IEEE, 2021. Liu, Nitesh Saxena, Yingying Chen, and Jiadi Yu. Face-mic: infer-
[91] Ivona Najdenkoska, Xiantong Zhen, and Marcel Worring. Meta learn- ring live speech and speaker identity via subtle facial dynamics cap-
ing to bridge vision and language models for multimodal few-shot tured by ar/vr motion sensors. In Proceedings of the 27th Annual
learning. arXiv preprint arXiv:2302.14794, 2023. International Conference on Mobile Computing and Networking,
[92] OpenAI. Gpt-4 technical report. https://cdn.openai.com/papers/gpt- pages 478–490, 2021.
4.pdf, 2023. [109] Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace,
[93] Myle Ott, Sergey Edunov, Angela Fan, Sam Gross, Nathan Ng, David and Sameer Singh. Autoprompt: Eliciting knowledge from lan-
Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for guage models with automatically generated prompts. arXiv preprint
sequence modeling. arXiv preprint arXiv:1904.01038, 2019. arXiv:2010.15980, 2020.
[94] paperswithcode. Paper with code. https://paperswithcode.com/, 2023. [110] Lingyun Song, Jun Liu, Buyue Qian, and Yihe Chen. Connecting lan-
[95] Imagebind parameters. Imagebind parameters. https://github.com/ guage to images: A progressive attention-guided network for simul-
facebookresearch/ImageBind, 2023. taneous image captioning and language grounding. In Proceedings
[96] LLaMa parameters. Llama parameters. https://github.com/ of the AAAI Conference on Artificial Intelligence, volume 33, pages
facebookresearch/llama, 2023. 8885–8892, 2019.
[97] Seonghoon Park, Yeonwoo Cho, Hyungchol Jun, Jeho Lee, and Hojung [111] Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai.
Cha. Omnilive: Super-resolution enhanced 360° video live streaming Pandagpt: One model to instruction-follow them all. arXiv preprint
for mobile devices. In Proceedings of the 21st Annual International arXiv:2305.16355, 2023.
Conference on Mobile Systems, Applications and Services, pages [112] Yanan Sun, Chi-Keung Tang, and Yu-Wing Tai. Semantic image
261–274, 2023. matting. In Proceedings of the IEEE/CVF Conference on Computer
[98] Adam Paszke, Francisco Massa, Adam Lerer, James Bradbury, Gregory Vision and Pattern Recognition, pages 11120–11129, 2021.
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, [113] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang,
et al. Pytorch: An imperative style, high-performance deep learning and Denny Zhou. Mobilebert: a compact task-agnostic bert for
library. arXiv preprint arXiv:1912.01703, 2019. resource-limited devices. arXiv preprint arXiv:2004.02984, 2020.
[99] Suchita Pati, Shaizeen Aga, Nuwan Jayasena, and Matthew D Sin- [114] Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, and Mohit
clair. Demystifying bert: System design implications. In 2022 IEEE Bansal. Any-to-any generation via composable diffusion. arXiv
International Symposium on Workload Characterization (IISWC), preprint arXiv:2305.11846, 2023.
pages 296–309. IEEE, 2022. [115] Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung.
[100] Siddhant Prakash, Alireza Bahremand, Linda D Nguyen, and Robert Branchynet: Fast inference via early exiting from deep neural net-
LiKamWa. Gleam: An illumination estimation framework for works. In 2016 23rd international conference on pattern recognition
real-time photorealistic augmented reality on mobile devices. In (ICPR), pages 2464–2469. IEEE, 2016.
Proceedings of the 17th Annual International Conference on Mobile [116] TFLite tools. Tflite tools. https://www.tensorflow.org/lite/
Systems, Applications, and Services, pages 142–154, 2019. performance/measurement, 2023.
[101] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel [117] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet,
Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman
Mishkin, Jack Clark, et al. Learning transferable visual models Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient
from natural language supervision. In International conference on foundation language models. arXiv preprint arXiv:2302.13971, 2023.
machine learning, pages 8748–8763. PMLR, 2021. [118] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma-
[102] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal
McLeavey, and Ilya Sutskever. Robust speech recognition via large- Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-
scale weak supervision. In International Conference on Machine tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Learning, pages 28492–28518. PMLR, 2023. [119] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
[103] Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and Jie Zhou. Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention
Global filter networks for image classification. Advances in neural is all you need. Advances in neural information processing systems,
30, 2017.
Mobile Foundation Model as Firmware ACM MobiCom ’24, September 30–October 4, 2024, Washington D.C., DC, USA
[120] Dan Wang, Haibo Lei, Haozhi Dong, Yunshu Wang, Yongpan Zou, time-sensitive networking. In IEEE INFOCOM, 2023.
and Kaishun Wu. What you wear know how you feel: An emotion [137] Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, and
inference system with multi-modal wearable devices. In Proceedings Mengwei Xu. Edgemoe: Fast on-device inference of moe-based large
of the 26th Annual International Conference on Mobile Computing language models. arXiv preprint arXiv:2308.14352, 2023.
and Networking, pages 1–3, 2020. [138] Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and
[121] Hanrui Wang, Yongshan Ding, Jiaqi Gu, Yujun Lin, David Z Pan, Enhong Chen. A survey on multimodal large language models. arXiv
Frederic T Chong, and Song Han. Quantumnas: Noise-adaptive search preprint arXiv:2306.13549, 2023.
for robust quantum circuits. In 2022 IEEE International Symposium [139] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim,
on High-Performance Computer Architecture (HPCA), pages 692– and Byung-Gon Chun. Orca: A distributed serving system
708. IEEE, 2022. for {Transformer-Based} generative models. In 16th USENIX
[122] Hanrui Wang, Zhekai Zhang, and Song Han. Spatten: Efficient Symposium on Operating Systems Design and Implementation
sparse attention architecture with cascade token and head prun- (OSDI 22), pages 521–538, 2022.
ing. In 2021 IEEE International Symposium on High-Performance [140] Koh Jing Yu. Model zoo. https://modelzoo.co/, 2023.
Computer Architecture (HPCA), pages 97–110. IEEE, 2021. [141] Mu Yuan, Lan Zhang, Fengxiang He, and Xiang-Yang Li. Infi: end-to-
[123] Haoyu Wang, Hao Li, and Yao Guo. Understanding the evolution of end learnable input filter for resource-efficient mobile-centric infer-
mobile app ecosystems: A longitudinal measurement study of google ence. In Proceedings of the 28th Annual International Conference
play. In The World Wide Web Conference, pages 1988–1999, 2019. on Mobile Computing And Networking, pages 228–241, 2022.
[124] Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun [142] Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Sim-
Huang. Images speak in images: A generalist painter for in-context ple parameter-efficient fine-tuning for transformer-based masked
visual learning. In Proceedings of the IEEE/CVF Conference on language-models. arXiv preprint arXiv:2106.10199, 2021.
Computer Vision and Pattern Recognition, pages 6830–6839, 2023. [143] Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai,
[125] Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al.
Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Fine- Glm-130b: An open bilingual pre-trained model. arXiv preprint
tuned language models are zero-shot learners. arXiv preprint arXiv:2210.02414, 2022.
arXiv:2109.01652, 2021. [144] Qiyang Zhang, Xiang Li, Xiangying Che, Xiao Ma, Ao Zhou, Mengwei
[126] Thomas Wolf, Lysandre Debut, Julien Chaumond, Clement Delangue, Xu, Shangguang Wang, Yun Ma, and Xuanzhe Liu. A comprehen-
Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Fun- sive benchmark of deep learning libraries on mobile devices. In
towicz, et al. Huggingface’s transformers: State-of-the-art natural Proceedings of the ACM Web Conference, pages 3298–3307, 2022.
language processing. arXiv preprint arXiv:1910.03771, 2019. [145] Wuyang Zhang, Zhezhi He, Zhenhua Jia, Yunxin Liu, Marco Gruteser,
[127] Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Dipankar Raychaudhuri, and Yanyong Zhang. Elf: accelerate high-
Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint resolution mobile deep vision with content-aware parallel offload-
arXiv:2309.05519, 2023. ing. In Proceedings of the 27th Annual International Conference on
[128] Yusheng Xiang and Marcus Geimer. Optimization of operation strat- Mobile Computing and Networking, pages 201–214, 2021.
egy for primary torque based hydrostatic drivetrain using artificial [146] Xi Zhang, Feifei Zhang, and Changsheng Xu. Vqacl: A novel visual
intelligence. arXiv preprint arXiv:2003.10011, 2020. question answering continual learning setting. In Proceedings of the
[129] Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluat- pages 19102–19112, 2023.
ing human preferences for text-to-image generation. arXiv preprint [147] Yi Zhao, Zheng Yang, Xiaowu He, Jiahang Wu, Hao Cao, Liang Dong,
arXiv:2304.05977, 2023. Fan Dang, and Yunhao Liu. E-tsn: Enabling event-triggered critical
[130] Mengwei Xu, Jiawei Liu, Yuanqiang Liu, Felix Xiaozhu Lin, Yunxin Liu, traffic in time-sensitive networking for industrial applications. In
and Xuanzhe Liu. A first look at deep learning apps on smartphones. 2022 IEEE 42nd International Conference on Distributed Computing
In The World Wide Web Conference, pages 2125–2136, 2019. Systems (ICDCS), pages 691–701. IEEE, 2022.
[131] Mengwei Xu, Yaozong Wu, Dongqi Cai, Xiang Li, and Shangguang [148] Yiqin Zhao and Tian Guo. Xihe: a 3d vision-based lighting estimation
Wang. Federated fine-tuning of billion-sized language models across framework for mobile augmented reality. In Proceedings of the 19th
mobile devices. arXiv preprint arXiv:2308.13894, 2023. Annual International Conference on Mobile Systems, Applications,
[132] Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, and Services, pages 28–40, 2021.
Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, [149] Quan Zhou, Haiquan Wang, Xiaoyan Yu, Cheng Li, Youhui Bai, Feng
et al. A survey of resource-efficient llm and multimodal foundation Yan, and Yinlong Xu. Mpress: Democratizing billion-scale model train-
models. arXiv preprint arXiv:2401.08092, 2024. ing on multi-gpu servers via memory-saving inter-operator paral-
[133] Mengwei Xu, Mengze Zhu, Yunxin Liu, Felix Xiaozhu Lin, and Xu- lelism. In 2023 IEEE International Symposium on High-Performance
anzhe Liu. Deepcache: Principled cache for mobile deep vision. In Computer Architecture (HPCA), pages 556–569. IEEE, 2023.
Proceedings of the 24th annual international conference on mobile [150] Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, and
computing and networking, pages 129–144, 2018. Furu Wei. Bert loses patience: Fast and robust inference with early
[134] Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose++: Vi- exit. Advances in Neural Information Processing Systems, 33:18330–
sion transformer for generic body pose estimation. IEEE Transactions 18341, 2020.
on Pattern Analysis and Machine Intelligence, 2023. [151] Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, and
[135] Zheng Yang, Lirong Jian, Chenshu Wu, and Yunhao Liu. Beyond Ishan Misra. Detecting twenty-thousand classes using image-level
triangle inequality: Sifting noisy and outlier distance measurements supervision. In European Conference on Computer Vision, pages
for localization. ACM Transactions on Sensor Networks (TOSN), 350–368. Springer, 2022.
9(2):1–20, 2013. [152] Yongchao Zhou, Andrei Ioan Muresanu, Keiran Paster, Silviu Pitis,
[136] Zheng Yang, Yi Zhao, Fan Dang, Xiaowu He, Jiahang Wu, Hao Cao, Harris Chan, and Jimmy Ba. Large language models are human-level
Zeyu Wang, and Yunhao Liu. Caas: Enabling control-as-a-service for prompt engineers. arXiv preprint arXiv:2211.01910, 2022.