On-Device Personalization For Human Activity Recognition On STM32
On-Device Personalization For Human Activity Recognition On STM32
2, JUNE 2024
Abstract—Human activity recognition (HAR) is one of the On-device learning (ODL) solutions for MCUs are scarce in
most interesting application for machine learning models run- the literature, and a few frameworks expose only training func-
ning on low-cost and low-power devices, such as microcontrollers tionalities for dense layers [5]. This is somehow in contrast
(MCUs). As a matter of fact, MCUs are often dedicated to with modern CNNs that favor convolutional ones. Moreover,
performing inference on their own acquired data, and any
TinyTL [6] and Train++ [7], are purely algorithmic studies
form of model training and update is delegated to external
resources. We consider this mainstream paradigm a severe lim- and do not provide any implementation to be used in HAR.
itation, especially when privacy concerns prevent data sharing, The only framework that specifically addresses CNNs train-
thus model personalization, which is universally recognized as ing on MCUs is [8], which introduces an algorithm-system
beneficial in HAR. In this letter, we present our HAR solu- co-design that combines quantizations and sparse updating
tion where MCUs can directly fine-tune a deep learning model techniques to enable the training of an image classifier on
using locally acquired data. In particular, we enable training STM32 MCUs with minimal memory requirements (256 KB).
functionalities for 1-D convolutional neural networks (CNNs) on However, none of these frameworks, including [8], were used
STM32 microcontrollers and provide a software tool to estimate to investigate model personalization in HAR by fine-tuning all
the memory and computational resources required to accomplish
model personalization. the layers of a CNN directly on the MCU. We believe this is
a relevant problem for HAR, where model personalization is
Index Terms—Human activity recognition (HAR), micro- often key to compensate for subjects’ heterogeneity resulting
controllers, model personalization, on-device learning (ODL), in very different signals for the same activity. Moreover, per-
STM32.
forming model personalization directly on the device is key to
prevent sharing of confidential data.
In this letter we perform for the first time model person-
alization for HAR directly on a STM32 microcontroller. To
I. I NTRODUCTION this purpose, we implemented both a software framework that
ECENTLY, we have witnessed a broad diffusion of
R Internet of Things (IoT) devices equipped with tiny
microcontroller units (MCUs) and an increasing interest in tiny
enables to fine-tune on the MCU all the layers of a 1-D CNN,
and also a tool to estimate the memory footprint and the
computational resources required for the personalization. Our
machine learning (TinyML [1]) research to leverage machine framework can also be used for addressing other problems
learning models on low-power devices. Human activity recog- in HAR, namely: 1) enabling continuous learning to coun-
nition (HAR) is among the most frequently addressed prob- teract concept drift and 2) enabling federated learning (FL)
lems in the TinyML domain [2]. The mainstream paradigm mechanisms [9] in a fully distributed manner.
in HAR consists in classifying [e.g., by a 1D-convolutional Our experiments, performed on real-world HAR datasets,
neural network (CNN)] segments of inertial measurement confirm that model personalization in HAR is very beneficial
units (IMUs) recordings, which are easy to gather in wear- and that it is possible to retrain 1-D CNNs satisfying the strict
able devices mounting accelerometers and gyroscopes. Model computational and memory constraints of the STM32L496ZG
training is exclusively performed on a server using many anno- MCU. In addition, we demonstrate a tradeoff between model
tated data and large computational resources. Once trained, the accuracy and computational requirements, since performing
model is possibly optimized, e.g., by distillation, quantization, a full retraining of the model is beneficial in terms of F1-
or pruning [2], and then deployed to the MCU, which is only score, but at the same time it has a greater impact on the
in charge of inference. Not surprisingly, the vast majority of memory and the energy consumption. This analysis provides
TinyML solutions support only inference on MCUs. Examples insights on when to schedule model personalization on the
include Tensorflow Lite Micro [3] from Google and X-CUBE device.
AI [4] from STMicroelectronics.
Manuscript received 1 April 2023; revised 23 May 2023; accepted 2 II. P ROBLEM D EFINITION
July 2023. Date of publication 11 July 2023; date of current version
30 May 2024. This manuscript was recommended for publication by G. Bhat.
We address HAR as a multiclass classification problem,
(Corresponding author: Michele Craighero.) where the input s ∈ Rn×z is a segment of z samples from
Michele Craighero and Giacomo Boracchi are with DEIB, Politecnico di n time series acquired by inertial sensors, and the output y is
Milano, 20133 Milan, Italy (e-mail: [email protected]). a label corresponding to a human activity. We assume a gen-
Davide Quarantiello, Beatrice Rossi, Diego Carrera, and Pasqualina
Fragneto are with the System Research and Applications, STMicroelectronics
eral training set TR of data from different users is provided to
SRL, 20864 Agrate Brianza, Italy. train a classifier C, which associates each s to a label ŷ = C(s).
Digital Object Identifier 10.1109/LES.2023.3293458 State-of-the-art classifiers for HAR [2] are 1D-CNNs.
c 2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/
CRAIGHERO et al.: ON-DEVICE PERSONALIZATION FOR HAR ON STM32 107
TABLE I
F ORWARD AND BACKWARD E XPRESSIONS FOR L AYERS I MPLEMENTED IN THE T RAINING M ODULE . N AND M A RE THE W IDTH OF THE P REVIOUS AND
THE C URRENT L AYER , R ESPECTIVELY, F I S THE N UMBER OF F ILTERS OF THE C URRENT L AYER , K I S THE K ERNEL S IZE , C I S THE N UMBER OF I NPUT
C HANNELS , AND p I S THE S TRIDE OF THE AVG P OOL 1D L AYER . T HE BACKWARD PASSES OF W EIGHTS AND B IASES A RE C OMPUTED O NLY
FOR Dense AND Conv1D S INCE THE OTHER C ONSIDERED L AYERS D O N OT T RAIN T HESE PARAMETERS
in Tensorflow, a customary practice when filters are being Our target MCU is the STM32L496ZG, an ultralow-power
learned. According to the chain rule, we have MCU produced by STMicroelectronics with an ARM Cortex-
M4 core and 320 KB of sRAM. We flash the firmware using
∂L ∂L ∂fjm
F M
the development board NUCLEO-L496ZG that is equipped
= . (3)
∂x ∂ajm ∂x with such MCU.
j=1 m=1
In particular, since (∂L/∂a) is passed from the next layer A. Datasets and Architecture
during backpropagation, we just have to compute the local
gradients of the layer function with respect to the input The general dataset TR we consider is the wireless data
mining (WISDM) dataset [10], which is used to pretrain
∂L F
∂L
K a general classifier C. The local training set TRi are from
= wjik . (4) ST dataset, which is collected in STM facilities using a
∂xin ∂aj,n−k+1
j=1 k=1 SensorTile.box [11]. Their characteristics are:
1) WISDM Dataset: 36 users, six activities (walking, jog-
In (4), the activation index n − k + 1 ranges from 2 − K to
ging, ascending stairs, descending stairs, sitting, and
N, for a total of N + K − 1 terms for each channel j, whereas
standing), and sampling frequency 20 Hz;
(∂L/∂a) has size F × (N − K + 1). If K > 1, we apply a 0
2) ST Dataset: Three users, three activities (walking,
padding to (∂L/∂a) by adding F · (K − 1) zeros to both sides
ascending stairs, and descending stairs), and sampling
of (∂L/∂a) along its second dimension. The index k in (4)
frequency 27 Hz.
has opposite signs in the two terms of the convolution (−k in
We notice that the activities in the ST dataset are a sub-
(∂L/∂a) and +k in w), thus we obtain a flipped kernel. The
set of those in WISDM. Both datasets have been collected
final result is expressed as
using tri-axial accelerometers, but with different sensors,
∂L ∂L thus the ST dataset has been resampled to 20 Hz. We
= conv pad , flip(w) (5)
∂x ∂a underline that, after the resampling, the two datasets have
approximately the same number of samples for each user
where conv still represents cross-correlation. (around 30k). The classifier takes as input from 1 to 5 s
Finally, to compute the gradient of a batch of input segments of recording, which correspond to the following input sizes:
we resort gradient accumulation. This maintains fixed the (20, 3), (40, 3), (60, 3), (80, 3), and (100, 3), where 3 is the
memory footprint during the training regardless the batch size. number of axis of the accelerometers.
We adopt a 1D-CNN made of four blocks: 1) Conv1D
B. Estimating Resources with F = 32 filters and kernel size K = 3 with ReLU
Our framework is equipped with a tool that estimates: 1) the and AvgPool1D; 2) Conv1D of F = 64 filters and kernel
memory footprint and 2) the CPU load to personalize a HAR size K = 3 with ReLU, AvgPool1D, and GlobalAvgPool1D;
classifier on an STM32 MCU. This tool is very valuable during 3) dense of M = 50 units with ReLU; and 4) dense of M = 6
prototyping to properly size embedded ML applications to the units with Softmax. The total number of parameters θ to train
MCU capabilities. is around 10000. We select the SGD optimizer with a learn-
Our tool computes the memory footprint as the amount ing rate of 0.01 and a batch size of 32. Facing a multiclass
of memory required by model personalization. Model train- classification task, we adopt the Categorical Cross-Entropy as
ing requires storing training samples, network parameters, loss function.
activations, gradients, and errors computed at each layer dur-
ing backpropagation. In particular, all the activations and the B. Model Personalization
downstream errors (∂L/∂a) of the layers that we want to train
First, we show that in the considered settings, model person-
must be stored in memory to be used during the backward
alization improves the performance of a pretrained model. The
pass to compute gradients. Our tool can estimate a priori the
first experiment is entirely conducted on the WISDM dataset
memory usage of training by multiplying the total amount of
by a leave-one-subject-out (LOSO) approach. For each user
saved variables by their bit precision. The tool also estimates
i = 1, . . . , 36 we define a training (TRi ) and test (TSi ) sets
the CPU load by counting the total number of operations per-
and pretrain a general classifier C from the other 35 users. We
formed by the backpropagation algorithm in the forward and
personalize each local classifier Ci by retraining all the layers
in backward passes, as well as in the parameters’ update phase.
(Full personalization) of C using TRi . As a comparison, we
The number and the type of operations performed depend on
consider the TL which can be pursued by standard TinyML
the type of layers and are derived from an analysis of the
frameworks [5] and that retrains only the last two dense lay-
expressions in Table I.
ers of C. For each user i we assess Ci on TSi , and we show
in Table II the F1-score averaged over all the users. We also
IV. E XPERIMENTS AND R ESULTS consider No Pers. as the performance of the classifier C. Both
Our experiments are meant to assess the benefits of model personalization approaches improve the performance of C, and
personalization in HAR. In particular, in the considered set- Full personalization always reaches the highest F1-score, for
tings we have that: 1) training on user-specific data improves all the input sizes. This confirms that enabling the retraining of
the accuracy of a pretrained model and 2) personalization of a all the network layers is highly beneficial in the HAR scenario.
pretrained model outperforms a classifier trained only on the We also assess the benefits of model personalization over
target user. Finally, we show that enabling the full retraining each user of the ST dataset for both TL and Full personalization.
of the classifier on MCU is beneficial with respect to transfer As a customary procedure for improving convergence in fine
learning (TL) in terms of accuracy, although it requires more tuning, we first retrain for two epochs the last dense layer
computational resources. freezing all the others. We then retrain all the network’s layers
CRAIGHERO et al.: ON-DEVICE PERSONALIZATION FOR HAR ON STM32 109
TABLE II
F1 S CORE ON WISDM AND ST DATASETS This analysis is very valuable as it gives an estimate of the
computational resources required during the training, allowing
to schedule model personalization depending on the power
availability. For example, Full personalization could be per-
formed only when the battery device is recharging, while TL
could be run when the device relies on its own battery. Finally,
our framework can be used to adjust the input size and select
the number of training epochs for the chosen MCU, to reach
the best tradeoff between the accuracy of the model and the
TABLE III
C OMPUTATIONAL R ESOURCES R EQUIRED BY F ULL AND TL
usage of resources.
P ERSONALIZATION FOR D IFFERENT I NPUT S IZES V. C ONCLUSION
We present a HAR solution to fine-tune and personalize
a deep learning model directly on a STM32 MCU using
locally acquired data. In particular, we develop a framework
to retrain 1-D CNNs satisfying the strict computational and
memory constraints of the STM32L496ZG MCU. Our experi-
ments shows that the full personalization of the CNN achieves
a better accuracy than TL, which is what existing frameworks
allow, although it requires more energy.
(Full) or training the last two dense layers only (TL). Moreover,
Future work concerns extending our framework to support
we train user specific classifiers from data of each user (denoted
more layers and different optimization strategies. We will also
as No Pretrain) starting from a random initialization. Table II
adapt the framework to be used on MicroProcessors like those
reports the F1-scores averaged on the three users of the ST
of the MP1 series from STMicroelectronics, which can be
dataset. We note that the while TL achieves lower or comparable
equipped with a small GPU.
performance w.r.t. classifier No Pretrain, Full personalization
always achieves the highest F1-score, independently from the ACKNOWLEDGMENT
chosen input size. This confirms that enabling the retraining The authors would like to thank Filippo Augusti and
of all the network’s layers is highly beneficial even when Marco Longoni for their invaluable support in the set up
personalization is performed on data from a different dataset, of the X-NUCLEO-LPM01A for the power consumption
which is common in HAR scenarios. measurements.
Open Access funding provided by ’Politecnico di Milano’ within the CRUI CARE Agreement