0% found this document useful (0 votes)
43 views

On-Device Personalization For Human Activity Recognition On STM32

Embedded Machine Learning - STM32

Uploaded by

Jhonatan Lira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

On-Device Personalization For Human Activity Recognition On STM32

Embedded Machine Learning - STM32

Uploaded by

Jhonatan Lira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

106 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 16, NO.

2, JUNE 2024

On-Device Personalization for Human


Activity Recognition on STM32
Michele Craighero , Davide Quarantiello, Beatrice Rossi, Diego Carrera , Pasqualina Fragneto,
and Giacomo Boracchi , Member, IEEE

Abstract—Human activity recognition (HAR) is one of the On-device learning (ODL) solutions for MCUs are scarce in
most interesting application for machine learning models run- the literature, and a few frameworks expose only training func-
ning on low-cost and low-power devices, such as microcontrollers tionalities for dense layers [5]. This is somehow in contrast
(MCUs). As a matter of fact, MCUs are often dedicated to with modern CNNs that favor convolutional ones. Moreover,
performing inference on their own acquired data, and any
TinyTL [6] and Train++ [7], are purely algorithmic studies
form of model training and update is delegated to external
resources. We consider this mainstream paradigm a severe lim- and do not provide any implementation to be used in HAR.
itation, especially when privacy concerns prevent data sharing, The only framework that specifically addresses CNNs train-
thus model personalization, which is universally recognized as ing on MCUs is [8], which introduces an algorithm-system
beneficial in HAR. In this letter, we present our HAR solu- co-design that combines quantizations and sparse updating
tion where MCUs can directly fine-tune a deep learning model techniques to enable the training of an image classifier on
using locally acquired data. In particular, we enable training STM32 MCUs with minimal memory requirements (256 KB).
functionalities for 1-D convolutional neural networks (CNNs) on However, none of these frameworks, including [8], were used
STM32 microcontrollers and provide a software tool to estimate to investigate model personalization in HAR by fine-tuning all
the memory and computational resources required to accomplish
model personalization. the layers of a CNN directly on the MCU. We believe this is
a relevant problem for HAR, where model personalization is
Index Terms—Human activity recognition (HAR), micro- often key to compensate for subjects’ heterogeneity resulting
controllers, model personalization, on-device learning (ODL), in very different signals for the same activity. Moreover, per-
STM32.
forming model personalization directly on the device is key to
prevent sharing of confidential data.
In this letter we perform for the first time model person-
alization for HAR directly on a STM32 microcontroller. To
I. I NTRODUCTION this purpose, we implemented both a software framework that
ECENTLY, we have witnessed a broad diffusion of
R Internet of Things (IoT) devices equipped with tiny
microcontroller units (MCUs) and an increasing interest in tiny
enables to fine-tune on the MCU all the layers of a 1-D CNN,
and also a tool to estimate the memory footprint and the
computational resources required for the personalization. Our
machine learning (TinyML [1]) research to leverage machine framework can also be used for addressing other problems
learning models on low-power devices. Human activity recog- in HAR, namely: 1) enabling continuous learning to coun-
nition (HAR) is among the most frequently addressed prob- teract concept drift and 2) enabling federated learning (FL)
lems in the TinyML domain [2]. The mainstream paradigm mechanisms [9] in a fully distributed manner.
in HAR consists in classifying [e.g., by a 1D-convolutional Our experiments, performed on real-world HAR datasets,
neural network (CNN)] segments of inertial measurement confirm that model personalization in HAR is very beneficial
units (IMUs) recordings, which are easy to gather in wear- and that it is possible to retrain 1-D CNNs satisfying the strict
able devices mounting accelerometers and gyroscopes. Model computational and memory constraints of the STM32L496ZG
training is exclusively performed on a server using many anno- MCU. In addition, we demonstrate a tradeoff between model
tated data and large computational resources. Once trained, the accuracy and computational requirements, since performing
model is possibly optimized, e.g., by distillation, quantization, a full retraining of the model is beneficial in terms of F1-
or pruning [2], and then deployed to the MCU, which is only score, but at the same time it has a greater impact on the
in charge of inference. Not surprisingly, the vast majority of memory and the energy consumption. This analysis provides
TinyML solutions support only inference on MCUs. Examples insights on when to schedule model personalization on the
include Tensorflow Lite Micro [3] from Google and X-CUBE device.
AI [4] from STMicroelectronics.

Manuscript received 1 April 2023; revised 23 May 2023; accepted 2 II. P ROBLEM D EFINITION
July 2023. Date of publication 11 July 2023; date of current version
30 May 2024. This manuscript was recommended for publication by G. Bhat.
We address HAR as a multiclass classification problem,
(Corresponding author: Michele Craighero.) where the input s ∈ Rn×z is a segment of z samples from
Michele Craighero and Giacomo Boracchi are with DEIB, Politecnico di n time series acquired by inertial sensors, and the output y is
Milano, 20133 Milan, Italy (e-mail: [email protected]). a label corresponding to a human activity. We assume a gen-
Davide Quarantiello, Beatrice Rossi, Diego Carrera, and Pasqualina
Fragneto are with the System Research and Applications, STMicroelectronics
eral training set TR of data from different users is provided to
SRL, 20864 Agrate Brianza, Italy. train a classifier C, which associates each s to a label ŷ = C(s).
Digital Object Identifier 10.1109/LES.2023.3293458 State-of-the-art classifiers for HAR [2] are 1D-CNNs.

c 2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/
CRAIGHERO et al.: ON-DEVICE PERSONALIZATION FOR HAR ON STM32 107

TABLE I
F ORWARD AND BACKWARD E XPRESSIONS FOR L AYERS I MPLEMENTED IN THE T RAINING M ODULE . N AND M A RE THE W IDTH OF THE P REVIOUS AND
THE C URRENT L AYER , R ESPECTIVELY, F I S THE N UMBER OF F ILTERS OF THE C URRENT L AYER , K I S THE K ERNEL S IZE , C I S THE N UMBER OF I NPUT
C HANNELS , AND p I S THE S TRIDE OF THE AVG P OOL 1D L AYER . T HE BACKWARD PASSES OF W EIGHTS AND B IASES A RE C OMPUTED O NLY
FOR Dense AND Conv1D S INCE THE OTHER C ONSIDERED L AYERS D O N OT T RAIN T HESE PARAMETERS

the backpropagation, implementing the backward expressions


reported in Table I.
3) Evaluation: This module performs inference and during
training computes L(θ ), namely, the value of the loss function
corresponding to parameters θ . We also use this module to
assess the effectiveness of personalization, by comparing the
accuracy of C and Ci .

Fig. 1. Proposed framework for HAR personalization. A. Gradients Computation


We illustrate the implementation of backpropagation in the
The general classifier C can poorly recognize activities from training module. As in Tensorflow API, we decompose the
users not present in TR. Therefore, model personalization for backpropagation step (a forward pass followed by a backward
a user i consists in fine-tuning the parameters θ0 of the general pass) into a sequence of operations performed layer by layer
classifier C using a local training set TRi containing data from and combined using the chain rule of derivatives. Each layer is
user i. In particular, our goal is to train a local classifier Ci with associated with a layer function f and parameters θ = [w, b].
parameters θi directly on an MCU, using θ0 as initialization Starting from the layer’s input x and the parameters θ , the
and TRi as training set. function f computes the value of the output activation a, which
will become the input x of the subsequent layer.
During the backward pass, starting from the last layer,
III. F RAMEWORK FOR HAR P ERSONALIZATION we compute the gradient of the loss L with respect to the
Fig. 1 illustrates our framework to personalize HAR mod- network parameters w, b, and the so-called downstream error
els on STM32 MCUs. The framework is developed in C (∂L/∂x) as
programming language and composed of three modules.  ∂L ∂fi ∂L  ∂L ∂fi ∂L  ∂L ∂fi
M M M
1) Network: Instantiates the local classifier Ci from the archi- ∂L
= , = , = (1)
tecture specifications and the initial parameters θ0 , which can ∂w ∂ai ∂w ∂b ∂ai ∂b ∂x ∂ai ∂x
i=1 i=1 i=1
be imported from a pretrained HAR classifier C, or randomly
set. The output of this module is the classifier itself, which where fi and ai are, respectively, the ith component of the layer
can be fed into the evaluation or training modules. function f and of the unrolled a tensor, and M is the number
2) Training: HAR personalization is performed by a few of units of the layer. The downstream error is then passed back
iterations of backpropagation. The goal of the backpropa- to the previous layer as (∂L/∂a). Table I reports the explicit
gation algorithm is to find parameters θ ∗ that minimize a expressions for the forward and backward pass of most pop-
loss function L on the local training set TRi (more details ular layers of a CNN, namely, Conv1D, Dense, AvgPool1D,
on that will be given in Section III-A). This module imple- GlobalAvgPool1D, and Flatten. By combining those expres-
ments backpropagation via gradient descent through three sions, it is possible to derive the overall forward pass and to
submodules: the Orchestrator governing the iterative train- compute the loss L and its derivative.
ing procedure by invoking alternatively the Forward and As an example we illustrate the computation of a Conv1D
the Backward submodules. The Orchestrator allows specify- layer with C channels, N input units, F filters, and kernel
ing training hyperparameters, such as the number of epochs, size K. The (j, m)th element of the output a is defined as
the learning rate, and the batch size. The Forward module
performs the forward pass of the backpropagation by imple- 
C 
K
aj,m = xc,m+k−1 wj,c,k + bj (2)
menting the forward expressions reported in Table I for the
c=1 k=1
most popular layers of 1D-CNN. During the forward pass,
the values of the activations in the neurons of Ci need to be where j ∈ {1, . . . , F}, m ∈ {1, . . . , M}, and M = N − K + 1.
stored. The Backward module performs the backward pass of Note that in (2) we treat convolution as cross-correlation as
108 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 16, NO. 2, JUNE 2024

in Tensorflow, a customary practice when filters are being Our target MCU is the STM32L496ZG, an ultralow-power
learned. According to the chain rule, we have MCU produced by STMicroelectronics with an ARM Cortex-
M4 core and 320 KB of sRAM. We flash the firmware using
∂L   ∂L ∂fjm
F M
the development board NUCLEO-L496ZG that is equipped
= . (3)
∂x ∂ajm ∂x with such MCU.
j=1 m=1

In particular, since (∂L/∂a) is passed from the next layer A. Datasets and Architecture
during backpropagation, we just have to compute the local
gradients of the layer function with respect to the input The general dataset TR we consider is the wireless data
mining (WISDM) dataset [10], which is used to pretrain
∂L F
∂L
K a general classifier C. The local training set TRi are from
= wjik . (4) ST dataset, which is collected in STM facilities using a
∂xin ∂aj,n−k+1
j=1 k=1 SensorTile.box [11]. Their characteristics are:
1) WISDM Dataset: 36 users, six activities (walking, jog-
In (4), the activation index n − k + 1 ranges from 2 − K to
ging, ascending stairs, descending stairs, sitting, and
N, for a total of N + K − 1 terms for each channel j, whereas
standing), and sampling frequency 20 Hz;
(∂L/∂a) has size F × (N − K + 1). If K > 1, we apply a 0
2) ST Dataset: Three users, three activities (walking,
padding to (∂L/∂a) by adding F · (K − 1) zeros to both sides
ascending stairs, and descending stairs), and sampling
of (∂L/∂a) along its second dimension. The index k in (4)
frequency 27 Hz.
has opposite signs in the two terms of the convolution (−k in
We notice that the activities in the ST dataset are a sub-
(∂L/∂a) and +k in w), thus we obtain a flipped kernel. The
set of those in WISDM. Both datasets have been collected
final result is expressed as
    using tri-axial accelerometers, but with different sensors,
∂L ∂L thus the ST dataset has been resampled to 20 Hz. We
= conv pad , flip(w) (5)
∂x ∂a underline that, after the resampling, the two datasets have
approximately the same number of samples for each user
where conv still represents cross-correlation. (around 30k). The classifier takes as input from 1 to 5 s
Finally, to compute the gradient of a batch of input segments of recording, which correspond to the following input sizes:
we resort gradient accumulation. This maintains fixed the (20, 3), (40, 3), (60, 3), (80, 3), and (100, 3), where 3 is the
memory footprint during the training regardless the batch size. number of axis of the accelerometers.
We adopt a 1D-CNN made of four blocks: 1) Conv1D
B. Estimating Resources with F = 32 filters and kernel size K = 3 with ReLU
Our framework is equipped with a tool that estimates: 1) the and AvgPool1D; 2) Conv1D of F = 64 filters and kernel
memory footprint and 2) the CPU load to personalize a HAR size K = 3 with ReLU, AvgPool1D, and GlobalAvgPool1D;
classifier on an STM32 MCU. This tool is very valuable during 3) dense of M = 50 units with ReLU; and 4) dense of M = 6
prototyping to properly size embedded ML applications to the units with Softmax. The total number of parameters θ to train
MCU capabilities. is around 10000. We select the SGD optimizer with a learn-
Our tool computes the memory footprint as the amount ing rate of 0.01 and a batch size of 32. Facing a multiclass
of memory required by model personalization. Model train- classification task, we adopt the Categorical Cross-Entropy as
ing requires storing training samples, network parameters, loss function.
activations, gradients, and errors computed at each layer dur-
ing backpropagation. In particular, all the activations and the B. Model Personalization
downstream errors (∂L/∂a) of the layers that we want to train
First, we show that in the considered settings, model person-
must be stored in memory to be used during the backward
alization improves the performance of a pretrained model. The
pass to compute gradients. Our tool can estimate a priori the
first experiment is entirely conducted on the WISDM dataset
memory usage of training by multiplying the total amount of
by a leave-one-subject-out (LOSO) approach. For each user
saved variables by their bit precision. The tool also estimates
i = 1, . . . , 36 we define a training (TRi ) and test (TSi ) sets
the CPU load by counting the total number of operations per-
and pretrain a general classifier C from the other 35 users. We
formed by the backpropagation algorithm in the forward and
personalize each local classifier Ci by retraining all the layers
in backward passes, as well as in the parameters’ update phase.
(Full personalization) of C using TRi . As a comparison, we
The number and the type of operations performed depend on
consider the TL which can be pursued by standard TinyML
the type of layers and are derived from an analysis of the
frameworks [5] and that retrains only the last two dense lay-
expressions in Table I.
ers of C. For each user i we assess Ci on TSi , and we show
in Table II the F1-score averaged over all the users. We also
IV. E XPERIMENTS AND R ESULTS consider No Pers. as the performance of the classifier C. Both
Our experiments are meant to assess the benefits of model personalization approaches improve the performance of C, and
personalization in HAR. In particular, in the considered set- Full personalization always reaches the highest F1-score, for
tings we have that: 1) training on user-specific data improves all the input sizes. This confirms that enabling the retraining of
the accuracy of a pretrained model and 2) personalization of a all the network layers is highly beneficial in the HAR scenario.
pretrained model outperforms a classifier trained only on the We also assess the benefits of model personalization over
target user. Finally, we show that enabling the full retraining each user of the ST dataset for both TL and Full personalization.
of the classifier on MCU is beneficial with respect to transfer As a customary procedure for improving convergence in fine
learning (TL) in terms of accuracy, although it requires more tuning, we first retrain for two epochs the last dense layer
computational resources. freezing all the others. We then retrain all the network’s layers
CRAIGHERO et al.: ON-DEVICE PERSONALIZATION FOR HAR ON STM32 109

TABLE II
F1 S CORE ON WISDM AND ST DATASETS This analysis is very valuable as it gives an estimate of the
computational resources required during the training, allowing
to schedule model personalization depending on the power
availability. For example, Full personalization could be per-
formed only when the battery device is recharging, while TL
could be run when the device relies on its own battery. Finally,
our framework can be used to adjust the input size and select
the number of training epochs for the chosen MCU, to reach
the best tradeoff between the accuracy of the model and the
TABLE III
C OMPUTATIONAL R ESOURCES R EQUIRED BY F ULL AND TL
usage of resources.
P ERSONALIZATION FOR D IFFERENT I NPUT S IZES V. C ONCLUSION
We present a HAR solution to fine-tune and personalize
a deep learning model directly on a STM32 MCU using
locally acquired data. In particular, we develop a framework
to retrain 1-D CNNs satisfying the strict computational and
memory constraints of the STM32L496ZG MCU. Our experi-
ments shows that the full personalization of the CNN achieves
a better accuracy than TL, which is what existing frameworks
allow, although it requires more energy.
(Full) or training the last two dense layers only (TL). Moreover,
Future work concerns extending our framework to support
we train user specific classifiers from data of each user (denoted
more layers and different optimization strategies. We will also
as No Pretrain) starting from a random initialization. Table II
adapt the framework to be used on MicroProcessors like those
reports the F1-scores averaged on the three users of the ST
of the MP1 series from STMicroelectronics, which can be
dataset. We note that the while TL achieves lower or comparable
equipped with a small GPU.
performance w.r.t. classifier No Pretrain, Full personalization
always achieves the highest F1-score, independently from the ACKNOWLEDGMENT
chosen input size. This confirms that enabling the retraining The authors would like to thank Filippo Augusti and
of all the network’s layers is highly beneficial even when Marco Longoni for their invaluable support in the set up
personalization is performed on data from a different dataset, of the X-NUCLEO-LPM01A for the power consumption
which is common in HAR scenarios. measurements.

C. Computational Resources’ Evaluation R EFERENCES


Here, we assess the computational resources required to per- [1] “TinyML.” Accessed: Mar. 10, 2023. [Online]. Available: https://www.
form model personalization on the MCU. First, we use our tinyml.org
[2] M.-K. Yi and S. O. Hwang, “Smartphone based human activity recog-
tool to estimate the memory footprint of both TL and Full nition using 1d lightweight convolutional neural network,” in Proc. Int.
personalization. As reported in Table III, the input size has a Conf. Electron. Inf. Commun., 2022, pp. 1–3.
great impact on the memory footprint: the smallest input size [3] R. David et al., “TensorFlow lite micro: Embedded machine learning on
(20, 3) requires about half of the memory with respect to the TinyML systems,” 2020, arXiv:2010.08678.
[4] “Artificial intelligence (AI) software expansion for STM32Cube.”
largest size (100, 3). However, all the tested cases are within STMicroelectronics. Feb. 2023. [Online]. Available: https://www.st.com/
the memory limitations of our selected device, since they use en/embedded-software/x-cube-ai.html
less than 320 kB. [5] K. Kopparapu, E. Lin, J. G. Breslin, and B. Sudharsan, “TinyFedTL:
Table III reports the time required to process a batch of Federated transfer learning on ubiquitous Tiny IoT devices,” in Proc.
IEEE Int. Conf. Pervasive Comput. Commun. Workshops Other Affiliated
32 segments for each input size for both Full and TL. Also in Events, 2022, pp. 79–81.
this case we observe that increasing the input size results in [6] H. Cai, C. Gan, L. Zhu, and S. Han, “TinyTL: Reduce memory, not
larger execution time. In particular, (100, 3) requires more than parameters for efficient on-device learning,” in Proc. Int. Conf. Adv.
5 times the time required by input size of (20, 3). However, Neural Inf. Process. Syst., vol. 33, 2020, pp. 11285–11297.
as shown in Table II, an input size of (20, 3) is enough for [7] B. Sudharsan, P. Yadav, J. G. Breslin, and M. I. Ali, “Train++:
An incremental ML model training algorithm to create self-
reaching a very high accuracy. learning IoT devices,” in Proc. 18th IEEE Int. Conf. UIC, 2021,
Finally, we use the X-NUCLEO-LPM01A power shield to pp. 97–106.
measure the average power required to process a batch of [8] J. Lin, L. Zhu, W.-M. Chen, W.-C. Wang, C. Gan, and S. Han, “On-
32 samples. The power shield is attached to the development device training under 256kb memory,” 2022, arXiv:2206.15472.
[9] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Arcas,
board to measure the current absorbed during the execution of “Communication-efficient learning of deep networks from decentralized
the training. Since the voltage provided is equal to 3.3 V, we data,” in Proc. 20th Int. Conf. Artif. Intell. Stat., 2017, pp. 1273–1282.
can easily derive the power consumed. We note that the power [10] J. R. Kwapisz, G. M. Weiss, and S. A. Moore, “Activity recognition
is the same for both TL and Full personalization procedures for using cell phone accelerometers,” in Proc. 4th Int. Workshop Knowl.
Discov. Sens. Data, 2010, pp. 10–18.
any input size. However, the energy (power × time) required [11] “Human activity recognition using CNN in Keras for sensor-
to process a single batch is higher for the Full personalization, tile.” Accessed: Feb. 1, 2023. [Online]. Available: https://github.com/
since it scales linearly with the time. ausilianapoli/HAR-CNN-Keras-STM32/blob/master/Dataset.csv

Open Access funding provided by ’Politecnico di Milano’ within the CRUI CARE Agreement

You might also like