Aics 21-3
Aics 21-3
Abstract. The problem of classifying body gesture and motion along with
aiming to predict states of action or behaviour during physical activity is
refered to as Human Activity Recongition (HAR). Inertial Measurement
Units (IMUs) prevail as the key technique to measure range of motion, speed,
velocity and magnetic field orientation during these physical activities. On-
body inertial sensors can be used to generate body motion and vital signs
recording signals that can successfully learn models and accurately classify
physical activities. In this paper, we compare the approaches of Extreme
Gradient Boosting (XGBoost), Multilayer Perceptron (MLP), Convolutional
Neural Network (CNN), Long Short-Term Memory Network (LSTM), CNN
+ LSTM Hybrid (ConvLSTM) and Autoencoder by Random Forest (AE w/
RF) to classify human activities on the MHEALTH dataset. All six of our
classification models use raw, unstructured data obtained from 4 inertial on-
body sensors. We examine multiple physical activities and on-body inertial
sensors, showing how body motion and vital signs recordings can be
modified to be fed into machine learning models using diverse network
architectures. We also compare the performance of the machine learning
models to analyse which model best suits multisensory fusion analysis. The
experimental results of this paper on the MHEALTH dataset consisting of 12
physical activities collected from 10 subjects with the use of four differnet
inertial sensors, are highly encouraging and consistently outperform exisiting
baseline models. MLP and XGBoost attain the highest performance measures
with accuracy (90.55%, 89.97%), precision (91.66%, 90.09%), recall
(90.55%, 89.97%) and F1 score (90.7%, 89.78%) respectively.
Keywords: human activity recognition, deep learning, classification,
extreme gradient boosting, neural networks
1 Introduction
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
2 J.O’Halloran et al.
1.1 Motivation
2 Related Work
3 Experiments
The purpose of this paper is to analyse the performance comparison of deep learning
algorithms on the MHEALTH dataset. We aim to identify the best deep learning
algorithm suited to the MHEALTH dataset, using on-body inertial sensor data, with
respect to the activity classification task.
We analyse a dataset collected by Oresti Banos, Rafael Garcia and Alejandro Saez
that is freely available from The UCI Machine Learning Repository [6]. The
MHEALTH dataset consists of body motion and vital signs recordings from ten
subjects with each of different characteristics [6][7]. The subjects’ task is to perform
12 different types of activities. The accelerometer, gyroscope and magnetometer
placed on the subjects’ body measure acceleration, rate of turn and magnetic field
orientation. These sensors measures the range of motion experienced by each
subjects’ body parts. The collected dataset comprises body motion and vital
recordings of the ten subjects’ performing the physical activities as stated above.
Shimmer2 [BUR10] wearable sensors were used for the recordings. Elastic straps
complement the sensors on the subjects’ chest, right wrist and left ankle.
Figure 1 The following figure outlines three subjects’ performing three different activities:
'lying down', 'cycling' and 'waist bends forward'. The Shimmer2 [BUR10] wearable sensors
which are attached by elastic straps are clearly visible on the subjects’ chest, right wrist and
left ankle.
3.2 Approach
In regards to the input adaptation, the streaming signals were fed into the neural
networks using a model-driven approach. The methodology process consisted of 7
steps; data preparation, feature extraction, one-hot encoding, training/testing split,
hyperparamter setting, model compilation and model evaluation. The MHEALTH
dataset consists of static data, it does not change after being recorded and is
essentially a fixed dataset. Step 1, data preparation, involves feature extraction,
Human Activity Recognition using Deep Learning 5
encoding labels to one-hot form, converting the raw data into the right shape for
input into the model, normalising the data and finally splitting the data into training
and testing. The next step is feature extraction. The MHEALTH dataset consist of
10 log files, with each log file corresponding to each of the ten subjects. In order to
extract the features (signal attributes) and labels (activities) of each log file, a feature
extraction method is used to successfully extract all the features and labels of each
subjects log file. The third step involves encoding the labels to one-hot form. Step
4 involves splitting the data into training and testing in the ratio of 80:20. In step 5,
Hyperparameters such as batch size, number of epochs, learning rate, number of
hidden layers, type of hidden layers, shape of input, shape of output and the number
of parameters are set. Step 6 involves compiling the model, ensuring it is ready to
be fitted. It is necessary to structure each model into organised layers. Once the
hyperparameters are tuned accordingly, as outlined in step 5, compiling the model
can begin. The compiled model is then fitted to the training data in order to classify
the volunteers’ activities. The final step is model training and evaluation. When the
model is compiled and fitted on the training data, it is evaluated against both the
training data and the testing data. The models predicted output is compared with the
true output.
We built each model revolving around these five aspects: identifying network
architecture, identifying network layers, choosing an optimiser, choosing the loss
function and hyperparameter setting. Each network model utilises the data values
given for each of the 23 signals recorded from the four sensors in order to classify
our class variable, which is the movement that each subject performs. Fine-tuning
the hyperparameters allow for beneficial development of the training process
outcome.
Table 1 MLP Architecture: The MLP model contains 706,317 data instances. The first
hidden layer contains 128 units, the second hidden layer contains 256 units, the third hidden
layer contains 512 units while the fourth hidden layer contains 1024 units.
Input layer
Adam:
2 Dropout layers Categorical Batch size: 32
MLP Learning rate
4 Hidden layers Crossentropy Number of epochs: 20
set to 0.0001
Output layer
Table 2 CNN Architecture: The CNN model contains 245,584 data instnaces. The first
hidden layer has 128 neurons, the second hidden layer has 256 neurons while the third hidden
layer has 512 neurons.
Input layer
2 ID convolution layers
Adam:
2 MaxPooling1D layers Categorical Batch size: 32
CNN Learning rate
2 Dropout layers Crossentropy Number of epochs: 20
set to 0.0005
3 Hidden layers
Output layer
6 J.O’Halloran et al.
Table 3 ConvLSTM Architecture: The ConvLSTM model contains 191,376 data instances.
The first hidden layer contains 128 units, the second hidden layer contains 256 units while
the third hidden layer contains 512 units.
Input layer
2 ID convolution layers
2 MaxPooling1D layers Adam:
Categorical Batch size: 32
ConvLSTM 1 LSTM layer Learning rate
Crossentropy Number of epochs: 20
2 Dropout layers set to 0.001
3 Hidden layers
Output layer
Input layer
Adam:
Encoding layer Categorical Batch size: 32
AE w. RF Learning rate
4 Hidden layers Crossentropy Number of epochs: 20
set to 0.0005
Output layer
Table 5 LSTM Architecture: The LSTM model contains 175,373 data instances. The first
hidden layer contains 128 units, the second hidden layer contains 256 units while the third
hidden layer contains 512 units.
Input layer
2 LSTM layers Adam:
Categorical Batch size: 32
LSTM 2 Dropout layers Learning rate
Crossentropy Number of epochs: 20
3 Hidden layers set to 0.0001
Output layer
The full architectural structure of each model is presented in tables 1-6. Setting
up each model involved dropout regularisation, normalising inputs, limiting
vanishing and exploding gradients and weight initialisation. Dropout regularisation
allowed 0.4 (40%) of diverse sets of hidden layers to be ‘dropped’ as each epoch is
initialised, leading each model to learn minute details about the data while updating
weights during gradient descent. Normalising inputs enhanced performances by
reducing the amount of time the model takes to learn the data while accelerating the
training phase. The use of the ReLu activation function led to a reduction in
vanishing and exploding gradients and significantly enhanced speed, accuracy and
precision. Adam optimisation was set as the learning rate as the hyper-parameters
require little or no tuning. The learning rate is fine-tuned to ‘0.0005’ to enhance the
speed of the learning process for each neuron.
We set the following hyperparameters for each model: learning rate, number of
hidden layers, number of hidden units for different layers, batch size and the number
Human Activity Recognition using Deep Learning 7
of epochs. The number of hidden layers and the number of hidden units for different
layers varied across each model. They ensure results are conclusive, relevant and
maximised. The batch size for each model was set to 32 while the number of epochs
was set to 20. Batch normalisation ensured successful updates in data values across
more than one layer in each model. Batch normalisation allowed each model to
reparameterise after each subsequent layer, allowing for successful updates. Batch
normalisation provides a key role in constant coordination and updates to ensure
results provided accurate predictions in activity.
Table 6 The following table outlines the hyperparameter settings applied before
implementation of the XGBoost Architecture.
XGBoost Model
Max_Depth 10
Number of parallel threads 4
Number of classes 13
Evaluation metric merror
Objective multi:softmax
Trainable parameters 161,959
Number of rounds 10
Table 6 outlines the parameters that are set for the implementation of the
XGBoost model: The maximum depth of the tree used in the model is set to 10. It’s
vital that the model doesn’t become too complex and lead to overfitting. The
number of parallel threads used to run XGBoost in this instance is 4. The number
of classes is set to 13. The evaluation metric is set to ‘merror’, which is multiclass
classification error rate. The softmax objective is set for the XGBoost model, as it
is a multiclass classification task.
In conclusion of the hyperparameter evaluation, we showed that: 1.
regularisation is excellent in minimising overfitting for the MHEALTH dataset. 2.
Adam is the best optimisation algorithm that suits this data. 3. Fine-tuning the
hyperparameters to suit the subject data yields excellent, insightful results while
speeding up training the model. 4.
Human activity recognition systems contain a vast amount of streaming data. Only
a certain percentage of this streaming data is significant in the performance of a
HAR system. There is a slight imbalance between the portion of significant data
and insignificant data. This leads to some of the activities to be easily confused with
activities that have similar range of motion patterns and are irrelevant in predicting
the activity in question. For example, jogging is often mistaken for running and
cycling is often mistaken for running upstairs. These easily confused activities are
the so-called NULL class. Detecting, monitoring and modelling the NULL class is
a tough task. The NULL class often represents a massive portion of the dataset. As
8 J.O’Halloran et al.
seen in [8], the NULL class represents 72.28% of the whole dataset. It is good
practice to remove the NULL class if there is a skewed pattern in the dataset. If the
datasets attribute information and labels differ substantially from the correctly
classified activities, then the NULL class problem may be identified and appropriate
action or precautions taken. The NULL class is not a huge problem. At most, it leads
to minor confusion when classifying activities in HAR systems. As seen in [9],
applying self-learning can reap benefits of the NULL class. The studies in [9]
present a performance comparison of self-learning activity spotters to show the
benefits of this proposed approach. Results yielded an increase of 15% in
performance, which outlines that the NULL class if managed accordingly can
generate great model performance.
4 Experimental results
Figure 2 The XGBoost confusion matrix outlines the accuracy for correctly classifying each
activity. The XGBoost approach achieved an accuracy of 89.97%.
Table 7 The total number of misclassified instances for each approach is presented in the
following table.
Table 8 The following table presents our approaches performance comparison. Each
architecture is compared in terms of accuracy, precision, recall and F1 score. Upon
10 J.O’Halloran et al.
comparison of each architecure in terms of each evaluation metric and total misclassified
instances, XGBoost is the top performing model due to its performance, speed and scalability.
Table 3 compares the accuracy, precision, recall and F1 Score of the proposed
machine learning approaches. MLP attains the highest percentage of the four
performance comparison measures, achieving 90% or greater. XGBoost falls
slightly short of the top spot but still achieving excellent results, achieving 89% or
greater. ConvLSTM, CNN and AE w/ RF achieve satisfactory results, with LSTM
being the poorest performing model.
5 Discussion
The main conclusions from the comparison of MLP, XGBoost, CNN, LSTM,
ConvLSTM (CNN+LSTM) and AE w/ RF on the MHEALTH dataset is that: MLP
and XGBoost reaches a higher accuracy (90.55%, 89.97%), precision (91.66%,
90.09%), recall (90.55%, 89.97%) and F1-score (90.7%, 89.78%) respectively.
MLP and XGBoost are significantly superior in their ability to distinguish between
similar activities (e.g., ‘jogging/running’ and ‘climbing stairs/knees bending
(crouching)’). To the authors’ knowledge, XGBoost has not been implemented on
the MHEALTH dataset in order to classify each subjects’ activity. Findings suggest
that XGBoost can be successfully applied to the MHEALTH dataset and be
comparable to existing state of the art baselines.
These conclusions reinforce the hypothesis that an XGBoost model created and
implemented on the MHEALTH dataset to predict human activities generates
significant power to learn temporary feature activation dynamics and make decisive
predictions in classifying the subjects’ predicted activity. The XGBoost architecture
offers much better analysis characteristics than the other five classification models.
These characteristics include regularisation, tree pruning, tree depth and sparse
features. XGBoost identifies the vital signs and range of motion of the activities in
question more accurately. All of these findings mentioned in this discussion section
reiterate the hypothesis that XGBoost is the best performing model and is highly
suited to analysing MHEALTH data.
Although MLP outperformed XGBoost in terms of accuracy, precision, recall
and F1-score, MLP misclassified 471 instances while XGBoost misclassified only
281. CNN, ConvLSTM and LSTM misclassified 1341, 2533 and 2742 instances
Human Activity Recognition using Deep Learning 11
In this paper, we presented a comparative study of deep learning algoritgms for the
HAR problem. We focused our research on the MHEALTH dataset, which contains
a diverse set of activities as well as sensor data extracted from four different
wearable, electronic sensors. Our aim was to examine the classification proficiency
of each individual deep learning model. Our experimental results show that Extreme
Gradient Boosting (XGBoost) achieved the highest classification capability, upon
analysing its accuracy (89.97%), precision (90.09%), recall (89.97%), F1-score
(89.78%), confusion matrix and total amount of misclassified instances (281).
XGBoost can undoubtedly address the problem of human activity recognition in the
context of MHEALTH data.
Future work on the application of XGBoost to real-world data, particularly
around HAR in the healthcare domain, is recommended. To be precise, conducting
analysis on 100+ subjects’ could be interesting, in order to justify the classification
capabilities to a broader range of subjects’, which could lead to more insightful
conclusions showing why and how the model behaved on certain subjects like it
did. Long-term monitoring is another possibility of future work. Another possibility
12 J.O’Halloran et al.
for future work is to compare the neural networks models performance metrics
when using data from individual sensors or subsets of the MHEALTH dataset. This
would increase practicality in producing a real world HAR solution.
An important addition to this project would be to focus more on the XGBoost
implementation due to the successful performance it achieved. XGBoost is
excellent for model interpretability, which is a huge aspect in machine and deep
learning nowadays. Due to time constraints, analysing XGBoost shapley values
wasn’t feasible. Shapley values allows the XGBoost model to analyse a feature set
and identify each feature’s marginal contribution to the overall classification
prediction. Shapley values provide a very detailed account as which features greatly
influenced the model. They offer transparency as well as global approximations.
LIME (Local Interpretable Model-agnostic Explanations) is another technique,
which would have benefited this research greatly. It also offers detailed account of
model interpretability, detailing the highly important influential features. Lime
offers local approximations while shapley offers global approximations. Upon
extending this research, a comparison of both measures to improve model
interpretability would greatly benefit the whole research.
References
1. Wiener, J.M., Hanley, R.J., Clark, R. and Van Nostrand, J.F., 1990. Measuring the
activities of daily living: Comparisons across national surveys. Journal of
gerontology, 45(6), pp.S229-S237.
2. Chamberlain, R.S., Sond, J., Mahendraraj, K., Lau, C.S. and Siracuse, B.L., 2018.
Determining 30-day readmission risk for heart failure patients: the Readmission After
Heart Failure scale. International journal of general medicine, 11, p.127.
3. Nguyen, T.T., Fernandez, D., Nguyen, Q.T. and Bagheri, E., 2017, November. Location-
aware human activity recognition. In International Conference on Advanced Data
Mining and Applications (pp. 821-835). Springer, Cham.
4. Mo, L., Li, F., Zhu, Y. and Huang, A., 2016, May. Human physical activity recognition
based on computer vision with deep learning model. In 2016 IEEE International
Instrumentation and Measurement Technology Conference Proceedings (pp. 1-6). IEEE.
5. re3data. 2019. Cornell Activity Datasets: CAD-60 & CAD-120. [online] Available
at: https://www.re3data.org/repository/r3d100012216.
6. Banos, O., Villalonga, C., Garcia, R., Saez, A., Damas, M., Holgado-Terriza, J.A., Lee,
S., Pomares, H. and Rojas, I., 2015. Design, implementation and validation of a novel
open framework for agile development of mobile health applications. Biomedical
engineering online, 14(2), p.S6.
7. Banos, O., Garcia, R., Holgado-Terriza, J.A., Damas, M., Pomares, H., Rojas, I., Saez,
A. and Villalonga, C., 2014, December. mHealthDroid: a novel framework for agile
development of mobile health applications. In International workshop on ambient
assisted living (pp. 91-98). Springer, Cham.
8. Li, F., Shirahama, K., Nisar, M., Köping, L. and Grzegorzek, M., 2018. Comparison of
feature learning methods for human activity recognition using wearable
sensors. Sensors, 18(2), p.679.
9. Amft, O., 2011, June. Self-taught learning for activity spotting in on-body motion sensor
data. In 2011 15th Annual International Symposium on Wearable Computers (pp. 83-86).