0% found this document useful (0 votes)

78 views40 pages

Deep Learning in Sensor-based Activity Recognition

This document surveys deep learning methods for sensor-based human activity recognition, highlighting the challenges faced in this field such as feature extraction, data scarcity, and the complexity of data association. It proposes a new taxonomy of deep learning approaches based on these challenges and provides insights into current research progress, public datasets, and future directions. The authors aim to guide researchers in selecting appropriate methods to address specific challenges in activity recognition.

Uploaded by

pranay.chaplotstudy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views40 pages

Deep Learning in Sensor-based Activity Recognition

Uploaded by

pranay.chaplotstudy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Deep Learning for Sensor-based Human Activity

Recognition: Overview, Challenges, and Opportunities

KAIXUAN CHEN and DALIN ZHANG, Aalborg University, Denmark

LINA YAO, University of New South Wales, Australia
BIN GUO and ZHIWEN YU, Northwestern Polytechnical University, China
YUNHAO LIU, Michigan State University, USA

The vast proliferation of sensor devices and Internet of Things enables the applications of sensor-based ac-
tivity recognition. However, there exist substantial challenges that could influence the performance of the
recognition system in practical scenarios. Recently, as deep learning has demonstrated its effectiveness in
many areas, plenty of deep methods have been investigated to address the challenges in activity recognition.
In this study, we present a survey of the state-of-the-art deep learning methods for sensor-based human activ-
ity recognition. We first introduce the multi-modality of the sensory data and provide information for public
datasets that can be used for evaluation in different challenge tasks. We then propose a new taxonomy to
structure the deep methods by challenges. Challenges and challenge-related deep methods are summarized
and analyzed to form an overview of the current research progress. At the end of this work, we discuss the
open issues and provide some insights for future directions.
CCS Concepts: • General and reference → Surveys and overviews; • Hardware → Sensor devices and
platforms; • Computer systems organization → Neural networks;
Additional Key Words and Phrases: Activity recognition, deep learning, sensors
ACM Reference format:
Kaixuan Chen, Dalin Zhang, Lina Yao, Bin Guo, Zhiwen Yu, and Yunhao Liu. 2021. Deep Learning for Sensor-
based Human Activity Recognition: Overview, Challenges, and Opportunities. ACM Comput. Surv. 54, 4, Ar-
ticle 77 (May 2021), 40 pages.
[Link]

1 INTRODUCTION
Recent advance in human activity recognition has enabled myriad applications such as smart
homes [61], healthcare [79], and enhanced manufacturing [46]. Activity recognition is essen-
tial to humanity, since it records people’s behaviors with data that allows computing systems to

Kaixuan Chen and Dalin Zhang contributed equally to this research.

Authors’ addresses: K. Chen and D. Zhang (corresponding author), Department of Computer Science, Aalborg University,
Fredrik Bajers Vej 7K, 9220 Aalborg, Denmark; emails: {kchen, dalinz}@[Link]; L. Yao, School of Computer Science and
Engineering, University of New South Wales, Sydney, NSW, 2052, Australia; email: [Link]@[Link]; B. Guo and Z. Yu,
School of Computer Science, Northwestern Polytechnical University, 127 West Youyi Road, Beilin District, Xi’an Shaanxi,
710072, China; emails: [Link]@[Link], zhiwenyu@[Link]; Y. Liu, Department of Computer Science and
Engineering, Michigan State University, East Lansing, MI, 48824; email: yunhao@[Link].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be
honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee. Request permissions from permissions@[Link].
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
0360-0300/2021/05-ART77 $15.00
[Link]

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021. 77
77:2 K. Chen et al.

Fig. 1. Categories of deep learning in sensor-based human activity recognition.

monitor, analyze, and assist their daily life. There are two mainstreams of human activity recogni-
tion systems: video-based systems and sensor-based systems. Video-based systems use cameras to
take images or videos to recognize people’s behaviors [9]. Sensor-based systems utilize on-body
or ambient sensors to dead-reckon people’s motion details or log their activity tracks. Consid-
ering the privacy issues of installing cameras in our personal space, sensor-based systems have
dominated the applications of monitoring our daily activities. Besides, sensors take advantage of
pervasiveness. Thanks to the proliferation of smart devices and Internet of Things, sensors can
be embedded in portable devices such as phones, watches, and nonportable objects such as cars,
walls, and furniture. Sensors are widely embedded around us, uninterruptedly and non-intrusively
logging humans’ motion information.

1.1 Challenges in Human Activity Recognition

Many machine learning methods have been employed in human activity recognition. However,
this field still faces many technical challenges. Some of the challenges are shared with other pattern
recognition fields such as computer vision and natural language processing, while some are unique
to sensor-based activity recognition and require dedicated methods for real-life applications. Here
lists a few categories of challenges that the community of activity recognition should respond to.
A figure of the taxonomy is shown in Figure 1.
• The first challenge is feature extraction. Activity recognition is a classification task, so it
shares a common challenge with other classification problems, which is feature extraction.
For sensor-based activity recognition, feature extraction is more difficult, because there is

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:3

inter-activity similarity [22]. Different activities may have similar characteristics (e.g., walk-
ing and running). Therefore, it is difficult to produce distinguishable features to represent
activities uniquely.
• Training and evaluation of learning techniques require large annotated data samples. How-
ever, it is expensive and time-consuming to collect and annotate sensory activity data.
Therefore, annotation scarcity is a remarkable challenge for sensor-based activity recog-
nition. Besides, data for some emergent or unexpected activities (e.g., accidental fall) is es-
pecially hard to obtain, which leads to another challenge called class imbalance.
• Human activity recognition involves three factors: users, time, and sensors. First, activity
patterns are person-dependent. Different users may have diverse activity styles. Second,
activity concepts vary over time. The assumption that users’ activity patterns remain un-
changed for a long period is impractical. Moreover, novel activities are likely to emerge
when in use. Third, diverse sensor devices are opportunistically configured on human bod-
ies or in environments. The composition and the layouts of sensors dramatically influence
the data stimulated by activities. All the three factors lead to distribution discrepancy
between the training data and test data and need to be mitigated urgently.
• The complexity of data association is another reason that makes recognition challenging.
Data association refers to how many users and how many activities the data is associated
with. There are many specific challenges in activity recognition that are driven by sophis-
ticated data association. The first challenge can be seen in composite activities. Most ac-
tivity recognition tasks are based on simple activities, like walking and sitting. However,
more meaningful ways to log human daily routines are composite activities that comprise a
sequence of atomic activities. For example, “washing hands” can be represented as {turning
on the tap, soaping, rubbing hands, turning off the tap}. One challenge stimulated by com-
posite activities is data segmentation. A composite activity can be defined as a sequence of
activities. Therefore, accurate recognition highly relies on precise data segmentation tech-
niques. Concurrent activities show the third challenge. Concurrent activities occur when
a user participates in more than one activity simultaneously, such as answering a phone call
while watching TV. Multi-occupant activities are also associated with the complexity of
data association. Recognition is arduous when multiple users engage in a set of activities,
which usually happens in multi-resident scenarios.
• Another factor that needs to be concerned is the feasibility of the human activity recognition
system. Efforts need to be devoted to making the system acceptable by a vast number of
users, since human activity recognition is quite close to human daily life, which can be
twofold. First, the system should be recourse-intensive so it fits portable devices and is
able to give an instant response. Thus, the computational cost issue should be addressed.
Second, as the recognition system records users’ life continuously, there are risks of personal
information disclosure, which leads to the privacy issue.
• Unlike images or texts, sensory data is unreadable. Moreover, sensory data inevitably in-
cludes lots of noise information on account of the inherent imperfections of sensors. So,
reliable recognition solutions should have interpretability in sensory data and the capa-
bility of understanding which part of data facilitates recognition and which part deteriorates
that.

1.2 Deep Learning in Human Activity Recognition

Numerous previous works adopted machine learning methods in human activity recognition [75].
They highly rely on feature extraction techniques including time-frequency transformation [60],
statistical approaches [22], and symbolic representation [82]. However, the features extracted are

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:4 K. Chen et al.

carefully engineered and heuristic. There were no universal or systematical feature extraction
approaches to effectively capture distinguishable features for human activities.
In recent years, deep learning has embraced conspicuous prosperity in modeling high-level ab-
stractions from intricate data [108] in many areas such as computer vision, natural language pro-
cessing, and speech processing. After early works, including References [54, 73, 153], examined
the effectiveness of deep learning in human activity recognition, related studies sprung up in this
area. Along with the inevitable development of deep learning in human activity recognition, latest
works are undertaken to address the specific challenges. However, deep learning is still confronted
with reluctant acceptance by researchers owing to its abrupt success, bustling innovation, and lack
of theoretical support. Therefore, it is necessary to demonstrate the reasons behind the feasibility
and success of deep learning in human activity recognition despite the challenges.

• The most attractive characteristic of deep learning is “deep.” Layer-by-layer structures of

deep models allow to learn from simple to abstract features scalably. Also, advanced com-
puting resources like GPUs provide deep models with a powerful ability to learn descrip-
tive features from complex data. The outstanding learning ability also enables the activity
recognition system to deeply analyze multimodal sensory data for accurate recognition.
• Diverse structures of deep neural networks encode features from multiple perspectives. For
example, convolutional neural networks (CNNs) are competent in capturing the local
connections of multimodal sensory data, and the translational invariance introduced by
locality leads to accurate recognition [56]. Recurrent neural networks (RNNs) extract
the temporal dependencies and incrementally learn information through time intervals so
are appropriate for streaming sensory data in human activity recognition.
• Deep neural networks are detachable and can be flexibly composed into unified networks
with one overall optimization function, which makes allowance for miscellaneous deep
learning techniques including deep transfer learning [3], deep active learning [49], deep at-
tention mechanism [96], and other not systematic but as effective solutions [62, 89]. Works
that adopted these techniques cater to various challenges in deep learning.

1.3 Key Contributions

Unlike the existing surveys related to deep learning in human activity recognition, we focus dis-
tinctly on the challenges of human activity recognition and how motivated deep learning models
and techniques are developed to be challenge-specific. Specifically, Wang et al. [145] surveyed
a number of deep learning methods for sensor-based human activity recognition in the view of
model structures. Nweke et al. [99] presented a survey only on mobile and wearable sensor-based
activity recognition and categorized the deep learning methods into generative, discriminative, and
hybrid methods. Li et al. [78] introduced different deep neural networks for radar-based activity
recognition. These surveys only discuss the deep models that can be used for activity recognition
(e.g., CNNs and RNNs) while we expand the scope to the techniques that can be well merged with
deep learning to tackle specific challenges (e.g., deep transfer learning, multimodal fusion).
Compared with the existing surveys, the key contributions of this work are as follows:

• We conduct a comprehensive survey of deep learning approaches for sensor-based human

activity recognition. Our work provides a panorama of current progress and an in-depth
analysis of the reviewed methods to serve both novices and experienced researchers.
• We propose a new taxonomy of deep learning methods in the view of challenges of activity
recognition. Challenges stimulated by different reasons are presented for the readers to scan
which research direction is of interest.

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:5

• We summarize the state-of-the-art and how specific deep networks or deep techniques can
be applied to address the challenges with comprehensive analysis. We compare different
solutions for the same challenges and list the pros and cons. The challenge-method-analysis
format aims to build a problem-solution structure with a hope to suggest a rough guideline
when readers are selecting their research topics or developing their approaches.
• Moreover, we provide information on available public datasets and their potential extension
to evaluate specific challenges.
• We discuss some open issues in this field and point out potential future research directions.

2 SENSOR MODALITY AND DATASETS

2.1 Sensor Modality
The performance of an activity recognition system depends crucially on the used sensor modality.
In this section, we classify the sensor modalities into four strategies: wearable sensors, ambient
sensors, object sensors, and other modalities. Owing to the page limitation, the details about wear-
able sensors, ambient sensors, object sensors, and other modalities are presented in the Supple-
mentary Material.

2.2 Datasets
There are several publicly available human activity recognition datasets. We summarize some of
the most popular ones in Table 1, which contains the data acquisition context, number of subjects,
number of activities, sensor types, and potential challenge tasks they can be used in. In the data ac-
quisition context, “daily living” refers to subjects performing common daily living activities under
instructions. The challenges are explained in detail in Section 3.

3 CHALLENGES AND TECHNIQUES

3.1 Feature Extraction
While progress has been made, human activity recognition remains a challenging task. This is
partly due to the broad range of human activities and the rich variation in how a given activity
can be performed. Using features that clearly separate activities is crucial. Feature extraction is one
of the key steps in activity recognition, since it can capture relevant information to differentiate
various activities. The accuracy of activity recognition approaches dramatically depends on the
features extracted from raw signals. Supervised, semi-supervised, and unsupervised approaches
all contribute substantially to human activity recognition. After supervised learning proved to be
effective in extracting features from activity data [61, 65], a wealth of works on supervised learning
have been proposed considering that supervised approaches are more prone to end-to-end train-
ing. To be more organized, in this survey, we focus only on supervised learning methods in case
of feature extraction. Unsupervised and semi-supervised learning methods are mainly introduced
in case of annotation scarcity. We summarize feature extraction methods for activity recognition
into temporal features, multimodal features, and statistical features.
3.1.1 Temporal Feature Extraction. Typically, human activity is a combination of several con-
tinuous basic movements and can last from a few seconds to up to several minutes. Therefore,
considering the relatively high sensing frequency (tens to hundreds Hz), the data of human activ-
ity is represented by time-series signals. In this context, the basic streaming movements are more
likely to exhibit a smooth fluctuation, while, in contrast, the transitions between consecutive ba-
sic movements may induce substantial changes. To capture such signal characteristics of human
activities, it is essential to extract temporal features of both within and between successive basic
movements.

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:6
Table 1. Public Datasets for Human Activity Recognition

Dataset Context # Subject # Activities Sensor Types Challenges

WISDM Activity Prediction [71] Daily Living 29 6 Wearable Class Imbalance
UCI HAR [8] Daily Living 30 6 Wearable Multimodal

OPPORTUNITY [24, 120] Daily Living 4 9 Wearable, Object, Ambient Multimodal Composite Activity

Skoda Checkpoint [160] Car Maintenance 1 10 Wearable Simple

Daphnet Freezing of Gait [11] Patients of Parkinson’s Disease 10 3 Wearable Simple
Berkeley MHAD Daily Living 12 11 Wearable, Ambient Multimodal
PAMAP2 [117] Daily Living 9 18 Wearable Multimodal
SHO [130] Daily Living 10 7 Wearable Simple
UCI HAPT [118] Daily Living with activity transition 30 6 Wearable Multimodal
UTD-MHAD [25] Controlled Conditions 8 27 Wearable Multimodal
HHAR [134] Daily Living 9 6 Wearable Multimodal, Distribution Discrepancy
ARAS [6] Real-world Home Living 2 27 Ambient, Object Multimodal, Multi-occupant
Ambient Kitchen [104] Food Preparation 20 11 Object Simple
USC-HAD [172] Daily Living 14 12 Wearable Multimodal
MHEALTH [16] Real-world Home Living 10 12 Wearable Multimodal
BIDMC Congestive Heart Failure [15] Hear failure 15 2 Wearable Class Imbalance
DSADS [17] Daily Living and Sports 8 19 Wearable Multimodal

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
CASAS-4 [131] Real-world Home Living 2 15 Object, Ambient Multi-occupant Composite Activity Multimodal

Smartwatch/Notch/Farseeing [91] Daily Living & Fall Detection 7 4 ADL & 4 Fall Wearable Class Imbalance
Darmstadt Daily Routines [59] Real-world Routines 1 35 Wearable Class Imbalance
MotionSense [88] Daily Living 24 6 Wearable Simple
MobiAct/MobiFall [141] Daily Living & Fall Detection 66 12 ADL & 4 Fall Wearable Multimodal
VanKasteren benchmark [139] Real-world Home Living 3 9 Object Simple
ActiveMiles 1 Real-world Routines 10 7 Wearable Multimodal
ActRecTut [22] Hand Gesture & Playing Tennis 2 12 Wearable Multimodal
1 [Link]
K. Chen et al.
Deep Learning for Sensor-based Human Activity Recognition 77:7

Fig. 2. Example structures for temporal feature extraction.

Some researchers manage to adopt traditional methods to extract temporal features and use
deep learning techniques for the following activity recognition. Basic signal statistics and wave-
form traits such as mean and variance of time-series signals are commonly applied handcrafted
features for early-stage deep learning activity recognition [142]. This kind of feature is coarse and
lacks scalability. A more advanced temporal feature extraction approach is to exploit the spectral
power changes as time evolves by converting the time series from the time domain to the frequency
domain. A general example structure is shown in Figure 2(a), where a 2D-CNN is usually used to
process the spectral features. In Reference [65], Jiang and Yin applied the Short-time Discrete
Fourier Transform (STDFT) to time-serial signals and constructed a time-frequency-spectral
image. Then, CNN is utilized to handle the image for recognizing simple daily activities such as
walking and standing. More recently, Laput and Harrison [74] developed a fine-grained hand ac-
tivity sensing system through the combination of the time-frequency-spectral features and CNNs.
They demonstrated 95.2% classification accuracy over 25 atomic hand activities of 12 people. The
spectral features can not only be used for the wearable sensor activity recognition but also be used
for the device-free activity recognition. Fan et al. [42] proposed to develop time-angle spectrum
frames for representing the spectral power variations along time in different spatial angles of the
RFID signals.
Since one of the most favorable advantages of the deep learning technology is the impressive
power of automatic feature learning, extracting temporal features by a neural network is favorable
to construct an end-to-end deep learning model. The end-to-end learning manner facilitates the
training procedure and mutually promotes the feature learning and recognition processes. Various
deep learning approaches have been applied for temporal information extraction, including RNN,
temporal CNN, and their variants. RNN is a widely applied deep temporal feature extraction ap-
proach in many fields [92, 169]. Traditional RNN cells suffer from vanishing/exploding gradients
problems, which limits the application of EEG analysis. The Long Short-Term Memory (LSTM)
units that have overcome this issue are usually used to build an RNN for temporal feature extrac-
tion [45]. The depth of an effective LSTM-based RNN needs to be at least two when processing
sequential data [67]. As the sensor signals are continuous streaming data, a sliding window is gen-
erally used to segment the raw data to individual pieces, each of which is the input of an RNN cell
[32]. A typical LSTM-based structure for temporal feature extraction is illustrated in Figure 2(b).
The length and moving step of the sliding window are hyper-parameters that need to be care-
fully tuned for achieving satisfying performance. Besides the early application of the basic LSTM
network, continuing research of diverse RNN variants is also being investigated in the human ac-
tivity recognition field. The Bidirectional LSTM (Bi-LSTM) structure that has two conventional
LSTM layers for extracting temporal dynamics from both forward and backward directions is an
important variant of the RNN in various domains including human activity recognition [61]. In ad-
dition, Guan and Plötz [48] proposed an ensemble approach of multiple deep LSTM networks and

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:8 K. Chen et al.

demonstrated superior performance to individual networks on three benchmark datasets. Aside

from the variants of the RNN structure, some researchers also studied different RNN cells. For
example, Yao et al. [158] leveraged the Gated Recurrent Units (GRUs) instead of LSTM cells
to construct an RNN and applied it to activity recognition. However, some studies revealed that
the other sorts of RNN cells could not provide notably superior performance to the conventional
LSTM cell concerning classification accuracy [45]. However, due to its computational efficiency,
GRUs are more suitable for mobile devices where the computation resources are limited.
CNN is another favorable deep learning architecture for temporal feature extraction. Unlike
RNN, a temporal CNN does not need a sliding window for segmenting streaming data. The convo-
lution operations with small kernels are directly applied along the temporal dimension of sensor
signals so local temporal dependencies can be captured. Some works employed one-dimensional
(1D) convolutions on the individual univariate time series signals for temporal feature extraction
[13, 39, 46, 122, 123, 153]. When there were multiple sensors or multiple axes, multivariate time
series would be yielded, thus requiring the 1D convolutions to be applied separately. Figure 2(c)
presents a typical 1D-CNN structure for temporal feature handling. Conventional 1D CNNs usually
have a fixed kernel size, and thus can only discover the signal fluctuations within a fixed temporal
range. Considering this gap, Lee et al. [76] combined multiple CNN structures of different kernel
sizes to obtain the temporal features from different time scales. However, the multi-kernel CNN
structure would consume more computational resources, and the temporal scale that a pure CNN
could explore is inadequate as well. Furthermore, if a large time scale is desirable, then a pooling
operation would be commonly used between two CNN layers, which would cause information
loss. Xi et al. [149] applied a deep dilated CNN to time series for solving the issues. The dilated
CNN uses dilated convolution kernels instead of the standard convolutional kernels to expand the
convolution receptive field (i.e., time length) with no loss of resolution. Because the dilated ker-
nel only adds empty elements between the elements of the conventional convolution kernel, it
does not require an extra computational cost. In addition to the consideration of various temporal
scales, the temporal disparity of different sensing modalities (e.g., different sensors, axes, or chan-
nels) is also a critical concern, since commonly used CNN treats different modalities in the same
way. To resolve this concern, Ha and Choi [53] presented a new CNN structure that had specific
1D CNNs for different modalities for learning modality-specific temporal characteristics. With the
development of the CNNs, other kinds of CNN variants are also considered for effectively embed-
ding temporal features. Shen et al. [129] utilized the gated CNN for daily activity recognition from
audio signals and showed superior accuracy to the naive CNN. Long et al. adopted residual blocks
to build a two-stream CNN structure dealing with different time scales.
Developing a deep hybrid model to explore different views of temporal dynamics is another
attractive trend in the human activity recognition community. In light of the advantages of CNN
and RNN, Ordóñez and Roggen [101] proposed to combine CNNs and LSTMs for both local and
global temporal feature extraction. Wang et al. [147] developed a classifier with a CNN and an
LSTM to automatically extract complicated features from the acoustic data and perform gesture
recognition. Xu et al. [151] adopted the advanced Inception CNN structure for different scales of
local temporal feature extraction and took the GRUs for efficient global temporal representations.
Yuki et al. [159] employed a dual-stream ConvLSTM network with one stream handling smaller
time length and the other one handling more substantial time length to analyze more complex tem-
poral hierarchies. Zou et al. [180] induced an Autoencoder to first enhance feature extractions and
then applied the cascade CNN-LSTM to extract local and global features for WiFi-based activity
recognition. However, Gumaei et al. [50] proposed a hybrid model of different types of recurrent
units (SRUs and GRUs) for handling different aspects of temporal information.

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:9

Fig. 3. Multi-modality fusion strategies.

Fig. 4. Various strategies for feature fusion.

3.1.2 Multimodal Feature Extraction. The current research of human activity recognition is
usually achieved with multiple different sensors, such as accelerometers, gyroscopes, and magne-
tometers. Some research has further demonstrated that the combination of diverse sensing modal-
ities can obtain better results than one particular sensor only [51]. As a result, learning the inter-
modality correlations along with the intra-modality information is a major challenge in the field of
deep learning-based human activity recognition. The sensing modality fusion can be performed
following two strategies: Feature Fusion (Figure 3(a)), which combines different modalities to
produce single feature vectors for classification; and Classifier Ensemble (Figure 3(b)), in which
outputs of classifiers operating only on features of one modality are blended together.
Münzner et al. [95] investigated the feature fusion manner of deep neural networks for multi-
modal activity recognition. They organized the fusion manners into four categories according to
different fusion stages within a network. However, their study focuses on CNN-based architectures
only. Here, we extend their definitions of feature fusion manners to all deep learning architectures
and manage to reveal more insights and specific considerations.
Early Fusion (EF) (Figure 4(a)). This manner fuses the data of all sources at the beginning, irre-
spective of sensing modalities. It is attractive in terms of simplicity as a strategy, though it is at risk
of missing detailed correlations. A simple fusion approach in Reference [76] transformed the raw x,
y, and z acceleration data into a magnitude vector by calculating the Euclidean norm of x, y, and z
values. Gu et al. [47] stacked the time serial signals of different modalities horizontally into a single
1D vector and utilized a denoising autoencoder to learn robust representations. The output of the
intermediate layer was used to feed the final softmax classifier. In contrast, Ha et al. [54] proposed
to vertically stack all signal sequences to form a 2D matrix and directly applied 2D-CNNs to simul-
taneously capture both local dependencies over time as well as spatial dependencies over modal-
ities. In Reference [52], the authors preprocessed the raw signal sequence of a single modality

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:10 K. Chen et al.

into a 2D format, but by simply reorganizing, and stacked all modalities along the depth dimension
to finally achieve 3D data matrices. Afterwards, they applied a 3D-CNN to exploit the inter- and
intra-modality features. However, conventional CNN is restricted to explore the correlations of
neighboring arranged modalities and thus misses the relations between the nonadjacent modali-
ties. To solve this issue, unlike naturally organizing various data sources, Jiang and Yin [65] assem-
bled signal sequences of different modalities into a novel arrangement where every signal sequence
has the chance to be adjacent to every other sequence. This organization facilitates the DCNN to
extract elaborated correlations of individual sensing axes. Dilated convolution is another solution
to exploiting nonadjacent modalities without information loss and extra computational expenses
[150]. In addition to wearable sensors, RFID-based activity recognition requires the fusion of
multiple RFID signals as well, and CNNs are also commonly used for the early fusion manner [80].
Sensor-based Fusion (SF) (Figure 4(b)). In contrast to EF, SF first considers each modality in-
dividually and then fuses different modalities afterwards. Such an architecture not only extracts
modality-specific information from various sensors but also allows flexible complexity distribu-
tion, since the structures of the modality-specific branches can be different. In References [113,
114], Radu et al. proposed a fully connected deep neural network (DNN) architecture to facili-
tate the intra-modality learning. Independent DNN branches are assigned to each sensor modality,
and a unifying cross-sensor layer merges all the branches to uncover the inter-modality informa-
tion. Yao et al. [158] vertically stacked all axes of a sensor to form 2D matrices and designed
individual CNNs for each 2D matrix to learn the intra-modality relations. The sensor-specific fea-
tures of different sensors are then flattened and stacked into a new 2D matrix before being fed into
a merge CNN for further extracting the interactions among different sensors. A more advanced
fusion approach was proposed by Choi et al. [35] to efficiently fuse different modalities by regu-
lating the level of contribution of each sensor. The authors designed a confidence calculation layer
for automatically determining the confidence score of a sensing modality, and then the confidence
score was normalized and multiplied with pre-processed features for the following feature fusion
of addition. Instead of fusing sensor-specific feature only at the late stage, Ha and Choi [53] pro-
posed to create a vector of different modalities at the early stage as well and to extract the common
characteristics across modalities along with the sensor-specific characteristics; then both kinds of
features are fused at the later part of the model.
Axis-based Fusion (AF) (Figure 4(c)). This manner treats signal sources in more detail by
handling each sensor axis separately. In such a way, the interference between different sensor
axes is gotten rid of. Reference [95] referred this manner to Channel-based late fusion (CB-LF).
Nevertheless, the sensor channel may be confused with the “channel” in CNNs, so we use the
term “axis” instead in this article. A commonly used AF strategy is to design a specific neural
network for each univariate time series of each sensing channel [163, 177]. The information
representations from all channels are concatenated at last for input into a final classification
network. 1D-CNNs are widely used as the feature learning network of each sensing channel.
Dong and Han [38] proposed to use separable convolution operations to extract the specific
temporal features of each axis and concatenate all the features before feeding a fully connected
layer. In the studies of applying deep learning to hand-crafted features, the axis-specific process is
a requirement. For instance, in Reference [62], temporal features of acceleration and gyro signals
are first represented by FFT spectrogram images and then vertically combined into a larger image
for the following DCNN to learn inter-modality features. Furthermore, some research combined
the spectrogram images along the depth dimension to establish a 3D format [74], which could be
easily handled by 2D CNNs with the depth dimension.
Shared-filter Fusion (SFF) (Figure 4(d)). Same as the AF approach, this manner processes the
univariate time-serial data of a sensor axis independently. However, the same filter is applied to

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:11

all time sequences. Therefore, the filters are influenced by all input members. Compared to the AF
manner, SFF is more simple and contains fewer trainable parameters. The most popular approach
of SFF is to organize the raw sensing sequences into a 2D matrix by stacking along the modality
dimension and then to apply a 2D-CNN to the 2D matrix with 1D filters [39, 153, 161]. As a result,
the architecture is equivalent to applying identical 1D-CNNs to different univariate time series.
Although the features of all sensing modalities are not merged explicitly, they communicate by
the shared 1D filters.
Classifier Ensemble. In addition to fusing features before interference, the integration of mul-
tiple modalities can be done by blending the recognition results from each modality as well. A
range of ensemble approaches has been developed for fusing recognition results to yield an over-
all inference. For example, Guo et al. [51] proposed to use MLPs to create a base classifier for each
sensing modality and incorporate all classifiers by assigning ensemble weights in the classifier
level. When building the base classifiers, the authors not only considered the recognition accuracy
but also emphasized the diversity of the base classifiers by inducing diversity measures. Thus, the
diversity of different modalities is preserved, which is critical to overcoming the over-fit issues and
to improving the overall generalization ability. Besides the conventional classifier ensemble, Khan
et al. [69] targeted the fall detection problem and introduced an ensemble of the reconstruction
error from the autoencoder of each sensing modality.
The most attractive benefit of the classifier ensemble method is the scalability of additional sen-
sors. A well-developed model of a specific sensing modality can be easily merged into an existing
system by configuring the ensemble part only. Reversely, when a sensor is removed from a system,
the recognition model can be freely adapted to this hardware change. Nevertheless, an intrinsic
shortcoming of the ensemble fusion is that the inter-modality correlations may be underestimated
due to the late fusion stage.
3.1.3 Statistical Feature Extraction. Different from deep learning-based feature extraction, fea-
ture engineering-based methods are able to extract meaningful features, such as statistical in-
formation. However, domain knowledge is usually required for manually designing such kind of
features. In Reference [110], a kernel embedding based solution is proposed to extract all statisti-
cal information of the activity data. However, spatial and temporal information is not considered
in their model. Recently, Qian et al. [111] managed to develop a Distribution-Embedded Deep
Neural Network (DDNN) to integrate the statistical features with spatial and temporal infor-
mation in an end-to-end deep learning framework for activity recognition. It encodes the idea of
kernel embedding of distributions into a deep architecture, such that all orders of statistical mo-
ments could be extracted as features to represent each segment of sensor readings and further
combined with conventional spatial and temporal deep features for activity classification in an
end-to-end training manner. The authors utilized an autoencoder to guarantee the injectivity of
the feature mapping. They also introduced an extra loss function based on MMD distance to force
the autoencoder to learn good feature representations of inputs. Extensive experiments on four
datasets demonstrated the effectiveness of the statistical feature extraction methods. Although ex-
tracting statistical features has been explored in a deep-learning-based way, more reasonable and
meaningful explanations on the extracted features are still undeveloped.
The technologies for feature extraction have their strengths and weaknesses. A summary of the
advantages and limitations of different technologies is presented in Table 2.

3.2 Annotation Scarcity

Section 3.1 surveys the recent supervised deep learning methods for extracting distinguishable
features from sensory data. One main characteristic of supervised learning methods is the necessity
of a mass of labeled data to train the discriminative models. However, there may be some missing
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:12 K. Chen et al.

Table 2. Advantages and Limitations of Different Works for Feature Extraction Approaches

Feature extraction Approach References Advantages Limitations

-coarse
mean/variance [142] -simple -unsatisfactory
performance
-capture frequency -experience
time-frequency [42, 65, 74]
features dependent
-limited in extracting
[39, 46, 53, 76, 122, 123, -capture local
Temporal feature temporal CNN global temporal
129, 149, 153, 13] temporal features
features
-capture global
RNN [32, 45, 48, 61, 158] -pre-slicing required
temporal features
-capture local and -complex structure
[50, 101, 151, 159, 180,
deep hybrid global temporal -high computation
147]
features cost
[47, 52, 54, 65, 76, 80, -coarse -unstable
early fusion -simple
150] performance
-capture sensor -limited in capturing
sensor-based fusion [35, 53, 65, 114, 158] variance -hierarchical intra-sensor
features variance
-complex structure
-capture axis variance
Multimodal feature axis-based fusion [35, 53, 163, 177] -high computation
-hierarchical features
cost
-limited in handling
-relative simple
shared-filter fusion [39, 153, 161] complex axis
-hierarchical features
diversity
-non end-to-end
manner -complex
classifier ensemble [51, 69] -high scalability
structure and
training
-domain knowledge
Statistical feature - [111] -good interpretability
required

readings due to hardware issues making the sensor data temporally sparse that requires a specific
structure of neural network to resolve [2]. Furthermore, it is more challenging to assign labels
to a large amount of data. First, the annotation process is expensive, time-consuming, and very
tedious. Second, labels are subject to various sources of noise, such as sensor noise, segmentation
issues, and the variation of activities across different people, which makes the annotation process
error-prone. Therefore, researchers have begun to investigate unsupervised learning and semi-
supervised learning approaches to reduce the dependence on massive annotated data.

3.2.1 Unsupervised Learning. Unsupervised learning is mainly used for exploratory data anal-
ysis to discover patterns among data. In Reference [77], the authors examined the feasibility of in-
corporating unsupervised learning methods in activity recognition, but the community of activity
recognition still needs more effective methods to deal with the high-dimensional and heteroge-
neous sensory data for activity recognition.
Recently, deep generative models including Deep Belief Networks (DBNs) and autoencoders
have become dominant for unsupervised learning. DBNs and autoencoders are composed of mul-
tiple layers of hidden units. They are useful in extracting features and finding patterns in massive
data. Also, deep generative models are more robust against overfitting problems as compared to
discriminative models [93]. So, researchers tend to use them for feature extraction to exploit unla-
beled data, as it is easy and cheap to collect unlabeled activity datasets. According to Erhan et al.
in Reference [41], a generative pretraining of a deep model guides the discriminative training to
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:13

better generalization solutions. Pretraining a deep network on large-scale unlabeled datasets in an

unsupervised fashion thus became very common. The whole process for recognition can be divided
into two parts. First, the input data are fed to feature extractors, which are usually deep generative
models, for pretraining to extract features. Second, a top-layer or other classifier is added and then
trained with labeled data in a supervised fashion for classification. During the supervised training,
weights in the feature extractor may be fine-tuned. For example, DBN-based activity recognition
models are implemented in Reference [7]. The unsupervised pretraining is followed by fine-tuning
the learned weights in an up-down manner with available labeled samples. In Reference [55], the
same pretraining process is conducted, but Restricted Boltzmann Machines (RBMs) are ap-
plied to learn a generative model of the input features. In another work [107], Plötz et al. proposed
to use autoencoders for unsupervised feature learning as an alternative to Principal Component
Analysis (PCA) for activity recognition in ubiquitous computing. And the authors in References
[34, 47, 164] employed the variants of autoencoders such as stacked autoencoders [34], stacked
denoising autoencoders [47], and CNN autoencoders [164] to combine automatic feature learning
and dimensionality reduction in one integrated neural network for activity recognition. In a recent
work [14], Bai et al. proposed a method called Motion2Vector to convert a time period of activity
data into a movement vector embedding within a multidimensional space. To fit with the context
of activity recognition, they use a bidirectional LSTM to encode the input blocks of the temporal
wrist-sensing data.
Despite the success of deep generative models in unsupervised learning for human activity
recognition, unsupervised learning still cannot undertake the activity recognition tasks indepen-
dently, since unsupervised learning is not capable of identifying the true labels of activities without
any labeled samples presenting the ground truth. Therefore, the aforementioned methods can be
considered as semi-supervised learning, in which both labeled data and unlabeled data are lever-
aged for training the neural networks.

3.2.2 Semi-supervised Learning. Semi-supervised learning has shown a growing trend in activ-
ity recognition because of the difficulty in obtaining labeled data [156]. A semi-supervised method
requires less labeled data and massive unlabeled data for training. How to utilize unlabeled data
for reinforcing the recognition system has become a point of interest. Some works have explored
to promote classic semi-supervised learning methods on activity recognition, such as manifold
learning [86, 112]. Recently, as deep learning is powerful in capturing patterns from data, various
semi-supervised learning has been incorporated for activity recognition such as co-training, active
learning, and data augmentation.
Co-training was proposed by Blum and Mitchell in 1998 [20]. It was an extension of self-
learning. In self-learning approaches, a weak classifier is first trained with a small amount of
labeled data. This classifier is used for classifying the unlabeled samples. The samples with high
confidence can be labeled and added to the labeled set for re-training the classifier. In co-training,
multiple classifiers are employed, each of which is trained with one individual view of training
data. Likewise, the classifiers select unlabeled samples to add to the labeled set by confidence score
or majority voting. The whole process of co-training can be seen in Figure 5(a). With the training
set augmented, the classifiers are enhanced. Blum and Mitchell [20] suggested that co-training is
fully effective under three conditions: (a) multiple views of training data are not strongly corre-
lated, (b) each view contains sufficient information for learning a weak classifier, (c) the views are
mutually redundant. In respect of sensor-based human activity recognition, co-training is compat-
ible, because multiple modalities can be regarded as multiple views. Chen et al. [29] applied co-
training with multiple classifiers on different modalities of the data. Three classifiers are trained
on acceleration, angular velocity, and magnetism, respectively. The learned classifiers are used for

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:14 K. Chen et al.

Fig. 5. Co-training and active learning for Annotation Scarcity.

predicting the unlabeled data after each training round. If most of the classifiers reach an agree-
ment on predicting an unlabeled sample, then this sample is labeled and moved to the labeled set
for the next training round. The training flow is repeated until no confident samples can be la-
beled or the unlabeled set is empty. Then a new classifier is trained on the final labeled set with
all modalities.
Co-training is like human learning. People can learn new knowledge from existing experience,
and new knowledge can be used to summarize and accumulate experience. Experience and knowl-
edge constantly interact with each other. Similarly, co-training uses current models to select new
samples that they can learn from, and the samples help to train the models for the next selection.
However, automatic labeling may introduce errors. Acquiring correct labels can improve accuracy.
Active learning is another category in semi-supervised learning. Different from self-learning
and co-training, which label the unlabeled samples automatically, active learning requires annota-
tors who are usually experts or users to label the data manually. To lighten the burden of labeling,
the goal of active learning is to select the most informative unlabeled instances for annotators to
label and improve the classifiers with these data so minimal human supervision is needed. Here,
the most informative instances denote the instances that bring the most enormous impact on the
model if their labels are available. A general framework of active learning can be seen in Figure 5(b).
It includes a classifier, a query strategy, and an annotator. The classifier learns from a small amount
of labeled data, selects one or a set of the most useful unlabeled samples via query strategy, asks the
annotator for true labels, and utilizes the new labels for further training and next query. The ac-
tive learning process is also a loop. It stops when it meets the stop criteria. There are two common
query strategies for selecting the most profitable samples, which are uncertainty and diversity.
Uncertainty can be measured by information entropy. Larger entropy means higher uncertainty
and better informativeness. Diversity means that the queried samples should be comprehensive,
and the information provided by them are non-repetitive and non-redundant. In Reference [133],
the authors applied two query strategies. One of them is to select samples with lowest prediction
confidence, and the other one resorts to the idea of co-training, but it oppositely selects samples
with high disagreement among classifiers.
Deep active learning approaches are deployed in activity recognition [57, 58]. Hossain et al. [57]
considered that traditional active learning methods merely choose the most informative samples
that only occupy a small fraction of the available data. In this way, a large number of samples are
discarded. Although the selected samples are vital for training, the discarded samples are also of
value on account of the substantial amount. Therefore, they proposed a new method to combine
active learning and deep learning in which not only the most informative unlabeled samples are
queried but the less necessary samples are also leveraged. The data is first clustered with K-means
clustering. While the intuitive idea is to query the optimal samples such as the centroids of the

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:15

clusters, in this work, the neighboring samples are also queried. The experiments show that the
proposed method can achieve the optimal results by labeling 10% of the data.
Hossain and Roy [58] further investigated two problems of deep active learning and human
activity recognition. The first problem is that outliers can be easily mistaken for important samples.
When entropy is calculated for selection, apart from informativeness, larger entropy may also
mean outliers, because outliers belong to none of the classes. Therefore, a joint loss function was
proposed in Reference [58] to address this problem. Cross-entropy loss and information loss are
jointly minimized to reduce the entropy of outliers. The second problem considered in this work is
how to reduce the workload of annotators as annotators are required to master domain knowledge
for accurate labels. Multiple annotators are employed in this work. They are selected from the
intimate people of users. The annotator selection is made by the reinforcement learning algorithm
according to the discrepancy and the relations of users. The contextual similarity is used to measure
the relations among users and annotators. The experimental results show that this work has an
8% improvement in accuracy and has a higher convergence rate.
Co-training and active learning are based on the same idea of rebuilding the model upon labels
of unlabeled data. Data augmentation with synthesizing new activity data is another way when
data collection is challenging in specific scenarios such as resource-limited or high-risk scenarios.
Data augmentation with synthesizing data indicates generating massive fake data from a small
amount of real data so the fake data can facilitate to train the models. One popular tool is Genera-
tive Adversarial Network (GAN). GAN was first introduced in Reference [44]. GAN is powerful
in synthesizing data that follow the distribution of training data. A GAN is composed of two parts,
a generator and a discriminator. The generator creates synthetic data and the discriminator eval-
uates them for authenticity. The goal of the generator is to generate data that are genuine enough
to cheat the discriminator, while the goal of the discriminator is to identify images generated by
the generator as fake. The training is in an adversarial way, which is based on a min-max the-
ory. During training, the generator and the discriminator mutually improve their performance in
generation and discrimination. Variants of GANs have been applied to different fields, such as
language generation [109] and image generation [179].
The first work about data augmentation with synthesizing sensory data for activity recogni-
tion is called SensoryGANs [144]. As sensory data is heterogeneous, a unified GAN may not
be enough to depict the complex distribution of different activities. Wang et al. employed three
activity-specific GANs for three activities. After generation, the synthetic data are fed into classi-
fiers for prediction with original data. We should note that although this work uses deep generative
networks, the generation process depends on labels so the process is not unsupervised. Zhang et al.
[174] proposed to use semi-supervised GAN for activity recognition. Different from regular GAN,
the discriminator in semi-supervised GAN makes a K + 1 class classification that includes activity
classification and fake data identification. To ensure the distribution of the generated data to trend
to the authentic distribution, a prearranged distribution is provided as inputs by Variational Au-
toEncoders (VAEs) instead of Gaussian noises. The aim of VAEs is to provide distributions that
represent the distributions of input data. Moreover, VAE++ was proposed to guarantee that the
inputs are exclusive for each training sample. Overall, the unified framework combining VAE++
and semi-supervised GAN proves to be effective in activity recognition.
Table 3 summarizes recent deep learning works for annotation scarcity in activity recognition
and their advantages and disadvantages.

3.3 Class Imbalance

The primary contributor to the success of deep learning technique is the availability of a large
volume of training data due to modern information technology. Most existing research on human

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:16 K. Chen et al.

Table 3. Advantages and Limitations of Different Works for Annotation Scarcity

Training scheme Approach References Advantages Limitations

-rely on ground truth
[7, 14, 34, 47, 55, 107, -feature learning
Unsupervised pretraining for training activity
164] without labels
classifiers
-use both labeled and -at least two data
unlabeled data modalities required
co-training [29] -assign labels to -need training
unlabeled data multiple classifiers
Semi-supervised automatically each iteration
-high labeling
-human labeling
active learning [57, 58] efficiency and
required
accuracy
-enhance model -make less use of
data augmentation [144, 174]
generalization unlabeled data

activity recognition follows a supervised learning manner, which requires a significant amount
of labeled data to train a deep model. However, some sensor data of specific activities are chal-
lenging to obtain, such as those related to falls of elderly people. In addition, raw data recorded
from unconstrained conditions is naturally class-imbalanced. When using an imbalanced dataset,
conventional models tend to predict the class with the majority number of training samples while
ignoring the class with few available training samples. Therefore, it is urgent to determine the
class imbalance issue for developing an effective activity recognition model. Methods of dealing
with class imbalance can be divided into two groups.

3.3.1 Data Level. The most intuitive path to tackling the imbalance problem is to re-sample the
class with the largest number of samples [5]. However, such a method is at the risk of reducing the
total amount of training samples and omitting some critical samples with featured characteristics.
In contrast, augmenting new samples to the class with a minority number of samples could not
only keep all original samples but also enhance models’ robustness. Grzeszick et al. [46] utilized
two augmentation methods, Gaussian noises perturbation and interpolation, to tackle the problem
of class imbalance. The augmentation approaches could preserve the coarse structure of the data,
but a random time jitter in the sensor’s sampling process is simulated. They created a larger num-
ber of samples for the under-represented classes and ensure that each class has at least a certain
percentage of data in the training set.

3.3.2 Algorithmic Level. Another direction of solving the imbalance concern is to modify the
model-building strategy instead of directly balancing the training dataset. In Reference [48], Guan
and Plötz utilized the F 1-score rather than the conventional cross-entropy as the loss function to
address the imbalance problem. Because the F 1-score considers both the recall and precision as-
pects, classes with different numbers of training samples are equally taken into account. Besides
the class imbalance of original datasets, it is also a non-negligible problem for a semi-supervised
framework, as the process of gradually labeling unlabeled samples may create uneven new num-
bers of labels across different classes. Chen et al. [29] concerned class imbalance in small labeled
datasets. They leveraged a semi-supervised framework, co-training, to enrich the labeled set in
cyclic training rounds. To balance the training samples across classes while simultaneously main-
taining the distributions of the samples, a pattern-preserving strategy was proposed before the
training phase of the co-training framework. K-means clustering was first adopted to mine la-
tent activity patterns of each activity. Then, sampling is applied to each pattern. The main goal

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:17

Table 4. Advantages and Limitations of Different Works for Class Imbalance

Balancing scheme Approach References Advantages Limitations

-decrease the amount of
-simple balancing process sample
re-sampling [5]
-free of noises -may miss featured
samples
Data level
-enhance model
robustness -may induce unexpected
augmentation [46]
-keep all recording noises
samples
-free of data preprocess -not generic
Algorithmic level - [29, 48] -keep all recording -careful parameter tuning
samples required

is to guarantee that the numbers of all the patterns of all activities are even. A summary of the
advantages and limitations of different works for resolving class imbalance is presented in Table 4.

3.4 Distribution Discrepancy

Many state-of-the-art approaches for human activity recognition assume that the training data and
the test data are independent and identically distributed (i.i.d.). However, this is impractical,
since there is distribution discrepancy between training data and test data in activity recogni-
tion. The distribution discrepancy in sensory data can be divided into three categories by reason.
The first one is the discrepancy between users, which stems from different motion patterns when
activities are performed by different people. The second discrepancy is with time. In a dynamic
streaming environment, data distributions of activities are changing over time, and new activities
may also emerge. The third category is the discrepancy in sensors. Sensors used for human activity
recognition are usually sensitive. A small variation in sensors can cause a significant disturbance
in the sensory data. The factors that may potentially bring about discrepancy with sensors in-
clude sensor instances, types, positions, and layouts in the environment. We can also categorize
the discrepancy into homogeneous discrepancy and heterogeneous discrepancy by character [36].
In homogeneous discrepancy, training data and test data have the same attributes and the same
feature spaces. In heterogeneous discrepancy, the feature space of training data and test data may
differ in dimensions or attributes. Typically, the discrepancy among users and time belongs to ho-
mogeneous discrepancy while the discrepancy with the number of sensor instances, sensor types,
and sensor layouts is heterogeneous, as these factors may cause change in attributes and dimen-
sions. The following section summarizes the literature by reason (i.e., users, time, and sensors),
but the perspective of homogeneous and heterogeneous discrepancy is also inspiring.
Before taking a closer look at the factors that cause distribution discrepancy in sensory data, we
briefly introduce transfer learning [102]. Transfer learning is a common machine learning tech-
nique that transfers the classification ability of the learning model from one predefined setting to
a dynamic setting. Transfer learning is particularly effective in solving distribution discrepancy
problems. It avoids the decline in the performance of learning models when the training data and
the test data follow different distributions. In the activity recognition context, this problem appears
when activity recognition models are deployed for application in a different configuration with
where they are trained. In transfer learning, source domain refers to domains that contain massive
annotated data and knowledge, and the goal is to leverage the information from the source domain
to annotate the samples in the target domain. Regarding activity recognition, the source domain
corresponds to the original configuration, and the target domain denotes the new deployment that

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:18 K. Chen et al.

the system has never encountered (e.g., new activities, new users, new sensors). In the following
sections, we detailedly introduce three categorizes of discrepancy and how the state-of-the-art
approaches manage to mitigate the discrepancy. Most of them are based on transfer learning.
3.4.1 Distribution Discrepancy with Users. Owing to biological and environmental factors, the
same activity can be performed differently by different individuals. For example, some people walk
slowly and some prefer to walk faster and more dynamically. Since people have diverse behavior
patterns, data from different users are distributed variously. Usually, if the models are trained and
tested with data that are collected from a specific user, the accuracy can be rather high. However,
this setting is impractical. In practical human activity recognition scenarios, while a certain num-
ber of participants’ data can be collected and annotated for training, the target users are usually
unseen by the systems. So the distribution divergence between the training data and the test data
appears as a challenge in human activity recognition, and the performance of the models falls
dramatically across users. The research on personalized models for a specific user is significant.
Recently, personalized deep learning models for distribution discrepancy among users in activity
recognition have been explored. Woo et al. [148] proposed an approach to build an RNN model
for each individual. Learning Hidden Unit Contributions (LHUC) were applied in Reference
[90] where a particular layer with few parameters is inserted between every two hidden layers of
CNN, and the parameters are trained using a small amount of data. Rokni et al. [121] proposed to
personalize their models with transfer learning. In the training phase, CNN is first trained with
data collected from a few participants (source domain). In the test phase, only the top layers of
the CNN are fine-tuned with a small amount of data for the target users (target domain). Anno-
tation for target users is required. GAN is also serviceable for addressing distribution discrepancy
among users. In Reference [132], the authors generated data of the target domain directly from
the source domain with GANs to enhance the training of the classifier. Chen et al. [27] further
defined person-specific discrepancy and task-specific consistency for people-centric sensing ap-
plications. Person-specific discrepancy means the distribution divergence of data collected from
different people, and task-specific consistency denotes the inherent similarity of the same activity.
They proved that reducing person-specific discrepancy and preserving task-specific consistency
guarantee the recognition accuracy after transferring. Reference [30] combines activity recogni-
tion and user recognition with a multi-task model. The proposed method shares parameters be-
tween the activity module and the user module so the activity recognition performance can be
boosted by features learned from the user recognition module. To transfer important knowledge
between the two modules, a mutual attention mechanism is deployed.
3.4.2 Distribution Discrepancy with Time. Human activity recognition systems collect dynamic
and streaming data that logs people’s motions. In a real-world recognition system, the initial train-
ing data that portrays a set of activities is collected to train an original model, then the model is
configured for future activity recognition. In long-term systems that are longer than months or
even years, a natural feature that we should concern is that the streaming sensory data changes
over time. Three problems can be derived from the distribution discrepancy with time in line with
the extent of change and the extent of the need in recognizing the new concepts of data. They are
the concept drift problem, the concept evolution problem, and the open-set problem.
Concept Drift. Figure 6(a) shows the first problem of distribution discrepancy with time in ac-
tivity recognition called concept drift [127]. It denotes the distribution shift between the source
domain and the target domain. Concept drift can be abrupt or gradual [1]. To accommodate the
drift, deep learning models should incorporate incremental training to continuously learn new
concepts of human activities from newly coming data. For example, an ensemble classifier termed
multi-column bi-directional LSTM was proposed in Reference [136]. The model leverages new

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:19

Fig. 6. Distribution discrepancy with time.

training samples gradually via incremental learning. Active learning is a special type of incremen-
tal learning. In streaming data systems, active learning queries ground truth for samples when
change is detected. It encourages to select the most efficient samples to update the models for the
new concepts. That is why active learning can facilitate deep learning models to mitigate the dis-
crepancy with time of the streaming sensory data [49, 126]. In this way, Gudur et al. [49] proposed
a deep Bayesian CNN with dropout to obtain the uncertainties of the model and select the most
informative data points to be queried according to the uncertainty query strategy. Owing to the
active learning, the model supports updating continuously and capturing the changes of data over
time.
Concept Evolution. Figure 6(b) represents the distribution of concept evolution. Concept evo-
lution denotes the emergence of new activities in the streaming data. The appearance of concept
evolution is because collecting labeled data for all kinds of activities in the initial learning phase
is impractical. First, despite the effort, the initial training set in an activity recognition system is
only able to contain a limited number of activities. Second, people can perform new activities that
they never did before the initial training of the activity recognition system (e.g., learning to play
guitar for the first time). Third, it is difficult to collect some certain activities such as people falling
down. However, these activities still may appear in the test or the application phase. Thus, in the
application phase, the concepts of the new activities still need to be learned. It is essential to study
activity recognition systems that can recognize new activities in the streaming data settings. Nev-
ertheless, this is difficult due to the restricted access to annotated data in the application phase.
One approach is to decompose activities into mid-level features such as arm up, arm down, leg up,
and leg down. This method demands experts to define the mid-level attributes for further training,
and the capability is limited when new activities composed of new attributes appear [97]. Other
deep learning methods for activity concept evolution are still less explored, so some researchers
take a step back and study the problem of open-set.
Open-Set. Open-set problem is currently a trending topic. Before that, most of the state-of-
the-art works are for “closed-set” problems where the training set and the test set contain the
same set of activities. Open-set also originates from the fact that we can never collect sufficient
kinds of activities in the initial training phase. But compared with concept evolution problems,
the solutions to open-set problems only need to identify whether the test samples belong to the
target activities, rather than exactly recognize the activities. Figure 6(c) represents the distribution
of open-set problems where the shadow means the space where new activities may emerge. An
intuitive solution to open-set problems is to build a negative set so they can be considered in a
closed-set way. A deep model based on GAN is proposed in Reference [154]. The authors generate
fake samples with GAN to construct the negative set, and the discriminator of the GAN can be
seamlessly used as the open-set classifier.

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:20 K. Chen et al.

3.4.3 Distribution Discrepancy with Sensors. Due to the sensitivity of sensors, a tiny variation
in the sensors may lead to substantial changes in the data collected or transmitted by the sensors.
The influential factors of sensors include the instances, types, positions, and layouts in the envi-
ronment. To illustrate, instances of sensors may have different parameters such as the sampling
rate; different types of sensors collect totally different types of data with varying shapes, frequen-
cies, and scales; wearable sensors attached to positions of human body only record motions in the
corresponding body parts; environmental layouts of device-free sensors influence the propagation
of signals. All of these factors may cause drops in the recognition accuracy when the classifiers are
not trained for specific device deployments. Therefore, seamless deep learning models for activity
recognition in the wild is necessary. Reference [94] proves that features learned by deep learning
models are transferable across sensor types and sensor deployments for activity recognition.
Sensor Instances. Even when data is collected in the same setting and only the sensor instances
are different, for example, a person replaces his smartphone with a new one, the recognition ac-
curacy still declines soon. Both the hardware and the software are responsible. In fact, owing to
the imperfections in the production process, sensor chips show variation in the same conditions
[37]. Also, the performance of devices differs in different software platforms [21]. For example,
APIs, resolutions, and other factors are all influential to the performance of sensors. There have
been a few works developing deep learning models to address distribution discrepancy problems
caused by different sensor instances. One notable work is data augmentation with GANs [89]. Data
augmentation is a solution of enriching training sets so both the size and the quality of training
sets meet the requirement of training a powerful deep learning model. A discrepancy generator
that synthesizes heterogeneous data from different sensor instances under various degrees of dis-
turbance is developed in Reference [89]. The aim is to replenish the training set with sufficient
discrepancy. Moreover, the authors deploy a discrepancy pipeline with two parameters that con-
trol the discrepancy of the training set.
Sensor Types and Positions. In this section, we introduce the distribution discrepancy of sen-
sory data caused by different sensor types and positions on human bodies, because these two fac-
tors usually appear together. Thanks to the pervasiveness of wearables sensors and IoT equipment,
people can wear more than one smart device to assist their daily life. And it is also common that
users replace their smart devices or buy new electronic products. Since some devices are based on
the same platforms (e.g., iPhone and Apple Watch), people prefer the activity recognition system
to seamlessly recognize activities that are observed by the new device with models trained with
the old devices. In terms of positions, the devices should be attached to different body positions
according to the types. For example, a smartwatch should be attached to the user’s wrist while
a smartphone can be put in a pocket of a trouser or shirt. It is obvious that devices on different
body positions will lead to tremendous changes in their collected signals, because the signals are
stimulated by the motions of corresponding body parts. Therefore, there are two issues raised by
such changes that urgently need to be considered to address the distribution discrepancy with
sensor types and positions. First, massive data from the new sensors or new positions is required
so the new distribution can be estimated rather completely. Second, most of the existing works
still mediocrely characterize the old data and the new data with the same features, which is im-
practical when sensor types and positions are not fixed. For instance, KL divergence is minimized
between the parameters of CNNs that are trained by the old data and the new data, respectively, in
Reference [68]. To address the issue mentioned, Akbari and Jafari [3] designed stochastic features
that are not only discriminative for classification but also able to reserve the inherent structures
of the sensory data. The stochastic feature extraction model is based on a generative autoencoder.
Wang et al. [146] further posed a question about how to select the best source positions for
transfer when there are multiple source positions available. This question is pragmatic, since the

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:21

smart devices can be placed in diverse positions such as on wrist, in a pocket, or on nose (e.g.,
goggles), and inappropriate selection may lead to negative transfer. Reference [43] proves that the
similarity between domains in transfer learning is determinative. Reference [146] suggests that
higher similarity indicates better transfer performance between two domains. Therefore, Chen
et al. [31] assumed that data samples of the same activities are aggregated in the distribution
space even when they are from different sensors. They propose a stratified distance that is class-
wise to measure the distances between domains. Wang et al. [146] proposed a semantic distance
and a kinetic distance to measure domain distances, where the semantic distance involves spatial
relationships between data collected from two positions and the kinetic information concerns the
relationships of motion kinetic energy between two domains.
Sensor Layouts and Environments. Sensor layouts are in regard to device-free sensors such
as WiFi and RFID. The signals collected by the receivers are usually considerably influenced by the
layouts and the environments. The reason is that during the signals are transmitted, the signals are
inevitably reflected, refracted, and diffracted by media and barriers such as air, glass, and walls. And
the spatial positions of the receivers also play a role. Despite the maturity in building classification
models for device-free activity recognition, very few works focus on how to get equally accurate
recognition performance when sensors are configured in the wild. One example is Reference [64],
where an adversarial network is incorporated with deep feature extraction models to remove the
environment-specific information and extract the environment-independent features.
It should be noted that all the aforementioned methods need either labeled or unlabeled data
from the target domain to update their models. In real world, a one-fits-all model that only requires
one-time training and is general enough to fit all scenarios is indispensable. Zheng et al. [178]
defined Body-coordinate Velocity Profile (BVP) to capture domain-independent features. The
features represent power distributions over different velocities of body parts and are unique to
individual activities. The experimental results show that BVP is advantageous in cross-domain
learning, and it fits all kinds of domain factors including users, sensor types, and sensor layouts.
One-fits-all is a new direction for researchers to mitigate the distribution discrepancy problem in
activity recognition.
In conclusion, we review three categories of distribution discrepancy in activity recognition.
They are caused by different users, time streaming, and sensor deployments. They are further
categorized according to the extent of change or the main reason for changes. Table 5 summarizes
the advantages and limitations of different works for resolving distribution discrepancy in activity
recognition.

3.5 Composite Activity

Despite the success of applying a variety of deep learning models to recognizing human activi-
ties, the majority of existing research focuses on simple activities such as walking, standing, and
jogging, which are usually characterized by repeated actions or single body posture. The simple
activities are basic and thus possess lower-level semantics. In contrast, more composite activities
may contain a sequence of simple actions and have higher-level semantics, e.g., working, having
dinner, and preparing coffee, which can better reflect people’s daily life. As a result, it is desirable to
recognize more complicated and high-level human activities for most practical human-computer
interaction scenarios. Since not only human body movements but also context information of sur-
rounding environments are required for composite activity recognition, it is a more challenging
task compared to recognizing simple activities. In addition, designing effective experiments for col-
lecting sensor data for composite activities is also a challenging task that requires rich experience
of using diverse sorts of sensors and plans of human-computer interaction applications. Therefore,
the development of composite activity recognition is much more unexplored than simple activities.

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:22 K. Chen et al.

Table 5. Advantages and Limitations of Different Works for Distribution Discrepancy

Discrepancy Type Approach References Advantages Limitations

-long training time
-the discrepancy issue and a large amount of
user-specific models [148]
can be fully resolved training data required
for new users
-the diversity of the
-can be directly synthetic data is
User data augmentation [132]
applied to new users limited and not
guaranteed
-less data is required
for retrain
-retrain is required
transfer learning [90, 121, 27, 12, 30] -common information
for each new user
of different users is
preserved
-continuously update
-few works on
incremental learning [136, 49, 126] models to resolve the
handling new class
concept drift issue
-human efforts required
-able to figure out the
mid-level feature to define mid-level
Time [97] new class comprised
decompose features -unable to
with existing features
handle new features
-support open-set
-out-of-set data can
recognition without
synthetic data [154] only be recognized as
using real out-of-set
one class
data
-the diversity of the
-can be directly
synthetic data is
data augmentation [89] applied to new sensor
limited and not
deployment
guaranteed
-less data is required
for retrain
-retrain is required for
how to transfer [68, 3] -common information
each new user
Sensor of different users is
preserved
-only feasible when
-select suitable source
what to transfer [43, 31, 146] multiple sources are
to transfer
available
domain-independent -directly applied to -only applicable to
[178]
features new settings WiFi signals

3.5.1 Unified Models. Existing studies on composite activity recognition can be categorized
into two streams. The first one mixes complex and simple activities and tries to create a unified
model to recognize both kinds of activities. In Reference [142], there are 22 simple and composite
activities attributed to four strategies: (1) Locomotive (e.g., walk indoor, run indoor); (2) Semantic
(e.g., clean utensil and cooking); (3) Transitional (e.g., indoor to outdoor and walk upstairs); and
(4) Postural/ relatively Stationary (e.g., standing and lying on bed). A simple multi-layer feedfor-
ward neural network was created to recognize all the activities with a high average test accuracy
of 90%. However, the results are obtained with the subject-dependent setting, where training and
test samples are from the same subject, which limits the proposed method’s adaptability.
3.5.2 Separated Models. The second strategy is to consider composite activities separately
from simple ones and to further regard a composite activity as the combination of a series
of simple activities. This hierarchical manner is more intuitive and attracts stronger research

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:23

Table 6. Advantages and Limitations of Different Works for Composite Activity Recognition

Treatment Approaches References Advantages Limitations

-weak generalization ability
-simple data collection
Unified - [142] -proper signal segmentation
settings
required
-simultaneously recognizing
simple and composite
-prior knowledge required
joint learning [103] activity
-poor adaptability
-mutual performance
Separated enhancement
-intuitive
-favorable adaptability -complex training scheme
action to activity [33]
-mutual performance and inference process
enhancement

interests. However, applying deep learning techniques to this area is still underexplored. One of
the few deep learning works is Reference [103] where the authors developed a multi-task learning
approach to recognize both simple and composite activities simultaneously. To be concrete, the
authors divided a composite activity into multiple simple activities that were represented by a
series of sequential sensor signal segments. The signal segments are first input into CNNs to
extract representations of low-level activities, which are then loaded into a softmax classifier for
recognizing simple activities. At the same time, the CNN-extracted features of all segments are
taken into an LSTM network to exploit their correlations and consequently result in a high-level
semantic activity classification. In such a way, the prior simple activities being the components of a
composite activity is utilized by the shared deep feature extractor. Different from the joint learning
manner, Reference [33] inferred a sequence of simple activities and its corresponding composite
activity by using two conditional probabilistic models alternatively. The authors used an estimated
action sequence to infer the composite activity, where the temporal correlations of simple actions
are extracted for the composite activity classification. In reverse, the predicted composite activity
is utilized to help derive the simple activity sequence at the next timestep. As a result, the
predictions of the sequence of simple activities and composite activities are mutually updated
based on each other during the inference. The deep learning technique was used for feature
extraction from raw signals. The experiment results showed increasing accuracy as a composite
activity evolved. Even though these works have demonstrated promising solutions to recognizing
composite activities, there exists a major concern that properly cutting a raw time-serial signal
into segments of individual simple actions is the basis for success. A summary of the advantages
and limitations of different works on composite activity recognition is presented in Table 6.

3.6 Data Segmentation

As original sensor data is represented by continuously streaming signals, a fixed-size window is
always used to partition raw sensor data sequences into segments as input into a model for activ-
ity recognition. This is essential to overcome the limitation of the sample of a single timestep to
provide adequate information about an activity. Ideally, one partitioned data segment processes
only one activity, and thus a model predicts a single label for all the samples within a single win-
dow. However, the samples in one window may not always share the same label when an activity
transition occurs in the middle of the window. Therefore, an optimal segmentation approach is
critical to increasing activity recognition accuracy.

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:24 K. Chen et al.

Table 7. Advantages and Limitations of Different Works for Data Segmentation

Treatment Approaches References Advantages Limitations

-limited generalization
-able to deal with a
ability
transition within a
hierarchical -multiple classifiers
Explicit segmentation [4] window
narrow-down required
-able to capture long
-limited in capturing
range information
transitions
-able to deal with a
transition within a
window -difficult to define exact
timestep wise [157, 176] -able to capture long transition periods for
range information ground truth
-fine-grained
segmentation
-relatively coarse
-simple structure and -not able to capture
training scheme transitions
Implicit segmentation multi-label [140]
-able to capture long -not able to identify
range information activity sequence within
a window

3.6.1 Explicit Segmentation. An intuitive manner is to attempt various fixed window sizes em-
pirically. Nevertheless, although a larger window size provides richer information, it increases the
possibility that a transition occurs in the middle of windows. On the contrary, a smaller window
size cannot afford enough information. In light of this issue, Reference [4] reported a hierarchical
signal segmentation method, which initially used a large window size and gradually narrowed
down the segmentation until only one activity is in a sub-window. The narrow-down criterion is
that two consecutive windows have different labels or the classification confidence is less than a
threshold. Different from the hierarchical framework, some researchers explored to directly as-
sign a label for each timestep instead of predicting a window as a whole [157, 176]. Inspired by
semantic segmentation in the computer vision community, the authors employed fully connected
networks (FCNs) [83] to achieve such a goal. Data from a large window size is input, and a 1D
CNN layer is used to replace the final softmax layer, where the length of the feature map equals to
timesteps and the number of the feature maps equals to the number of activity classes, to predict a
label for each timestep. Therefore, the FCNs could not only use the information of the correspond-
ing timestep itself but also utilize the information of its neighboring timesteps.

3.6.2 Implicit Segmentation. Explicit segmentation for activity recognition is not practical,
since users perform activities in unfixed durations. In Reference [140], Varamin et al. defined un-
segmented activity recognition as a set prediction problem. They designed a multi-label architec-
ture to simultaneously predict the number of ongoing activities and the occurring possibility of
each alternative activity without explicit segmentation. Table 7 summarizes the advantages and
limitations of different methods for data segmentation.

3.7 Concurrent Activity

In real-world scenarios, in addition to performing each activity one after another in a sequential
fashion, a person may carry out more than one activity at the same time, which is called concurrent
activities. For instance, one may make a phone call when watching TV. From the angle of sensor
signals, a piece of data may correspond to multiple ground truth labels. Therefore, concurrent

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:25

Table 8. Advantages and Limitations of Different Works for Concurrent Activity Recognition

Treatment Approaches References Advantages Limitations

-limited adaptability to
Individually multi-label [81, 175] -simple architecture
new activities
-computational cost
increases exponentially
multi-layer LSTM and
with the number of
Concurrently high dimensional [100] -achieve results directly
activity increases
tensor
-limited adaptability to
new activities

activity recognition can be abstracted as a multi-label task. Note that the concurrent activity is
executed by a single subject.
3.7.1 Recognize Individually. A concurrent activity can be considered as several individual ac-
tivities. Zhang et al. [175] designed an individual fully connected network for each candidate ac-
tivity on top of shared multimodal fusion features. The final decision-making layer classified each
activity independently by independent softmax layers. A key drawback of this kind of structure
is that the computational cost would increase considerably with the number of activity rises. To
resolve this issue, the authors further proposed to use a single neuron with the siдmoid activation
to make binary classification (performed or not) for each activity [81].
3.7.2 Recognize Concurrently. In contrast, Okita and Inoue [100] also targeted the concurrent
activities, but directly considering the possibility of different activities occurring concurrently.
They suggested a multi-layer LSTM framework to give the concurrent possibility of every possi-
ble activity combination. The main limitation of this work is the output dimension would explode
exponentially as the amount of concurrent activities increases. The pace of exploring deep learning
methods on concurrent activity recognition is still slow, and there is a large room to improve. A
summary of the advantages and limitations of different approaches for concurrent activity recog-
nition is illustrated in Table 8.

3.8 Multi-occupant Activity

Most of the state-of-the-art works focus on monitoring and assisting people with regard to single-
occupant. Nevertheless, living and working spaces are usually resided by multiple subjects; hence,
designing solutions for handling multi-occupant is of notably practical significance. There are
mainly two types of multi-occupant activities: parallel activity, where occupants perform activities
individually such as one occupant is eating while the other one is watching TV; and collaborative
activity, where multiple occupants collaborate together to perform the same activity such as two
subjects play table tennis [19]. For the parallel activity recognition, when only wearable sensors
are used, it can be divided into multiple single-occupant activity recognition tasks and solved by
conventional solutions; while ambient or object sensors are used, data association of mapping
sensed signals to the occupant who actually causes the generation of the data becomes the major
challenge, which gets more serious as the number of occupants in the space increases. The problem
of data association is crucial to the multi-occupant scenario, since failing to do so, data would be
useless and could even endanger the life of residents in telehealth applications. For the collaborative
activity, human interactions and instruments are generally involved; thus, context and object-
use information play vital roles in designing recognition solutions. Although the multi-occupant
activity recognition is of great meaning, its deep learning-based research is still limited.

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:26 K. Chen et al.

Table 9. Advantages and Limitations of Different Works for Multi-occupant Activity Recognition

Targeting scenario Sensors References Advantages Limitations

-occupants are
-nearly 100% constrained to perform
Collaborative activity ambient and wearable [124]
recognition accuracy the same activity
together
-no constraints to -unable to associate
Parallel activity ambient [138]
occupants activities to occupants

3.8.1 Collaborative Activity. In Reference [124], both wearable and ambient sensors were used
to recognize group activities of two occupants. The ambient sensors were leveraged for extracting
context information, which is represented by disparate functional indoor areas. The sensor data
of different occupants was input into different RBMs separately and then merged into a sequential
network, a DBN and an MLP, for the inference of the group activity. Pretty high accuracy of nearly
100% was achieved. However, most of their targeting scenarios are constrained with two occupants
performing the same activity together.
3.8.2 Parallel Activity. On the contrary, Tran et al. [138] did not assume the occupants acting
together. They aimed at recognizing activities for occupants individually. A multi-label RNN was
created with each cell responding to the activity of an occupant. Nevertheless, the authors only
used ambient sensors and did not propose a specific solution to data association. Table 9 summa-
rizes the advantages and limitations of different methods for multi-occupant activity recognition.

3.9 Computation Cost

Although deep learning models have shown dominant accuracy in the sensor-based human ac-
tivity recognition community, they are typically resource-intensive. For example, the early DCNN
architecture, AlexNet [70], which has five CNN layers and three fully connected layers, processes
61M parameters (249 MB of memory) and performs 1.5B high precision operations to make a pre-
diction. For non-portable applications, Graphic Processing Units (GPUs) are usually leveraged
to accelerate computation. However, GPUs are very expensive and power-hungry, so not suit-
able for real-time applications on mobile devices. Moreover, current research has demonstrated
that making a neural network deeper by introducing additional layers and nodes is a critical ap-
proach to improving model performance, which inevitably increases computational complexity.
Therefore, it is essential and challenging to resolve the issue of high computation cost to realize
real-time and reliable human activity recognition on mobile devices by deep learning models.
3.9.1 Layer Reduction. Considering deep neural networks are more effective in feature extrac-
tion than shallow ones, a combination of human-crafted and deep features is a potential solution to
lowering computation cost. In Reference [116], the authors incorporated the spectrogram features
with only one CNN layer and two fully connected layers for human activity recognition. The hy-
brid architecture showed comparative recognition accuracy to state-of-the-art methods through
evaluation on four benchmark datasets. To validate the feasibility of real-time usage, the authors
implemented the proposed method on three different mobile platforms, including two smartphones
and one on-node unit. The results revealed milliseconds to tens of milliseconds computational
time of one prediction, suggesting the possibility of real-time applications. Reference [106] also
demonstrates the combination of hand-crafted features and a neural network is a potential plan to
achieve real-time activity recognition on a mobile device. In addition to the cascade structure of
hand-crafted features and deep learning features, Reference [115] proposed to arranged the deep

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:27

Table 10. Advantages and Limitations of Different Works for Computation Cost

Solution scheme Approaches References Advantages Limitations

-domain knowledge
combination of -simple structure
required for
Layer reduction hand-crafted features [106, 115, 116] -incorporate features
hand-crafted features
and deep features of different aspects
-complex preprocessing
optimizing basic -limited computation cost
[143, 115] -end-to-end manner
block reducing capability
Network optimization -powerful
computation cost
-risk of performance
network quantization [155, 40] reducing capability
degradation
-suitable for FPGAs
and ASICs

learning features and hand-crafted features in parallel before fed into a fully connected classi-
fier. This structure could increase recognition accuracy with only a small gain of computational
consumption.
3.9.2 Network Optimization. Optimizing basic neural network cells and structures is another
intuitive scheme of decreasing computation complexity. In Reference [143], Vu et al. used a self-
gated recurrent neural network (SGRNN) cell to decline the complexity of a standard LSTM
and prevent gradient vanishing. Their experiments displayed superior computation efficiency to
LSTM and GRU in terms of the running time and model size. However, the running time was still
in the order of hundreds of milliseconds and no real-world evaluation on mobile devices is carried
out to show possible real-time implementation. For CNN-based methods, reducing filter size is
an effective means to optimize the memory consumption and the number of computation oper-
ations. For example, Reference [115] utilized 1D-CNNs instead of 2D-CNNs to control the model
size. A more insightful strategy to dealing with both the storage and computational problems is
the quantization of network [40]. This scheme is to constrain the weights and outputs of activa-
tion functions to two discrete values (e.g., −1, +1) instead of continuous numbers. There are three
major benefits of network quantization: (1) the memory usage and model size are greatly reduced
when compared to the full and precise networks; (2) the bitwise operations are considerably more
efficient than conventional floating or fixed-point arithmetic; (3) if bitwise operations are used,
then most multiply-accumulate operations (requiring hundreds of logic gates at least) can be re-
placed by popcount-XNOR operations (that only require a single logic gate), which are especially
well suited for FPGAs and ASICs [155]. In Reference [155], Yang et al. explored a 2-bit CNN with
weights and activation constrained to {−0.5, 0, 0.5} for efficient activity recognition. Table 10 sum-
marizes the advantages and limitations of different methods for reducing computation cost.

3.10 Privacy
The main application of human activity recognition is to monitor human behaviors so the sensors
capture the activities of a user continuously. Since the way an activity is performed varies among
users, it is possible for an adversary to infer user-sensitive information such as age through the
time series sensor data. Specifically, for the deep learning technique, its black-box characteristic
may be at the risk of revealing user-discriminative features unintentionally. In Reference [63], the
authors investigated the privacy issue of using CNN features for human activity recognition. Their
empirical studies revealed that although CNN is trained with a cross-entropy loss only targeting
activity classification, the obtained CNN features still showed powerful user-discriminative ability.
A simple logistic regressor could achieve a high user-classification accuracy of 84.7% when using

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:28 K. Chen et al.

the CNN features basically extracted for activity, while the same classifier could only obtain 35.2%
user-classification accuracy on raw sensor data. Therefore, it is essential to address the privacy
leakage potentials of a deep learning model originally used for human activity recognition.

3.10.1 Transformation. To address this concern, some researchers explored to utilize an adver-
sarial loss function to minimize the discriminative accuracy of specific privacy information during
the training process. For example, Iwasawa et al. [63] proposed to integrate an adversarial loss with
the standard activity classification loss to minimize the user identification accuracy. The authors
of References [88] and [87] also adopted the similar idea to prevent privacy leakage. Their experi-
ment results show an effective reduction of inferring accuracy for sensitive information. However,
an adversarial loss function can only be used for protecting one kind of private information, such
as user identity and gender. In addition, the adversarial loss goes against the end-to-end train-
ing process that make it hard to converge stably. Considering this gap, Reference [166] borrowed
the idea of image style transformation from the computer vision community to protect all private
information at once. The authors creatively viewed raw sensor signals from two aspects: “style”
aspect, which describes how a user performs an activity and was influenced by user’s identical
information, such as age, weight, gender, height, and so on; and “content” aspect, which describes
what activity a user performs. They proposed to transform raw sensor data to have the “content”
unchanged but the “style” is similar to random noises. Therefore, the method has the potential to
protect all sensitive information at once.

3.10.2 Perturbation. Besides data transformation, data perturbation is another way to resolve
the privacy issue. For example, Lyu et al. proposed to tailor two kinds of data perturbation mech-
anisms: Random Projection and repeated Gompertz to achieve a better tradeoff between privacy
and recognition accuracy [84]. Recently, differential privacy has gained increased research atten-
tion due to its strong theoretical privacy guarantee. Phan et al. [105] proposed to perturb the
objective functions of the traditional deep auto-encoder to enforce the ϵ-differential privacy. In
addition to the privacy preservation in feature extraction layers, an ϵ-differential privacy preserv-
ing softmax layer was also developed for either classification or prediction. Different from the
above approaches, this method provided theoretical privacy guarantees and error bounds. The ad-
vantages and limitations of different methods for protecting user privacy in activity recognition
are in Table 11.

3.11 Interpretability
Sensory data for human activity is unreadable. A data sample may include diverse modalities (e.g.,
acceleration, angular velocity) from multiple positions (e.g., wrist, ankle) in a time window. How-
ever, only a few of modalities from specific positions contribute to identifying certain activities
[72]. For example, lying is distinguishable when people are horizontal (magnetism), and ascending
stairs can be recognized by the forward and the upward acceleration of people’s ankle. Unrelated
modalities can introduce noise and deteriorate the recognition performance. Moreover, the signifi-
cance of each modality changes over time. For instance, in a Parkinson’s disease detection system,
anomaly only appears in gait in a short period instead of the entire time window [162]. Intuitively,
the modality shows more considerable significance when the corresponding body part is actively
moving.
Despite the success of deep learning in activity recognition, the inner mechanisms of deep learn-
ing networks still remain unrevealed. Considering the varying salience of modalities and time
intervals, it is necessary to interpret the neural networks to explore the factors of the models’
decisions. For example, when a deep learning model identifies that a user is walking, we tend to

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:29

Table 11. Advantages and Limitations of Different Works for Privacy Protection

Protection scheme Approaches References Advantages Limitations

-unstable training
-sensitive labels required
-simple network
adversarial training [63, 87, 88] -new structure needed
structure
for new privacy
information
Transformation -protect all privacy
information at one
transformation -complex structure and
style transfer [166]
-free of sensitive training strategy
information for
training
-limited ability to retain
direct noise insertion [84] -simple
activity information
Perturbation
-theoretical privacy
-only validated on fully
differential privacy [105] guarantees and error
connected layers
bounds

know which modality from which time interval is the determinant. Therefore, the interpretability
of deep learning methods has become a new trend in the human activity recognition community.

3.11.1 Feature Visualization. The basic idea of interpretable deep learning is to automatically
decide the importance of each part of the input data and to achieve high accuracy by omitting the
unimportant parts and focusing on the salient parts. In fact, the standard fully connected layers
already possess such capacity, as they automatically reduce the weights of less important neurons
during training, but we still need to visualize the features for interpretation. Some researchers [21,
152] visualized the features extracted by neural networks. Salient features are sent to the subse-
quent models after the authors find out their relationships to the activities from the visualization
[152]. Nutter et al. [98] transformed sensory data to images so visualization tools can be applied
to the sensory data for more direct interpretability.

3.11.2 Attentive Selection. Attention mechanism is recently popular in deep learning areas and
is originally a concept in biology and psychology that illustrates how we restrict our attention to
something crucial for better cognitive results. Inspired by this, researchers apply neural attention
mechanisms to deep learning to give neural networks the capability of concentrating on a subset
of inputs that really matters. Since the principle of deep attention models is to weigh input compo-
nents, components with higher weights are assumed to be more tightly related to the recognition
task and show greater influence over the models’ decisions [128]. Some works employed attention
mechanism to interpret deep model behaviors [165, 168, 170]. Back to human activity recognition,
attention mechanism not only highlights the most distinguishable modalities and time intervals
but also informs us of the most contributing modalities and body parts to specific activities. Deep
attention approaches can be categorized into soft attention and hard attention based on their dif-
ferentiability.
Soft Attention. In machine learning, “soft” means differentiable. Soft attention assigns weight
from 0 to 1 to each element of the inputs. It decides how much attention to focus on each element.
Soft attention uses softmax functions in the attention layers to compute the weights so the whole
model is fully differentiable where gradients can be propagated to other parts of the network
[167]. Attention layers can be inserted into sequence-to-sequence LSTMs for feature extraction
[135]. Attention layers can also be inserted in the neural networks to tune the weights of all

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:30 K. Chen et al.

Table 12. Advantages and Limitations of Different Works for Model Interpretability

Interpretation scheme Approaches References Advantages Limitations

-unable to interpret
hidden layers
-adopt current tools of
-limited power compared
Feature visualization - [22, 98, 152] computer vision
to visualize images as
-simple and intuitive
raw signals are
unreadable
-fully differentiable
-applied to both temporal -high cost when input is
soft attention [85, 87, 96, 162]
and modality selection large
interpretation
Attentive selection
-complex training
-less calculation during procedure
hard attention [26, 29, 28, 173]
test -applied only to modality
selection interpretation

samples [96] in sliding windows, since samples at different time points have varying contributions
to activity recognition. Shen et al. [129] further considered the temporal context. They designed
a segment-level attention approach to decide which time segment contains more information.
Combined with gated CNN, the segment-level attention better extracts temporal dependencies.
Zeng et al. [162] developed attention mechanisms in two perspectives. They first propose sensor
attention on the inputs to extract the salient sensory modalities and then apply temporal attention
to an LSTM to filter out the inactive data segments. Spatial and temporal attention mechanisms
are employed in Reference [85]. Especially, the spatial dependencies are extracted by fusing the
modalities with self-attention.
Hard Attention. Hard attention determines whether to attend to a part of inputs or not. The
weight assigned to an input part is either 0 or 1 so the problem is non-differentiable. The process
involves making a sequence of selections about which part to attend. The selection can be output by
a neural network. However, since there is no ground truth indicating the correct selection policy,
hard attention should be represented as a stochastic process. This is where deep reinforcement
learning comes in. Deep reinforcement learning tackles the selection problems in deep learning
and allows the models to propagate gradients in the space of selection policies.
Different reinforcement learning techniques can be applied to hard attention mechanisms in
human activity recognition. Zhang et al. [173] use dueling deep Q networks as a core of hard
attention to focus on the salient parts of multimodal sensory data. Chen et al. [26, 29] mined im-
portant modalities and elide undesirable features with policy gradient. The attention is embedded
into an LSTM to make selections step-by-step, because LSTM incrementally learns information in
an episode. Chen et al. [28] further considered the intrinsic relations between activities and sub-
motions from human body parts. They employ multiple agents to concentrate on modalities that
are related to sub-motions. Multiple agents coordinate to portray the activities. The visualization
of the selected modalities and body parts validates that the attention mechanism provides insights
into how sensory data elements affect the models’ prediction of activities. The advantages and
limitations of different methods for model interpretability are listed in Table 12.

4 FUTURE RESEARCH DIRECTION

To develop full potential of deep learning in human activity recognition, some future research di-
rections are worthy of further investigation. Future directions can be stimulated by the challenges
summarized in this work. Despite the effort devoted to these challenges, some of them are still not

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:31

fully explored, such as class imbalance, composite activities, concurrent activities, and so on. Al-
though current research works still lack comprehensive and reliable solutions for the challenges,
they lay concrete foundations and show guidance for future directions.
Moreover, there are other research directions that have rarely been explored before. We outline
several key research directions that urgently need to be exploited as follows:

• Independent unsupervised methods. Human activity recognition needs a sufficient

amount of annotated samples to train the deep learning models. Unsupervised learning can
help mitigate such requirements. So far, deep unsupervised models used for human activity
recognition are mainly used for extracting features but are not able to identify activities,
because there is no ground truth. Therefore, one potential method for unsupervised learn-
ing to infer true labels is to seek other knowledge, which leads us to a popular method, deep
unsupervised transfer learning [18]. Another way is to resort to data-driven methods such as
ontology [119].
• Identifying new activities. Identifying novel activities that have never been seen by the
models is a big challenge in human activity recognition. A reliable model should be able
to learn the new knowledge online and achieve accurate recognition without any ground
truth. A promising way is to learn features that are scalable to diverse activities. While
Reference [97] enlightens us that mid-level attributes can be used to depict activities with
a set of characteristics, disentangled features [137] may be another serviceable solution to
representing novel activities.
• Future activity prediction. Future activity prediction is an extension of activity recogni-
tion. Unlike activity recognition, the activity prediction system can forecast users’ behav-
iors in advance. The prediction system is useful in detecting human intention so it can be
applied to smart services, criminal detection, and driver behavior prediction. In some com-
mon behavior tasks, the activities are usually in a certain order. Therefore, modeling the
temporal dependencies across activities is beneficial to predict future predictions. LSTMs
[10] are suitable for such tasks. But for long-span activities, LSTMs cannot contain such
long dependencies. In this case, intention recognition based on brain signals [171] can assist
to inspire activity prediction.
• A standardization of the state-of-the-art. While hundreds of works have been investi-
gated in deep learning and sensor-based human activity recognition, there lacks a stan-
dardization of the state-of-the-art for a fair comparison. The experiment settings and
evaluation metrics for assessing the performance of activity recognition vary from paper
to paper. While deep learning heavily relies on the training data, the division of train-
ing/test/validation sets influences the recognition results. Other factors including data pro-
cessing and the implementation platforms also lead to skewed comparison. Therefore, hav-
ing a mature standardization for all researchers is pressing. It is noteworthy that such an
issue is absent in other areas. For example, ImageNet Challenge [125] meticulously defines
details in the experiment setting to ensure impartial comparison. Jordao et al. [66] imple-
mented and evaluated a set of existing works with standardized settings, but there is still
no rigorous and well-recognized standardization in the field of human activity recognition.

5 CONCLUSION
This work aims at suggesting a rough guideline for novices and experienced researchers who
have interest in deep learning methods for sensor-based human activity recognition. We present a
comprehensive survey to summarize the current deep learning methods for sensor-based human
activity recognition. We first introduce the multi-modality of the sensory data and available public

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:32 K. Chen et al.

datasets and their extensive utilization in different challenges. We then summarize the challenges
in human activity recognition based on their reasons and analyze how existing deep methods are
adopted to address the challenges. At the end of this work, we discuss the open issues and provide
some insights for future directions.

REFERENCES
[1] Zahraa S. Abdallah, Mohamed Medhat Gaber, Bala Srinivasan, and Shonali Krishnaswamy. 2018. Activity recogni-
tion with evolving data streams: A review. Comput. Surv. 51, 4 (2018), 71.
[2] Alireza Abedin, Seyed Hamid Rezatofighi, Qinfeng Shi, and Damith Chinthana Ranasinghe. 2019. SparseSense: Hu-
man activity recognition from highly sparse sensor data-streams using set-based neural networks. In Proceedings of
the 28th International Joint Conference on Artificial Intelligence (IJCAI’19). 5780–5786.
[3] Ali Akbari and Roozbeh Jafari. 2019. Transferring activity recognition models for new wearable sensors with deep
generative domain adaptation. In Proceedings of the 18th International Conference on Information Processing in Sensor
Networks. ACM, 85–96.
[4] Ali Akbari, Jian Wu, Reese Grimsley, and Roozbeh Jafari. 2018. Hierarchical signal segmentation and classification
for accurate activity recognition. In Proceedings of the ACM International Joint Conference and International Sympo-
sium on Pervasive and Ubiquitous Computing and Wearable Computers. ACM, 1596–1605.
[5] Ali A. Alani, Georgina Cosma, and Aboozar Taherkhani. 2020. Classifying imbalanced multi-modal sensor data for
human activity recognition in a smart home using deep learning. In Proceedings of the International Joint Conference
on Neural Networks (IJCNN’20). IEEE, 1–8.
[6] Hande Alemdar, Halil Ertan, Ozlem Durmaz Incel, and Cem Ersoy. 2013. ARAS human activity datasets in multiple
homes with multiple residents. In Proceedings of the 7th International Conference on Pervasive Computing Technologies
for Healthcare. ICST, 232–235.
[7] Mohammad Abu Alsheikh, Ahmed Selim, Dusit Niyato, Linda Doyle, Shaowei Lin, and Hwee-Pink Tan. 2016. Deep
activity recognition models with triaxial accelerometers. In Proceedings of the Workshops at the 30th AAAI Conference
on Artificial Intelligence.
[8] Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, and Jorge Luis Reyes-Ortiz. 2013. A public domain
dataset for human activity recognition using smartphones. In Proceedings of the European Symposium on Artificial
Neural Networks.
[9] Sina Mokhtarzadeh Azar, Mina Ghadimi Atigh, Ahmad Nickabadi, and Alexandre Alahi. 2019. Convolutional rela-
tional machine for group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 7892–7901.
[10] Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and Atilla Baskurt. 2010. Action classifica-
tion in soccer videos with long short-term memory recurrent neural networks. In Proceedings of the International
Conference on Artificial Neural Networks. Springer, 154–159.
[11] Marc Bachlin, Meir Plotnik, Daniel Roggen, Inbal Maidan, Jeffrey M. Hausdorff, Nir Giladi, and Gerhard Troster.
2010. Wearable assistant for Parkinson’s disease patients with the freezing of gait symptom. IEEE Trans. Inf. Technol.
Biomed. 14, 2 (2010), 436–446.
[12] Lei Bai, Lina Yao, Xianzhi Wang, Salil S. Kanhere, Bin Guo, and Zhiwen Yu. 2020. Adversarial multi-view networks
for activity recognition. Proc. ACM Interact., Mob., Wear. Ubiq. Technol. 4, 2 (2020), 1–22.
[13] Lei Bai, Lina Yao, Xianzhi Wang, Salil S. Kanhere, and Yang Xiao. 2020. Prototype similarity learning for activity
recognition. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 649–
661.
[14] Lu Bai, Chris Yeung, Christos Efstratiou, and Moyra Chikomo. 2019. Motion2Vector: Unsupervised learning in hu-
man activity recognition using wrist-sensing data. In Proceedings of the ACM International Joint Conference on Perva-
sive and Ubiquitous Computing and Proceedings of the ACM International Symposium on Wearable Computers. ACM,
537–542.
[15] Donald S. Baim, Wilson S. Colucci, E. Scott Monrad, Harton S. Smith, Richard F. Wright, Alyce Lanoue, Diane
F. Gauthier, Bernard J. Ransil, William Grossman, and Eugene Braunwald. 1986. Survival of patients with severe
congestive heart failure treated with oral milrinone. J. Amer. Coll. Cardiol. 7, 3 (1986), 661–670.
[16] Oresti Banos, Rafael Garcia, Juan A. Holgado-Terriza, Miguel Damas, Hector Pomares, Ignacio Rojas, Alejandro Saez,
and Claudia Villalonga. 2014. mHealthDroid: A novel framework for agile development of mobile health applica-
tions. In Proceedings of the International Workshop on Ambient Assisted Living. Springer, 91–98.
[17] Billur Barshan and Murat Cihan Yüksek. 2014. Recognizing daily and sports activities in two open source machine
learning environments using body-worn sensor units. Comput. J. 57, 11 (2014), 1649–1667.
[18] Yoshua Bengio. 2012. Deep learning of representations for unsupervised and transfer learning. In Proceedings of
ICML Workshop on Unsupervised and Transfer Learning. 17–36.

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:33

[19] Asma Benmansour, Abdelhamid Bouchachia, and Mohammed Feham. 2015. Multioccupant activity recognition in
pervasive smart home environments. Comput. Surv. 48, 3 (2015), 1–36.
[20] Avrim Blum and Tom M. Mitchell. 1998. Combining labeled and unlabeled data with Co-Training. In Proceedings
of the Eleventh Annual Conference on Computational Learning Theory, COLT 1998, Madison, Wisconsin, USA, July
24–26,1998. ACM, 92–100.
[21] Henrik Blunck, Niels Olof Bouvin, Tobias Franke, Kaj Grønbæk, Mikkel B. Kjaergaard, Paul Lukowicz, and Markus
Wüstenberg. 2013. On heterogeneity in mobile sensing applications aiming at representative data collection. In
Proceedings of the ACM Conference on Pervasive and Ubiquitous Computing Adjunct Publication. ACM, 1087–1098.
[22] Eoin Brophy, José Juan Dominguez Veiga, Zhengwei Wang, Alan F. Smeaton, and Tomas E. Ward. 2018. An inter-
pretable machine vision approach to human activity recognition using photoplethysmograph sensor data. arXiv
preprint arXiv:1812.00668 (2018).
[23] Andreas Bulling, Ulf Blanke, and Bernt Schiele. 2014. A tutorial on human activity recognition using body-worn
inertial sensors. Comput. Surv. 46, 3 (2014), 33:1–33:33. DOI:[Link]
[24] Ricardo Chavarriaga, Hesam Sagha, Alberto Calatroni, Sundara Tejaswi Digumarti, Gerhard Tröster, José del R.
Millán, and Daniel Roggen. 2013. The opportunity challenge: A benchmark database for on-body sensor-based ac-
tivity recognition. Pattern Recog. Lett. 34, 15 (2013), 2033–2042.
[25] Chen Chen, Roozbeh Jafari, and Nasser Kehtarnavaz. 2015. UTD-MHAD: A multimodal dataset for human action
recognition utilizing a depth camera and a wearable inertial sensor. In Proceedings of the IEEE International Confer-
ence on Image Processing (ICIP’15). IEEE, 168–172.
[26] Kaixuan Chen, Lina Yao, Xianzhi Wang, Dalin Zhang, Tao Gu, Zhiwen Yu, and Zheng Yang. 2018. Interpretable par-
allel recurrent neural networks with convolutional attentions for multi-modality activity modeling. In Proceedings
of the International Joint Conference on Neural Networks. IEEE, 1–8.
[27] Kaixuan Chen, Lina Yao, Dalin Zhang, Xiaojun Chang, Guodong Long, and Sen Wang. 2019. Distributionally ro-
bust semi-supervised learning for people-centric sensing. In Proceedings of the 33rd AAAI Conference on Artificial
Intelligence (AAAI’19). 3321–3328.
[28] Kaixuan Chen, Lina Yao, Dalin Zhang, Bin Guo, and Zhiwen Yu. 2019. Multi-agent attentional activity recognition.
In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19). 1344–1350.
[29] Kaixuan Chen, Lina Yao, Dalin Zhang, Xianzhi Wang, Xiaojun Chang, and Feiping Nie. 2020. A semisupervisedrecur-
rent convolutional attention model for human activity recognition. IEEE Trans. Neural Networks Learn. Syst. 31, 5
(2020), 1747–1756.
[30] Ling Chen, Yi Zhang, and Liangying Peng. 2020. METIER: A deep multi-task learning based activity and user recog-
nition model using wearable sensors. Proc. ACM Interact., Mob., Wear. Ubiq. Technol. 4, 1 (2020), 1–18.
[31] Yiqiang Chen, Jindong Wang, Meiyu Huang, and Han Yu. 2019. Cross-position activity recognition with stratified
transfer learning. Pervas. Mob. Comput. 57 (2019), 1–13.
[32] Yuwen Chen, Kunhua Zhong, Ju Zhang, Qilong Sun, and Xueliang Zhao. 2016. LSTM networks for mobile human
activity recognition. In Proceedings of the International Conference on Artificial Intelligence: Technologies and Appli-
cations. Atlantis Press.
[33] Weihao Cheng, Sarah M. Erfani, Rui Zhang, and Ramamohanarao Kotagiri. 2018. Predicting complex activities from
ongoing multivariate time series. In Proceedings of the 27th International Joint Conference on Artificial Intelligence.
3322–3328.
[34] Belkacem Chikhaoui and Frank Gouineau. 2017. Towards automatic feature extraction for activity recognition from
wearable sensors: A deep learning approach. In Proceedings of the IEEE 17th International Conference on Data Mining
Workshops (ICDMW’17). IEEE, 693–702.
[35] Jun-Ho Choi and Jong-Seok Lee. 2018. Confidence-based deep multimodal fusion for activity recognition. In Proceed-
ings of the ACM International Joint Conference and International Symposium on Pervasive and Ubiquitous Computing
and Wearable Computers. ACM, 1548–1556.
[36] Oscar Day and Taghi M. Khoshgoftaar. 2017. A survey on heterogeneous transfer learning. J. Big Data 4, 1 (2017),
29.
[37] Sanorita Dey, Nirupam Roy, Wenyuan Xu, Romit Roy Choudhury, and Srihari Nelakuditi. 2014. AccelPrint: Im-
perfections of accelerometers make smartphones trackable. In Proceedings of the Network and Distributed System
Security Symposium (NDSS’14).
[38] Mingtao Dong, Jindong Han, Yuan He, and Xiaojun Jing. 2018. HAR-Net: Fusing deep representation and hand-
crafted features for human activity recognition. In Proceedings of the International Conference on Signal and Infor-
mation Processing, Networking and Computers. Springer, 32–40.
[39] Stefan Duffner, Samuel Berlemont, Grégoire Lefebvre, and Christophe Garcia. 2014. 3D gesture classification with
convolutional neural networks. In Proceedings of the International Conference on Acoustics, Speech and Signal Pro-
cessing. IEEE, 5432–5436.

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:34 K. Chen et al.

[40] Marcus Edel and Enrico Köppe. 2016. Binarized-BLSTM-RNN based human activity recognition. In Proceedings of
the International Conference on Indoor Positioning and Indoor Navigation (IPIN’16). IEEE, 1–7.
[41] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. 2010.
Why does unsupervised pre-training help deep learning? J. Mach. Learn. Res. 11, Feb. (2010), 625–660.
[42] Xiaoyi Fan, Wei Gong, and Jiangchuan Liu. 2018. TagFree activity identification with RFIDs. Proc. ACM Interact.,
Mob., Wear. Ubiq. Technol. 2, 1 (2018), 7.
[43] Martin Gjoreski, Stefan Kalabakov, Mitja Luštrek, and Hristijan Gjoreski. 2019. Cross-dataset deep transfer learn-
ing for activity recognition. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous
Computing and Proceedings of the ACM International Symposium on Wearable Computers. ACM, 714–718.
[44] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville,
and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the International Conference on Advances in
Neural Information Processing Systems. 2672–2680.
[45] Klaus Greff, Rupesh K. Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. 2016. LSTM: A search
space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28, 10 (2016), 2222–2232.
[46] Rene Grzeszick, Jan Marius Lenk, Fernando Moya Rueda, Gernot A. Fink, Sascha Feldhorst, and Michael ten Hompel.
2017. Deep neural network based human activity recognition for the order picking process. In Proceedings of the 4th
International Workshop on Sensor-based Activity Recognition and Interaction. ACM, 14.
[47] Fuqiang Gu, Kourosh Khoshelham, Shahrokh Valaee, Jianga Shang, and Rui Zhang. 2018. Locomotion activity recog-
nition using stacked denoising autoencoders. IEEE Internet Things J. 5, 3 (2018), 2085–2093.
[48] Yu Guan and Thomas Plötz. 2017. Ensembles of deep LSTM learners for activity recognition using wearables. Proc.
ACM Interact., Mob., Wear. Ubiq. Technol. 1, 2 (2017), 11.
[49] Gautham Krishna Gudur, Prahalathan Sundaramoorthy, and Venkatesh Umaashankar. 2019. ActiveHARNet: To-
wards on-device deep Bayesian active learning for human activity recognition. arXiv preprint arXiv:1906.00108
(2019).
[50] Abdu Gumaei, Mohammad Mehedi Hassan, Abdulhameed Alelaiwi, and Hussain Alsalman. 2019. A hybrid deep
learning model for human activity recognition using multimodal body sensing data. IEEE Access 7 (2019), 99152–
99160.
[51] Haodong Guo, Ling Chen, Liangying Peng, and Gencai Chen. 2016. Wearable sensor based multimodal human ac-
tivity recognition exploiting the diversity of classifier ensemble. In Proceedings of the ACM International Joint Con-
ference on Pervasive and Ubiquitous Computing. ACM, 1112–1123.
[52] Quang-Do Ha and Minh-Triet Tran. 2017. Activity recognition from inertial sensors with convolutional neural net-
works. In Proceedings of the International Conference on Future Data and Security Engineering. Springer, 285–298.
[53] Sojeong Ha and Seungjin Choi. 2016. Convolutional neural networks for human activity recognition using multiple
accelerometer and gyroscope sensors. In Proceedings of the International Joint Conference on Neural Networks. IEEE,
381–388.
[54] Sojeong Ha, Jeong-Min Yun, and Seungjin Choi. 2015. Multi-modal convolutional neural networks for activity recog-
nition. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics. IEEE, 3017–3022.
[55] Nils Yannick Hammerla, James Fisher, Peter Andras, Lynn Rochester, Richard Walker, and Thomas Plötz. 2015. PD
disease state assessment in naturalistic environments using deep learning. In Proceedings of the 29th AAAI Conference
on Artificial Intelligence.
[56] Nils Y. Hammerla, Shane Halloran, and Thomas Plötz. 2016. Deep, convolutional, and recurrent models for human
activity recognition using wearables. In Proceedings of the 25th International Joint Conference on Artificial Intelligence.
1533–1540.
[57] H. M. Hossain, MD Al Haiz Khan, and Nirmalya Roy. 2018. DeActive: Scaling activity recognition with active deep
learning. Proc. ACM Interact., Mob., Wear. Ubiq. Technol. 2, 2 (2018), 66.
[58] H. M. Hossain and Nirmalya Roy. 2019. Active deep learning for activity recognition with context aware annotator
selection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
ACM, 1862–1870.
[59] Tâm Huynh, Mario Fritz, and Bernt Schiele. 2008. Discovery of activity patterns using topic models. In Proceedings
of the Conference on Ubiquitous Computing (UbiComp’08), Vol. 8. 10–19.
[60] Tâm Huynh and Bernt Schiele. 2005. Analyzing features for activity recognition. In Proceedings of the Joint Confer-
ence on Smart Objects and Ambient Intelligence: Innovative Context-aware Services: Usages and Technologies. ACM,
159–163.
[61] Shoya Ishimaru, Kensuke Hoshika, Kai Kunze, Koichi Kise, and Andreas Dengel. 2017. Towards reading trackers in
the wild: Detecting reading activities by EOG glasses and deep neural networks. In Proceedings of the ACM Interna-
tional Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the ACM International Symposium
on Wearable Computers. ACM, 704–711.

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:35

[62] Chihiro Ito, Xin Cao, Masaki Shuzo, and Eisaku Maeda. 2018. Application of CNN for human activity recognition
with FFT spectrogram of acceleration and gyro sensors. In Proceedings of the ACM International Joint Conference and
International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers. ACM, 1503–1510.
[63] Yusuke Iwasawa, Kotaro Nakayama, Ikuko Yairi, and Yutaka Matsuo. 2017. Privacy issues regarding the applica-
tion of DNNs to activity-recognition using wearables and its countermeasures by use of adversarial training. In
Proceedings of the 26th International Joint Conference on Artificial Intelligence. 1930–1936.
[64] Wenjun Jiang, Chenglin Miao, Fenglong Ma, Shuochao Yao, Yaqing Wang, Ye Yuan, Hongfei Xue, Chen Song, Xin
Ma, Dimitrios Koutsonikolas, et al. 2018. Towards environment independent device free human activity recognition.
In Proceedings of the 24th International Conference on Mobile Computing and Networking. ACM, 289–304.
[65] Wenchao Jiang and Zhaozheng Yin. 2015. Human activity recognition using wearable sensors by deep convolutional
neural networks. In Proceedings of the 23rd ACM International Conference on Multimedia. ACM, 1307–1310.
[66] Artur Jordao, Antonio C. Nazare Jr, Jessica Sena, and William Robson Schwartz. 2018. Human activity recognition
based on wearable sensor data: A standardization of the state-of-the-art. arXiv preprint arXiv:1806.05226 (2018).
[67] Andrej Karpathy, Justin Johnson, and Li Fei-Fei. 2016. Visualizing and understanding recurrent networks. In Pro-
ceedings of the 4th International Conference on Learning Representations Workshop.
[68] Md Abdullah Al Hafiz Khan, Nirmalya Roy, and Archan Misra. 2018. Scaling human activity recognition via deep
learning-based domain adaptation. In Proceedings of the International Conference on Pervasive Computing and Com-
munications. IEEE, 1–9.
[69] Shehroz S. Khan and Babak Taati. 2017. Detecting unseen falls from wearable devices using channel-wise ensemble
of autoencoders. Exp. Syst. Applic. 87 (2017), 280–290.
[70] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional
neural networks. In Proceedings of the International Conference on Advances in Neural Information Processing Systems.
1097–1105.
[71] Jennifer R. Kwapisz, Gary M. Weiss, and Samuel A. Moore. 2011. Activity recognition using cell phone accelerome-
ters. ACM SIGKDD Explor. Newslett. 12, 2 (2011), 74–82.
[72] Yongjin Kwon, Kyuchang Kang, and Changseok Bae. 2015. Analysis and evaluation of smartphone-based human
activity recognition using a neural network approach. In Proceedings of the International Joint Conference on Neural
Networks. IEEE, 1–5.
[73] Nicholas D. Lane and Petko Georgiev. 2015. Can deep learning revolutionize mobile sensing? In Proceedings of the
16th International Workshop on Mobile Computing Systems and Applications. ACM, 117–122.
[74] Gierad Laput and Chris Harrison. 2019. Sensing fine-grained hand activity with smartwatches. In Proceedings of the
CHI Conference on Human Factors in Computing Systems. ACM, 338.
[75] Oscar D. Lara and Miguel A. Labrador. 2013. A survey on human activity recognition using wearable sensors. IEEE
Commun. Surv. Tutor. 15, 3 (2013), 1192–1209.
[76] Song-Mi Lee, Sang Min Yoon, and Heeryon Cho. 2017. Human activity recognition from accelerometer data using
Convolutional Neural Network. In Proceedings of the IEEE International Conference on Big Data and Smart Computing
(BigComp’17). IEEE, 131–134.
[77] Fei Li and Schahram Dustdar. 2011. Incorporating unsupervised learning in activity recognition. In Proceedings of
the Workshops at the 25th AAAI Conference on Artificial Intelligence.
[78] Xinyu Li, Yuan He, and Xiaojun Jing. 2019. A survey of deep learning-based human activity recognition in radar.
Remote Sens. 11, 9 (2019), 1068.
[79] Xinyu Li, Yanyi Zhang, Mengzhu Li, Ivan Marsic, JaeWon Yang, and Randall S. Burd. 2016. Deep neural network
for RFID-based activity recognition. In Proceedings of the 8th Wireless of the Students, by the Students, and for the
Students Workshop (S3@MobiCom’16). ACM, 24–26.
[80] Xinyu Li, Yanyi Zhang, Ivan Marsic, Aleksandra Sarcevic, and Randall S. Burd. 2016. Deep learning for RFID-based
activity recognition. In Proceedings of the 14th ACM Conference on Embedded Network Sensor Systems CD-ROM. ACM,
164–175.
[81] Xinyu Li, Yanyi Zhang, Jianyu Zhang, Shuhong Chen, Ivan Marsic, Richard A. Farneth, and Randall S. Burd. 2017.
Concurrent activity recognition with multimodal CNN-LSTM structure. arXiv preprint arXiv:1702.01638 (2017).
[82] Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. 2003. A symbolic representation of time series, with
implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data
Mining and Knowledge Discovery. ACM, 2–11.
[83] Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431–3440.
[84] Lingjuan Lyu, Xuanli He, Yee Wei Law, and Marimuthu Palaniswami. 2017. Privacy-preserving collaborative deep
learning with application to human activity recognition. In Proceedings of the ACM on Conference on Information
and Knowledge Management. ACM, 1219–1228.

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:36 K. Chen et al.

[85] Haojie Ma, Wenzhong Li, Xiao Zhang, Songcheng Gao, and Sanglu Lu. 2019. AttnSense: Multi-level attention mecha-
nism for multimodal human activity recognition. In Proceedings of the 28th International Joint Conference on Artificial
Intelligence (IJCAI’19). 3109–3115.
[86] Yuchao Ma and Hassan Ghasemzadeh. 2019. LabelForest: Non-parametric semi-supervised learning for activity
recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4520–4527.
[87] Mohammad Malekzadeh, Richard G. Clegg, Andrea Cavallaro, and Hamed Haddadi. 2018. Protecting sensory data
against sensitive inferences. In Proceedings of the 1st Workshop on Privacy by Design in Distributed Systems. ACM, 2.
[88] Mohammad Malekzadeh, Richard G. Clegg, Andrea Cavallaro, and Hamed Haddadi. 2019. Mobile sensor data
anonymization. In Proceedings of the International Conference on Internet of Things Design and Implementation. 49–58.
[89] Akhil Mathur, Tianlin Zhang, Sourav Bhattacharya, Petar Veličković, Leonid Joffe, Nicholas D. Lane, Fahim Kawsar,
and Pietro Lió. 2018. Using deep data augmentation training to address software and hardware heterogeneities in
wearable and smartphone sensing devices. In Proceedings of the 17th ACM/IEEE International Conference on Infor-
mation Processing in Sensor Networks. IEEE Press, 200–211.
[90] Shinya Matsui, Nakamasa Inoue, Yuko Akagi, Goshu Nagino, and Koichi Shinoda. 2017. User adaptation of con-
volutional neural network for human activity recognition. In Proceedings of the 25th European Signal Processing
Conference. IEEE, 753–757.
[91] Taylor Mauldin, Marc Canby, Vangelis Metsis, Anne Ngu, and Coralys Rivera. 2018. SmartFall: A smartwatch-based
fall detection system using deep learning. Sensors 18, 10 (2018), 3363.
[92] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural
network based language model. In Proceedings of the 11th Conference of the International Speech Communication
Association.
[93] Abdel-rahman Mohamed, George E. Dahl, and Geoffrey Hinton. 2011. Acoustic modeling using deep belief networks.
IEEE Trans. Audio, Speech, Lang. Proc. 20, 1 (2011), 14–22.
[94] Francisco Javier Ordóñez Morales and Daniel Roggen. 2016. Deep convolutional feature transfer across mobile ac-
tivity recognition domains, sensor modalities and locations. In Proceedings of the ACM International Symposium on
Wearable Computers. ACM, 92–99.
[95] Sebastian Münzner, Philip Schmidt, Attila Reiss, Michael Hanselmann, Rainer Stiefelhagen, and Robert Dürichen.
2017. CNN-based sensor fusion techniques for multimodal human activity recognition. In Proceedings of the ACM
International Symposium on Wearable Computers. ACM, 158–165.
[96] Vishvak S. Murahari and Thomas Plötz. 2018. On attention models for human activity recognition. In Proceedings of
the ACM International Symposium on Wearable Computers. ACM, 100–103.
[97] Harideep Nair, Cathy Tan, Ming Zeng, Ole J. Mengshoel, and John Paul Shen. 2019. AttriNet: Learning mid-level
features for human activity recognition with deep belief networks. In Proceedings of the ACM International Joint
Conference on Pervasive and Ubiquitous Computing and Proceedings of the ACM International Symposium on Wearable
Computers. ACM, 510–517.
[98] Mark Nutter, Catherine H. Crawford, and Jorge Ortiz. 2018. Design of novel deep learning models for real-time hu-
man activity recognition with mobile phones. In Proceedings of the International Joint Conference on Neural Networks.
IEEE, 1–8.
[99] Henry Friday Nweke, Ying Wah Teh, Mohammed Ali Al-Garadi, and Uzoma Rita Alo. 2018. Deep learning algorithms
for human activity recognition using mobile and wearable sensor networks: State of the art and research challenges.
Exp. Syst. Applic. 105 (2018), 233–261.
[100] Tsuyoshi Okita and Sozo Inoue. 2017. Recognition of multiple overlapping activities using compositional CNN-
LSTM model. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing and
Proceedings of the ACM International Symposium on Wearable Computers. ACM, 165–168.
[101] Francisco Ordóñez and Daniel Roggen. 2016. Deep convolutional and LSTM recurrent neural networks for multi-
modal wearable activity recognition. Sensors 16, 1 (2016), 115.
[102] Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (2009),
1345–1359.
[103] Liangying Peng, Ling Chen, Zhenan Ye, and Yi Zhang. 2018. AROMA: A deep multi-task learning based simple and
complex human activity recognition method using wearable sensors. Proc. ACM Interact., Mob., Wear. Ubiq. Technol.
2, 2 (2018), 74.
[104] Cuong Pham and Patrick Olivier. 2009. Slice&dice: Recognizing food preparation activities using embedded ac-
celerometers. In Proceedings of the European Conference on Ambient Intelligence. Springer, 34–43.
[105] NhatHai Phan, Yue Wang, Xintao Wu, and Dejing Dou. 2016. Differential privacy preservation for deep auto-
encoders: An application of human behavior prediction. In Proceedings of the 30th AAAI Conference on Artificial
Intelligence.

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:37

[106] Ivan Miguel Pires, Nuno Pombo, Nuno M. Garcia, and Francisco Flórez-Revuelta. 2018. Multi-sensor mobile plat-
form for the recognition of activities of daily living and their environments based on artificial neural networks. In
Proceedings of the 27th International Joint Conference on Artificial Intelligence. 5850–5852.
[107] Thomas Plötz, Nils Y. Hammerla, and Patrick L. Olivier. 2011. Feature learning for activity recognition in ubiquitous
computing. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence.
[108] Samira Pouyanfar, Saad Sadiq, Yilin Yan, Haiman Tian, Yudong Tao, Maria Presa Reyes, Mei-Ling Shyu, Shu-Ching
Chen, and S. S. Iyengar. 2018. A survey on deep learning: Algorithms, techniques, and applications. Comput. Surv.
51, 5 (2018), 92.
[109] Ofir Press, Amir Bar, Ben Bogin, Jonathan Berant, and Lior Wolf. 2017. Language generation with recurrent gener-
ative adversarial networks without pre-training. arXiv preprint arXiv:1706.01399 (2017).
[110] Hangwei Qian, Sinno Pan, and Chunyan Miao. 2018. Sensor-based activity recognition via learning from distribu-
tions. In Proceedings of the AAAI Conference on Artificial Intelligence.
[111] Hangwei Qian, Sinno Jialin Pan, Bingshui Da, and Chunyan Miao. 2019. A novel distribution-embedded neural
network for sensor-based activity recognition. In Proceedings of the 28th International Joint Conference on Artificial
Intelligence (IJCAI’19). 5614–5620.
[112] Hangwei Qian, Sinno Jialin Pan, and Chunyan Miao. 2019. Distribution-based semi-supervised learning for activity
recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7699–7706.
[113] Valentin Radu, Nicholas D. Lane, Sourav Bhattacharya, Cecilia Mascolo, Mahesh K. Marina, and Fahim Kawsar. 2016.
Towards multimodal deep learning for activity recognition on mobile devices. In Proceedings of the ACM International
Joint Conference on Pervasive and Ubiquitous Computing: Adjunct. ACM, 185–188.
[114] Valentin Radu, Catherine Tong, Sourav Bhattacharya, Nicholas D. Lane, Cecilia Mascolo, Mahesh K. Marina, and
Fahim Kawsar. 2018. Multimodal deep learning for activity and context recognition. Proc. ACM Interact., Mob., Wear.
Ubiq. Technol. 1, 4 (2018), 157.
[115] Daniele Ravi, Charence Wong, Benny Lo, and Guang-Zhong Yang. 2016. A deep learning approach to on-node sensor
data analytics for mobile or wearable devices. IEEE J. Biomed. Health Inform. 21, 1 (2016), 56–64.
[116] Daniele Ravi, Charence Wong, Benny Lo, and Guang-Zhong Yang. 2016. Deep learning for human activity recog-
nition: A resource efficient implementation on low-power devices. In Proceedings of the IEEE 13th International
Conference on Wearable and Implantable Body Sensor Networks (BSN’16). IEEE, 71–76.
[117] Attila Reiss and Didier Stricker. 2012. Introducing a new benchmarked dataset for activity monitoring. In Proceedings
of the 16th International Symposium on Wearable Computers. IEEE, 108–109.
[118] Jorge-L. Reyes-Ortiz, Luca Oneto, Albert Samà, Xavier Parra, and Davide Anguita. 2016. Transition-aware human
activity recognition using smartphones. Neurocomputing 171 (2016), 754–767.
[119] Daniele Riboni, Linda Pareschi, Laura Radaelli, and Claudio Bettini. 2011. Is ontology-based activity recognition
really effective? In Proceedings of the IEEE International Conference on Pervasive Computing and Communications
Workshops. IEEE, 427–431.
[120] Daniel Roggen, Alberto Calatroni, Mirco Rossi, Thomas Holleczek, Kilian Förster, Gerhard Tröster, Paul Lukowicz,
David Bannach, Gerald Pirkl, Alois Ferscha, et al. 2010. Collecting complex activity datasets in highly rich networked
sensor environments. In Proceedings of the 7th International Conference on Networked Sensing Systems (INSS’10). IEEE,
233–240.
[121] Seyed Ali Rokni, Marjan Nourollahi, and Hassan Ghasemzadeh. 2018. Personalized human activity recognition using
convolutional neural networks. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.
[122] Charissa Ann Ronao and Sung-Bae Cho. 2015. Deep convolutional neural networks for human activity recognition
with smartphone sensors. In Proceedings of the International Conference on Neural Information Processing. Springer,
46–53.
[123] Charissa Ann Ronao and Sung-Bae Cho. 2016. Human activity recognition with smartphone sensors using deep
learning neural networks. Exp. Syst. Applic. 59 (2016), 235–244.
[124] Silvia Rossi, Roberto Capasso, Giovanni Acampora, and Mariacarla Staffa. 2018. A multimodal deep learning network
for group activity recognition. In Proceedings of the International Joint Conference on Neural Networks. IEEE, 1–6.
[125] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpa-
thy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. Int. J. Comput.
Vis. 115, 3 (2015), 211–252.
[126] Ramyar Saeedi, Skyler Norgaard, and Assefaw H. Gebremedhin. 2017. A closed-loop deep learning architecture for
robust activity recognition using wearable sensors. In Proceedings of the IEEE International Conference on Big Data.
IEEE, 473–479.
[127] Jeffrey C. Schlimmer and Richard H. Granger. 1986. Incremental learning from noisy data. Mach. Learn. 1, 3 (1986),
317–354.

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:38 K. Chen et al.

[128] Sofia Serrano and Noah A. Smith. 2019. Is attention interpretable? In Proceedings of the 57th Conference of the Asso-
ciation for Computational Linguistics (ACL’19). 2931–2951.
[129] Yu-Han Shen, Ke-Xin He, and Wei-Qiang Zhang. 2018. SAM-GCNN: A gated convolutional neural network with
segment-level attention mechanism for home activity monitoring. In Proceedings of the IEEE International Symposium
on Signal Processing and Information Technology (ISSPIT’18). IEEE, 679–684.
[130] Muhammad Shoaib, Stephan Bosch, Ozlem Incel, Hans Scholten, and Paul Havinga. 2014. Fusion of smartphone
motion sensors for physical activity recognition. Sensors 14, 6 (2014), 10146–10176.
[131] Geetika Singla, Diane J. Cook, and Maureen Schmitter-Edgecombe. 2010. Recognizing independent and joint activ-
ities among multiple residents in smart environments. J. Amb. Intell. Human. Comput. 1, 1 (2010), 57–63.
[132] Elnaz Soleimani and Ehsan Nazerfard. 2019. Cross-subject transfer learning in human activity recognition systems
using generative adversarial networks. arXiv preprint arXiv:1903.12489 (2019).
[133] Maja Stikic, Kristof Van Laerhoven, and Bernt Schiele. 2008. Exploring semi-supervised and active learning for
activity recognition. In Proceedings of the 12th IEEE International Symposium on Wearable Computers. IEEE, 81–88.
[134] Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow, Mikkel Baun Kjærgaard, Anind Dey, Tobias
Sonne, and Mads Møller Jensen. 2015. Smart devices are different: Assessing and mitigating mobile sensing hetero-
geneities for activity recognition. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems.
ACM, 127–140.
[135] Yujin Tang, Jianfeng Xu, Kazunori Matsumoto, and Chihiro Ono. 2016. Sequence-to-sequence model with attention
for time series classification. In Proceedings of the 16th International Conference on Data Mining Workshops. IEEE,
503–510.
[136] Dapeng Tao, Yonggang Wen, and Richang Hong. 2016. Multicolumn bidirectional long short-term memory for mo-
bile devices-based human activity recognition. IEEE Internet Things J. 3, 6 (2016), 1124–1134.
[137] Luan Tran, Xi Yin, and Xiaoming Liu. 2017. Disentangled representation learning GAN for pose-invariant face
recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1415–1424.
[138] Son N. Tran, Qing Zhang, Vanessa Smallbon, and Mohan Karunanithi. 2018. Multi-resident activity monitoring in
smart homes: A case study. In Proceedings of the IEEE International Conference on Pervasive Computing and Commu-
nications Workshops (PerCom Workshops’18). IEEE, 698–703.
[139] Tim L. M. van Kasteren, Gwenn Englebienne, and Ben J. A. Kröse. 2011. Human activity recognition from wire-
less sensor network data: Benchmark and software. In Activity Recognition in Pervasive Intelligent Environments.
Springer, 165–186.
[140] Alireza Abedin Varamin, Ehsan Abbasnejad, Qinfeng Shi, Damith C. Ranasinghe, and Hamid Rezatofighi. 2018. Deep
auto-set: A deep auto-encoder-set network for activity recognition using wearables. In Proceedings of the 15th EAI
International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services. ACM, 246–253.
[141] George Vavoulas, Charikleia Chatzaki, Thodoris Malliotakis, Matthew Pediaditis, and Manolis Tsiknakis. 2016. The
MobiAct dataset: Recognition of activities of daily living using smartphones. In Proceedings of the International
Conference on Information and Communication Technologies for Ageing Well and e-Health (ICT4AgeingWell’16). 143–
151.
[142] Praneeth Vepakomma, Debraj De, Sajal K. Das, and Shekhar Bhansali. 2015. A-Wristocracy: Deep learning on wrist-
worn sensing for recognition of user complex activities. In Proceedings of the IEEE 12th International Conference on
Wearable and Implantable Body Sensor Networks (BSN’15). 1–6.
[143] Toan H. Vu, An Dang, Le Dung, and Jia-Ching Wang. 2017. Self-gated recurrent neural networks for human activity
recognition on wearable devices. In Proceedings of the Thematic Workshops of ACM Multimedia. ACM, 179–185.
[144] Jiwei Wang, Yiqiang Chen, Yang Gu, Yunlong Xiao, and Haonan Pan. 2018. SensoryGANs: An effective genera-
tive adversarial framework for sensor-based human activity recognition. In Proceedings of the International Joint
Conference on Neural Networks. IEEE, 1–8.
[145] Jindong Wang, Yiqiang Chen, Shuji Hao, Xiaohui Peng, and Lisha Hu. 2019. Deep learning for sensor-based activity
recognition: A survey. Pattern Recog. Lett. 119 (2019), 3–11.
[146] Jindong Wang, Vincent W. Zheng, Yiqiang Chen, and Meiyu Huang. 2018. Deep transfer learning for cross-domain
activity recognition. In Proceedings of the 3rd International Conference on Crowd Science and Engineering. ACM, 16.
[147] Yanwen Wang, Jiaxing Shen, and Yuanqing Zheng. 2020. Push the limit of acoustic gesture recognition. In 39th IEEE
Conference on Computer Communications (INFOCOM’20). IEEE, 566–575.
[148] Sungpil Woo, Jaewook Byun, Seonghoon Kim, Hoang Minh Nguyen, Janggwan Im, and Daeyoung Kim. 2016. RNN-
based personalized activity recognition in multi-person environment using RFID. In Proceedings of the IEEE Inter-
national Conference on Computer and Information Technology (CIT’16). IEEE, 708–715.
[149] Rui Xi, Mengshu Hou, Mingsheng Fu, Hong Qu, and Daibo Liu. 2018. Deep dilated convolution on multimodality
time series for human activity recognition. In Proceedings of the International Joint Conference on Neural Networks.
IEEE, 1–8.

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:39

[150] Rui Xi, Ming Li, Mengshu Hou, Mingsheng Fu, Hong Qu, Daibo Liu, and Charles R. Haruna. 2018. Deep dilation on
multimodality time series for human activity recognition. IEEE Access 6 (2018), 53381–53396.
[151] Cheng Xu, Duo Chai, Jie He, Xiaotong Zhang, and Shihong Duan. 2019. InnoHAR: A deep neural network for
complex human activity recognition. IEEE Access 7 (2019), 9893–9902.
[152] Li Xue, Si Xiandong, Nie Lanshun, Li Jiazhen, Ding Renjie, Zhan Dechen, and Chu Dianhui. 2018. Understanding
and improving deep neural network for activity recognition. arXiv preprint arXiv:1805.07020 (2018).
[153] Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiao Li Li, and Shonali Krishnaswamy. 2015. Deep convolutional
neural networks on multichannel time series for human activity recognition. In Proceedings of the 24th International
Joint Conference on Artificial Intelligence.
[154] Yang Yang, Chunping Hou, Yue Lang, Dai Guan, Danyang Huang, and Jinchen Xu. 2019. Open-set human activity
recognition based on micro-Doppler signatures. Pattern Recog. 85 (2019), 60–69.
[155] Zhan Yang, Osolo Ian Raymond, Chengyuan Zhang, Ying Wan, and Jun Long. 2018. DFTerNet: Towards 2-bit dy-
namic fusion networks for accurate human activity recognition. IEEE Access 6 (2018), 56750–56764.
[156] Lina Yao, Feiping Nie, Quan Z. Sheng, Tao Gu, Xue Li, and Sen Wang. 2016. Learning from less for better: Semi-
supervised activity recognition via shared structure discovery. In Proceedings of the ACM International Joint Confer-
ence on Pervasive and Ubiquitous Computing. 13–24.
[157] Rui Yao, Guosheng Lin, Qinfeng Shi, and Damith C. Ranasinghe. 2018. Efficient dense labelling of human activity
sequences from wearables using fully convolutional networks. Pattern Recog. 78 (2018), 252–266.
[158] Shuochao Yao, Shaohan Hu, Yiran Zhao, Aston Zhang, and Tarek Abdelzaher. 2017. Deepsense: A unified deep learn-
ing framework for time-series mobile sensing data processing. In Proceedings of the 26th International Conference on
World Wide Web. International World Wide Web Conferences Steering Committee, 351–360.
[159] Yuta Yuki, Junto Nozaki, Kei Hiroi, Katsuhiko Kaji, and Nobuo Kawaguchi. 2018. Activity recognition using dual-
ConvLSTM extracting local and global features for SHL recognition challenge. In Proceedings of the ACM Interna-
tional Joint Conference and International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers.
ACM, 1643–1651.
[160] Piero Zappi, Clemens Lombriser, Thomas Stiefmeier, Elisabetta Farella, Daniel Roggen, Luca Benini, and Gerhard
Tröster. 2008. Activity recognition from on-body sensors: Accuracy-power trade-off by dynamic sensor selection.
In Proceedings of the European Conference on Wireless Sensor Networks. Springer, 17–33.
[161] Tahmina Zebin, Patricia J. Scully, and Krikor B. Ozanyan. 2016. Human activity recognition with inertial sensors
using a deep learning approach. In Proceedings of the IEEE Conference on Sensors (SENSORS’16). IEEE, 1–3.
[162] Ming Zeng, Haoxiang Gao, Tong Yu, Ole J. Mengshoel, Helge Langseth, Ian Lane, and Xiaobing Liu. 2018. Under-
standing and improving recurrent networks for human activity recognition by continuous attention. In Proceedings
of the ACM International Symposium on Wearable Computers. ACM, 56–63.
[163] Ming Zeng, Le T. Nguyen, Bo Yu, Ole J. Mengshoel, Jiang Zhu, Pang Wu, and Joy Zhang. 2014. Convolutional neural
networks for human activity recognition using mobile sensors. In Proceedings of the 6th International Conference on
Mobile Computing, Applications and Services. IEEE, 197–205.
[164] Ming Zeng, Tong Yu, Xiao Wang, Le T. Nguyen, Ole J. Mengshoel, and Ian Lane. 2017. Semi-supervised convolutional
neural networks for human activity recognition. In Proceedings of the IEEE International Conference on Big Data. IEEE,
522–529.
[165] Dalin Zhang, Kaixuan Chen, Debao Jian, and Lina Yao. 2020. Motor imagery classification via temporal attention
cues of graph embedded EEG signals. IEEE J. Biomed. Health Inform. 24, 9 (2020), 2570–2579.
[166] Dalin Zhang, Lina Yao, Kaixuan Chen, Guodong Long, and Sen Wang. 2019. Collective protection: Preventing sen-
sitive inferences via integrative transformation. In Proceedings of the 19th IEEE International Conference on Data
Mining. IEEE, 1–6.
[167] Dalin Zhang, Lina Yao, Kaixuan Chen, and Jessica Monaghan. 2019. A convolutional recurrent attention model for
subject-independent eeg signal analysis. IEEE Sig. Proc. Lett. 26, 5 (2019), 715–719.
[168] Dalin Zhang, Lina Yao, Kaixuan Chen, and Sen Wang. 2018. Ready for use: Subject-independent movement inten-
tion recognition via a convolutional attention model. In Proceedings of the 27th ACM International Conference on
Information and Knowledge Management. ACM, 1763–1766.
[169] Dalin Zhang, Lina Yao, Kaixuan Chen, Sen Wang, Xiaojun Chang, and Yunhao Liu. 2019. Making sense of spatio-
temporal preserving representations for EEG-based human intention recognition. IEEE Trans. Cyber. 50, 7 (2019),
3033–3044.
[170] Dalin Zhang, Lina Yao, Kaixuan Chen, Sen Wang, Pari Delir Haghighi, and Caley Sullivan. 2019. A graph-based
hierarchical attention model for movement intention detection from EEG signals. IEEE Trans. Neural Syst. Rehab.
Eng. 27, 11 (2019), 2247–2253.
[171] Dalin Zhang, Lina Yao, Xiang Zhang, Sen Wang, Weitong Chen, Robert Boots, and Boualem Benatallah. 2018. Cas-
cade and parallel convolutional recurrent neural networks on EEG-based intention recognition for brain computer
interface. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:40 K. Chen et al.

[172] Mi Zhang and Alexander A. Sawchuk. 2012. USC-HAD: A daily activity dataset for ubiquitous activity recognition
using wearable sensors. In Proceedings of the ACM Conference on Ubiquitous Computing. ACM, 1036–1043.
[173] Xiang Zhang, Lina Yao, Chaoran Huang, Sen Wang, Mingkui Tan, Guodong Long, and Can Wang. 2018. Multi-
modality sensor data classification with selective attention. In Proceedings of the 27th International Joint Conference
on Artificial Intelligence.
[174] Xiang Zhang, Lina Yao, and Feng Yuan. 2019. Adversarial variational embedding for robust semi-supervised learning.
In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 139–147.
[175] Yanyi Zhang, Xinyu Li, Jianyu Zhang, Shuhong Chen, Moliang Zhou, Richard A. Farneth, Ivan Marsic, and Randall S.
Burd. 2017. Car—A deep learning structure for concurrent activity recognition. In Proceedings of the 16th ACM/IEEE
International Conference on Information Processing in Sensor Networks (IPSN’17). IEEE, 299–300.
[176] Yong Zhang, Yu Zhang, Zhao Zhang, Jie Bao, and Yunpeng Song. 2018. Human activity recognition based on time
series analysis using U-Net. arXiv preprint arXiv:1809.08113 (2018).
[177] Yi Zheng, Qi Liu, Enhong Chen, Yong Ge, and J. Leon Zhao. 2014. Time series classification using multi-channels deep
convolutional neural networks. In Proceedings of the International Conference on Web-age Information Management.
Springer, 298–310.
[178] Yue Zheng, Yi Zhang, Kun Qian, Guidong Zhang, Yunhao Liu, Chenshu Wu, and Zheng Yang. 2019. Zero-effort
cross-domain gesture recognition with Wi-Fi. In Proceedings of the 17th International Conference on Mobile Systems,
Applications, and Services. ACM, 313–325.
[179] Jun-Yan Zhu and Jim Foley. 2019. Learning to synthesize and manipulate natural images. IEEE Comput. Graph. Applic.
39, 2 (2019), 14–23.
[180] Han Zou, Yuxun Zhou, Jianfei Yang, Hao Jiang, Lihua Xie, and Costas J. Spanos. 2018. DeepSense: Device-free human
activity recognition via autoencoder long-term recurrent convolutional network. In Proceedings of the International
Conference on Communications (ICC’18). IEEE, 1–6.

Received January 2020; revised November 2020; accepted January 2021

ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.

Detecting Fake Twitter Datasets Using AI
No ratings yet
Detecting Fake Twitter Datasets Using AI
21 pages
Hybrid Framework for Activity Recognition
No ratings yet
Hybrid Framework for Activity Recognition
13 pages
Human Activity Recognition Review
No ratings yet
Human Activity Recognition Review
30 pages
Human Activity
No ratings yet
Human Activity
53 pages
CNN-Based Human Activity Recognition
No ratings yet
CNN-Based Human Activity Recognition
17 pages
Informatics 09 00056
No ratings yet
Informatics 09 00056
13 pages
Smartphone-Based Human Activity Recognition
No ratings yet
Smartphone-Based Human Activity Recognition
7 pages
CNN-Based Human Activity Recognition
No ratings yet
CNN-Based Human Activity Recognition
18 pages
Expert Systems With Applications: Review
No ratings yet
Expert Systems With Applications: Review
29 pages
Deep Learning in Human Activity Recognition
No ratings yet
Deep Learning in Human Activity Recognition
3 pages
Human Activity Recognition with CNN
No ratings yet
Human Activity Recognition with CNN
5 pages
Human Activity Recognition: A Review: March 2015
No ratings yet
Human Activity Recognition: A Review: March 2015
6 pages
SLR Zainab Saba
No ratings yet
SLR Zainab Saba
21 pages
Sensor Fusion for Human Activity Recognition
No ratings yet
Sensor Fusion for Human Activity Recognition
20 pages
Human Activity Recognition for Smart Environments
No ratings yet
Human Activity Recognition for Smart Environments
10 pages
LSTM-CNN for Human Activity Recognition
No ratings yet
LSTM-CNN for Human Activity Recognition
12 pages
Synopsis
No ratings yet
Synopsis
4 pages
Human Activity Recognition with ML
No ratings yet
Human Activity Recognition with ML
5 pages
(Chapman & Hall - CRC Computer & Information Science Series) Miguel A. Labrador, Oscar D. Lara Yejas - Human Activity Recognition - Using Wearable Sensors and Smartphones-Chapman and Hall - CRC (2013)
No ratings yet
(Chapman & Hall - CRC Computer & Information Science Series) Miguel A. Labrador, Oscar D. Lara Yejas - Human Activity Recognition - Using Wearable Sensors and Smartphones-Chapman and Hall - CRC (2013)
206 pages
Human-Object Interaction via Wrist Sensors
No ratings yet
Human-Object Interaction via Wrist Sensors
9 pages
Sensors 22 06463 v2
No ratings yet
Sensors 22 06463 v2
33 pages
Machine Learning Fall Detection System
No ratings yet
Machine Learning Fall Detection System
19 pages
Transfer Learning in Human Activity Recognition
No ratings yet
Transfer Learning in Human Activity Recognition
15 pages
Deep Learning in Human Activity Recognition
No ratings yet
Deep Learning in Human Activity Recognition
1 page
Deep Reinforcement Learning for HAR Survey
No ratings yet
Deep Reinforcement Learning for HAR Survey
12 pages
Human Activity Recognition with ML Metrics
No ratings yet
Human Activity Recognition with ML Metrics
19 pages
Sequential Neural Networks For Multi-Resident Acti
No ratings yet
Sequential Neural Networks For Multi-Resident Acti
16 pages
Human Activity Recognition With Sensor Approach
No ratings yet
Human Activity Recognition With Sensor Approach
179 pages
Machine Learning in Human Activity Recognition
No ratings yet
Machine Learning in Human Activity Recognition
36 pages
Human Activity Recognition Project Report
No ratings yet
Human Activity Recognition Project Report
14 pages
CNN for Human Activity Recognition
No ratings yet
CNN for Human Activity Recognition
19 pages
Deep Learning in Sensor-based HAR Survey
No ratings yet
Deep Learning in Sensor-based HAR Survey
11 pages
Survey of Machine Learning in HAR
No ratings yet
Survey of Machine Learning in HAR
6 pages
Real-Time Activity Recognition Using KNN
No ratings yet
Real-Time Activity Recognition Using KNN
16 pages
Environment-Independent Activity Recognition
No ratings yet
Environment-Independent Activity Recognition
5 pages
Vision-Based Human Activity Recognition
No ratings yet
Vision-Based Human Activity Recognition
4 pages
Human Activity Recognition with Accelerometers
No ratings yet
Human Activity Recognition with Accelerometers
13 pages
Biometric Identification via Human Activity Recognition
No ratings yet
Biometric Identification via Human Activity Recognition
21 pages
Machine Learning Trends in Activity Recognition
No ratings yet
Machine Learning Trends in Activity Recognition
16 pages
Deep Learning for Human Activity Recognition
No ratings yet
Deep Learning for Human Activity Recognition
10 pages
Multichannel Attention Networks for HAR
No ratings yet
Multichannel Attention Networks for HAR
12 pages
Lightweight Deep Learning for Activity Recognition
No ratings yet
Lightweight Deep Learning for Activity Recognition
20 pages
Keyframe-Based Human Activity Recognition
No ratings yet
Keyframe-Based Human Activity Recognition
18 pages
IMU Signal-Based Activity Recognition
No ratings yet
IMU Signal-Based Activity Recognition
32 pages
Human Action Recognition Overview
No ratings yet
Human Action Recognition Overview
58 pages
Advances in Human Activity Recognition
No ratings yet
Advances in Human Activity Recognition
8 pages
CNN and Transfer Learning for HAR
No ratings yet
CNN and Transfer Learning for HAR
4 pages
Deep Learning for Wearable Activity Recognition
No ratings yet
Deep Learning for Wearable Activity Recognition
25 pages
Human Activity Recognition Techniques
No ratings yet
Human Activity Recognition Techniques
4 pages
Video-Based Human Activity Recognition
No ratings yet
Video-Based Human Activity Recognition
4 pages
Human Activity Recognition for Robots
No ratings yet
Human Activity Recognition for Robots
10 pages
Deep Learning for Elderly Activity Monitoring
No ratings yet
Deep Learning for Elderly Activity Monitoring
4 pages
Human Activity Recognition with AI Models
No ratings yet
Human Activity Recognition with AI Models
80 pages
FPGA-Based Neural Networks for HAR
No ratings yet
FPGA-Based Neural Networks for HAR
11 pages
Knowledge-Driven Activity Recognition
No ratings yet
Knowledge-Driven Activity Recognition
27 pages
Hybrid Feature Selection for HAR
No ratings yet
Hybrid Feature Selection for HAR
25 pages
Kinetics Dataset for Human Activity Recognition
No ratings yet
Kinetics Dataset for Human Activity Recognition
3 pages
Data Science Terminology Guide
No ratings yet
Data Science Terminology Guide
9 pages
Data Scientist Resume: Gokul Singh Shah
No ratings yet
Data Scientist Resume: Gokul Singh Shah
2 pages
AI Engineer Learning Roadmap
No ratings yet
AI Engineer Learning Roadmap
2 pages
Robust Flight Navigation with Liquid Networks
No ratings yet
Robust Flight Navigation with Liquid Networks
15 pages
AI-Driven Content Recommendations Guide
No ratings yet
AI-Driven Content Recommendations Guide
36 pages
Digital Softmax and Inverse Softmax Design
No ratings yet
Digital Softmax and Inverse Softmax Design
4 pages
NILM Algorithm and Machine Learning Techniques
No ratings yet
NILM Algorithm and Machine Learning Techniques
9 pages
Deep Learning for Knee OA Severity Assessment
No ratings yet
Deep Learning for Knee OA Severity Assessment
72 pages
Software Engineer Portfolio: Skills & Projects
No ratings yet
Software Engineer Portfolio: Skills & Projects
2 pages
Linear Algebra Fundamentals for ML
No ratings yet
Linear Algebra Fundamentals for ML
70 pages
Solar PV Generation Forecasting System
No ratings yet
Solar PV Generation Forecasting System
18 pages
AI Course Assignment on Legal Cases
No ratings yet
AI Course Assignment on Legal Cases
5 pages
Genomic Data Analysis for Personalized Medicine
No ratings yet
Genomic Data Analysis for Personalized Medicine
29 pages
Deep Learning in AdS/CFT Modeling
No ratings yet
Deep Learning in AdS/CFT Modeling
15 pages
AI Engineer with Expertise in NLP & Automation
No ratings yet
AI Engineer with Expertise in NLP & Automation
1 page
Age and Gender Prediction from Faces
No ratings yet
Age and Gender Prediction from Faces
8 pages
JavaScript Tensors in Deep Learning
No ratings yet
JavaScript Tensors in Deep Learning
21 pages
Fintech Glossary 2nd Edition
No ratings yet
Fintech Glossary 2nd Edition
197 pages
FrankenGAN: Guided Detail Synthesis For Building Mass-Models Using Style-Synchonized GANs
No ratings yet
FrankenGAN: Guided Detail Synthesis For Building Mass-Models Using Style-Synchonized GANs
14 pages
VGG16 for Automated Fruit Disease Detection
No ratings yet
VGG16 for Automated Fruit Disease Detection
3 pages
Introduction to Artificial Intelligence
No ratings yet
Introduction to Artificial Intelligence
124 pages
Genealogy of AI: Math and Logic Foundations
No ratings yet
Genealogy of AI: Math and Logic Foundations
14 pages
Hardware Solutions for Machine Learning
No ratings yet
Hardware Solutions for Machine Learning
4 pages
CNN for Diabetic Retinopathy Detection
No ratings yet
CNN for Diabetic Retinopathy Detection
19 pages
HCIA-AI V4.0 Exam Overview
No ratings yet
HCIA-AI V4.0 Exam Overview
2 pages
ETE Syllabus for B.E. Curriculum 2022-23
No ratings yet
ETE Syllabus for B.E. Curriculum 2022-23
34 pages
CBSE Class X AI Sample Question Paper
No ratings yet
CBSE Class X AI Sample Question Paper
19 pages
Explainable AI: Interpreting Deep Learning
No ratings yet
Explainable AI: Interpreting Deep Learning
8 pages
Cucumber Leaf Disease Identification Using AI
No ratings yet
Cucumber Leaf Disease Identification Using AI
13 pages
Deep Learning for MIMO Detection
No ratings yet
Deep Learning for MIMO Detection
5 pages