Deep Learning in Sensor-based Activity Recognition
Deep Learning in Sensor-based Activity Recognition
The vast proliferation of sensor devices and Internet of Things enables the applications of sensor-based ac-
tivity recognition. However, there exist substantial challenges that could influence the performance of the
recognition system in practical scenarios. Recently, as deep learning has demonstrated its effectiveness in
many areas, plenty of deep methods have been investigated to address the challenges in activity recognition.
In this study, we present a survey of the state-of-the-art deep learning methods for sensor-based human activ-
ity recognition. We first introduce the multi-modality of the sensory data and provide information for public
datasets that can be used for evaluation in different challenge tasks. We then propose a new taxonomy to
structure the deep methods by challenges. Challenges and challenge-related deep methods are summarized
and analyzed to form an overview of the current research progress. At the end of this work, we discuss the
open issues and provide some insights for future directions.
CCS Concepts: • General and reference → Surveys and overviews; • Hardware → Sensor devices and
platforms; • Computer systems organization → Neural networks;
Additional Key Words and Phrases: Activity recognition, deep learning, sensors
ACM Reference format:
Kaixuan Chen, Dalin Zhang, Lina Yao, Bin Guo, Zhiwen Yu, and Yunhao Liu. 2021. Deep Learning for Sensor-
based Human Activity Recognition: Overview, Challenges, and Opportunities. ACM Comput. Surv. 54, 4, Ar-
ticle 77 (May 2021), 40 pages.
[Link]
1 INTRODUCTION
Recent advance in human activity recognition has enabled myriad applications such as smart
homes [61], healthcare [79], and enhanced manufacturing [46]. Activity recognition is essen-
tial to humanity, since it records people’s behaviors with data that allows computing systems to
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021. 77
77:2 K. Chen et al.
monitor, analyze, and assist their daily life. There are two mainstreams of human activity recogni-
tion systems: video-based systems and sensor-based systems. Video-based systems use cameras to
take images or videos to recognize people’s behaviors [9]. Sensor-based systems utilize on-body
or ambient sensors to dead-reckon people’s motion details or log their activity tracks. Consid-
ering the privacy issues of installing cameras in our personal space, sensor-based systems have
dominated the applications of monitoring our daily activities. Besides, sensors take advantage of
pervasiveness. Thanks to the proliferation of smart devices and Internet of Things, sensors can
be embedded in portable devices such as phones, watches, and nonportable objects such as cars,
walls, and furniture. Sensors are widely embedded around us, uninterruptedly and non-intrusively
logging humans’ motion information.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:3
inter-activity similarity [22]. Different activities may have similar characteristics (e.g., walk-
ing and running). Therefore, it is difficult to produce distinguishable features to represent
activities uniquely.
• Training and evaluation of learning techniques require large annotated data samples. How-
ever, it is expensive and time-consuming to collect and annotate sensory activity data.
Therefore, annotation scarcity is a remarkable challenge for sensor-based activity recog-
nition. Besides, data for some emergent or unexpected activities (e.g., accidental fall) is es-
pecially hard to obtain, which leads to another challenge called class imbalance.
• Human activity recognition involves three factors: users, time, and sensors. First, activity
patterns are person-dependent. Different users may have diverse activity styles. Second,
activity concepts vary over time. The assumption that users’ activity patterns remain un-
changed for a long period is impractical. Moreover, novel activities are likely to emerge
when in use. Third, diverse sensor devices are opportunistically configured on human bod-
ies or in environments. The composition and the layouts of sensors dramatically influence
the data stimulated by activities. All the three factors lead to distribution discrepancy
between the training data and test data and need to be mitigated urgently.
• The complexity of data association is another reason that makes recognition challenging.
Data association refers to how many users and how many activities the data is associated
with. There are many specific challenges in activity recognition that are driven by sophis-
ticated data association. The first challenge can be seen in composite activities. Most ac-
tivity recognition tasks are based on simple activities, like walking and sitting. However,
more meaningful ways to log human daily routines are composite activities that comprise a
sequence of atomic activities. For example, “washing hands” can be represented as {turning
on the tap, soaping, rubbing hands, turning off the tap}. One challenge stimulated by com-
posite activities is data segmentation. A composite activity can be defined as a sequence of
activities. Therefore, accurate recognition highly relies on precise data segmentation tech-
niques. Concurrent activities show the third challenge. Concurrent activities occur when
a user participates in more than one activity simultaneously, such as answering a phone call
while watching TV. Multi-occupant activities are also associated with the complexity of
data association. Recognition is arduous when multiple users engage in a set of activities,
which usually happens in multi-resident scenarios.
• Another factor that needs to be concerned is the feasibility of the human activity recognition
system. Efforts need to be devoted to making the system acceptable by a vast number of
users, since human activity recognition is quite close to human daily life, which can be
twofold. First, the system should be recourse-intensive so it fits portable devices and is
able to give an instant response. Thus, the computational cost issue should be addressed.
Second, as the recognition system records users’ life continuously, there are risks of personal
information disclosure, which leads to the privacy issue.
• Unlike images or texts, sensory data is unreadable. Moreover, sensory data inevitably in-
cludes lots of noise information on account of the inherent imperfections of sensors. So,
reliable recognition solutions should have interpretability in sensory data and the capa-
bility of understanding which part of data facilitates recognition and which part deteriorates
that.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:4 K. Chen et al.
carefully engineered and heuristic. There were no universal or systematical feature extraction
approaches to effectively capture distinguishable features for human activities.
In recent years, deep learning has embraced conspicuous prosperity in modeling high-level ab-
stractions from intricate data [108] in many areas such as computer vision, natural language pro-
cessing, and speech processing. After early works, including References [54, 73, 153], examined
the effectiveness of deep learning in human activity recognition, related studies sprung up in this
area. Along with the inevitable development of deep learning in human activity recognition, latest
works are undertaken to address the specific challenges. However, deep learning is still confronted
with reluctant acceptance by researchers owing to its abrupt success, bustling innovation, and lack
of theoretical support. Therefore, it is necessary to demonstrate the reasons behind the feasibility
and success of deep learning in human activity recognition despite the challenges.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:5
• We summarize the state-of-the-art and how specific deep networks or deep techniques can
be applied to address the challenges with comprehensive analysis. We compare different
solutions for the same challenges and list the pros and cons. The challenge-method-analysis
format aims to build a problem-solution structure with a hope to suggest a rough guideline
when readers are selecting their research topics or developing their approaches.
• Moreover, we provide information on available public datasets and their potential extension
to evaluate specific challenges.
• We discuss some open issues in this field and point out potential future research directions.
2.2 Datasets
There are several publicly available human activity recognition datasets. We summarize some of
the most popular ones in Table 1, which contains the data acquisition context, number of subjects,
number of activities, sensor types, and potential challenge tasks they can be used in. In the data ac-
quisition context, “daily living” refers to subjects performing common daily living activities under
instructions. The challenges are explained in detail in Section 3.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:6
Table 1. Public Datasets for Human Activity Recognition
OPPORTUNITY [24, 120] Daily Living 4 9 Wearable, Object, Ambient Multimodal Composite Activity
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
CASAS-4 [131] Real-world Home Living 2 15 Object, Ambient Multi-occupant Composite Activity Multimodal
Smartwatch/Notch/Farseeing [91] Daily Living & Fall Detection 7 4 ADL & 4 Fall Wearable Class Imbalance
Darmstadt Daily Routines [59] Real-world Routines 1 35 Wearable Class Imbalance
MotionSense [88] Daily Living 24 6 Wearable Simple
MobiAct/MobiFall [141] Daily Living & Fall Detection 66 12 ADL & 4 Fall Wearable Multimodal
VanKasteren benchmark [139] Real-world Home Living 3 9 Object Simple
ActiveMiles 1 Real-world Routines 10 7 Wearable Multimodal
ActRecTut [22] Hand Gesture & Playing Tennis 2 12 Wearable Multimodal
1 [Link]
K. Chen et al.
Deep Learning for Sensor-based Human Activity Recognition 77:7
Some researchers manage to adopt traditional methods to extract temporal features and use
deep learning techniques for the following activity recognition. Basic signal statistics and wave-
form traits such as mean and variance of time-series signals are commonly applied handcrafted
features for early-stage deep learning activity recognition [142]. This kind of feature is coarse and
lacks scalability. A more advanced temporal feature extraction approach is to exploit the spectral
power changes as time evolves by converting the time series from the time domain to the frequency
domain. A general example structure is shown in Figure 2(a), where a 2D-CNN is usually used to
process the spectral features. In Reference [65], Jiang and Yin applied the Short-time Discrete
Fourier Transform (STDFT) to time-serial signals and constructed a time-frequency-spectral
image. Then, CNN is utilized to handle the image for recognizing simple daily activities such as
walking and standing. More recently, Laput and Harrison [74] developed a fine-grained hand ac-
tivity sensing system through the combination of the time-frequency-spectral features and CNNs.
They demonstrated 95.2% classification accuracy over 25 atomic hand activities of 12 people. The
spectral features can not only be used for the wearable sensor activity recognition but also be used
for the device-free activity recognition. Fan et al. [42] proposed to develop time-angle spectrum
frames for representing the spectral power variations along time in different spatial angles of the
RFID signals.
Since one of the most favorable advantages of the deep learning technology is the impressive
power of automatic feature learning, extracting temporal features by a neural network is favorable
to construct an end-to-end deep learning model. The end-to-end learning manner facilitates the
training procedure and mutually promotes the feature learning and recognition processes. Various
deep learning approaches have been applied for temporal information extraction, including RNN,
temporal CNN, and their variants. RNN is a widely applied deep temporal feature extraction ap-
proach in many fields [92, 169]. Traditional RNN cells suffer from vanishing/exploding gradients
problems, which limits the application of EEG analysis. The Long Short-Term Memory (LSTM)
units that have overcome this issue are usually used to build an RNN for temporal feature extrac-
tion [45]. The depth of an effective LSTM-based RNN needs to be at least two when processing
sequential data [67]. As the sensor signals are continuous streaming data, a sliding window is gen-
erally used to segment the raw data to individual pieces, each of which is the input of an RNN cell
[32]. A typical LSTM-based structure for temporal feature extraction is illustrated in Figure 2(b).
The length and moving step of the sliding window are hyper-parameters that need to be care-
fully tuned for achieving satisfying performance. Besides the early application of the basic LSTM
network, continuing research of diverse RNN variants is also being investigated in the human ac-
tivity recognition field. The Bidirectional LSTM (Bi-LSTM) structure that has two conventional
LSTM layers for extracting temporal dynamics from both forward and backward directions is an
important variant of the RNN in various domains including human activity recognition [61]. In ad-
dition, Guan and Plötz [48] proposed an ensemble approach of multiple deep LSTM networks and
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:8 K. Chen et al.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:9
3.1.2 Multimodal Feature Extraction. The current research of human activity recognition is
usually achieved with multiple different sensors, such as accelerometers, gyroscopes, and magne-
tometers. Some research has further demonstrated that the combination of diverse sensing modal-
ities can obtain better results than one particular sensor only [51]. As a result, learning the inter-
modality correlations along with the intra-modality information is a major challenge in the field of
deep learning-based human activity recognition. The sensing modality fusion can be performed
following two strategies: Feature Fusion (Figure 3(a)), which combines different modalities to
produce single feature vectors for classification; and Classifier Ensemble (Figure 3(b)), in which
outputs of classifiers operating only on features of one modality are blended together.
Münzner et al. [95] investigated the feature fusion manner of deep neural networks for multi-
modal activity recognition. They organized the fusion manners into four categories according to
different fusion stages within a network. However, their study focuses on CNN-based architectures
only. Here, we extend their definitions of feature fusion manners to all deep learning architectures
and manage to reveal more insights and specific considerations.
Early Fusion (EF) (Figure 4(a)). This manner fuses the data of all sources at the beginning, irre-
spective of sensing modalities. It is attractive in terms of simplicity as a strategy, though it is at risk
of missing detailed correlations. A simple fusion approach in Reference [76] transformed the raw x,
y, and z acceleration data into a magnitude vector by calculating the Euclidean norm of x, y, and z
values. Gu et al. [47] stacked the time serial signals of different modalities horizontally into a single
1D vector and utilized a denoising autoencoder to learn robust representations. The output of the
intermediate layer was used to feed the final softmax classifier. In contrast, Ha et al. [54] proposed
to vertically stack all signal sequences to form a 2D matrix and directly applied 2D-CNNs to simul-
taneously capture both local dependencies over time as well as spatial dependencies over modal-
ities. In Reference [52], the authors preprocessed the raw signal sequence of a single modality
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:10 K. Chen et al.
into a 2D format, but by simply reorganizing, and stacked all modalities along the depth dimension
to finally achieve 3D data matrices. Afterwards, they applied a 3D-CNN to exploit the inter- and
intra-modality features. However, conventional CNN is restricted to explore the correlations of
neighboring arranged modalities and thus misses the relations between the nonadjacent modali-
ties. To solve this issue, unlike naturally organizing various data sources, Jiang and Yin [65] assem-
bled signal sequences of different modalities into a novel arrangement where every signal sequence
has the chance to be adjacent to every other sequence. This organization facilitates the DCNN to
extract elaborated correlations of individual sensing axes. Dilated convolution is another solution
to exploiting nonadjacent modalities without information loss and extra computational expenses
[150]. In addition to wearable sensors, RFID-based activity recognition requires the fusion of
multiple RFID signals as well, and CNNs are also commonly used for the early fusion manner [80].
Sensor-based Fusion (SF) (Figure 4(b)). In contrast to EF, SF first considers each modality in-
dividually and then fuses different modalities afterwards. Such an architecture not only extracts
modality-specific information from various sensors but also allows flexible complexity distribu-
tion, since the structures of the modality-specific branches can be different. In References [113,
114], Radu et al. proposed a fully connected deep neural network (DNN) architecture to facili-
tate the intra-modality learning. Independent DNN branches are assigned to each sensor modality,
and a unifying cross-sensor layer merges all the branches to uncover the inter-modality informa-
tion. Yao et al. [158] vertically stacked all axes of a sensor to form 2D matrices and designed
individual CNNs for each 2D matrix to learn the intra-modality relations. The sensor-specific fea-
tures of different sensors are then flattened and stacked into a new 2D matrix before being fed into
a merge CNN for further extracting the interactions among different sensors. A more advanced
fusion approach was proposed by Choi et al. [35] to efficiently fuse different modalities by regu-
lating the level of contribution of each sensor. The authors designed a confidence calculation layer
for automatically determining the confidence score of a sensing modality, and then the confidence
score was normalized and multiplied with pre-processed features for the following feature fusion
of addition. Instead of fusing sensor-specific feature only at the late stage, Ha and Choi [53] pro-
posed to create a vector of different modalities at the early stage as well and to extract the common
characteristics across modalities along with the sensor-specific characteristics; then both kinds of
features are fused at the later part of the model.
Axis-based Fusion (AF) (Figure 4(c)). This manner treats signal sources in more detail by
handling each sensor axis separately. In such a way, the interference between different sensor
axes is gotten rid of. Reference [95] referred this manner to Channel-based late fusion (CB-LF).
Nevertheless, the sensor channel may be confused with the “channel” in CNNs, so we use the
term “axis” instead in this article. A commonly used AF strategy is to design a specific neural
network for each univariate time series of each sensing channel [163, 177]. The information
representations from all channels are concatenated at last for input into a final classification
network. 1D-CNNs are widely used as the feature learning network of each sensing channel.
Dong and Han [38] proposed to use separable convolution operations to extract the specific
temporal features of each axis and concatenate all the features before feeding a fully connected
layer. In the studies of applying deep learning to hand-crafted features, the axis-specific process is
a requirement. For instance, in Reference [62], temporal features of acceleration and gyro signals
are first represented by FFT spectrogram images and then vertically combined into a larger image
for the following DCNN to learn inter-modality features. Furthermore, some research combined
the spectrogram images along the depth dimension to establish a 3D format [74], which could be
easily handled by 2D CNNs with the depth dimension.
Shared-filter Fusion (SFF) (Figure 4(d)). Same as the AF approach, this manner processes the
univariate time-serial data of a sensor axis independently. However, the same filter is applied to
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:11
all time sequences. Therefore, the filters are influenced by all input members. Compared to the AF
manner, SFF is more simple and contains fewer trainable parameters. The most popular approach
of SFF is to organize the raw sensing sequences into a 2D matrix by stacking along the modality
dimension and then to apply a 2D-CNN to the 2D matrix with 1D filters [39, 153, 161]. As a result,
the architecture is equivalent to applying identical 1D-CNNs to different univariate time series.
Although the features of all sensing modalities are not merged explicitly, they communicate by
the shared 1D filters.
Classifier Ensemble. In addition to fusing features before interference, the integration of mul-
tiple modalities can be done by blending the recognition results from each modality as well. A
range of ensemble approaches has been developed for fusing recognition results to yield an over-
all inference. For example, Guo et al. [51] proposed to use MLPs to create a base classifier for each
sensing modality and incorporate all classifiers by assigning ensemble weights in the classifier
level. When building the base classifiers, the authors not only considered the recognition accuracy
but also emphasized the diversity of the base classifiers by inducing diversity measures. Thus, the
diversity of different modalities is preserved, which is critical to overcoming the over-fit issues and
to improving the overall generalization ability. Besides the conventional classifier ensemble, Khan
et al. [69] targeted the fall detection problem and introduced an ensemble of the reconstruction
error from the autoencoder of each sensing modality.
The most attractive benefit of the classifier ensemble method is the scalability of additional sen-
sors. A well-developed model of a specific sensing modality can be easily merged into an existing
system by configuring the ensemble part only. Reversely, when a sensor is removed from a system,
the recognition model can be freely adapted to this hardware change. Nevertheless, an intrinsic
shortcoming of the ensemble fusion is that the inter-modality correlations may be underestimated
due to the late fusion stage.
3.1.3 Statistical Feature Extraction. Different from deep learning-based feature extraction, fea-
ture engineering-based methods are able to extract meaningful features, such as statistical in-
formation. However, domain knowledge is usually required for manually designing such kind of
features. In Reference [110], a kernel embedding based solution is proposed to extract all statisti-
cal information of the activity data. However, spatial and temporal information is not considered
in their model. Recently, Qian et al. [111] managed to develop a Distribution-Embedded Deep
Neural Network (DDNN) to integrate the statistical features with spatial and temporal infor-
mation in an end-to-end deep learning framework for activity recognition. It encodes the idea of
kernel embedding of distributions into a deep architecture, such that all orders of statistical mo-
ments could be extracted as features to represent each segment of sensor readings and further
combined with conventional spatial and temporal deep features for activity classification in an
end-to-end training manner. The authors utilized an autoencoder to guarantee the injectivity of
the feature mapping. They also introduced an extra loss function based on MMD distance to force
the autoencoder to learn good feature representations of inputs. Extensive experiments on four
datasets demonstrated the effectiveness of the statistical feature extraction methods. Although ex-
tracting statistical features has been explored in a deep-learning-based way, more reasonable and
meaningful explanations on the extracted features are still undeveloped.
The technologies for feature extraction have their strengths and weaknesses. A summary of the
advantages and limitations of different technologies is presented in Table 2.
Table 2. Advantages and Limitations of Different Works for Feature Extraction Approaches
readings due to hardware issues making the sensor data temporally sparse that requires a specific
structure of neural network to resolve [2]. Furthermore, it is more challenging to assign labels
to a large amount of data. First, the annotation process is expensive, time-consuming, and very
tedious. Second, labels are subject to various sources of noise, such as sensor noise, segmentation
issues, and the variation of activities across different people, which makes the annotation process
error-prone. Therefore, researchers have begun to investigate unsupervised learning and semi-
supervised learning approaches to reduce the dependence on massive annotated data.
3.2.1 Unsupervised Learning. Unsupervised learning is mainly used for exploratory data anal-
ysis to discover patterns among data. In Reference [77], the authors examined the feasibility of in-
corporating unsupervised learning methods in activity recognition, but the community of activity
recognition still needs more effective methods to deal with the high-dimensional and heteroge-
neous sensory data for activity recognition.
Recently, deep generative models including Deep Belief Networks (DBNs) and autoencoders
have become dominant for unsupervised learning. DBNs and autoencoders are composed of mul-
tiple layers of hidden units. They are useful in extracting features and finding patterns in massive
data. Also, deep generative models are more robust against overfitting problems as compared to
discriminative models [93]. So, researchers tend to use them for feature extraction to exploit unla-
beled data, as it is easy and cheap to collect unlabeled activity datasets. According to Erhan et al.
in Reference [41], a generative pretraining of a deep model guides the discriminative training to
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:13
3.2.2 Semi-supervised Learning. Semi-supervised learning has shown a growing trend in activ-
ity recognition because of the difficulty in obtaining labeled data [156]. A semi-supervised method
requires less labeled data and massive unlabeled data for training. How to utilize unlabeled data
for reinforcing the recognition system has become a point of interest. Some works have explored
to promote classic semi-supervised learning methods on activity recognition, such as manifold
learning [86, 112]. Recently, as deep learning is powerful in capturing patterns from data, various
semi-supervised learning has been incorporated for activity recognition such as co-training, active
learning, and data augmentation.
Co-training was proposed by Blum and Mitchell in 1998 [20]. It was an extension of self-
learning. In self-learning approaches, a weak classifier is first trained with a small amount of
labeled data. This classifier is used for classifying the unlabeled samples. The samples with high
confidence can be labeled and added to the labeled set for re-training the classifier. In co-training,
multiple classifiers are employed, each of which is trained with one individual view of training
data. Likewise, the classifiers select unlabeled samples to add to the labeled set by confidence score
or majority voting. The whole process of co-training can be seen in Figure 5(a). With the training
set augmented, the classifiers are enhanced. Blum and Mitchell [20] suggested that co-training is
fully effective under three conditions: (a) multiple views of training data are not strongly corre-
lated, (b) each view contains sufficient information for learning a weak classifier, (c) the views are
mutually redundant. In respect of sensor-based human activity recognition, co-training is compat-
ible, because multiple modalities can be regarded as multiple views. Chen et al. [29] applied co-
training with multiple classifiers on different modalities of the data. Three classifiers are trained
on acceleration, angular velocity, and magnetism, respectively. The learned classifiers are used for
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:14 K. Chen et al.
predicting the unlabeled data after each training round. If most of the classifiers reach an agree-
ment on predicting an unlabeled sample, then this sample is labeled and moved to the labeled set
for the next training round. The training flow is repeated until no confident samples can be la-
beled or the unlabeled set is empty. Then a new classifier is trained on the final labeled set with
all modalities.
Co-training is like human learning. People can learn new knowledge from existing experience,
and new knowledge can be used to summarize and accumulate experience. Experience and knowl-
edge constantly interact with each other. Similarly, co-training uses current models to select new
samples that they can learn from, and the samples help to train the models for the next selection.
However, automatic labeling may introduce errors. Acquiring correct labels can improve accuracy.
Active learning is another category in semi-supervised learning. Different from self-learning
and co-training, which label the unlabeled samples automatically, active learning requires annota-
tors who are usually experts or users to label the data manually. To lighten the burden of labeling,
the goal of active learning is to select the most informative unlabeled instances for annotators to
label and improve the classifiers with these data so minimal human supervision is needed. Here,
the most informative instances denote the instances that bring the most enormous impact on the
model if their labels are available. A general framework of active learning can be seen in Figure 5(b).
It includes a classifier, a query strategy, and an annotator. The classifier learns from a small amount
of labeled data, selects one or a set of the most useful unlabeled samples via query strategy, asks the
annotator for true labels, and utilizes the new labels for further training and next query. The ac-
tive learning process is also a loop. It stops when it meets the stop criteria. There are two common
query strategies for selecting the most profitable samples, which are uncertainty and diversity.
Uncertainty can be measured by information entropy. Larger entropy means higher uncertainty
and better informativeness. Diversity means that the queried samples should be comprehensive,
and the information provided by them are non-repetitive and non-redundant. In Reference [133],
the authors applied two query strategies. One of them is to select samples with lowest prediction
confidence, and the other one resorts to the idea of co-training, but it oppositely selects samples
with high disagreement among classifiers.
Deep active learning approaches are deployed in activity recognition [57, 58]. Hossain et al. [57]
considered that traditional active learning methods merely choose the most informative samples
that only occupy a small fraction of the available data. In this way, a large number of samples are
discarded. Although the selected samples are vital for training, the discarded samples are also of
value on account of the substantial amount. Therefore, they proposed a new method to combine
active learning and deep learning in which not only the most informative unlabeled samples are
queried but the less necessary samples are also leveraged. The data is first clustered with K-means
clustering. While the intuitive idea is to query the optimal samples such as the centroids of the
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:15
clusters, in this work, the neighboring samples are also queried. The experiments show that the
proposed method can achieve the optimal results by labeling 10% of the data.
Hossain and Roy [58] further investigated two problems of deep active learning and human
activity recognition. The first problem is that outliers can be easily mistaken for important samples.
When entropy is calculated for selection, apart from informativeness, larger entropy may also
mean outliers, because outliers belong to none of the classes. Therefore, a joint loss function was
proposed in Reference [58] to address this problem. Cross-entropy loss and information loss are
jointly minimized to reduce the entropy of outliers. The second problem considered in this work is
how to reduce the workload of annotators as annotators are required to master domain knowledge
for accurate labels. Multiple annotators are employed in this work. They are selected from the
intimate people of users. The annotator selection is made by the reinforcement learning algorithm
according to the discrepancy and the relations of users. The contextual similarity is used to measure
the relations among users and annotators. The experimental results show that this work has an
8% improvement in accuracy and has a higher convergence rate.
Co-training and active learning are based on the same idea of rebuilding the model upon labels
of unlabeled data. Data augmentation with synthesizing new activity data is another way when
data collection is challenging in specific scenarios such as resource-limited or high-risk scenarios.
Data augmentation with synthesizing data indicates generating massive fake data from a small
amount of real data so the fake data can facilitate to train the models. One popular tool is Genera-
tive Adversarial Network (GAN). GAN was first introduced in Reference [44]. GAN is powerful
in synthesizing data that follow the distribution of training data. A GAN is composed of two parts,
a generator and a discriminator. The generator creates synthetic data and the discriminator eval-
uates them for authenticity. The goal of the generator is to generate data that are genuine enough
to cheat the discriminator, while the goal of the discriminator is to identify images generated by
the generator as fake. The training is in an adversarial way, which is based on a min-max the-
ory. During training, the generator and the discriminator mutually improve their performance in
generation and discrimination. Variants of GANs have been applied to different fields, such as
language generation [109] and image generation [179].
The first work about data augmentation with synthesizing sensory data for activity recogni-
tion is called SensoryGANs [144]. As sensory data is heterogeneous, a unified GAN may not
be enough to depict the complex distribution of different activities. Wang et al. employed three
activity-specific GANs for three activities. After generation, the synthetic data are fed into classi-
fiers for prediction with original data. We should note that although this work uses deep generative
networks, the generation process depends on labels so the process is not unsupervised. Zhang et al.
[174] proposed to use semi-supervised GAN for activity recognition. Different from regular GAN,
the discriminator in semi-supervised GAN makes a K + 1 class classification that includes activity
classification and fake data identification. To ensure the distribution of the generated data to trend
to the authentic distribution, a prearranged distribution is provided as inputs by Variational Au-
toEncoders (VAEs) instead of Gaussian noises. The aim of VAEs is to provide distributions that
represent the distributions of input data. Moreover, VAE++ was proposed to guarantee that the
inputs are exclusive for each training sample. Overall, the unified framework combining VAE++
and semi-supervised GAN proves to be effective in activity recognition.
Table 3 summarizes recent deep learning works for annotation scarcity in activity recognition
and their advantages and disadvantages.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:16 K. Chen et al.
activity recognition follows a supervised learning manner, which requires a significant amount
of labeled data to train a deep model. However, some sensor data of specific activities are chal-
lenging to obtain, such as those related to falls of elderly people. In addition, raw data recorded
from unconstrained conditions is naturally class-imbalanced. When using an imbalanced dataset,
conventional models tend to predict the class with the majority number of training samples while
ignoring the class with few available training samples. Therefore, it is urgent to determine the
class imbalance issue for developing an effective activity recognition model. Methods of dealing
with class imbalance can be divided into two groups.
3.3.1 Data Level. The most intuitive path to tackling the imbalance problem is to re-sample the
class with the largest number of samples [5]. However, such a method is at the risk of reducing the
total amount of training samples and omitting some critical samples with featured characteristics.
In contrast, augmenting new samples to the class with a minority number of samples could not
only keep all original samples but also enhance models’ robustness. Grzeszick et al. [46] utilized
two augmentation methods, Gaussian noises perturbation and interpolation, to tackle the problem
of class imbalance. The augmentation approaches could preserve the coarse structure of the data,
but a random time jitter in the sensor’s sampling process is simulated. They created a larger num-
ber of samples for the under-represented classes and ensure that each class has at least a certain
percentage of data in the training set.
3.3.2 Algorithmic Level. Another direction of solving the imbalance concern is to modify the
model-building strategy instead of directly balancing the training dataset. In Reference [48], Guan
and Plötz utilized the F 1-score rather than the conventional cross-entropy as the loss function to
address the imbalance problem. Because the F 1-score considers both the recall and precision as-
pects, classes with different numbers of training samples are equally taken into account. Besides
the class imbalance of original datasets, it is also a non-negligible problem for a semi-supervised
framework, as the process of gradually labeling unlabeled samples may create uneven new num-
bers of labels across different classes. Chen et al. [29] concerned class imbalance in small labeled
datasets. They leveraged a semi-supervised framework, co-training, to enrich the labeled set in
cyclic training rounds. To balance the training samples across classes while simultaneously main-
taining the distributions of the samples, a pattern-preserving strategy was proposed before the
training phase of the co-training framework. K-means clustering was first adopted to mine la-
tent activity patterns of each activity. Then, sampling is applied to each pattern. The main goal
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:17
is to guarantee that the numbers of all the patterns of all activities are even. A summary of the
advantages and limitations of different works for resolving class imbalance is presented in Table 4.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:18 K. Chen et al.
the system has never encountered (e.g., new activities, new users, new sensors). In the following
sections, we detailedly introduce three categorizes of discrepancy and how the state-of-the-art
approaches manage to mitigate the discrepancy. Most of them are based on transfer learning.
3.4.1 Distribution Discrepancy with Users. Owing to biological and environmental factors, the
same activity can be performed differently by different individuals. For example, some people walk
slowly and some prefer to walk faster and more dynamically. Since people have diverse behavior
patterns, data from different users are distributed variously. Usually, if the models are trained and
tested with data that are collected from a specific user, the accuracy can be rather high. However,
this setting is impractical. In practical human activity recognition scenarios, while a certain num-
ber of participants’ data can be collected and annotated for training, the target users are usually
unseen by the systems. So the distribution divergence between the training data and the test data
appears as a challenge in human activity recognition, and the performance of the models falls
dramatically across users. The research on personalized models for a specific user is significant.
Recently, personalized deep learning models for distribution discrepancy among users in activity
recognition have been explored. Woo et al. [148] proposed an approach to build an RNN model
for each individual. Learning Hidden Unit Contributions (LHUC) were applied in Reference
[90] where a particular layer with few parameters is inserted between every two hidden layers of
CNN, and the parameters are trained using a small amount of data. Rokni et al. [121] proposed to
personalize their models with transfer learning. In the training phase, CNN is first trained with
data collected from a few participants (source domain). In the test phase, only the top layers of
the CNN are fine-tuned with a small amount of data for the target users (target domain). Anno-
tation for target users is required. GAN is also serviceable for addressing distribution discrepancy
among users. In Reference [132], the authors generated data of the target domain directly from
the source domain with GANs to enhance the training of the classifier. Chen et al. [27] further
defined person-specific discrepancy and task-specific consistency for people-centric sensing ap-
plications. Person-specific discrepancy means the distribution divergence of data collected from
different people, and task-specific consistency denotes the inherent similarity of the same activity.
They proved that reducing person-specific discrepancy and preserving task-specific consistency
guarantee the recognition accuracy after transferring. Reference [30] combines activity recogni-
tion and user recognition with a multi-task model. The proposed method shares parameters be-
tween the activity module and the user module so the activity recognition performance can be
boosted by features learned from the user recognition module. To transfer important knowledge
between the two modules, a mutual attention mechanism is deployed.
3.4.2 Distribution Discrepancy with Time. Human activity recognition systems collect dynamic
and streaming data that logs people’s motions. In a real-world recognition system, the initial train-
ing data that portrays a set of activities is collected to train an original model, then the model is
configured for future activity recognition. In long-term systems that are longer than months or
even years, a natural feature that we should concern is that the streaming sensory data changes
over time. Three problems can be derived from the distribution discrepancy with time in line with
the extent of change and the extent of the need in recognizing the new concepts of data. They are
the concept drift problem, the concept evolution problem, and the open-set problem.
Concept Drift. Figure 6(a) shows the first problem of distribution discrepancy with time in ac-
tivity recognition called concept drift [127]. It denotes the distribution shift between the source
domain and the target domain. Concept drift can be abrupt or gradual [1]. To accommodate the
drift, deep learning models should incorporate incremental training to continuously learn new
concepts of human activities from newly coming data. For example, an ensemble classifier termed
multi-column bi-directional LSTM was proposed in Reference [136]. The model leverages new
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:19
training samples gradually via incremental learning. Active learning is a special type of incremen-
tal learning. In streaming data systems, active learning queries ground truth for samples when
change is detected. It encourages to select the most efficient samples to update the models for the
new concepts. That is why active learning can facilitate deep learning models to mitigate the dis-
crepancy with time of the streaming sensory data [49, 126]. In this way, Gudur et al. [49] proposed
a deep Bayesian CNN with dropout to obtain the uncertainties of the model and select the most
informative data points to be queried according to the uncertainty query strategy. Owing to the
active learning, the model supports updating continuously and capturing the changes of data over
time.
Concept Evolution. Figure 6(b) represents the distribution of concept evolution. Concept evo-
lution denotes the emergence of new activities in the streaming data. The appearance of concept
evolution is because collecting labeled data for all kinds of activities in the initial learning phase
is impractical. First, despite the effort, the initial training set in an activity recognition system is
only able to contain a limited number of activities. Second, people can perform new activities that
they never did before the initial training of the activity recognition system (e.g., learning to play
guitar for the first time). Third, it is difficult to collect some certain activities such as people falling
down. However, these activities still may appear in the test or the application phase. Thus, in the
application phase, the concepts of the new activities still need to be learned. It is essential to study
activity recognition systems that can recognize new activities in the streaming data settings. Nev-
ertheless, this is difficult due to the restricted access to annotated data in the application phase.
One approach is to decompose activities into mid-level features such as arm up, arm down, leg up,
and leg down. This method demands experts to define the mid-level attributes for further training,
and the capability is limited when new activities composed of new attributes appear [97]. Other
deep learning methods for activity concept evolution are still less explored, so some researchers
take a step back and study the problem of open-set.
Open-Set. Open-set problem is currently a trending topic. Before that, most of the state-of-
the-art works are for “closed-set” problems where the training set and the test set contain the
same set of activities. Open-set also originates from the fact that we can never collect sufficient
kinds of activities in the initial training phase. But compared with concept evolution problems,
the solutions to open-set problems only need to identify whether the test samples belong to the
target activities, rather than exactly recognize the activities. Figure 6(c) represents the distribution
of open-set problems where the shadow means the space where new activities may emerge. An
intuitive solution to open-set problems is to build a negative set so they can be considered in a
closed-set way. A deep model based on GAN is proposed in Reference [154]. The authors generate
fake samples with GAN to construct the negative set, and the discriminator of the GAN can be
seamlessly used as the open-set classifier.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:20 K. Chen et al.
3.4.3 Distribution Discrepancy with Sensors. Due to the sensitivity of sensors, a tiny variation
in the sensors may lead to substantial changes in the data collected or transmitted by the sensors.
The influential factors of sensors include the instances, types, positions, and layouts in the envi-
ronment. To illustrate, instances of sensors may have different parameters such as the sampling
rate; different types of sensors collect totally different types of data with varying shapes, frequen-
cies, and scales; wearable sensors attached to positions of human body only record motions in the
corresponding body parts; environmental layouts of device-free sensors influence the propagation
of signals. All of these factors may cause drops in the recognition accuracy when the classifiers are
not trained for specific device deployments. Therefore, seamless deep learning models for activity
recognition in the wild is necessary. Reference [94] proves that features learned by deep learning
models are transferable across sensor types and sensor deployments for activity recognition.
Sensor Instances. Even when data is collected in the same setting and only the sensor instances
are different, for example, a person replaces his smartphone with a new one, the recognition ac-
curacy still declines soon. Both the hardware and the software are responsible. In fact, owing to
the imperfections in the production process, sensor chips show variation in the same conditions
[37]. Also, the performance of devices differs in different software platforms [21]. For example,
APIs, resolutions, and other factors are all influential to the performance of sensors. There have
been a few works developing deep learning models to address distribution discrepancy problems
caused by different sensor instances. One notable work is data augmentation with GANs [89]. Data
augmentation is a solution of enriching training sets so both the size and the quality of training
sets meet the requirement of training a powerful deep learning model. A discrepancy generator
that synthesizes heterogeneous data from different sensor instances under various degrees of dis-
turbance is developed in Reference [89]. The aim is to replenish the training set with sufficient
discrepancy. Moreover, the authors deploy a discrepancy pipeline with two parameters that con-
trol the discrepancy of the training set.
Sensor Types and Positions. In this section, we introduce the distribution discrepancy of sen-
sory data caused by different sensor types and positions on human bodies, because these two fac-
tors usually appear together. Thanks to the pervasiveness of wearables sensors and IoT equipment,
people can wear more than one smart device to assist their daily life. And it is also common that
users replace their smart devices or buy new electronic products. Since some devices are based on
the same platforms (e.g., iPhone and Apple Watch), people prefer the activity recognition system
to seamlessly recognize activities that are observed by the new device with models trained with
the old devices. In terms of positions, the devices should be attached to different body positions
according to the types. For example, a smartwatch should be attached to the user’s wrist while
a smartphone can be put in a pocket of a trouser or shirt. It is obvious that devices on different
body positions will lead to tremendous changes in their collected signals, because the signals are
stimulated by the motions of corresponding body parts. Therefore, there are two issues raised by
such changes that urgently need to be considered to address the distribution discrepancy with
sensor types and positions. First, massive data from the new sensors or new positions is required
so the new distribution can be estimated rather completely. Second, most of the existing works
still mediocrely characterize the old data and the new data with the same features, which is im-
practical when sensor types and positions are not fixed. For instance, KL divergence is minimized
between the parameters of CNNs that are trained by the old data and the new data, respectively, in
Reference [68]. To address the issue mentioned, Akbari and Jafari [3] designed stochastic features
that are not only discriminative for classification but also able to reserve the inherent structures
of the sensory data. The stochastic feature extraction model is based on a generative autoencoder.
Wang et al. [146] further posed a question about how to select the best source positions for
transfer when there are multiple source positions available. This question is pragmatic, since the
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:21
smart devices can be placed in diverse positions such as on wrist, in a pocket, or on nose (e.g.,
goggles), and inappropriate selection may lead to negative transfer. Reference [43] proves that the
similarity between domains in transfer learning is determinative. Reference [146] suggests that
higher similarity indicates better transfer performance between two domains. Therefore, Chen
et al. [31] assumed that data samples of the same activities are aggregated in the distribution
space even when they are from different sensors. They propose a stratified distance that is class-
wise to measure the distances between domains. Wang et al. [146] proposed a semantic distance
and a kinetic distance to measure domain distances, where the semantic distance involves spatial
relationships between data collected from two positions and the kinetic information concerns the
relationships of motion kinetic energy between two domains.
Sensor Layouts and Environments. Sensor layouts are in regard to device-free sensors such
as WiFi and RFID. The signals collected by the receivers are usually considerably influenced by the
layouts and the environments. The reason is that during the signals are transmitted, the signals are
inevitably reflected, refracted, and diffracted by media and barriers such as air, glass, and walls. And
the spatial positions of the receivers also play a role. Despite the maturity in building classification
models for device-free activity recognition, very few works focus on how to get equally accurate
recognition performance when sensors are configured in the wild. One example is Reference [64],
where an adversarial network is incorporated with deep feature extraction models to remove the
environment-specific information and extract the environment-independent features.
It should be noted that all the aforementioned methods need either labeled or unlabeled data
from the target domain to update their models. In real world, a one-fits-all model that only requires
one-time training and is general enough to fit all scenarios is indispensable. Zheng et al. [178]
defined Body-coordinate Velocity Profile (BVP) to capture domain-independent features. The
features represent power distributions over different velocities of body parts and are unique to
individual activities. The experimental results show that BVP is advantageous in cross-domain
learning, and it fits all kinds of domain factors including users, sensor types, and sensor layouts.
One-fits-all is a new direction for researchers to mitigate the distribution discrepancy problem in
activity recognition.
In conclusion, we review three categories of distribution discrepancy in activity recognition.
They are caused by different users, time streaming, and sensor deployments. They are further
categorized according to the extent of change or the main reason for changes. Table 5 summarizes
the advantages and limitations of different works for resolving distribution discrepancy in activity
recognition.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:22 K. Chen et al.
3.5.1 Unified Models. Existing studies on composite activity recognition can be categorized
into two streams. The first one mixes complex and simple activities and tries to create a unified
model to recognize both kinds of activities. In Reference [142], there are 22 simple and composite
activities attributed to four strategies: (1) Locomotive (e.g., walk indoor, run indoor); (2) Semantic
(e.g., clean utensil and cooking); (3) Transitional (e.g., indoor to outdoor and walk upstairs); and
(4) Postural/ relatively Stationary (e.g., standing and lying on bed). A simple multi-layer feedfor-
ward neural network was created to recognize all the activities with a high average test accuracy
of 90%. However, the results are obtained with the subject-dependent setting, where training and
test samples are from the same subject, which limits the proposed method’s adaptability.
3.5.2 Separated Models. The second strategy is to consider composite activities separately
from simple ones and to further regard a composite activity as the combination of a series
of simple activities. This hierarchical manner is more intuitive and attracts stronger research
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:23
Table 6. Advantages and Limitations of Different Works for Composite Activity Recognition
interests. However, applying deep learning techniques to this area is still underexplored. One of
the few deep learning works is Reference [103] where the authors developed a multi-task learning
approach to recognize both simple and composite activities simultaneously. To be concrete, the
authors divided a composite activity into multiple simple activities that were represented by a
series of sequential sensor signal segments. The signal segments are first input into CNNs to
extract representations of low-level activities, which are then loaded into a softmax classifier for
recognizing simple activities. At the same time, the CNN-extracted features of all segments are
taken into an LSTM network to exploit their correlations and consequently result in a high-level
semantic activity classification. In such a way, the prior simple activities being the components of a
composite activity is utilized by the shared deep feature extractor. Different from the joint learning
manner, Reference [33] inferred a sequence of simple activities and its corresponding composite
activity by using two conditional probabilistic models alternatively. The authors used an estimated
action sequence to infer the composite activity, where the temporal correlations of simple actions
are extracted for the composite activity classification. In reverse, the predicted composite activity
is utilized to help derive the simple activity sequence at the next timestep. As a result, the
predictions of the sequence of simple activities and composite activities are mutually updated
based on each other during the inference. The deep learning technique was used for feature
extraction from raw signals. The experiment results showed increasing accuracy as a composite
activity evolved. Even though these works have demonstrated promising solutions to recognizing
composite activities, there exists a major concern that properly cutting a raw time-serial signal
into segments of individual simple actions is the basis for success. A summary of the advantages
and limitations of different works on composite activity recognition is presented in Table 6.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:24 K. Chen et al.
3.6.1 Explicit Segmentation. An intuitive manner is to attempt various fixed window sizes em-
pirically. Nevertheless, although a larger window size provides richer information, it increases the
possibility that a transition occurs in the middle of windows. On the contrary, a smaller window
size cannot afford enough information. In light of this issue, Reference [4] reported a hierarchical
signal segmentation method, which initially used a large window size and gradually narrowed
down the segmentation until only one activity is in a sub-window. The narrow-down criterion is
that two consecutive windows have different labels or the classification confidence is less than a
threshold. Different from the hierarchical framework, some researchers explored to directly as-
sign a label for each timestep instead of predicting a window as a whole [157, 176]. Inspired by
semantic segmentation in the computer vision community, the authors employed fully connected
networks (FCNs) [83] to achieve such a goal. Data from a large window size is input, and a 1D
CNN layer is used to replace the final softmax layer, where the length of the feature map equals to
timesteps and the number of the feature maps equals to the number of activity classes, to predict a
label for each timestep. Therefore, the FCNs could not only use the information of the correspond-
ing timestep itself but also utilize the information of its neighboring timesteps.
3.6.2 Implicit Segmentation. Explicit segmentation for activity recognition is not practical,
since users perform activities in unfixed durations. In Reference [140], Varamin et al. defined un-
segmented activity recognition as a set prediction problem. They designed a multi-label architec-
ture to simultaneously predict the number of ongoing activities and the occurring possibility of
each alternative activity without explicit segmentation. Table 7 summarizes the advantages and
limitations of different methods for data segmentation.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:25
Table 8. Advantages and Limitations of Different Works for Concurrent Activity Recognition
activity recognition can be abstracted as a multi-label task. Note that the concurrent activity is
executed by a single subject.
3.7.1 Recognize Individually. A concurrent activity can be considered as several individual ac-
tivities. Zhang et al. [175] designed an individual fully connected network for each candidate ac-
tivity on top of shared multimodal fusion features. The final decision-making layer classified each
activity independently by independent softmax layers. A key drawback of this kind of structure
is that the computational cost would increase considerably with the number of activity rises. To
resolve this issue, the authors further proposed to use a single neuron with the siдmoid activation
to make binary classification (performed or not) for each activity [81].
3.7.2 Recognize Concurrently. In contrast, Okita and Inoue [100] also targeted the concurrent
activities, but directly considering the possibility of different activities occurring concurrently.
They suggested a multi-layer LSTM framework to give the concurrent possibility of every possi-
ble activity combination. The main limitation of this work is the output dimension would explode
exponentially as the amount of concurrent activities increases. The pace of exploring deep learning
methods on concurrent activity recognition is still slow, and there is a large room to improve. A
summary of the advantages and limitations of different approaches for concurrent activity recog-
nition is illustrated in Table 8.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:26 K. Chen et al.
Table 9. Advantages and Limitations of Different Works for Multi-occupant Activity Recognition
3.8.1 Collaborative Activity. In Reference [124], both wearable and ambient sensors were used
to recognize group activities of two occupants. The ambient sensors were leveraged for extracting
context information, which is represented by disparate functional indoor areas. The sensor data
of different occupants was input into different RBMs separately and then merged into a sequential
network, a DBN and an MLP, for the inference of the group activity. Pretty high accuracy of nearly
100% was achieved. However, most of their targeting scenarios are constrained with two occupants
performing the same activity together.
3.8.2 Parallel Activity. On the contrary, Tran et al. [138] did not assume the occupants acting
together. They aimed at recognizing activities for occupants individually. A multi-label RNN was
created with each cell responding to the activity of an occupant. Nevertheless, the authors only
used ambient sensors and did not propose a specific solution to data association. Table 9 summa-
rizes the advantages and limitations of different methods for multi-occupant activity recognition.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:27
Table 10. Advantages and Limitations of Different Works for Computation Cost
learning features and hand-crafted features in parallel before fed into a fully connected classi-
fier. This structure could increase recognition accuracy with only a small gain of computational
consumption.
3.9.2 Network Optimization. Optimizing basic neural network cells and structures is another
intuitive scheme of decreasing computation complexity. In Reference [143], Vu et al. used a self-
gated recurrent neural network (SGRNN) cell to decline the complexity of a standard LSTM
and prevent gradient vanishing. Their experiments displayed superior computation efficiency to
LSTM and GRU in terms of the running time and model size. However, the running time was still
in the order of hundreds of milliseconds and no real-world evaluation on mobile devices is carried
out to show possible real-time implementation. For CNN-based methods, reducing filter size is
an effective means to optimize the memory consumption and the number of computation oper-
ations. For example, Reference [115] utilized 1D-CNNs instead of 2D-CNNs to control the model
size. A more insightful strategy to dealing with both the storage and computational problems is
the quantization of network [40]. This scheme is to constrain the weights and outputs of activa-
tion functions to two discrete values (e.g., −1, +1) instead of continuous numbers. There are three
major benefits of network quantization: (1) the memory usage and model size are greatly reduced
when compared to the full and precise networks; (2) the bitwise operations are considerably more
efficient than conventional floating or fixed-point arithmetic; (3) if bitwise operations are used,
then most multiply-accumulate operations (requiring hundreds of logic gates at least) can be re-
placed by popcount-XNOR operations (that only require a single logic gate), which are especially
well suited for FPGAs and ASICs [155]. In Reference [155], Yang et al. explored a 2-bit CNN with
weights and activation constrained to {−0.5, 0, 0.5} for efficient activity recognition. Table 10 sum-
marizes the advantages and limitations of different methods for reducing computation cost.
3.10 Privacy
The main application of human activity recognition is to monitor human behaviors so the sensors
capture the activities of a user continuously. Since the way an activity is performed varies among
users, it is possible for an adversary to infer user-sensitive information such as age through the
time series sensor data. Specifically, for the deep learning technique, its black-box characteristic
may be at the risk of revealing user-discriminative features unintentionally. In Reference [63], the
authors investigated the privacy issue of using CNN features for human activity recognition. Their
empirical studies revealed that although CNN is trained with a cross-entropy loss only targeting
activity classification, the obtained CNN features still showed powerful user-discriminative ability.
A simple logistic regressor could achieve a high user-classification accuracy of 84.7% when using
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:28 K. Chen et al.
the CNN features basically extracted for activity, while the same classifier could only obtain 35.2%
user-classification accuracy on raw sensor data. Therefore, it is essential to address the privacy
leakage potentials of a deep learning model originally used for human activity recognition.
3.10.1 Transformation. To address this concern, some researchers explored to utilize an adver-
sarial loss function to minimize the discriminative accuracy of specific privacy information during
the training process. For example, Iwasawa et al. [63] proposed to integrate an adversarial loss with
the standard activity classification loss to minimize the user identification accuracy. The authors
of References [88] and [87] also adopted the similar idea to prevent privacy leakage. Their experi-
ment results show an effective reduction of inferring accuracy for sensitive information. However,
an adversarial loss function can only be used for protecting one kind of private information, such
as user identity and gender. In addition, the adversarial loss goes against the end-to-end train-
ing process that make it hard to converge stably. Considering this gap, Reference [166] borrowed
the idea of image style transformation from the computer vision community to protect all private
information at once. The authors creatively viewed raw sensor signals from two aspects: “style”
aspect, which describes how a user performs an activity and was influenced by user’s identical
information, such as age, weight, gender, height, and so on; and “content” aspect, which describes
what activity a user performs. They proposed to transform raw sensor data to have the “content”
unchanged but the “style” is similar to random noises. Therefore, the method has the potential to
protect all sensitive information at once.
3.10.2 Perturbation. Besides data transformation, data perturbation is another way to resolve
the privacy issue. For example, Lyu et al. proposed to tailor two kinds of data perturbation mech-
anisms: Random Projection and repeated Gompertz to achieve a better tradeoff between privacy
and recognition accuracy [84]. Recently, differential privacy has gained increased research atten-
tion due to its strong theoretical privacy guarantee. Phan et al. [105] proposed to perturb the
objective functions of the traditional deep auto-encoder to enforce the ϵ-differential privacy. In
addition to the privacy preservation in feature extraction layers, an ϵ-differential privacy preserv-
ing softmax layer was also developed for either classification or prediction. Different from the
above approaches, this method provided theoretical privacy guarantees and error bounds. The ad-
vantages and limitations of different methods for protecting user privacy in activity recognition
are in Table 11.
3.11 Interpretability
Sensory data for human activity is unreadable. A data sample may include diverse modalities (e.g.,
acceleration, angular velocity) from multiple positions (e.g., wrist, ankle) in a time window. How-
ever, only a few of modalities from specific positions contribute to identifying certain activities
[72]. For example, lying is distinguishable when people are horizontal (magnetism), and ascending
stairs can be recognized by the forward and the upward acceleration of people’s ankle. Unrelated
modalities can introduce noise and deteriorate the recognition performance. Moreover, the signifi-
cance of each modality changes over time. For instance, in a Parkinson’s disease detection system,
anomaly only appears in gait in a short period instead of the entire time window [162]. Intuitively,
the modality shows more considerable significance when the corresponding body part is actively
moving.
Despite the success of deep learning in activity recognition, the inner mechanisms of deep learn-
ing networks still remain unrevealed. Considering the varying salience of modalities and time
intervals, it is necessary to interpret the neural networks to explore the factors of the models’
decisions. For example, when a deep learning model identifies that a user is walking, we tend to
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:29
Table 11. Advantages and Limitations of Different Works for Privacy Protection
know which modality from which time interval is the determinant. Therefore, the interpretability
of deep learning methods has become a new trend in the human activity recognition community.
3.11.1 Feature Visualization. The basic idea of interpretable deep learning is to automatically
decide the importance of each part of the input data and to achieve high accuracy by omitting the
unimportant parts and focusing on the salient parts. In fact, the standard fully connected layers
already possess such capacity, as they automatically reduce the weights of less important neurons
during training, but we still need to visualize the features for interpretation. Some researchers [21,
152] visualized the features extracted by neural networks. Salient features are sent to the subse-
quent models after the authors find out their relationships to the activities from the visualization
[152]. Nutter et al. [98] transformed sensory data to images so visualization tools can be applied
to the sensory data for more direct interpretability.
3.11.2 Attentive Selection. Attention mechanism is recently popular in deep learning areas and
is originally a concept in biology and psychology that illustrates how we restrict our attention to
something crucial for better cognitive results. Inspired by this, researchers apply neural attention
mechanisms to deep learning to give neural networks the capability of concentrating on a subset
of inputs that really matters. Since the principle of deep attention models is to weigh input compo-
nents, components with higher weights are assumed to be more tightly related to the recognition
task and show greater influence over the models’ decisions [128]. Some works employed attention
mechanism to interpret deep model behaviors [165, 168, 170]. Back to human activity recognition,
attention mechanism not only highlights the most distinguishable modalities and time intervals
but also informs us of the most contributing modalities and body parts to specific activities. Deep
attention approaches can be categorized into soft attention and hard attention based on their dif-
ferentiability.
Soft Attention. In machine learning, “soft” means differentiable. Soft attention assigns weight
from 0 to 1 to each element of the inputs. It decides how much attention to focus on each element.
Soft attention uses softmax functions in the attention layers to compute the weights so the whole
model is fully differentiable where gradients can be propagated to other parts of the network
[167]. Attention layers can be inserted into sequence-to-sequence LSTMs for feature extraction
[135]. Attention layers can also be inserted in the neural networks to tune the weights of all
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:30 K. Chen et al.
Table 12. Advantages and Limitations of Different Works for Model Interpretability
samples [96] in sliding windows, since samples at different time points have varying contributions
to activity recognition. Shen et al. [129] further considered the temporal context. They designed
a segment-level attention approach to decide which time segment contains more information.
Combined with gated CNN, the segment-level attention better extracts temporal dependencies.
Zeng et al. [162] developed attention mechanisms in two perspectives. They first propose sensor
attention on the inputs to extract the salient sensory modalities and then apply temporal attention
to an LSTM to filter out the inactive data segments. Spatial and temporal attention mechanisms
are employed in Reference [85]. Especially, the spatial dependencies are extracted by fusing the
modalities with self-attention.
Hard Attention. Hard attention determines whether to attend to a part of inputs or not. The
weight assigned to an input part is either 0 or 1 so the problem is non-differentiable. The process
involves making a sequence of selections about which part to attend. The selection can be output by
a neural network. However, since there is no ground truth indicating the correct selection policy,
hard attention should be represented as a stochastic process. This is where deep reinforcement
learning comes in. Deep reinforcement learning tackles the selection problems in deep learning
and allows the models to propagate gradients in the space of selection policies.
Different reinforcement learning techniques can be applied to hard attention mechanisms in
human activity recognition. Zhang et al. [173] use dueling deep Q networks as a core of hard
attention to focus on the salient parts of multimodal sensory data. Chen et al. [26, 29] mined im-
portant modalities and elide undesirable features with policy gradient. The attention is embedded
into an LSTM to make selections step-by-step, because LSTM incrementally learns information in
an episode. Chen et al. [28] further considered the intrinsic relations between activities and sub-
motions from human body parts. They employ multiple agents to concentrate on modalities that
are related to sub-motions. Multiple agents coordinate to portray the activities. The visualization
of the selected modalities and body parts validates that the attention mechanism provides insights
into how sensory data elements affect the models’ prediction of activities. The advantages and
limitations of different methods for model interpretability are listed in Table 12.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:31
fully explored, such as class imbalance, composite activities, concurrent activities, and so on. Al-
though current research works still lack comprehensive and reliable solutions for the challenges,
they lay concrete foundations and show guidance for future directions.
Moreover, there are other research directions that have rarely been explored before. We outline
several key research directions that urgently need to be exploited as follows:
5 CONCLUSION
This work aims at suggesting a rough guideline for novices and experienced researchers who
have interest in deep learning methods for sensor-based human activity recognition. We present a
comprehensive survey to summarize the current deep learning methods for sensor-based human
activity recognition. We first introduce the multi-modality of the sensory data and available public
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:32 K. Chen et al.
datasets and their extensive utilization in different challenges. We then summarize the challenges
in human activity recognition based on their reasons and analyze how existing deep methods are
adopted to address the challenges. At the end of this work, we discuss the open issues and provide
some insights for future directions.
REFERENCES
[1] Zahraa S. Abdallah, Mohamed Medhat Gaber, Bala Srinivasan, and Shonali Krishnaswamy. 2018. Activity recogni-
tion with evolving data streams: A review. Comput. Surv. 51, 4 (2018), 71.
[2] Alireza Abedin, Seyed Hamid Rezatofighi, Qinfeng Shi, and Damith Chinthana Ranasinghe. 2019. SparseSense: Hu-
man activity recognition from highly sparse sensor data-streams using set-based neural networks. In Proceedings of
the 28th International Joint Conference on Artificial Intelligence (IJCAI’19). 5780–5786.
[3] Ali Akbari and Roozbeh Jafari. 2019. Transferring activity recognition models for new wearable sensors with deep
generative domain adaptation. In Proceedings of the 18th International Conference on Information Processing in Sensor
Networks. ACM, 85–96.
[4] Ali Akbari, Jian Wu, Reese Grimsley, and Roozbeh Jafari. 2018. Hierarchical signal segmentation and classification
for accurate activity recognition. In Proceedings of the ACM International Joint Conference and International Sympo-
sium on Pervasive and Ubiquitous Computing and Wearable Computers. ACM, 1596–1605.
[5] Ali A. Alani, Georgina Cosma, and Aboozar Taherkhani. 2020. Classifying imbalanced multi-modal sensor data for
human activity recognition in a smart home using deep learning. In Proceedings of the International Joint Conference
on Neural Networks (IJCNN’20). IEEE, 1–8.
[6] Hande Alemdar, Halil Ertan, Ozlem Durmaz Incel, and Cem Ersoy. 2013. ARAS human activity datasets in multiple
homes with multiple residents. In Proceedings of the 7th International Conference on Pervasive Computing Technologies
for Healthcare. ICST, 232–235.
[7] Mohammad Abu Alsheikh, Ahmed Selim, Dusit Niyato, Linda Doyle, Shaowei Lin, and Hwee-Pink Tan. 2016. Deep
activity recognition models with triaxial accelerometers. In Proceedings of the Workshops at the 30th AAAI Conference
on Artificial Intelligence.
[8] Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, and Jorge Luis Reyes-Ortiz. 2013. A public domain
dataset for human activity recognition using smartphones. In Proceedings of the European Symposium on Artificial
Neural Networks.
[9] Sina Mokhtarzadeh Azar, Mina Ghadimi Atigh, Ahmad Nickabadi, and Alexandre Alahi. 2019. Convolutional rela-
tional machine for group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 7892–7901.
[10] Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and Atilla Baskurt. 2010. Action classifica-
tion in soccer videos with long short-term memory recurrent neural networks. In Proceedings of the International
Conference on Artificial Neural Networks. Springer, 154–159.
[11] Marc Bachlin, Meir Plotnik, Daniel Roggen, Inbal Maidan, Jeffrey M. Hausdorff, Nir Giladi, and Gerhard Troster.
2010. Wearable assistant for Parkinson’s disease patients with the freezing of gait symptom. IEEE Trans. Inf. Technol.
Biomed. 14, 2 (2010), 436–446.
[12] Lei Bai, Lina Yao, Xianzhi Wang, Salil S. Kanhere, Bin Guo, and Zhiwen Yu. 2020. Adversarial multi-view networks
for activity recognition. Proc. ACM Interact., Mob., Wear. Ubiq. Technol. 4, 2 (2020), 1–22.
[13] Lei Bai, Lina Yao, Xianzhi Wang, Salil S. Kanhere, and Yang Xiao. 2020. Prototype similarity learning for activity
recognition. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 649–
661.
[14] Lu Bai, Chris Yeung, Christos Efstratiou, and Moyra Chikomo. 2019. Motion2Vector: Unsupervised learning in hu-
man activity recognition using wrist-sensing data. In Proceedings of the ACM International Joint Conference on Perva-
sive and Ubiquitous Computing and Proceedings of the ACM International Symposium on Wearable Computers. ACM,
537–542.
[15] Donald S. Baim, Wilson S. Colucci, E. Scott Monrad, Harton S. Smith, Richard F. Wright, Alyce Lanoue, Diane
F. Gauthier, Bernard J. Ransil, William Grossman, and Eugene Braunwald. 1986. Survival of patients with severe
congestive heart failure treated with oral milrinone. J. Amer. Coll. Cardiol. 7, 3 (1986), 661–670.
[16] Oresti Banos, Rafael Garcia, Juan A. Holgado-Terriza, Miguel Damas, Hector Pomares, Ignacio Rojas, Alejandro Saez,
and Claudia Villalonga. 2014. mHealthDroid: A novel framework for agile development of mobile health applica-
tions. In Proceedings of the International Workshop on Ambient Assisted Living. Springer, 91–98.
[17] Billur Barshan and Murat Cihan Yüksek. 2014. Recognizing daily and sports activities in two open source machine
learning environments using body-worn sensor units. Comput. J. 57, 11 (2014), 1649–1667.
[18] Yoshua Bengio. 2012. Deep learning of representations for unsupervised and transfer learning. In Proceedings of
ICML Workshop on Unsupervised and Transfer Learning. 17–36.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:33
[19] Asma Benmansour, Abdelhamid Bouchachia, and Mohammed Feham. 2015. Multioccupant activity recognition in
pervasive smart home environments. Comput. Surv. 48, 3 (2015), 1–36.
[20] Avrim Blum and Tom M. Mitchell. 1998. Combining labeled and unlabeled data with Co-Training. In Proceedings
of the Eleventh Annual Conference on Computational Learning Theory, COLT 1998, Madison, Wisconsin, USA, July
24–26,1998. ACM, 92–100.
[21] Henrik Blunck, Niels Olof Bouvin, Tobias Franke, Kaj Grønbæk, Mikkel B. Kjaergaard, Paul Lukowicz, and Markus
Wüstenberg. 2013. On heterogeneity in mobile sensing applications aiming at representative data collection. In
Proceedings of the ACM Conference on Pervasive and Ubiquitous Computing Adjunct Publication. ACM, 1087–1098.
[22] Eoin Brophy, José Juan Dominguez Veiga, Zhengwei Wang, Alan F. Smeaton, and Tomas E. Ward. 2018. An inter-
pretable machine vision approach to human activity recognition using photoplethysmograph sensor data. arXiv
preprint arXiv:1812.00668 (2018).
[23] Andreas Bulling, Ulf Blanke, and Bernt Schiele. 2014. A tutorial on human activity recognition using body-worn
inertial sensors. Comput. Surv. 46, 3 (2014), 33:1–33:33. DOI:[Link]
[24] Ricardo Chavarriaga, Hesam Sagha, Alberto Calatroni, Sundara Tejaswi Digumarti, Gerhard Tröster, José del R.
Millán, and Daniel Roggen. 2013. The opportunity challenge: A benchmark database for on-body sensor-based ac-
tivity recognition. Pattern Recog. Lett. 34, 15 (2013), 2033–2042.
[25] Chen Chen, Roozbeh Jafari, and Nasser Kehtarnavaz. 2015. UTD-MHAD: A multimodal dataset for human action
recognition utilizing a depth camera and a wearable inertial sensor. In Proceedings of the IEEE International Confer-
ence on Image Processing (ICIP’15). IEEE, 168–172.
[26] Kaixuan Chen, Lina Yao, Xianzhi Wang, Dalin Zhang, Tao Gu, Zhiwen Yu, and Zheng Yang. 2018. Interpretable par-
allel recurrent neural networks with convolutional attentions for multi-modality activity modeling. In Proceedings
of the International Joint Conference on Neural Networks. IEEE, 1–8.
[27] Kaixuan Chen, Lina Yao, Dalin Zhang, Xiaojun Chang, Guodong Long, and Sen Wang. 2019. Distributionally ro-
bust semi-supervised learning for people-centric sensing. In Proceedings of the 33rd AAAI Conference on Artificial
Intelligence (AAAI’19). 3321–3328.
[28] Kaixuan Chen, Lina Yao, Dalin Zhang, Bin Guo, and Zhiwen Yu. 2019. Multi-agent attentional activity recognition.
In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19). 1344–1350.
[29] Kaixuan Chen, Lina Yao, Dalin Zhang, Xianzhi Wang, Xiaojun Chang, and Feiping Nie. 2020. A semisupervisedrecur-
rent convolutional attention model for human activity recognition. IEEE Trans. Neural Networks Learn. Syst. 31, 5
(2020), 1747–1756.
[30] Ling Chen, Yi Zhang, and Liangying Peng. 2020. METIER: A deep multi-task learning based activity and user recog-
nition model using wearable sensors. Proc. ACM Interact., Mob., Wear. Ubiq. Technol. 4, 1 (2020), 1–18.
[31] Yiqiang Chen, Jindong Wang, Meiyu Huang, and Han Yu. 2019. Cross-position activity recognition with stratified
transfer learning. Pervas. Mob. Comput. 57 (2019), 1–13.
[32] Yuwen Chen, Kunhua Zhong, Ju Zhang, Qilong Sun, and Xueliang Zhao. 2016. LSTM networks for mobile human
activity recognition. In Proceedings of the International Conference on Artificial Intelligence: Technologies and Appli-
cations. Atlantis Press.
[33] Weihao Cheng, Sarah M. Erfani, Rui Zhang, and Ramamohanarao Kotagiri. 2018. Predicting complex activities from
ongoing multivariate time series. In Proceedings of the 27th International Joint Conference on Artificial Intelligence.
3322–3328.
[34] Belkacem Chikhaoui and Frank Gouineau. 2017. Towards automatic feature extraction for activity recognition from
wearable sensors: A deep learning approach. In Proceedings of the IEEE 17th International Conference on Data Mining
Workshops (ICDMW’17). IEEE, 693–702.
[35] Jun-Ho Choi and Jong-Seok Lee. 2018. Confidence-based deep multimodal fusion for activity recognition. In Proceed-
ings of the ACM International Joint Conference and International Symposium on Pervasive and Ubiquitous Computing
and Wearable Computers. ACM, 1548–1556.
[36] Oscar Day and Taghi M. Khoshgoftaar. 2017. A survey on heterogeneous transfer learning. J. Big Data 4, 1 (2017),
29.
[37] Sanorita Dey, Nirupam Roy, Wenyuan Xu, Romit Roy Choudhury, and Srihari Nelakuditi. 2014. AccelPrint: Im-
perfections of accelerometers make smartphones trackable. In Proceedings of the Network and Distributed System
Security Symposium (NDSS’14).
[38] Mingtao Dong, Jindong Han, Yuan He, and Xiaojun Jing. 2018. HAR-Net: Fusing deep representation and hand-
crafted features for human activity recognition. In Proceedings of the International Conference on Signal and Infor-
mation Processing, Networking and Computers. Springer, 32–40.
[39] Stefan Duffner, Samuel Berlemont, Grégoire Lefebvre, and Christophe Garcia. 2014. 3D gesture classification with
convolutional neural networks. In Proceedings of the International Conference on Acoustics, Speech and Signal Pro-
cessing. IEEE, 5432–5436.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:34 K. Chen et al.
[40] Marcus Edel and Enrico Köppe. 2016. Binarized-BLSTM-RNN based human activity recognition. In Proceedings of
the International Conference on Indoor Positioning and Indoor Navigation (IPIN’16). IEEE, 1–7.
[41] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. 2010.
Why does unsupervised pre-training help deep learning? J. Mach. Learn. Res. 11, Feb. (2010), 625–660.
[42] Xiaoyi Fan, Wei Gong, and Jiangchuan Liu. 2018. TagFree activity identification with RFIDs. Proc. ACM Interact.,
Mob., Wear. Ubiq. Technol. 2, 1 (2018), 7.
[43] Martin Gjoreski, Stefan Kalabakov, Mitja Luštrek, and Hristijan Gjoreski. 2019. Cross-dataset deep transfer learn-
ing for activity recognition. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous
Computing and Proceedings of the ACM International Symposium on Wearable Computers. ACM, 714–718.
[44] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville,
and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the International Conference on Advances in
Neural Information Processing Systems. 2672–2680.
[45] Klaus Greff, Rupesh K. Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. 2016. LSTM: A search
space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28, 10 (2016), 2222–2232.
[46] Rene Grzeszick, Jan Marius Lenk, Fernando Moya Rueda, Gernot A. Fink, Sascha Feldhorst, and Michael ten Hompel.
2017. Deep neural network based human activity recognition for the order picking process. In Proceedings of the 4th
International Workshop on Sensor-based Activity Recognition and Interaction. ACM, 14.
[47] Fuqiang Gu, Kourosh Khoshelham, Shahrokh Valaee, Jianga Shang, and Rui Zhang. 2018. Locomotion activity recog-
nition using stacked denoising autoencoders. IEEE Internet Things J. 5, 3 (2018), 2085–2093.
[48] Yu Guan and Thomas Plötz. 2017. Ensembles of deep LSTM learners for activity recognition using wearables. Proc.
ACM Interact., Mob., Wear. Ubiq. Technol. 1, 2 (2017), 11.
[49] Gautham Krishna Gudur, Prahalathan Sundaramoorthy, and Venkatesh Umaashankar. 2019. ActiveHARNet: To-
wards on-device deep Bayesian active learning for human activity recognition. arXiv preprint arXiv:1906.00108
(2019).
[50] Abdu Gumaei, Mohammad Mehedi Hassan, Abdulhameed Alelaiwi, and Hussain Alsalman. 2019. A hybrid deep
learning model for human activity recognition using multimodal body sensing data. IEEE Access 7 (2019), 99152–
99160.
[51] Haodong Guo, Ling Chen, Liangying Peng, and Gencai Chen. 2016. Wearable sensor based multimodal human ac-
tivity recognition exploiting the diversity of classifier ensemble. In Proceedings of the ACM International Joint Con-
ference on Pervasive and Ubiquitous Computing. ACM, 1112–1123.
[52] Quang-Do Ha and Minh-Triet Tran. 2017. Activity recognition from inertial sensors with convolutional neural net-
works. In Proceedings of the International Conference on Future Data and Security Engineering. Springer, 285–298.
[53] Sojeong Ha and Seungjin Choi. 2016. Convolutional neural networks for human activity recognition using multiple
accelerometer and gyroscope sensors. In Proceedings of the International Joint Conference on Neural Networks. IEEE,
381–388.
[54] Sojeong Ha, Jeong-Min Yun, and Seungjin Choi. 2015. Multi-modal convolutional neural networks for activity recog-
nition. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics. IEEE, 3017–3022.
[55] Nils Yannick Hammerla, James Fisher, Peter Andras, Lynn Rochester, Richard Walker, and Thomas Plötz. 2015. PD
disease state assessment in naturalistic environments using deep learning. In Proceedings of the 29th AAAI Conference
on Artificial Intelligence.
[56] Nils Y. Hammerla, Shane Halloran, and Thomas Plötz. 2016. Deep, convolutional, and recurrent models for human
activity recognition using wearables. In Proceedings of the 25th International Joint Conference on Artificial Intelligence.
1533–1540.
[57] H. M. Hossain, MD Al Haiz Khan, and Nirmalya Roy. 2018. DeActive: Scaling activity recognition with active deep
learning. Proc. ACM Interact., Mob., Wear. Ubiq. Technol. 2, 2 (2018), 66.
[58] H. M. Hossain and Nirmalya Roy. 2019. Active deep learning for activity recognition with context aware annotator
selection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
ACM, 1862–1870.
[59] Tâm Huynh, Mario Fritz, and Bernt Schiele. 2008. Discovery of activity patterns using topic models. In Proceedings
of the Conference on Ubiquitous Computing (UbiComp’08), Vol. 8. 10–19.
[60] Tâm Huynh and Bernt Schiele. 2005. Analyzing features for activity recognition. In Proceedings of the Joint Confer-
ence on Smart Objects and Ambient Intelligence: Innovative Context-aware Services: Usages and Technologies. ACM,
159–163.
[61] Shoya Ishimaru, Kensuke Hoshika, Kai Kunze, Koichi Kise, and Andreas Dengel. 2017. Towards reading trackers in
the wild: Detecting reading activities by EOG glasses and deep neural networks. In Proceedings of the ACM Interna-
tional Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the ACM International Symposium
on Wearable Computers. ACM, 704–711.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:35
[62] Chihiro Ito, Xin Cao, Masaki Shuzo, and Eisaku Maeda. 2018. Application of CNN for human activity recognition
with FFT spectrogram of acceleration and gyro sensors. In Proceedings of the ACM International Joint Conference and
International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers. ACM, 1503–1510.
[63] Yusuke Iwasawa, Kotaro Nakayama, Ikuko Yairi, and Yutaka Matsuo. 2017. Privacy issues regarding the applica-
tion of DNNs to activity-recognition using wearables and its countermeasures by use of adversarial training. In
Proceedings of the 26th International Joint Conference on Artificial Intelligence. 1930–1936.
[64] Wenjun Jiang, Chenglin Miao, Fenglong Ma, Shuochao Yao, Yaqing Wang, Ye Yuan, Hongfei Xue, Chen Song, Xin
Ma, Dimitrios Koutsonikolas, et al. 2018. Towards environment independent device free human activity recognition.
In Proceedings of the 24th International Conference on Mobile Computing and Networking. ACM, 289–304.
[65] Wenchao Jiang and Zhaozheng Yin. 2015. Human activity recognition using wearable sensors by deep convolutional
neural networks. In Proceedings of the 23rd ACM International Conference on Multimedia. ACM, 1307–1310.
[66] Artur Jordao, Antonio C. Nazare Jr, Jessica Sena, and William Robson Schwartz. 2018. Human activity recognition
based on wearable sensor data: A standardization of the state-of-the-art. arXiv preprint arXiv:1806.05226 (2018).
[67] Andrej Karpathy, Justin Johnson, and Li Fei-Fei. 2016. Visualizing and understanding recurrent networks. In Pro-
ceedings of the 4th International Conference on Learning Representations Workshop.
[68] Md Abdullah Al Hafiz Khan, Nirmalya Roy, and Archan Misra. 2018. Scaling human activity recognition via deep
learning-based domain adaptation. In Proceedings of the International Conference on Pervasive Computing and Com-
munications. IEEE, 1–9.
[69] Shehroz S. Khan and Babak Taati. 2017. Detecting unseen falls from wearable devices using channel-wise ensemble
of autoencoders. Exp. Syst. Applic. 87 (2017), 280–290.
[70] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional
neural networks. In Proceedings of the International Conference on Advances in Neural Information Processing Systems.
1097–1105.
[71] Jennifer R. Kwapisz, Gary M. Weiss, and Samuel A. Moore. 2011. Activity recognition using cell phone accelerome-
ters. ACM SIGKDD Explor. Newslett. 12, 2 (2011), 74–82.
[72] Yongjin Kwon, Kyuchang Kang, and Changseok Bae. 2015. Analysis and evaluation of smartphone-based human
activity recognition using a neural network approach. In Proceedings of the International Joint Conference on Neural
Networks. IEEE, 1–5.
[73] Nicholas D. Lane and Petko Georgiev. 2015. Can deep learning revolutionize mobile sensing? In Proceedings of the
16th International Workshop on Mobile Computing Systems and Applications. ACM, 117–122.
[74] Gierad Laput and Chris Harrison. 2019. Sensing fine-grained hand activity with smartwatches. In Proceedings of the
CHI Conference on Human Factors in Computing Systems. ACM, 338.
[75] Oscar D. Lara and Miguel A. Labrador. 2013. A survey on human activity recognition using wearable sensors. IEEE
Commun. Surv. Tutor. 15, 3 (2013), 1192–1209.
[76] Song-Mi Lee, Sang Min Yoon, and Heeryon Cho. 2017. Human activity recognition from accelerometer data using
Convolutional Neural Network. In Proceedings of the IEEE International Conference on Big Data and Smart Computing
(BigComp’17). IEEE, 131–134.
[77] Fei Li and Schahram Dustdar. 2011. Incorporating unsupervised learning in activity recognition. In Proceedings of
the Workshops at the 25th AAAI Conference on Artificial Intelligence.
[78] Xinyu Li, Yuan He, and Xiaojun Jing. 2019. A survey of deep learning-based human activity recognition in radar.
Remote Sens. 11, 9 (2019), 1068.
[79] Xinyu Li, Yanyi Zhang, Mengzhu Li, Ivan Marsic, JaeWon Yang, and Randall S. Burd. 2016. Deep neural network
for RFID-based activity recognition. In Proceedings of the 8th Wireless of the Students, by the Students, and for the
Students Workshop (S3@MobiCom’16). ACM, 24–26.
[80] Xinyu Li, Yanyi Zhang, Ivan Marsic, Aleksandra Sarcevic, and Randall S. Burd. 2016. Deep learning for RFID-based
activity recognition. In Proceedings of the 14th ACM Conference on Embedded Network Sensor Systems CD-ROM. ACM,
164–175.
[81] Xinyu Li, Yanyi Zhang, Jianyu Zhang, Shuhong Chen, Ivan Marsic, Richard A. Farneth, and Randall S. Burd. 2017.
Concurrent activity recognition with multimodal CNN-LSTM structure. arXiv preprint arXiv:1702.01638 (2017).
[82] Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. 2003. A symbolic representation of time series, with
implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data
Mining and Knowledge Discovery. ACM, 2–11.
[83] Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431–3440.
[84] Lingjuan Lyu, Xuanli He, Yee Wei Law, and Marimuthu Palaniswami. 2017. Privacy-preserving collaborative deep
learning with application to human activity recognition. In Proceedings of the ACM on Conference on Information
and Knowledge Management. ACM, 1219–1228.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:36 K. Chen et al.
[85] Haojie Ma, Wenzhong Li, Xiao Zhang, Songcheng Gao, and Sanglu Lu. 2019. AttnSense: Multi-level attention mecha-
nism for multimodal human activity recognition. In Proceedings of the 28th International Joint Conference on Artificial
Intelligence (IJCAI’19). 3109–3115.
[86] Yuchao Ma and Hassan Ghasemzadeh. 2019. LabelForest: Non-parametric semi-supervised learning for activity
recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4520–4527.
[87] Mohammad Malekzadeh, Richard G. Clegg, Andrea Cavallaro, and Hamed Haddadi. 2018. Protecting sensory data
against sensitive inferences. In Proceedings of the 1st Workshop on Privacy by Design in Distributed Systems. ACM, 2.
[88] Mohammad Malekzadeh, Richard G. Clegg, Andrea Cavallaro, and Hamed Haddadi. 2019. Mobile sensor data
anonymization. In Proceedings of the International Conference on Internet of Things Design and Implementation. 49–58.
[89] Akhil Mathur, Tianlin Zhang, Sourav Bhattacharya, Petar Veličković, Leonid Joffe, Nicholas D. Lane, Fahim Kawsar,
and Pietro Lió. 2018. Using deep data augmentation training to address software and hardware heterogeneities in
wearable and smartphone sensing devices. In Proceedings of the 17th ACM/IEEE International Conference on Infor-
mation Processing in Sensor Networks. IEEE Press, 200–211.
[90] Shinya Matsui, Nakamasa Inoue, Yuko Akagi, Goshu Nagino, and Koichi Shinoda. 2017. User adaptation of con-
volutional neural network for human activity recognition. In Proceedings of the 25th European Signal Processing
Conference. IEEE, 753–757.
[91] Taylor Mauldin, Marc Canby, Vangelis Metsis, Anne Ngu, and Coralys Rivera. 2018. SmartFall: A smartwatch-based
fall detection system using deep learning. Sensors 18, 10 (2018), 3363.
[92] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural
network based language model. In Proceedings of the 11th Conference of the International Speech Communication
Association.
[93] Abdel-rahman Mohamed, George E. Dahl, and Geoffrey Hinton. 2011. Acoustic modeling using deep belief networks.
IEEE Trans. Audio, Speech, Lang. Proc. 20, 1 (2011), 14–22.
[94] Francisco Javier Ordóñez Morales and Daniel Roggen. 2016. Deep convolutional feature transfer across mobile ac-
tivity recognition domains, sensor modalities and locations. In Proceedings of the ACM International Symposium on
Wearable Computers. ACM, 92–99.
[95] Sebastian Münzner, Philip Schmidt, Attila Reiss, Michael Hanselmann, Rainer Stiefelhagen, and Robert Dürichen.
2017. CNN-based sensor fusion techniques for multimodal human activity recognition. In Proceedings of the ACM
International Symposium on Wearable Computers. ACM, 158–165.
[96] Vishvak S. Murahari and Thomas Plötz. 2018. On attention models for human activity recognition. In Proceedings of
the ACM International Symposium on Wearable Computers. ACM, 100–103.
[97] Harideep Nair, Cathy Tan, Ming Zeng, Ole J. Mengshoel, and John Paul Shen. 2019. AttriNet: Learning mid-level
features for human activity recognition with deep belief networks. In Proceedings of the ACM International Joint
Conference on Pervasive and Ubiquitous Computing and Proceedings of the ACM International Symposium on Wearable
Computers. ACM, 510–517.
[98] Mark Nutter, Catherine H. Crawford, and Jorge Ortiz. 2018. Design of novel deep learning models for real-time hu-
man activity recognition with mobile phones. In Proceedings of the International Joint Conference on Neural Networks.
IEEE, 1–8.
[99] Henry Friday Nweke, Ying Wah Teh, Mohammed Ali Al-Garadi, and Uzoma Rita Alo. 2018. Deep learning algorithms
for human activity recognition using mobile and wearable sensor networks: State of the art and research challenges.
Exp. Syst. Applic. 105 (2018), 233–261.
[100] Tsuyoshi Okita and Sozo Inoue. 2017. Recognition of multiple overlapping activities using compositional CNN-
LSTM model. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing and
Proceedings of the ACM International Symposium on Wearable Computers. ACM, 165–168.
[101] Francisco Ordóñez and Daniel Roggen. 2016. Deep convolutional and LSTM recurrent neural networks for multi-
modal wearable activity recognition. Sensors 16, 1 (2016), 115.
[102] Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (2009),
1345–1359.
[103] Liangying Peng, Ling Chen, Zhenan Ye, and Yi Zhang. 2018. AROMA: A deep multi-task learning based simple and
complex human activity recognition method using wearable sensors. Proc. ACM Interact., Mob., Wear. Ubiq. Technol.
2, 2 (2018), 74.
[104] Cuong Pham and Patrick Olivier. 2009. Slice&dice: Recognizing food preparation activities using embedded ac-
celerometers. In Proceedings of the European Conference on Ambient Intelligence. Springer, 34–43.
[105] NhatHai Phan, Yue Wang, Xintao Wu, and Dejing Dou. 2016. Differential privacy preservation for deep auto-
encoders: An application of human behavior prediction. In Proceedings of the 30th AAAI Conference on Artificial
Intelligence.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:37
[106] Ivan Miguel Pires, Nuno Pombo, Nuno M. Garcia, and Francisco Flórez-Revuelta. 2018. Multi-sensor mobile plat-
form for the recognition of activities of daily living and their environments based on artificial neural networks. In
Proceedings of the 27th International Joint Conference on Artificial Intelligence. 5850–5852.
[107] Thomas Plötz, Nils Y. Hammerla, and Patrick L. Olivier. 2011. Feature learning for activity recognition in ubiquitous
computing. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence.
[108] Samira Pouyanfar, Saad Sadiq, Yilin Yan, Haiman Tian, Yudong Tao, Maria Presa Reyes, Mei-Ling Shyu, Shu-Ching
Chen, and S. S. Iyengar. 2018. A survey on deep learning: Algorithms, techniques, and applications. Comput. Surv.
51, 5 (2018), 92.
[109] Ofir Press, Amir Bar, Ben Bogin, Jonathan Berant, and Lior Wolf. 2017. Language generation with recurrent gener-
ative adversarial networks without pre-training. arXiv preprint arXiv:1706.01399 (2017).
[110] Hangwei Qian, Sinno Pan, and Chunyan Miao. 2018. Sensor-based activity recognition via learning from distribu-
tions. In Proceedings of the AAAI Conference on Artificial Intelligence.
[111] Hangwei Qian, Sinno Jialin Pan, Bingshui Da, and Chunyan Miao. 2019. A novel distribution-embedded neural
network for sensor-based activity recognition. In Proceedings of the 28th International Joint Conference on Artificial
Intelligence (IJCAI’19). 5614–5620.
[112] Hangwei Qian, Sinno Jialin Pan, and Chunyan Miao. 2019. Distribution-based semi-supervised learning for activity
recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7699–7706.
[113] Valentin Radu, Nicholas D. Lane, Sourav Bhattacharya, Cecilia Mascolo, Mahesh K. Marina, and Fahim Kawsar. 2016.
Towards multimodal deep learning for activity recognition on mobile devices. In Proceedings of the ACM International
Joint Conference on Pervasive and Ubiquitous Computing: Adjunct. ACM, 185–188.
[114] Valentin Radu, Catherine Tong, Sourav Bhattacharya, Nicholas D. Lane, Cecilia Mascolo, Mahesh K. Marina, and
Fahim Kawsar. 2018. Multimodal deep learning for activity and context recognition. Proc. ACM Interact., Mob., Wear.
Ubiq. Technol. 1, 4 (2018), 157.
[115] Daniele Ravi, Charence Wong, Benny Lo, and Guang-Zhong Yang. 2016. A deep learning approach to on-node sensor
data analytics for mobile or wearable devices. IEEE J. Biomed. Health Inform. 21, 1 (2016), 56–64.
[116] Daniele Ravi, Charence Wong, Benny Lo, and Guang-Zhong Yang. 2016. Deep learning for human activity recog-
nition: A resource efficient implementation on low-power devices. In Proceedings of the IEEE 13th International
Conference on Wearable and Implantable Body Sensor Networks (BSN’16). IEEE, 71–76.
[117] Attila Reiss and Didier Stricker. 2012. Introducing a new benchmarked dataset for activity monitoring. In Proceedings
of the 16th International Symposium on Wearable Computers. IEEE, 108–109.
[118] Jorge-L. Reyes-Ortiz, Luca Oneto, Albert Samà, Xavier Parra, and Davide Anguita. 2016. Transition-aware human
activity recognition using smartphones. Neurocomputing 171 (2016), 754–767.
[119] Daniele Riboni, Linda Pareschi, Laura Radaelli, and Claudio Bettini. 2011. Is ontology-based activity recognition
really effective? In Proceedings of the IEEE International Conference on Pervasive Computing and Communications
Workshops. IEEE, 427–431.
[120] Daniel Roggen, Alberto Calatroni, Mirco Rossi, Thomas Holleczek, Kilian Förster, Gerhard Tröster, Paul Lukowicz,
David Bannach, Gerald Pirkl, Alois Ferscha, et al. 2010. Collecting complex activity datasets in highly rich networked
sensor environments. In Proceedings of the 7th International Conference on Networked Sensing Systems (INSS’10). IEEE,
233–240.
[121] Seyed Ali Rokni, Marjan Nourollahi, and Hassan Ghasemzadeh. 2018. Personalized human activity recognition using
convolutional neural networks. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.
[122] Charissa Ann Ronao and Sung-Bae Cho. 2015. Deep convolutional neural networks for human activity recognition
with smartphone sensors. In Proceedings of the International Conference on Neural Information Processing. Springer,
46–53.
[123] Charissa Ann Ronao and Sung-Bae Cho. 2016. Human activity recognition with smartphone sensors using deep
learning neural networks. Exp. Syst. Applic. 59 (2016), 235–244.
[124] Silvia Rossi, Roberto Capasso, Giovanni Acampora, and Mariacarla Staffa. 2018. A multimodal deep learning network
for group activity recognition. In Proceedings of the International Joint Conference on Neural Networks. IEEE, 1–6.
[125] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpa-
thy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. Int. J. Comput.
Vis. 115, 3 (2015), 211–252.
[126] Ramyar Saeedi, Skyler Norgaard, and Assefaw H. Gebremedhin. 2017. A closed-loop deep learning architecture for
robust activity recognition using wearable sensors. In Proceedings of the IEEE International Conference on Big Data.
IEEE, 473–479.
[127] Jeffrey C. Schlimmer and Richard H. Granger. 1986. Incremental learning from noisy data. Mach. Learn. 1, 3 (1986),
317–354.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:38 K. Chen et al.
[128] Sofia Serrano and Noah A. Smith. 2019. Is attention interpretable? In Proceedings of the 57th Conference of the Asso-
ciation for Computational Linguistics (ACL’19). 2931–2951.
[129] Yu-Han Shen, Ke-Xin He, and Wei-Qiang Zhang. 2018. SAM-GCNN: A gated convolutional neural network with
segment-level attention mechanism for home activity monitoring. In Proceedings of the IEEE International Symposium
on Signal Processing and Information Technology (ISSPIT’18). IEEE, 679–684.
[130] Muhammad Shoaib, Stephan Bosch, Ozlem Incel, Hans Scholten, and Paul Havinga. 2014. Fusion of smartphone
motion sensors for physical activity recognition. Sensors 14, 6 (2014), 10146–10176.
[131] Geetika Singla, Diane J. Cook, and Maureen Schmitter-Edgecombe. 2010. Recognizing independent and joint activ-
ities among multiple residents in smart environments. J. Amb. Intell. Human. Comput. 1, 1 (2010), 57–63.
[132] Elnaz Soleimani and Ehsan Nazerfard. 2019. Cross-subject transfer learning in human activity recognition systems
using generative adversarial networks. arXiv preprint arXiv:1903.12489 (2019).
[133] Maja Stikic, Kristof Van Laerhoven, and Bernt Schiele. 2008. Exploring semi-supervised and active learning for
activity recognition. In Proceedings of the 12th IEEE International Symposium on Wearable Computers. IEEE, 81–88.
[134] Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow, Mikkel Baun Kjærgaard, Anind Dey, Tobias
Sonne, and Mads Møller Jensen. 2015. Smart devices are different: Assessing and mitigating mobile sensing hetero-
geneities for activity recognition. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems.
ACM, 127–140.
[135] Yujin Tang, Jianfeng Xu, Kazunori Matsumoto, and Chihiro Ono. 2016. Sequence-to-sequence model with attention
for time series classification. In Proceedings of the 16th International Conference on Data Mining Workshops. IEEE,
503–510.
[136] Dapeng Tao, Yonggang Wen, and Richang Hong. 2016. Multicolumn bidirectional long short-term memory for mo-
bile devices-based human activity recognition. IEEE Internet Things J. 3, 6 (2016), 1124–1134.
[137] Luan Tran, Xi Yin, and Xiaoming Liu. 2017. Disentangled representation learning GAN for pose-invariant face
recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1415–1424.
[138] Son N. Tran, Qing Zhang, Vanessa Smallbon, and Mohan Karunanithi. 2018. Multi-resident activity monitoring in
smart homes: A case study. In Proceedings of the IEEE International Conference on Pervasive Computing and Commu-
nications Workshops (PerCom Workshops’18). IEEE, 698–703.
[139] Tim L. M. van Kasteren, Gwenn Englebienne, and Ben J. A. Kröse. 2011. Human activity recognition from wire-
less sensor network data: Benchmark and software. In Activity Recognition in Pervasive Intelligent Environments.
Springer, 165–186.
[140] Alireza Abedin Varamin, Ehsan Abbasnejad, Qinfeng Shi, Damith C. Ranasinghe, and Hamid Rezatofighi. 2018. Deep
auto-set: A deep auto-encoder-set network for activity recognition using wearables. In Proceedings of the 15th EAI
International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services. ACM, 246–253.
[141] George Vavoulas, Charikleia Chatzaki, Thodoris Malliotakis, Matthew Pediaditis, and Manolis Tsiknakis. 2016. The
MobiAct dataset: Recognition of activities of daily living using smartphones. In Proceedings of the International
Conference on Information and Communication Technologies for Ageing Well and e-Health (ICT4AgeingWell’16). 143–
151.
[142] Praneeth Vepakomma, Debraj De, Sajal K. Das, and Shekhar Bhansali. 2015. A-Wristocracy: Deep learning on wrist-
worn sensing for recognition of user complex activities. In Proceedings of the IEEE 12th International Conference on
Wearable and Implantable Body Sensor Networks (BSN’15). 1–6.
[143] Toan H. Vu, An Dang, Le Dung, and Jia-Ching Wang. 2017. Self-gated recurrent neural networks for human activity
recognition on wearable devices. In Proceedings of the Thematic Workshops of ACM Multimedia. ACM, 179–185.
[144] Jiwei Wang, Yiqiang Chen, Yang Gu, Yunlong Xiao, and Haonan Pan. 2018. SensoryGANs: An effective genera-
tive adversarial framework for sensor-based human activity recognition. In Proceedings of the International Joint
Conference on Neural Networks. IEEE, 1–8.
[145] Jindong Wang, Yiqiang Chen, Shuji Hao, Xiaohui Peng, and Lisha Hu. 2019. Deep learning for sensor-based activity
recognition: A survey. Pattern Recog. Lett. 119 (2019), 3–11.
[146] Jindong Wang, Vincent W. Zheng, Yiqiang Chen, and Meiyu Huang. 2018. Deep transfer learning for cross-domain
activity recognition. In Proceedings of the 3rd International Conference on Crowd Science and Engineering. ACM, 16.
[147] Yanwen Wang, Jiaxing Shen, and Yuanqing Zheng. 2020. Push the limit of acoustic gesture recognition. In 39th IEEE
Conference on Computer Communications (INFOCOM’20). IEEE, 566–575.
[148] Sungpil Woo, Jaewook Byun, Seonghoon Kim, Hoang Minh Nguyen, Janggwan Im, and Daeyoung Kim. 2016. RNN-
based personalized activity recognition in multi-person environment using RFID. In Proceedings of the IEEE Inter-
national Conference on Computer and Information Technology (CIT’16). IEEE, 708–715.
[149] Rui Xi, Mengshu Hou, Mingsheng Fu, Hong Qu, and Daibo Liu. 2018. Deep dilated convolution on multimodality
time series for human activity recognition. In Proceedings of the International Joint Conference on Neural Networks.
IEEE, 1–8.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
Deep Learning for Sensor-based Human Activity Recognition 77:39
[150] Rui Xi, Ming Li, Mengshu Hou, Mingsheng Fu, Hong Qu, Daibo Liu, and Charles R. Haruna. 2018. Deep dilation on
multimodality time series for human activity recognition. IEEE Access 6 (2018), 53381–53396.
[151] Cheng Xu, Duo Chai, Jie He, Xiaotong Zhang, and Shihong Duan. 2019. InnoHAR: A deep neural network for
complex human activity recognition. IEEE Access 7 (2019), 9893–9902.
[152] Li Xue, Si Xiandong, Nie Lanshun, Li Jiazhen, Ding Renjie, Zhan Dechen, and Chu Dianhui. 2018. Understanding
and improving deep neural network for activity recognition. arXiv preprint arXiv:1805.07020 (2018).
[153] Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiao Li Li, and Shonali Krishnaswamy. 2015. Deep convolutional
neural networks on multichannel time series for human activity recognition. In Proceedings of the 24th International
Joint Conference on Artificial Intelligence.
[154] Yang Yang, Chunping Hou, Yue Lang, Dai Guan, Danyang Huang, and Jinchen Xu. 2019. Open-set human activity
recognition based on micro-Doppler signatures. Pattern Recog. 85 (2019), 60–69.
[155] Zhan Yang, Osolo Ian Raymond, Chengyuan Zhang, Ying Wan, and Jun Long. 2018. DFTerNet: Towards 2-bit dy-
namic fusion networks for accurate human activity recognition. IEEE Access 6 (2018), 56750–56764.
[156] Lina Yao, Feiping Nie, Quan Z. Sheng, Tao Gu, Xue Li, and Sen Wang. 2016. Learning from less for better: Semi-
supervised activity recognition via shared structure discovery. In Proceedings of the ACM International Joint Confer-
ence on Pervasive and Ubiquitous Computing. 13–24.
[157] Rui Yao, Guosheng Lin, Qinfeng Shi, and Damith C. Ranasinghe. 2018. Efficient dense labelling of human activity
sequences from wearables using fully convolutional networks. Pattern Recog. 78 (2018), 252–266.
[158] Shuochao Yao, Shaohan Hu, Yiran Zhao, Aston Zhang, and Tarek Abdelzaher. 2017. Deepsense: A unified deep learn-
ing framework for time-series mobile sensing data processing. In Proceedings of the 26th International Conference on
World Wide Web. International World Wide Web Conferences Steering Committee, 351–360.
[159] Yuta Yuki, Junto Nozaki, Kei Hiroi, Katsuhiko Kaji, and Nobuo Kawaguchi. 2018. Activity recognition using dual-
ConvLSTM extracting local and global features for SHL recognition challenge. In Proceedings of the ACM Interna-
tional Joint Conference and International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers.
ACM, 1643–1651.
[160] Piero Zappi, Clemens Lombriser, Thomas Stiefmeier, Elisabetta Farella, Daniel Roggen, Luca Benini, and Gerhard
Tröster. 2008. Activity recognition from on-body sensors: Accuracy-power trade-off by dynamic sensor selection.
In Proceedings of the European Conference on Wireless Sensor Networks. Springer, 17–33.
[161] Tahmina Zebin, Patricia J. Scully, and Krikor B. Ozanyan. 2016. Human activity recognition with inertial sensors
using a deep learning approach. In Proceedings of the IEEE Conference on Sensors (SENSORS’16). IEEE, 1–3.
[162] Ming Zeng, Haoxiang Gao, Tong Yu, Ole J. Mengshoel, Helge Langseth, Ian Lane, and Xiaobing Liu. 2018. Under-
standing and improving recurrent networks for human activity recognition by continuous attention. In Proceedings
of the ACM International Symposium on Wearable Computers. ACM, 56–63.
[163] Ming Zeng, Le T. Nguyen, Bo Yu, Ole J. Mengshoel, Jiang Zhu, Pang Wu, and Joy Zhang. 2014. Convolutional neural
networks for human activity recognition using mobile sensors. In Proceedings of the 6th International Conference on
Mobile Computing, Applications and Services. IEEE, 197–205.
[164] Ming Zeng, Tong Yu, Xiao Wang, Le T. Nguyen, Ole J. Mengshoel, and Ian Lane. 2017. Semi-supervised convolutional
neural networks for human activity recognition. In Proceedings of the IEEE International Conference on Big Data. IEEE,
522–529.
[165] Dalin Zhang, Kaixuan Chen, Debao Jian, and Lina Yao. 2020. Motor imagery classification via temporal attention
cues of graph embedded EEG signals. IEEE J. Biomed. Health Inform. 24, 9 (2020), 2570–2579.
[166] Dalin Zhang, Lina Yao, Kaixuan Chen, Guodong Long, and Sen Wang. 2019. Collective protection: Preventing sen-
sitive inferences via integrative transformation. In Proceedings of the 19th IEEE International Conference on Data
Mining. IEEE, 1–6.
[167] Dalin Zhang, Lina Yao, Kaixuan Chen, and Jessica Monaghan. 2019. A convolutional recurrent attention model for
subject-independent eeg signal analysis. IEEE Sig. Proc. Lett. 26, 5 (2019), 715–719.
[168] Dalin Zhang, Lina Yao, Kaixuan Chen, and Sen Wang. 2018. Ready for use: Subject-independent movement inten-
tion recognition via a convolutional attention model. In Proceedings of the 27th ACM International Conference on
Information and Knowledge Management. ACM, 1763–1766.
[169] Dalin Zhang, Lina Yao, Kaixuan Chen, Sen Wang, Xiaojun Chang, and Yunhao Liu. 2019. Making sense of spatio-
temporal preserving representations for EEG-based human intention recognition. IEEE Trans. Cyber. 50, 7 (2019),
3033–3044.
[170] Dalin Zhang, Lina Yao, Kaixuan Chen, Sen Wang, Pari Delir Haghighi, and Caley Sullivan. 2019. A graph-based
hierarchical attention model for movement intention detection from EEG signals. IEEE Trans. Neural Syst. Rehab.
Eng. 27, 11 (2019), 2247–2253.
[171] Dalin Zhang, Lina Yao, Xiang Zhang, Sen Wang, Weitong Chen, Robert Boots, and Boualem Benatallah. 2018. Cas-
cade and parallel convolutional recurrent neural networks on EEG-based intention recognition for brain computer
interface. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.
77:40 K. Chen et al.
[172] Mi Zhang and Alexander A. Sawchuk. 2012. USC-HAD: A daily activity dataset for ubiquitous activity recognition
using wearable sensors. In Proceedings of the ACM Conference on Ubiquitous Computing. ACM, 1036–1043.
[173] Xiang Zhang, Lina Yao, Chaoran Huang, Sen Wang, Mingkui Tan, Guodong Long, and Can Wang. 2018. Multi-
modality sensor data classification with selective attention. In Proceedings of the 27th International Joint Conference
on Artificial Intelligence.
[174] Xiang Zhang, Lina Yao, and Feng Yuan. 2019. Adversarial variational embedding for robust semi-supervised learning.
In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 139–147.
[175] Yanyi Zhang, Xinyu Li, Jianyu Zhang, Shuhong Chen, Moliang Zhou, Richard A. Farneth, Ivan Marsic, and Randall S.
Burd. 2017. Car—A deep learning structure for concurrent activity recognition. In Proceedings of the 16th ACM/IEEE
International Conference on Information Processing in Sensor Networks (IPSN’17). IEEE, 299–300.
[176] Yong Zhang, Yu Zhang, Zhao Zhang, Jie Bao, and Yunpeng Song. 2018. Human activity recognition based on time
series analysis using U-Net. arXiv preprint arXiv:1809.08113 (2018).
[177] Yi Zheng, Qi Liu, Enhong Chen, Yong Ge, and J. Leon Zhao. 2014. Time series classification using multi-channels deep
convolutional neural networks. In Proceedings of the International Conference on Web-age Information Management.
Springer, 298–310.
[178] Yue Zheng, Yi Zhang, Kun Qian, Guidong Zhang, Yunhao Liu, Chenshu Wu, and Zheng Yang. 2019. Zero-effort
cross-domain gesture recognition with Wi-Fi. In Proceedings of the 17th International Conference on Mobile Systems,
Applications, and Services. ACM, 313–325.
[179] Jun-Yan Zhu and Jim Foley. 2019. Learning to synthesize and manipulate natural images. IEEE Comput. Graph. Applic.
39, 2 (2019), 14–23.
[180] Han Zou, Yuxun Zhou, Jianfei Yang, Hao Jiang, Lihua Xie, and Costas J. Spanos. 2018. DeepSense: Device-free human
activity recognition via autoencoder long-term recurrent convolutional network. In Proceedings of the International
Conference on Communications (ICC’18). IEEE, 1–6.
ACM Computing Surveys, Vol. 54, No. 4, Article 77. Publication date: May 2021.