0% found this document useful (0 votes)
41 views9 pages

Privacy-Aware Environmental Sound Classification For Indoor (2019)

This document discusses using environmental sound classification to recognize human activity indoors while preserving privacy. It explores using acoustic sensors, which provide rich information at low cost but introduce privacy concerns from continuous recordings. The authors evaluate techniques for feature extraction and machine learning classifiers on indoor audio events, considering classification accuracy, computational efficiency, and privacy by removing voice frequencies from recordings. Their results show 78% accuracy using SVM with LPCC features, improving to over 85% by combining features and dropping unreliable predictions with minimal increased complexity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views9 pages

Privacy-Aware Environmental Sound Classification For Indoor (2019)

This document discusses using environmental sound classification to recognize human activity indoors while preserving privacy. It explores using acoustic sensors, which provide rich information at low cost but introduce privacy concerns from continuous recordings. The authors evaluate techniques for feature extraction and machine learning classifiers on indoor audio events, considering classification accuracy, computational efficiency, and privacy by removing voice frequencies from recordings. Their results show 78% accuracy using SVM with LPCC features, improving to over 85% by combining features and dropping unreliable predictions with minimal increased complexity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Privacy-aware environmental sound classification for indoor

human activity recognition


Wei Wang∗ Fatjon Seraj
[email protected] [email protected]
University of Twente University of Twente
Enschede, The Netherlands Enschede, The Netherlands

Nirvana Meratnia Paul J.M. Havinga


[email protected] [email protected]
University of Twente University of Twente
Enschede, The Netherlands Enschede, The Netherlands
ABSTRACT 1 INTRODUCTION
This paper presents a comparative study on di�erent feature extrac- Recent studies show that nowadays people spend in average 80%-
tion and machine learning techniques for indoor environmental 90% time indoor and the indoor environments have large e�ect
sound classi�cation. Compared to outdoor environmental sound on the comfort, health and productivity of the occupants [4]. A
classi�cation systems, indoor systems need to pay special attention new concept of "smart buildings" therefore has emerged, referring
to power consumption and privacy. We consider feature calcula- to building automation systems (BAS) with automatic centralized
tion complexity, classi�cation accuracy and privacy as evaluation control for heating, ventilation and air conditioning, lighting, etc.
metrics. To ensure privacy, we strip voice bands from sound input Smart buildings are designed to improve occupant comfort, e�cient
to make human conversations unrecognizable. With 5 classes of operation of building systems, reducing energy consumption and
2500 indoor audio events as input, our experimental results show operating costs as well as to improve the life cycle of utilities [10].
that using SVM model with LPCC feature, 78% classi�cation accu- Energy e�ciency impacts positively the portfolio of both buildings
racy can be reached. Furthermore, the performance is improved to owners and utility companies, with bene�ts ranging from reduc-
more than 85% by combining several simple features and dropping tion in utility bills to grid stability and sustainability [10]. In order
unreliable predictions, which only slightly increase the complexity. to achieve "the smartness", the building �rst needs sensors to col-
lect wide and relevant ambient information. These data then need
CCS CONCEPTS a fully connected network to be gathered. Finally an intelligent
• Computing methodologies → Feature selection; • Applied control system will learn from these data and react to environmen-
computing → Sound and music computing; • Human-centered tal change automatically. Among all the needed information, data
computing → Ubiquitous and mobile computing theory, concepts about occupants is the key component, as human comfort is the
and paradigms. reason these systems exist in the �rst place. Thus knowing the hu-
man activity, location and mobility patterns inside the building can
substantially increase the e�ectiveness of such solutions. This is
KEYWORDS particularly important because knowing of the activity of a person,
Smart Buildings, Privacy-aware Environmental Sound Recognition, more information can be provided to the building management
Voice Bands Stripping, Internet Of Things, Computational E�- systems on how to operate more e�ciently.
ciency, Web Crawling, Mel Frequency Cepstral Coe�cients, Linear Numerous technologies are used to automatically detect human
Predictive Cepstral Coe�cients, Support Vector Machine presences and activities. Such technologies include camera systems
that recognize human presence and actions from images, wearable
ACM Reference Format:
Wei Wang, Fatjon Seraj, Nirvana Meratnia, and Paul J.M. Havinga. 2019.
devices that track users movement, PIR sensors that give binary
Privacy-aware environmental sound classi�cation for indoor human activity information about people presence [21][18][25]. As many papers
recognition. In Proceedings of PETRA ’19, (Conference’19). ACM, New York, have proposed, these sensors have pros and cons in di�erent aspects.
NY, USA, 9 pages. https://doi.org//10.1145/3316782.3321521 For instance, PIR sensors are cheap, non-intrusive but only detect
the presence of stationary people [16]. A wearable device such
like a badge needs carefully maintenance and often let users feel it
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
troublesome [2]. Camera or infra-red camera systems can recognize
for pro�t or commercial advantage and that copies bear this notice and the full citation human activities in �ne grain but they are expensive and need line-
on the �rst page. Copyrights for components of this work owned by others than ACM of-sight, more importantly many people worry that their privacy
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior speci�c permission and/or a are exposed[14][26].
fee. Request permissions from [email protected]. Another type of sensors worth mentioning is the acoustic sen-
Conference’19, June 5–7, 2019, Rhodes, Greece, sors. Compared to the aforementioned sensors, they o�er many
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6232-0/19/06. . . $15.00
advantages such as being pervasive and rich information medium
https://doi.org//10.1145/3316782.3321521

36
Conference’19, June 5–7, 2019, Rhodes, Greece,
Wei Wang, Fatjon Seraj, Nirvana Meratnia, and Paul J.M. Havinga

while cheap to deploy. Beside the rich information that sound con- continuous audio recordings from 14 di�erent scenarios. Their in-
tains, there are also obvious challenges such as noise interference novation was a matching pursuit algorithm to extract features from
and high attenuation. To apply sound in indoor applications, sound time-frequency domain to complement MFCC. Compared to previ-
sensors also introduce some critical privacy concerns. People could ous works, the rich features used helped improved the classi�cation
feel stalked and spied on when microphones are around them. Al- accuracy. Heittola et al. [17] developed a context-dependent sound
though many research initiatives have focused on environmental event detection system. With context label which provided prior
sound recognition for general purposes, very few are speci�c for probabilities to HMM model to improve event detection accuracy.
human activity detection. There is still a gap between general sound The system also adopted an iterative algorithm based on original
classi�cation techniques and human activity speci�c techniques HMM to model the overlapping sound events.
inside buildings, e.g. how could privacy concern being preserved The aforementioned papers mainly focus on classifying the
and how to make the model lightweight to �t in IoT devices. In sound context but not the events. Later people use the features
this paper, we will �ll in this gap, by exploring the possibilities and models from these works to solve another problem, where the
of accurately using sound sensors to recognize human activity in focus is the shorter duration of an audio or so called events. From
privacy. data processing perspective, algorithms presented in these papers
Our idea to preserve privacy is to strip voice bands from input normally cut audio streaming into �xed length frames (millisecond
audio stream, as human voice largely falls into the range of 80HZ - level) and build models with feature arrays extracted from them.
3KHZ while environmental sound events have much wider range. This is because the time-varying characteristic of audio signal, short
We achieve this by using a hardware band-pass acoustic sensor frame based features can approximate time invariant functions and
that omits human voice frequencies in the device layer. Being a represent the details well.
challenging task, we conduct comparative experiments to �nd the Cowling et al. [8] compared several feature extraction and classi-
best suitable model and features that provide presence information �cation techniques typically used in speech recognition for general
in absence of human voice bands. We make use of machine learning environmental sound recognition. Particularly, tempo-frequency
techniques to classify di�erent human activity classes based on features such as STFT (Short Time Fourier Transformation) were
sound. Both the classi�cation accuracy and calculation density found to perform better than stationary MFCC features. Dalibor
results are given in the evaluation. We also compute the highest et al. [23] compiled a survey on the audio feature extraction tech-
rating model that can be used by the IoT devices with sound capable niques for audio retrieval systems. They proposed a taxonomy
functions to automatically detect ambient human behaviors in real of audio features together with the relevant application domain.
time. Temporal domain, frequency domain, Mel-frequency and harmonic
The remainder of the paper is organized as follows: Section 2 features were the most popular features in environmental sound
gives an overview of the related works on environmental sound recognition. Tran et al. [30] proposed a probabilistic distance SVM
recognition. Section 3 describes the methodology used to de�ne (Support Vector Machines) for online sound event classi�cation.
the characteristics, extract features and model the candidate sound This method has comparably good performance with very short
event classi�ers. Section 4 describes the experimental and evalu- latency and low computation cost compared to other mentioned
ation phase of this work. We conclude this paper with our open approaches. STE (Short Time Energy) was the only feature used
discussions in Section 5. in their model since in real online applications, the sound power
is highly related to the distance of the sound source, using this
feature alone could be highly biased in sound recognition task.
2 RELATED WORK Piczak et al. [24] applied CNN (Convolutional Neural Network), a
Classi�cation of human activities based on sound falls under cat- method usually applied to image processing, to classify environ-
egory of ’Environmental Sound Recognition’ (ESR). ESR aims to mental sound and achieved results comparable with traditional
automatically detect and identify audio events from captured audio methods ( 64.5% accuracy for 10 classes). Adavanne et al. [3] used
signal. Compared to other areas in audio presence such as speech multi-channel microphones and RNN (Recurrent Neural Networks)
or music recognition, general environmental sound recognition has for sound event detection, which improved the accuracy on de-
received less attention and is less e�ective because the sound input tecting overlapping sound events. Lopatka et al. [20] proposed an
is unstructured and arbitrary in size and pattern. acoustic surveillance system to recognize the acoustic events for
Eronen et al. [11] developed an environmental context classi�ca- possible threats like gunshot and explosion. The accuracy was not
tion system based on audio. With low-order hidden Markov models so encouraging compared to state of the art, but the advantage is
and MFCC (Mel Frequency Cepstral Coe�cients) features, they the low-cost model and realtime processing capacity. Dong et al.
achieved an accuracy of 82% for 6 classes, which dropped to 58% [9] used sensor fusion on PIR, CO2 , temperature data combined
when number of classes was increased to 24. Cai et al. [6] proposed with acoustic sensors to detect and predict user occupancy patterns.
a key audio e�ect detection framework for context inference, where They used only sound volume and no further exploration of other
movies are used as the dataset. The advantage of their model over sound-based feature. Yaniv et al. [33] used a combination of sound
others is to �rst infer the context by carefully picked and distin- and �oor vibrations to automatically detect people fall events to
guishable key sounds events (like explosion and gun shot means monitor elderly people living alone. Sound event length, energy,
an action movie). However these ’key events’ need manually and and MFCC features were extracted and used for event classi�cation.
carefully chosen and is hardly scaleable. Chu et al. [7] proposed an Table 1 shows the summery of the related works.
environmental sound recognition model to classify context, with

37
Conference’19, June 5–7, 2019, Rhodes, Greece,
Privacy-aware environmental sound classification for indoor human activity recognition

Table 1: Features and Methods in Related Works

Scenario Event Online


References Platform Features Models
/Data Source /Context /oF�ine
[11] General C None F MFCC HMM
[6] Movie C None F Spectral Flux,Harmonicity,MFCC SVM, BN
[7] General C None F MFCC,temporal signatures,time-frequency SVM, BN
[17] Transport,Shop,Open Space C&E None F MFCC GMM,HMM
[3] General E None F MFCC,Pitch,TDOA RNN
[8] General E None F FFT subband,STFT subband,MFCC ANN, GMM
[24] General Urban E None F MFCC CNN
[9] Conference Room E SN O Volume SMM
[33] People Falling E None O Length,STE,MFCC HMM
[29] General E None O STE SVM
C = Context, E = Event, SN = Sensor Network, BN = Bayes Network, SMM = Semi-Markov Model, GMM = Gausian Mixture Model

Figure 1. audio recognition system

In General, many models have provided good results in sound 3.1 Preprocessing
events recognition problem, especially deep learning based models 3.1.1 Segmentation. In this step, the duration of events are ex-
have shown great potentials in this �eld [3][24]. However to use tracted from continuous audio streaming from background noise
sound in smart building applications there are still several draw- (or silence). The algorithm to extract these events is called segmen-
backs. One biggest problem is that sound may expose more privacy tation. Although segmentation is as important as other steps in our
information of occupants than cameras. Another drawback of deep algorithm, it is not the focus of this paper since numerous general
learning based model is the lack of huge training dataset. Compared approaches exist to tackle the problem already exist. Here we pro-
to the popular speech recognition problem, environmental sound vide one simple segmentation approach, which works as follows:
recognition problem is less attractive to researchers while the data (1) The audio stream is smoothed in time domain, and cut into �xed
source is more diverse than speech. What is more, the models also short frames (20ms) while the power is calculated for each frame.
need to be light weighted in order to operate in IoT devices. Our (2) The frames with power higher than a threshold are labeled as
work will �ll in these gaps and provide solutions to these prob- ’active’. The threshold can be preset static value or dynamic adjust-
lems. In our work, we �rst strip the voice bands of audio stream as ing. (3) Adjacent ’active’ frames are combined to form an event. (4)
human conversations are the most concerned privacy issue. Our Events shorter than a given duration are dropped, while the long
training data is crawled from multiple free audio websites so that frames are truncated such that the events are between 1 to 3s. The
the data source is diverse enough for model training. Regarding reason to choose this duration range is from practical experience,
to the models, we mainly investigate the low cost classi�ers and as human can identify sound segments with such length quite well.
e�cient feature combinations to improve the performance at low Figure 2 shows an example of the segmentation results.
computation complexity.
3.1.2 Voice bands stripping. The purpose of stripping human voice
is to preserve privacy, as human conversations are the one of the
most critical privacy concerns in indoor environments. Here we
3 METHODOLOGY implement a software band-stop �lter to eliminate the human voice
Our overall aim is to build an algorithm capable of classifying while in a real world implementation this can be implemented by
human activities based on environmental sound in real-time. The acoustic sensor physically, so that the privacy is protected by the
methodology is shown in Figure 1. The raw audio stream is �rst device layer. A typical band-stop �lter can achieve this function,
segmented into short events, from which important features are which let pass only the bands from zero up to its lower cut-o� fre-
extracted, and �nally each event will be classi�ed into an activity quency Flow and the bands above its upper cut-o� frequency Fhi h .
class using a classi�cation model. This results in a rejection of frequencies in the band Flow :Fhi h .

38
Conference’19, June 5–7, 2019, Rhodes, Greece,
Wei Wang, Fatjon Seraj, Nirvana Meratnia, and Paul J.M. Havinga

(a). an audio stream w ith silence sound


recognition tasks. We also choose features from di�erent domains
and with di�erent characters since machine learning works better
by combining features with low correlation. The features used in our
paper are listed in Table 2, including temporal-domain, frequency-
domain, mel-frequency and other audio features.

3.2.2 Temporal domain features. The temporal domain is the na-


tive domain of audio signals. The features are extracted directly
(b). the event is located by thresholding (between 2 vetical lines) from raw signal without any pre-processing. Consequently, the
computational complexity and delay tends to be low. We select
the following features from temporal-domain to describe di�erent
aspects of signal:
(i) Short-time energy (STE) is a widely used feature in audio
analysis which describes the energy of signal, calculated by
mean-square of signal per frame [23].
(ii) Zero crossing rate (ZCR) is the rate at which the signal
changes from positive to negative or vice versa. This feature
is one of the simplest and widely used in speech recognition,
Figure 2. An Example of Audio Segmentation
which characterizes the dominant frequency of signal[19].
(iii) Temporal entropy (TE) is the entropy of temporal domain
per frame, which characterizes the dispersal of acoustic en-
In our case Flow = 300Hz and Fhi h = 3KHZ, this range is often ergy [28].
referred to as the voice bands [31]. By �ltering out this band the
human speech content become unrecognizable. 3.2.3 Frequency domain features.
Figure 3 shows the signal before and after voice-bands trunca-
tion in time and frequency domain. Obviously many information (i) Spectral centroid is calculated as the weighted mean of
is removed to preserve privacy, resulting in a more challenging the frequencies while magnitudes are the weights. It indi-
classi�cation problem. cates where the "center of mass" of the spectrum is located.
Sometimes the median is used for "center" rather than mean
[23].
3.2 Feature Extraction
(ii) Spectral spread is the magnitude-weighted average of the
Audio data is a continuous stream of high sampling rate informa- di�erences between the spectral components and the spec-
tion. This stream of continuous data can be transformed into a tral centroid, together they describe how disperse and wide
reduced set of features, which contain the most important and the the frequency bands are [23].
most relevant information for the classi�cation task. To provide an (iii) Spectral entropy is calculated as the entropy of spectrum
intuitive impression, we choose a representative sample from each it re�ects the �atness but spectrum [28].
class and plot the signal from di�erent domain, as shown in Figure (iv) Spectral �ux also describes the �atness in spectral domain,
6 [23]. but across frames. It calculates as the 2-norm Euclidean dis-
A single feature from a single domain only represents limited tance between the power spectrum of adjacent frames [28].
information, thus it will be hard to classify. However, by combining (v) Spectral rollo� represents the point where N% power are
multiple features, the class characterization becomes conspicuous. concentrated below that frequency. Spectral rollo� is exten-
In this paper, we �rst cut audio stream to smaller frames of �xed sively used in music information retrieval[23].
length and partly overlapped(e.g. 20ms frame length with 10ms
overlap). We then calculate the statistics (mean and variance) of 3.2.4 Other features.
all frame features as the representation of the whole event. There
are several bene�ts for the aggregation of short frames: �rstly (i) Mel-frequency cepstral coe�cients (MFCC) is the cep-
the features extracted from any event have the same length, i.e. stral representation of Mel-Frequency. Compared to original
irrelevant to event duration, secondly the statistics stand for global linear frequency bands, Mel-Frequency are equally spaced
representation which treat the event as a whole. Figure 7 shows on the Mel scale, which approximates the human auditory
the feature extraction �ow. system’s response. MFCC describes the spectral envelope
Figure 4 and 5 show the box-plot of per-frame value and the and is commonly used in speech recognition [7].
scatter-plot of statistic values of features: ’spectral-spread’. At the (ii) Linear Predictive Coding (LPC) is an auto-regression model
�rst glance, basically no conclusion can be drawn from a single in which a predictor estimates a sample by linear combina-
frame feature value, but the statistic feature of an event shows tion of previous sample values (1), s is the signal sequence
much clearer pattern and would be easier to classify. and a 1 to ap are the coe�cients. LPC can be used in audio
compression and speech recognition to represent the spec-
3.2.1 Features Lists. While there are numerous audio features, we tral envelope. However, LPC is prone to disturbance where
mainly select features that prove to be highly e�cient in audio small changes in LPC value would cause large deviation in

39
Conference’19, June 5–7, 2019, Rhodes, Greece,
Privacy-aware environmental sound classification for indoor human activity recognition

(a) original signal in time domain (b) truncated signal in time domain

(c) original signal in frequency domain (d) truncated signal in frequency domain

0HZ 4KHZ 8KHZ 0HZ 4KHZ 8KHZ

Figure 4. Feature ’frequency-spread’ on Figure 5. Feature ’frequency-spread’ sta-


Figure 3. Voice bands truncation
frame basis tistics on event basis

Announce Chatter Cheer Clap Footsteps Laugh elevatorcall shopping cart viznoman door slam wooden hair

0.5
0
-0.5
-1
PSD specetrogram
Frequency (kHz)

20
15
10
5
0
Log (mel) filterbank energies
20
Channel index

15
10
5

Mel frequency cepstrum


Cepstrum index

12
10
8
6
4
2
0 5 10 15 20 25
Time (s)

Figure 6. An example from each event class

fr.2 fr.299 Table 2: Used feature list


ft[i] = Feature_fn(frame[i])
Signal: 3sec for i=0:150
FT = [men(ft), COV(ft)]
Domain transformation
fr.1 fr.3 fr.300
Mel &
Raw signal Framing / Windowing Feature extraction Feature array Nr. Temporal Frequency
Linear predictive
Figure 7. Feature extraction �ow 1 ZCE centroid MFCC
2 STE medium LPCC
3 TE spread LSF
spectrum [23]. 4 entropy Chromogram
s[n] ⇡ a 1 ⇤ s[n 1] + a 2 ⇤ s[n 2] + ... + ap ⇤ s[n p] (1) 5 �ux
6 rollof
(iii) Linear predictive cepstral coe�cients (LPCC) is the cep-
stral representation of LPC [23].
(iv) Line Spectrum Frequencies (LSF) is the roots of 2 poly-
nomials decomposed from the LPC polynomial [23]. may incur a big di�erence to output, while LPCC and LSF are more
(v) Chromagram is a well-established tool for processing and robust in this aspect, which make them better for the classi�cation
analyzing music data which capture harmonic and pitch char- task.
acters from sound. Apart from music application, Chroma
features are also powerful mid-level feature representations 3.3 Classi�cation
in content-based audio retrieval or audio matching [23]. After features have been extracted, a classi�er should be learned so
Because both LPCC and LSF are derived from LPC, they are that new coming data can be classi�ed. This classi�cation function
alternative representation of LPC, thus containing the same amount which is also called representation, has di�erent forms for each
of information. However for LPC, a small disturbance from input model, e.g. a search tree in decision tree or a hyperplane in SVM.

40
Conference’19, June 5–7, 2019, Rhodes, Greece,
Wei Wang, Fatjon Seraj, Nirvana Meratnia, and Paul J.M. Havinga

Learning is therefore a process performed by searching through classes of sound events including: Speech, Crowd Chatting, Cheer-
the representation space to �nd the best hypothesis that better �ts ing, Walking, Applause, Door slamming, Chair moving, Elevator,
the solution. Trolley. However, after looking into the dataset, we merged some
Classi�cation algorithms can be categorized into two types [5]: classes resulting in 5 classes mainly because of 2 reasons: (I) Some
(i) stateless algorithms, in which events to be classi�ed are treated objects make very similar sounds, which is hard to di�erentiate
as unrelated to each other and (ii) stateful algorithms, in which even by human. For example some door and chair sometimes can
the algorithms treat the events as related to each other and put make similar crisp and sharp sound, (II) some sounds always appear
them into a context while updating the memory. A stateful model simultaneously or very closely in certain scenarios, for example in a
works best in the scenario where the output is not only decided by party when there is cheering, there is always applause accompanied
current input, but also by previous input in timeline (state). Stateless with. Since these sounds are hard to separate, we merge them to-
models such as GMM, SVM, Random Forest and Neural Network gether into one class. After merging into 5 classes, the data samples
are widely used in sound events classi�cation, while in context distribution is shown in Figure 8, 150 samples from each class are
classi�cation, stateful models such as Bayes Networks, HMM and chosen for training for balance, the rest are used for testing.
RNN are more preferred. In our task, we think there is no need to
remember the past information, since it’s enough for human being Table 3: Data Source
to tell what’s happening when a sound is event heard, even without
much context. Scenario/ SR Length
Ref Format
While there might be a plethora of machine learning algorithm Data source used in this paper (khz) (hour)
variations with di�erent type of representation space, the e�ec- [22] residental house wav 44.1 24
tiveness for di�erent scenarios can only empirically be con�rmed. [12] random sound rec. wav, mp3, ai� 22,44.1 20
[1] random sound e�. wav, mp3 22,44.1 5
Hence, we select several commonly used algorithms to conduct an
empirical comparison and �nd the best candidate for the problem.
The chosen models in this paper are listed below, all of which are
stateless models:
(1) Decision tree
(2) Random Forest
(3) Mixed Gaussian
(4) Naive Bayes
(5) SVM(Linear & RBF kernel)
(6) Arti�cial Neural Network

4 EXPERIMENTAL EVALUATION
In this section, we present the experimental results over our labeled
audio dataset. Discussions are given to gain instructive insights
about how to build real-time indoor environmental sound recogni-
tion applications. The entire dataset is split into 2 parts: training
and test set. For the training sets, 5 folds cross-validation is used to
build the best �tting model, which is subsequently applied on the
test set for validation.

4.1 Dataset Preparation Figure 8. Data samples number of each class


In order to build a general human activity classi�cation model
using sound, we �rst prepare data sets from di�erent indoor envi-
ronments. The audio data must be about indoor sound events and 4.2 Feature Comparison
highly relevant for human activities and location. Unfortunately we In this subsection, we compare di�erent feature extraction tech-
could not �nd databases that perfectly match these objectives. Most niques from performance and complexity perspectives. The per-
famous environmental sound dataset are for very general purposes, formance is represented by the overall classi�cation accuracy and
for instance the urban sound dataset from NYU [27]. Part of our F1-score, and the complexity is calculated as the ratio of: feature
data comes from TUT [22], which contains labeled sound events extraction time divided by audio duration, which should be as short
recorded from streets and residential houses. Other part of our data as possible for realtime processing. Through all the comparative
come from online (www.freesound.org, www.freesfx.co.uk)[12][1], experiments, we choose SVM (with RBF kernel) model together
as shown in 3. As these audio stream have di�erent formats and with feature MFCC as the baseline. This baseline function has been
sampling rates, we �rst need to convert them into a uni�ed form, i.e used in both environmental sound classi�cation and sound context
mono, 44.1KHZ and ’.wav’ format with the same average volume. classi�cation and has achieved good results [23][30].
Through the segmentation algorithm, these long audio stream were Figure 9 shows the performance of each single feature with and
further segmented to short sound events. Initially we selected 9 without voice bands. The MFCC of the baseline is the best before

41
Conference’19, June 5–7, 2019, Rhodes, Greece,
Privacy-aware environmental sound classification for indoor human activity recognition

(a) Training w ith non-stripped data (b) Training w ith voice-stripped data

(baseline)

Figure 9. Performance and complexity of single features Figure 10. Greedy Search For Feature Combination

voice bands truncation, while LPCC is the best after truncation. In speech recognition, common techniques such as STFT are
These results show a signi�cant performance drop of MFCC after used to extract spectrum from overlapping frames, normally with
voice bands truncation (from 82% to 77%), which is reasonable half or 3/4 frame length overlapped. This is because audio signals
because MFCC is designed for human speech recognition as such it especially speech are highly time varying, and need more sophis-
provides better resolution in lower frequency bands. While LPCC ticated analysis to re�ect the details. However using overlapping
portraits the smoothed spectral envelope for entire frequency bands frames also duplicate or quadruple the complexity. We compared
without discrimination, it works nearly good with or without voice the results from di�erent overlapping frame length, with the best
bands. features combination found above. Figure 11 shows the results with
In machine learning, multiple features could be combined to- di�erent overlapping frame lengths. Results show that it’s best to
gether to reach better performances, as more features provide more use half frame length overlapping, which leads to approximately
information. However, this does not mean the more features the the same result as 3/4 overlapping, but is much more e�cient.
better, with a �xed number of training samples. With the predic-
tive power would normally �rst increase as feature numbers goes 4.3 Comparison of classi�cation models
up then decreases[13], mainly because duplicated and irrelevant
We also compare multiple classi�cation models with the same input
features only introduce noise to the model. In o�ine audio classi�-
features. In total we adopted 6 stateless models, using LPCC feature
cation, this is normally achieved by techniques like PCA (Principle
in input.
Component Analysis) to �rst trim redundant information. How-
The performance of classi�ers is shown in �gure 12. Results
ever, in real time applications, that need to be e�cient, selection of
show that SVM(RBF kernel) is the best, and the second best is ANN.
features should be proactively decided.
We used one hidden layer with 50 neurons for the ANN model
While brute force all the combinations is too time costing, we
based on the criterion in [15]. It is likely that our training data was
use heuristic greedy search to �nd the local optimum instead. It
not large enough, which has led to slightly worse results compared
starts with selecting the best single feature, then at each iteration
to SVM.
one more feature is added depending on the classi�cation accuracy.
Table 10 shows results of each iteration of the greedy algorithm for
both with and without voice truncation, added with the all-features 4.4 Con�dence of classi�cation results
result for comparison. In both cases, combination of features can In our application, knowing the classi�cation results is not enough,
improve the performance signi�cantly. Looking into the voice- we also need to score how reliable the results are. The high score
bands truncated case, the best combination contains 4 features: classi�cation results are kept, while the low score ones will be
LPCC, spectral-�ux, STE and time-entropy. This result matches discarded. This is because there are a lot of noise and irrelevant
our expectations, as these features portrait di�erent perspectives events in the environment, which are hard to be �ltered beforehand.
of the audio signals. Not surprisingly, using all-features for model However, they are likely to get low score in the classi�cation model,
input is not preferred, as high dimension data will cause over�tting since these sound carry very di�erent features. On the other hand,
problem and is much less e�cient. The calculation complexity of it’s acceptable if some interested events are discarded by mistake
features combination is smaller than the simple accumulation of as knowing part of the events is enough for detecting human be-
each since they share some steps such as FFT for all frequency haviours. Classi�cation algorithms such as Naive-Bayes give both
domain features. classi�cation result and the probability, i.e. a degree of certainty

42
Conference’19, June 5–7, 2019, Rhodes, Greece,
Wei Wang, Fatjon Seraj, Nirvana Meratnia, and Paul J.M. Havinga

Figure 11. Frame overlapping test Figure 12. Performance of Classi�ers Figure 13. Con�dence of prediction

about the result, which just corresponds to our needed score. How- Training set Testing set

ever, SVM algorithm does not provide the prediction probability


directly, it only gives the support vectors, in which larger margin
means more con�dent. Hence, we need to calibrate the support
vectors into prediction probability, i.e. the score.
We use the algorithm from Wu [32] to calibrate the SVM support
vectors into prediction probability. This algorithm is the multiclass
version of Platt Scaling: applying logistic regression on the SVM’s
scores, �t by an additional cross-validation on disjoint training data.
However, normally this method needs a large dataset (1000 samples
for each class) to work well, otherwise the probability estimated
may be inconsistent with the classi�cation result. Figure 13 shows
the real accuracy versus the calibrated probability of our classi�ca-
tion results, where the green bars show the samples distribution.
The real accuracy and prediction probability are roughly in positive Figure 14. Confusion matrix of training and testing dataset
proportion, which means the higher the score the more accurate
the results. The violation of monotonic increasing of prediction
accuracy could be from multiple reasons: our dataset is not big for non-voice, and MFCC performs best for full signal bands. These
enough and the calibration method is not perfect. This result shows results match our expectation that MFCC is designed for speech
that nearly half the results are given 90% probability, whose real recognition and voice bands details. We also tested combinations
accuracy is at the same level. of features in order to further improve the accuracy. In a greedy
Figure 14 shows the confusion matrix of our best model for searching experiment, LPCC together with one frequency domain
training and test set. The ’applause’ and ’crowd’ are the best two feature(�ux) and two time-domain features(STE, entropy) perform
classes in terms of accuracy. ’Door’ and ’footsteps’ are most likely best, since they portrait di�erent aspects of signal: LPCC and ’�ux’
to be mistaken for each other, which makes sense, since one single each portraits the static and dynamic characters of frequency do-
footstep and door slapping may sound similar. This mistake is main, while STE and entropy do the similar job in time-domain.
mostly like to happen when people walk very slowly. Another The third feature test is about the frame overlapping, the trade-o� is
�nding is that ’speech’ doesn’t perform in the test as well as in the between complexity and performance. In our experiment, half the
training, partly because speech sound characters are more variable frame length overlapping (0.02s frame length, 0.01s overlapping) is
and complicated compared to others. So 150 samples are not enough the best.
for training, partly because the voice bands is truncated which The classi�cation process show that SVM performs the best, and
harms most this class. the prediction probability can be calibrated through Platt Scaling
algorithm to �lter out unreliable results. If we only keep half of the
5 OPEN DISCUSSION best predictions, the accuracy increases to more than 90%. How-
In order recognize indoor human activities using acoustic sensors, ever, the real accuracy is not monotonically increasing when the
performance is not the only concern, e�ciency and privacy are prediction probability increases. We think it is because our dataset
equally important considerations. Our results show that with voice is not su�ciently large for the Platt Scaling algorithm to work well.
bands stripped o� for the privacy concern, the system could still The most misclassi�ed classes were ’door’ and ’footsteps’, this
detect human activities with quite high accuracy (86%). In order to happens when people walk very slowly, our segmentation algo-
�nd the best features for this task, we �rst conducted a comparative rithm sometimes falsely split a series of footsteps into smaller events
test for all single features. Results show that LPCC performs best other than considering them as a whole. While in real applications,

43
Conference’19, June 5–7, 2019, Rhodes, Greece,
Privacy-aware environmental sound classification for indoor human activity recognition

we could di�erentiate the two classes with the help of other meth- [18] Eun Jeon, Jong-Suk Choi, Ji Lee, Kwang Shin, Yeong Kim, Toan Le, and Kang
ods, for example with sound localization algorithm since door could Park. 2015. Human detection based on the generation of a background image by
using a far-infrared light camera. Sensors 15, 3 (2015), 6763–6788.
not move. Even though our model is not perfect, we think this accu- [19] Benjamin Kedem. 1986. Spectral analysis and discrimination by zero-crossings.
racy makes it competent for a real indoor events recognition system. Proc. IEEE 74, 11 (1986), 1477–1493.
[20] K. Lopatka, J. Kotus, and A. Czyzewski. 2016. Detection, classi�cation and
Also our model is proved to be general since the dataset is from localization of acoustic events in the presence of background noise for acoustic
di�erent context and di�erent sound sources of high diversities. surveillance of hazardous situations. Multimedia Tools and Applications 75, 17
In real applications the accuracy could be higher, since the micro- (01 Sep 2016), 10407–10439. https://doi.org/10.1007/s11042-015-3105-4
[21] Konstantinos Makantasis, Antonios Nikitakis, Anastasios D Doulamis, Nikolaos D
phone embedded devices are normally �xed installed, where the Doulamis, and Ioannis Papaefstathiou. 2018. Data-driven background subtraction
sound source should be more homogeneous and easier to classify. algorithm for in-camera acceleration in thermal imagery. IEEE Transactions on
Circuits and Systems for Video Technology 28, 9 (2018), 2090–2104.
[22] A. Mesaros, T. Heittola, and T. Virtanen. 2016. TUT database for acoustic scene
6 ACKNOWLEDGEMENT classi�cation and sound event detection. In 2016 24th European Signal Processing
Conference (EUSIPCO). 1128–1132. https://doi.org/10.1109/EUSIPCO.2016.7760424
This work is a part of the COPAS (Cooperating Objects for Privacy [23] Dalibor Mitrović, Matthias Zeppelzauer, and Christian Breiteneder. 2010. Features
Aware Smart public buildings) project. for Content-Based Audio Retrieval. In Advances in Computers. Vol. 78. Elsevier,
71–150. http://linkinghub.elsevier.com/retrieve/pii/S0065245810780037 8 - 6
features.
REFERENCES [24] K. J. Piczak. 2015. Environmental sound classi�cation with convolutional neural
[1] [n. d.]. freeSFX.co.uk - FREESFX.CO.UK CONTENT PUBLISHER LICENCE networks. In 2015 IEEE 25th International Workshop on Machine Learning for
AGREEMENT. ([n. d.]). https://www.freesfx.co.uk Signal Processing (MLSP). 1–6. https://doi.org/10.1109/MLSP.2015.7324337
[2] G. Acampora, D. J. Cook, P. Rashidi, and A. V. Vasilakos. [n. d.]. A Survey on [25] Rashmi Priyadarshini and RM Mehra. 2015. Quantitative review of occupancy
Ambient Intelligence in Healthcare. 101, 12 ([n. d.]), 2470–2494. https://doi.org/ detection technologies. Int. J. Radio Freq 1 (2015), 1–19.
10.1109/JPROC.2013.2262913 3. [26] Fariba Sadri. [n. d.]. Ambient Intelligence: A Survey. 43, 4 ([n. d.]), 36:1–36:66.
[3] Sharath Adavanne, Giambattista Parascandolo, Pasi Pertilä, Toni Heittola, and https://doi.org/10.1145/1978802.1978815 3.
Tuomas Virtanen. 2017. Sound Event Detection in Multichannel Audio Using [27] Justin Salamon, Christopher Jacoby, and Juan Pablo Bello. 2014. A dataset and
Spatial and Harmonic Features. arXiv:1706.02293 [cs] (2017). http://arxiv.org/abs/ taxonomy for urban sound research. In Proceedings of the 22nd ACM international
1706.02293 conference on Multimedia. ACM, 1041–1044.
[4] Mohammed Arif, Martha Katafygiotou, Ahmed Mazroei, Amit Kaushik, Esam [28] Jérôme Sueur, Sandrine Pavoine, Olivier Hamerlynck, and Stéphanie Duvail. 2008.
Elsarrag, et al. 2016. Impact of indoor environmental quality on occupant well- Rapid acoustic survey for biodiversity appraisal. PloS one 3, 12 (2008), e4065.
being and comfort: A review of the literature. International Journal of Sustainable [29] Huy Dat Tran and Haizhou Li. 2011. Sound event recognition with probabilistic
Built Environment 5, 1 (2016), 1–11. distance SVMs. IEEE transactions on audio, speech, and language processing 19, 6
[5] Christopher M. Bishop. 2006. Pattern recognition and machine learning. Springer, (2011), 1556–1568.
New York. [30] Huy Dat Tran and Haizhou Li. 2011. Sound Event Recognition With Probabilistic
[6] R. Cai, Lie Lu, A. Hanjalic, Hong-Jiang Zhang, and Lian-Hong Cai. 2006. A �exible Distance SVMs. IEEE Transactions on Audio, Speech, and Language Processing 19,
framework for key audio e�ects detection and auditory context inference. IEEE 6 (2011), 1556–1568. https://doi.org/10.1109/TASL.2010.2093519
Transactions on Audio, Speech, and Language Processing 14, 3 (2006), 1026–1039. [31] Wikipedia contributors. 2018. Voice frequency — Wikipedia, The Free Encyclope-
https://doi.org/10.1109/TSA.2005.857575 7 - 5. dia. (2018). https://en.wikipedia.org/w/index.php?title=Voice_frequency&oldid=
[7] S. Chu, S. Narayanan, and C. C. J. Kuo. 2009. Environmental Sound Recognition 834458520 [Online; accessed 30-May-2018].
With Time #x2013;Frequency Audio Features. IEEE Transactions on Audio, Speech, [32] Ting-Fan Wu, Chih-Jen Lin, and Ruby C Weng. 2004. Probability estimates for
and Language Processing 17, 6 (2009), 1142–1158. https://doi.org/10.1109/TASL. multi-class classi�cation by pairwise coupling. Journal of Machine Learning
2009.2017438 2-3. Research 5, Aug (2004), 975–1005.
[8] Michael Cowling and Renate Sitte. 2003. Comparison of techniques for environ- [33] Y. Zigel, D. Litvak, and I. Gannot. 2009. A Method for Automatic Fall Detection of
mental sound recognition. Pattern Recognition Letters 24, 15 (2003), 2895 – 2907. Elderly People Using Floor Vibrations and Sound—Proof of Concept on Human
https://doi.org/10.1016/S0167-8655(03)00147-8 Mimicking Doll Falls. IEEE Transactions on Biomedical Engineering 56, 12 (Dec.
[9] Bing Dong and Burton Andrews. 2009. Sensor-based occupancy behavioral 2009), 2858–2867. https://doi.org/10.1109/TBME.2009.2030171
pattern recognition for energy and comfort management in intelligent buildings.
In Proceedings of building simulation. 1444–1451. https://pdfs.semanticscholar.
org/de95/b672e9e30b04749623c2d92c89f256eedda4.pdf 1.
[10] Monica Drăgoicea, Laurenţiu Bucur, and Monica Pătraşcu. 2013. A service
oriented simulation architecture for intelligent building management. In Interna-
tional Conference on Exploring Services Science. Springer, 14–28.
[11] Antti J Eronen, Vesa T Peltonen, Juha T Tuomi, Anssi P Klapuri, Seppo Fagerlund,
Timo Sorsa, Gaëtan Lorho, and Jyri Huopaniemi. 2006. Audio-based context
recognition. IEEE Transactions on Audio, Speech, and Language Processing 14, 1
(2006), 321–329.
[12] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound Technical Demo.
In ACM International Conference on Multimedia (MM’13). ACM, ACM, Barcelona,
Spain, 411–412. https://doi.org/10.1145/2502081.2502245
[13] Jerome H. Friedman. 1997. On Bias, Variance, 0/1—Loss, and the Curse-of-
Dimensionality. Data Mining and Knowledge Discovery 1, 1 (01 Mar 1997), 55–77.
https://doi.org/10.1023/A:1009778005914
[14] Timothy D Gri�ths, Adrian Rees, Caroline Witton, A Shakir Ra’ad, G Bruce
Henning, and Gary GR Green. 1996. Evidence for a sound movement area in the
human cerebral cortex. Nature 383, 6599 (1996), 425.
[15] M.T. Hagan, H.B. Demuth, and M.H. Beale. 2014. Neural network design. Martin
Hagan. https://books.google.nl/books?id=4EW9oQEACAAJ
[16] Ebenezer Hailemariam, Rhys Goldstein, Ramtin Attar, and Azam Khan. [n. d.].
Real-time Occupancy Detection Using Decision Trees with Multiple Sensor Types.
In Proceedings of the 2011 Symposium on Simulation for Architecture and Urban
Design (2011) (SimAUD ’11). Society for Computer Simulation International,
141–148. http://dl.acm.org/citation.cfm?id=2048536.2048555
[17] Toni Heittola, Annamaria Mesaros, Antti Eronen, and Tuomas Virtanen. 2013.
Context-dependent sound event detection. EURASIP Journal on Audio, Speech,
and Music Processing 2013, 1 (2013), 1. 9 -8 Add context-label to detect overlapping
sound event by HMM.

44

You might also like