0% found this document useful (0 votes)
24 views13 pages

B4 A Novel Biometric Identification System Based On Fingertip Electrocardiogram and Speech Signals

Uploaded by

Mano Arun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views13 pages

B4 A Novel Biometric Identification System Based On Fingertip Electrocardiogram and Speech Signals

Uploaded by

Mano Arun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Digital Signal Processing 121 (2022) 103306

Contents lists available at ScienceDirect

Digital Signal Processing


www.elsevier.com/locate/dsp

A novel biometric identification system based on fingertip


electrocardiogram and speech signals
Gokhan Guven a , Umit Guz b,∗ , Hakan Gürkan c
a
The Scientific and Technological Research Council of Turkey (TUBITAK), Information Technologies Institute, Gebze, Kocaeli, Turkey
b
FMV ISIK University, Department of Electrical-Electronics Engineering, Şile, İstanbul, Turkey
c
Bursa Technical University, Department of Electrical and Electronics Engineering, Yıldırım, Bursa, Turkey

a r t i c l e i n f o a b s t r a c t

Article history: In this research work, we propose a one-dimensional Convolutional Neural Network (CNN) based
Available online 10 November 2021 biometric identification system that combines speech and ECG modalities. The aim is to find an effective
identification strategy while enhancing both the confidence and the performance of the system. In our
Keywords:
first approach, we have developed a voting-based ECG and speech fusion system to improve the overall
Biometric identification
Biometric recognition
performance compared to the conventional methods. In the second approach, we have developed a robust
CNN rejection algorithm to prevent unauthorized access to the fusion system. We also presented a newly
Fingertip ECG developed ECG spike and inconsistent beats removal algorithm to detect and eliminate the problems
Speech caused by portable fingertip ECG devices and patient movements. Furthermore, we have achieved a
system that can work with only one authorized user by adding a Universal Background Model to our
algorithm. In the first approach, the proposed fusion system achieved a 100% accuracy rate for 90 people
by taking the average of 3-fold cross-validation. In the second approach, by using 90 people as genuine
classes and 26 people as imposter classes, the proposed system achieved 92% accuracy in identifying
genuine classes and 96% accuracy in rejecting imposter classes.
© 2021 Elsevier Inc. All rights reserved.

1. Introduction the system’s reliability and provides a secure alternative to tradi-


tional biometrics.
Biometric recognition is a task of identifying an individual Electrocardiogram (ECG) is a biological sign of the heart’s elec-
based on his/her physiological or behavioral traits [1]. Over the tric activity [7]. It carries distinctive information depending on
last three decades, different identification systems have been de- personal features such as age, gender, size, position, and heart
veloped based on different modalities such as fingerprint, face, iris, anatomy [8,9]. Although it is not widespread in practical applica-
and palm print. Hardware issues, the practicability of the measure- tion today, it is progressing towards being a modality on which
ment, or the robustness against spoofing attacks are the main con- studies have been increasing. It provides strong protection against
cerns of these systems [1–3]. A retinal scan is a relatively quick and forgery with its unique feature of liveliness detection [9,10]. The
secure procedure; however, the technology of scanning is expen- liveliness feature ensures that the individual to be identified is
sive, and the process is sometimes perceived as invasive and un- physically present at the time and place of enrollment [10]. An-
pleasant [2]. The face can be easily obtained with a camera’s help; other advantage is that a technology that can produce or imi-
however, it is susceptible to illumination and poses variations [4]. tate an individual’s ECG signal has not been artificially developed
The fingerprint and palmprint are the most widely used biomet- yet [11,12]. Therefore, it has the potential to use in innovative
rics. Although the performance of identification systems based on home systems, forensic and government enforcement such as con-
such modalities is relatively high, the reliability of these systems trol of a computer and network access, board control, and cash
has decreased as they can be stolen easily with latex [2,5,6] or
point systems [11,13]. However, significant disadvantages are that
nowadays with a high-definition camera. Therefore, a new biomet-
signal irregularities may exist due to the individual’s illness, or
ric measure has emerged, which shows the prospect of enhancing
procedure could be hampered by difficulties, such as a long waiting
time for data collection [13]. The sudden changes in emotion such
as anger, excitement, and fear also significantly affect such sys-
*
Corresponding author.
tems, making it relatively difficult to produce higher performance
E-mail addresses: [email protected] (G. Guven),
[email protected] (U. Guz), [email protected] (H. Gürkan). than the traditional system. Thus, it is necessary to enhance the ac-

https://doi.org/10.1016/j.dsp.2021.103306
1051-2004/© 2021 Elsevier Inc. All rights reserved.
G. Guven, U. Guz and H. Gürkan Digital Signal Processing 121 (2022) 103306

curacy of identification. In this paper, we present an original work the data comes to lie on the first coordinate (which is called the
addressing the use of ECG signals with speech signals for human first principal component), second greatest variance on the second
identification purposes. coordinate, and so forth. ECG data with reduced dimensionality
In our previous work, an ECG-based authentication system has then trained with a discriminant method using Bayes’ theo-
was proposed using three different feature extraction methods: rem, and their system achieved an 85.3% identification rate over
Autocorrelation and Discrete Cosine Transform (AC / DCT), Mel- lead-I ECG signals of 60 subjects.
frequency cepstrum coefficients (MFCC), and QRS complex of ECG In previous researches, analytic features were used to capture
signal. In the research, we presented the average frame recognition local information in a heartbeat signal. The performance of such
rate of each individual [14]. In our other work, we have developed systems depends on the accurate detection of fiducial points or
a dry contact portable fingertip ECG device that can collect the discriminant amplitude of the features. Therefore, W. Yongjin et
ECG signals over the left and right hand’s thumbs. Besides, we ana- al. [13] have suggested an appearance-based feature extraction
lyzed our existing system’s performance by increasing the number method capturing the holistic patterns in a heartbeat signal. Their
of people in the fingertip ECG dataset collected with the proposed proposed approach depends on estimating and comparing the sig-
device [11]. nificant coefficients of the discrete cosine transform of the autocor-
In this work, we propose a better identification system that related heartbeat signals. The solution they demonstrated has been
combines the benefits of ECG and speech modalities. Thus, the tested in ECG data from two public databases, PTB and MIT-BIH,
speech signals increase the system’s performance, whereas the ECG and has achieved 94.47% and 97.8% accuracy rates, respectively.
signal increases the confidence of the system. Furthermore, we Z. Zhao and L. Yang [12] introduced a new non-fiducial fea-
proposed an algorithm that prevents unauthorized users’ access ture extraction method based on matching pursuit (MP) sparse
while at the same time identifying the genuine class within the decomposition for ECG identification. In their system, the R lo-
system’s response time. We have added a UBM dataset to the algo- cation of ECG signals detected using the wavelet-based QRS de-
rithm so that it can work even if there is one registered authorized lineation method was passed through the MP Sparse Decomposi-
user. Thus, the proposed system contains practical solutions for tion algorithm. By finding the best matching feature projections of
identification and verification tasks in a single algorithm. More- multidimensional ECG data onto the span of an overcomplete dic-
over, we present a newly developed ECG spike and inconsistent tionary, the Support Vector Machine (SVM) classifier was trained
beats removal algorithm to detect and eliminate the unnecessary and achieved a 95.3% identification rate for 20 people.
signals that affect the system performance. X. Lei et al. [18] proposed a deep learning feature-based identi-
The organization of this paper is as follows: Section 2 presents fication system which, reduces the dependence of algorithm accu-
the related work on ECG-based biometric identification systems racy on the origin and length of the ECG signals. Unlike any other
and speech-based identification systems. In Section 3, the method- method, both fiducial and non-fiducial ECG features were used in
ology of pre-processing stage, feature extraction algorithms, post- their system. A 1-D CNN algorithm was developed to obtain tem-
processing stage, and classifier are given. Section 4 contains the de- poral points of the ECG signals and DCT of the segmented ECG
tails of the proposed system. Section 5 gives the information about vectors. Their system was tested on a publicly available database
ECG and speech databases and shows the experimental results. The using the backpropagation neural network (BPNN) and a non-linear
discussion and conclusion sections are presented in Sections 6 and SVM as classifiers and achieved a 99.33% accuracy rate at 100 sub-
7, respectively. jects.
L. Wieclaw et al. [19] proposed a biometric identification sys-
2. Related works tem based on deep learning techniques. They constructed a fin-
gertip ECG-based acquisition system which obtained the signals
Before presenting the details of the fingertip ECG and speech- from the subject’s right and left-hand fingers by using Ag/AgCl
based identification system employed in this study, we first electrodes. Using the fingertip ECG of 18 subjects, their system
present the literature reviews for ECG-based biometric identifica- achieved 96% identification performance.
tion systems and speech-based identification systems, respectively. H. Gürkan et al. [20] proposed a 2-D CNN-based biometric iden-
tification system. Their system works with the principle of finding
2.1. ECG-based identification systems the QRS complex of each people in the database and putting them
into a 256 x 256 QRS image matrix. By designing a proper 2-D
In recent years, there has been a shift towards the use of med- CNN architecture for the QRS matrix, they achieved an identifica-
ical signals for biometric purposes due to the high security de- tion rate of 98.08% for a database consists of 46 people.
mands, and among all medical signals, the ECG signal has been R. Srivastva et al. [21] proposed an ensemble learning technique
accepted to be the most prominent [10]. The first study of using by gathering four fine-tuned models i.e. “ResNet” and “DenseNet”
ECG as a biometric identification purpose was presented by L. Biel into one stacking model i.e. “PlexNet”. Therefore, PlexNet takes ad-
et al. [15] in 2001, and a 12-lead ECG monitor was used to col- vantage of transfer learning and making a novel model for ECG
lect 360 biometric features for each person. The identification was biometrics that is both robust and secure. By combining two pub-
achieved with 100% accuracy among 22 subjects. licly available databases, PTB and CYBHI, 176 subjects with two
Discriminant analysis has been proposed by M. Kyoso and A. recorded sessions of each were added in a mixed dataset. Then, by
Uchiyama [16] to improve the performance of the ECG-based sys- using the mixed dataset, 2D ECG images of size 150 x 150 contain-
tem. To perform the identification process, they have suggested ing from three heartbeats were feedforward through and trained
extraction of four feature parameters: P wave duration, PQ inter- four paths of PlexNet, separately. Best identification performance
val, QRS interval, and QT interval, which are not affected by the was obtained as 99.66% when 63 people in the CYBHI dataset were
R-R interval or electrode’s condition. used.
Z. Zhang and D. Wei [17] improve M. Kyoso and A. Uchiyama’s Y. Zhang et al. [22] proposed a solution for the three impedi-
work and suggested new fiducial points: QRS, Q, R, and S durations ments to fingertip ECG: the impact of variation in ECG measure-
Q, R, S, and T amplitudes, ST segments, QRS area, and PR interval. ments system, the high computational complexity of traditional
In their system, ECG records were passed through the Principal CNN models, and lack of sufficient fingertip samples. Therefore,
Component Analysis (PCA) to be transformed into a new coordi- they suggested using a recurrence plot (RP) which reflects the
nate system such that the greatest variance by any projection of motion laws the information of the nonlinear signal in the high-

2
G. Guven, U. Guz and H. Gürkan Digital Signal Processing 121 (2022) 103306

dimensional phase space. Blindly segmented three heartbeats cy- 3. Methodology


cles were transformed into a high dimensional space by the RP.
Best identification performance was obtained as 98.77% when 60 3.1. Pre-processing stage
people in the CYBHI dataset were used.
The pre-processing stage is the first step in which all the ECG
2.2. Speech-based identification systems and speech data are passed through multiple operations such as
filtering, silence removing, or spike and inconsistent beat removal
operations. It is a process that eliminates the noise effects on the
Utilization of the speech modality in biometric identification, signals in order to improve the classification performance.
verification, and authentication has been accepted by researchers
long ago. The research that takes speech modality has been di-
3.1.1. Filtering process
vided into two categories, text-dependent and text-independent
First, both ECG and speech signals were passed through 50 Hz
identification [23]. Text-dependent identification can also be stated
and 100 Hz IIR second-order notch filters. The first filter was used
as Fixed Phrase Verification because a predefined phrase is used
to remove the power-line noise, whereas the second removes its
during the training and identification periods. However, text-
first harmonic. Then, ECG signals were passed through a window-
independent identification is not restricted to any fixed and
based FIR low-pass filter with a 150 Hz cut-off frequency. The
prompted phrases. Such methods mainly focused on each speaker’s
hamming windowing was applied, which has a length of 48, to re-
spectral characteristics by extracting one or more codebook enti-
move the high-frequency components. After that, a fourth-order IIR
ties presentative of that speaker. Therefore, it allows more freedom
high-pass filter, which has a cut-off frequency of 0.5, was applied
to the users. In the following, we will provide a literature review
to the signal to decrease the baseline wanders caused by muscle
for the text-independent identification method.
artifacts or patient’s movements. The smoothing process was then
In 2014, H. Yu et al. [24] proposed a text-independent speaker
applied to the ECG signals in order to reduce the muscle artifacts’
identification method that used a histogram transform model on effect further and improve the quality of the signal. Speech signals
MFCC features. They created a system that calculates three neigh- were passed through the window-based FIR low-pass filter with
boring MFCCs using dynamic MFCC properties that take and place a 4 kHz cut-off frequency. Then 10 Hz, fourth-order IIR high-pass
them in a super vector. Then probability density function (PDF) filter was applied to the signals to decrease the speech records’
of the super vector was estimated using the histogram transform instability.
(HT) algorithm. When the super vector contained consecutive 100
MFCCs, it achieved a recognition rate of 97.6% on 20 people.
3.1.2. ECG spikes and inconsistent beats detection algorithm
N. Almaadeed et al. [25] proposed a text-independent speaker
Spikes and inconsistencies on the ECG signal are caused by the
identification system based on the wavelet analysis and neural net-
inability of portable ECG measurement devices, which do not have
work. Their system consists of three neural networks: Radial Basis
an onboard lead off detection circuit, to differentiate the signal’s
Function (RBF), Probabilistic Neural Network (PNN), and General
absence when the copper plates are not held. Therefore, we repre-
Regressive Neural Network (GRNN). The majority voting fusion was
sent a newly developed ECG spike and inconsistent beats detection
used according to the classifiers’ results. The system was tested on
algorithm to remove the inconsistent ECG beats and undesired
34 people in the Grid database, which had 1000 sentences for each
spikes to improve classification performance. The algorithm is pre-
person, and a 97.5% identification rate was achieved.
sented for the first time in this article. It works with the principle
N. AboElenein et al. [26] introduced an improved text-indepen-
of finding and choosing the most repetitive peaks which contain
dent speaker identification algorithm using MFCC and GMM meth-
QRS information. Therefore, the ECG signal is passed through four
ods. In the algorithm, each signal was passed through a pre- different filtering processes to obtain R peaks containing the sig-
processing stage. In this stage, downsampling and silence removing nal’s QRS information [29]. Then, the ECG signal is framed from
operations were conducted. After that, a pitch detection algorithm the R peaks’ points, and the standard deviation of each frame is
was used to detect the signals, whether they come from females or calculated. After that, two thresholds are applied to each frame,
males, to improve the accuracy. Then, GMM classifiers were trained higher and lower than its standard deviation. The signals within
using the MFCC coefficients of each gender and achieved a 91% these thresholds are accepted as consistent ECGs; if not, they are
identification rate for 36 people. eliminated. Then, local maxima of each ECG frame are found, and
A. Imran et al. [27] proposed a text-independent speaker iden- two thresholds are applied to each frame, higher and lower than
tification system using MFCC features and a deep neural network its local maxima. The signals within these thresholds are accepted
classifier. In their system, MFCC features of 3 seconds audio sig- as spike-free ECGs; if not, they are eliminated. The prerequisite for
nals for each speaker were extracted and mapped into 299 x 13 the algorithms to work efficiently is that the number of consistent
2-D vector. The system was tested on 119 speakers in the MOOC ECG signals in the recording must be greater than the inconsistent
database, and it achieved a 93.37% identification rate for 3 sec- and undesired signal. If this requirement is not met, the algorithm
onds speech signal. It achieved a 94.44% identification rate for 5 can not decide which part of the signal is desired and which part
seconds speech signal, whereas 94.64% were achieved when 7 sec- is undesired.
onds speech signal was used. Fig. 1 shows the first thirty seconds of a person’s 1-minute ECG
S. El-Moneim et al. [28] proposed text-independent research recording. In this figure, the regions marked with 1 and 2 show the
that pointed out the performance difference of feature extraction spikes in the ECG signal, while the region marked with 3 indicates
methods such as MFCCs, spectrum, and log-spectrum. 70 utter- the inconsistent beat. The proposed algorithm removes the spikes
ances of 5 female speakers were chosen as training, and 30 ut- and inconsistent beats in the ECG signal and splits the signal into
terances were chosen for testing their system. The features of 5 fe- spikes-free and inconsistencies-free ECG frames as shown in Fig. 2.
male speakers were trained by using Long-Short Term Memory Re- In the proposed identification system, the second and fourth of
current Neural Network (LSTM-RNN). In their research, the recog- the ECG frames in Fig. 2 are chosen, and the first ten-second parts
nition rate was stated to reached 95.33% using MFCC, whereas of these signals are used since the system’s response time is 10
98.7% using spectrum or log-spectrum. seconds.

3
G. Guven, U. Guz and H. Gürkan Digital Signal Processing 121 (2022) 103306

Algorithm 1 ECG spikes and inconsistent beats detection algo-


rithm.
Inputs:
Fingertip ECG files from the databases
Outputs:
Noise-free ECG signal vectors
Algorithm:
1: Apply bandpass, differentiator, squaring operation, and moving-window integra-
tor to ECG signal, and obtain a signal which contains QRS information.
2: Define window length W = 850 where F s = 1000
3: Frame the filtered signal into W N vectors
4: for each vector i=1 to N do
5: M i ← Find maximum value,
6: li ← Find index of maximum value,
7: li ← li − filters group delay
8: end for
9: Sort M into descending order
10: Find value of M in the middle, Store it into T h1
11: Define S 1 as T h1/35
12: for each vector i=1 to N do
13: if M i < ( T h1 − S 1 ) && M i > ( T h1 + S 1 ) then
14: Erase li from locations
Fig. 1. ECG signal of a person which contains spikes, and inconsistent beat. 15: end if
16: end for
17: for ECG signal between li to li +1 until i = N − 1 do
18: if li +1 − li > 2F s then
19: continue
20: end if
21: Find σi − μi and Store it into V i
22: end for
23: Sort V into descending order
24: Find value of V in the middle, Store it into T h2
25: Define S 2 as T h2/5
26: for ECG signal between li to li +1 until i = N − 1 do
27: if li +1 − li > 2F s then
28: continue
29: else if V i < ( T h2 − S 2 ) && V i > ( T h2 + S 2 ) then
30: Erase li from locations
31: end if
32: end for
33: for ECG signal between li to li +1 until i = N − 1 do
34: if li +1 − li > 2F s then
35: continue
36: else
37: Find values of local maxima and Store them into L
38: end if
Fig. 2. The spikes-free and inconsistencies-free ECG frames of a person. 39: if L i < ( T h1 − S 1 ) && L i > ( T h1 + S 1 ) then
40: Erase li from locations
41: end if
3.1.3. Voice activity detection and silence removing 42: end for
Voice Activity Detection (VAD) is a technique that detects the 43: Separate ECG signal from location li to li +1 , if not bigger than 2F s

presence or absence of speech. VAD algorithm assumes that the


background noise statistics are stationary over a longer period of P wave, QRS complex, and T wave. In the experimental works, we
time than the presence of speech signal. Jongseo Sohn et al. [30] have concluded that the best results were obtained using the QRS
developed a decision system that detects the presence or absence complex when the number of classes is less than 45. However,
of speech by observing estimated noise statistics in the current whenever the number of classes increased, it was seen that the
frame. Their decision rule derives from the likelihood ratio test accuracy rate was affected. Therefore, the QRS complex was ex-
(LRT) by estimating unknown parameters using the maximum like- panded to cover the P and T waves. The crucial point is that the
lihood (ML) criterion. In addition to the decision system, they CNN algorithm works when we have a fixed number of feature
proposed a hang-over scheme to minimize misdetections at weak data. Hence, we conducted a test to find the best points includes
speech. In our proposed method, the VAD algorithm proposed by the P, QRS, and T waves on the fingertip ECG data. We found that
Jongseo Sohn was used, and the silence part of the speech signals 165 samples left and 319 samples right of the R points cover all
was found. After that, all the silence frames were concatenated in of them for each user. So, we fixed the P-QRS-T wave size as 485
a vector and segmented into the 20 ms frames. We found each samples, including the R point when the sample rate of ECG sig-
frame’s standard deviation, then calculated the average standard nals was 1000 Hz. After that, we extracted the P-QRS-T wave for
deviation, and used it as a threshold. Then, the original speech sig- every person to construct an ECG feature database. A P-QRS-T in-
nal, which was not separated from the silence part, was split into terval of a person is illustrated in Fig. 3.
the frames. If the standard deviation of the speech signal frame MFCCs are the best features for the text-independent speaker
were lower than the threshold value, it would be accepted as a identification and verification systems because of their robust-
silence frame and removed from the database. ness [31–33]. MFCC feature extraction works with the principle of
finding the short-term power spectrum of a sound based on DCT
3.2. Feature extraction of a log power spectrum on a non-linear Mel frequency scale. It
models the behavior of the human auditory systems [34]. There-
Temporal fiducial points such as P, R, Q, S, and T points, P wave, fore, we split the speech signals into the 20 ms frames with 65%
QRS complex, T wave, PR interval, QT interval, PR segment, and overlaps, and then we removed the silence frames by using VAD
ST segment were extracted for ECG-based identification algorithm. algorithm [30]. After that, we extracted 20 MFCC for each person’s
The most significant features were decided as the combination of speech frame to construct the MFCC feature database.

4
G. Guven, U. Guz and H. Gürkan Digital Signal Processing 121 (2022) 103306

Table 1
1-D CNN architecture.

1-D CNN architecture for 1-D CNN architecture for


ECG-based identification speech-based identification
Layer Size Layer Size
Input Layer (1x485) Input Layer (1x20)
Convolutional Layer (1x3, 2) Convolutional Layer (1x9, 8)
batchNormalization Layer - batchNormalization Layer -
Relu Layer Relu Layer
Convolutional Layer (1x5, 4) Convolutional Layer (1x12, 16)
batchNormalization Layer - batchNormalization Layer -
Relu Layer Relu Layer
Convolutional Layer (1x7, 8) Convolutional Layer (1x15, 32)
batchNormalization Layer - batchNormalization Layer -
Relu Layer Relu Layer
Convolutional Layer (1x9, 16) Maxpooling Layer (1x2)
batchNormalization Layer - Convolutional Layer (1x18, 64)
Relu Layer batchNormalization Layer -
Maxpooling Layer (1x2) Relu Layer
Fully Connected Layer 1x(485*N) Maxpooling Layer (1x2)
Fully Connected Layer 1xN Fully Connected Layer 1x(20*N)
Softmax Layer 1xN Fully Connected Layer 1xN
Softmax Layer 1xN

 
N
log2
k=2 cal
(2)
In equation (2), cal is the calibration value (30 in our application),
k is the number of significant MFCC features which we want to
find by using the VQ algorithm.

3.3.2. Min max normalization


Min-max normalization is a process where the MFCC and P-
QRS-T features are normalized before the training stage to decrease
the negative effect of recording devices. In the proposed method,
features were normalized in the range of 0 and 1 using the follow-
ing formula.

( X − Xmin ) (b − a)
X = a + (3)
X max − X min
In the proposed method, b and a is equal to 1 and 0, respectively.
Fig. 3. P-QRS-T interval of a person.
3.4. 1-D convolutional neural network classifier

3.3. Post-processing stage Artificial Neural Networks (ANN) have become popular over
a wide range of data with non-linearity features during the last
3.3.1. Vector quantization decade. It provides a complex decision with the increase of hid-
Vector quantization (VQ) is used to prevent the overfitting den layers where classical machine learning algorithms could not.
problem in machine learning. It eliminates the redundant features CNN is the feed-forward ANN with additional convolutional and
to prevent the correlated data may confuse the classifier. In the subsampling layers [35]. We can either train a massive 2-D visual
ECG-based identification process, the VQ algorithm was applied database or 1-D speech or ECG database with a proper training
only onto the dataset used in the training process, and 16 signifi- process by selecting appropriate convolutional and hidden layers.
cant P-QRS-T features were found for each person because the ECG The proposed classifier contains two independent 1-D CNN archi-
signal of a person does not vary much. In the speech-based identi- tectures. Both architectures are summarized in Table 1.
fication, the VQ process was applied on both training and test data, In the ECG-based biometric identification problem, both the 1-D
and 32 significant MFCC features were found for every 10 seconds CNN-based [36–38] and 2-D CNN-based [21,22,39] models have
been proposed in the literature in recent years. Although all these
speech signal of a person. The following equation shows how the
works that ECG signal is used as a single modality have high
algorithm proceeded for the speech-based identification system.
biometric identification performances, 2-D CNN-based models pro-
rt vide slightly better biometric identification performance than 1-D
N= (1)
f t − ( f t × ovl) CNN-based models. However, 1-D CNN-based models have less
computational complexity than 2-D CNN-based models [35]. Be-
where N represents number of MFCC features found in a specific cause when an NxN image is convolved with a K xK kernel in the
response time, rt represents response time of the algorithm (10 two-dimensional convolution operation, the computational com-
s in our application), f t represents MFCC framing time (0.02 s in plexity will be O ( N 2 K 2 ), whereas it will be O ( N K ) in the one-
our application), ovl represents the frame overlapping percentage dimensional convolution operation [35]. Therefore 1-D CNN-based
(0.65 in our application). models can be preferred in low-cost and real-time applications.

5
G. Guven, U. Guz and H. Gürkan Digital Signal Processing 121 (2022) 103306

In this work, we have developed two different 1-D CNN mod-


els for biometric identification and verification problem based on
fingertip ECG and speech signals. The first one is proposed for ECG-
based biometric identification. It contains one input layer with a
size of 1x485 and four convolutional layers that use ReLu (Recti-
fied Linear Units) neurons. The Relu activation function is used to
solve the vanishing gradient problem, which leads to very small
changes in updating the weights and affects the speed of learning.
At each convolutional layer, the stride is specified as 1, and zero
paddings are used to ensure the same input and output dimen-
sional. The input of each convolutional layer is convolved using
filters with a kernel size of 1x3, 1x5, 1x7, and 1x9, respectively.
The number of kernels is appointed as 2, 4, 8, and 16, respec-
tively. After each convolutional layer, a batch normalization layer
is utilized to stabilize the learning process and reduce the number
of epochs required to train the networks. After the last convolu-
tional layer, one max-pooling layer with size 1x2 is used to reduce
both the feature space and computational cost. Then, the output
of this layer is flattened and passed the two fully connected layers
with the size 1x485xN and 1xN, where N represents the number of
people. Finally, the Softmax layer is utilized to perform the proba-
bilistic decisions between trained classes.
The second one is proposed for speech-based biometric iden-
tification. It is similar to the first one except for the kernel size,
the number of max-pooling layers, and the number of kernels. The
size of the kernels and the number of kernels are 1x9, 1x12, 1x15,
1x18, and 8, 16, 32, 64, respectively.
In the training process for both CNN architectures, the Stochas-
tic gradient descent method has been applied by setting the al-
gorithm’s learning rate, momentum, and mini-batch size as 0.01,
0.9, and, 128 respectively. The training process is terminated for
40 epochs for ECG modality, whereas 50 for speech modality.

4. Implementation of the proposed system

In the proposed method (shown in Fig. 4), both the ECG and
speech signals will be assumed to be recorded by an acquisition
system with both microphone and instrumentation amplifier. In
the data acquisition stage, filters were applied to both signals to
eliminate baseline wander, muscle noise, power-line noise, or un-
wanted frequency components. The minimum time for framing
was considered to be 10 seconds for the testing phase. No time
constraints were set on the signal to be trained in our experiments.
Half of the speech signals were assigned in the testing folder and
the other half in the training folder. ECG signals recorded in the
Fig. 4. Block diagram of the proposed biometric recognition system.
first week/month were assigned to the testing folder. On the other
hand, the second records taken a week/month later were assigned
to the training folder. Then, both ECG and speech signals were to this class. This newly appointed class can also be called Uni-
passed through the pre-processing stage, and inconsistent beats versal Background Model (UBM). When the ECG threshold value
and spikes were filtered out from the ECG signal, whereas speech (T hECG ) is higher than zero, we add a class in which the unautho-
signals were separated from their silent parts. After that, MFCC rized users were rejected for ECG-based identification algorithm.
features were extracted from speech signals, and the P-QRS-T fea- The same process was also applied for the speech-based identi-
tures were extracted from ECG signals. Then, the VQ algorithm was fication algorithm when the value of speech threshold (T hSpeech )
applied to the P-QRS-T features in the training folder, and 16 sig- is higher than zero. ECG UBM and speech UBM were also passed
nificant P-QRS-T features were found. For the speech signals in the through the same process with genuine classes in the training
training and testing folders, every MFCC features found in 10 sec- stage until the vector quantization. For ECG UBM, we found 1 sig-
onds passed through VQ, and 32 significant MFCC features were nificant P-QRS-T feature for each ECG signal in the UBM database,
extracted. Two 1-D CNN classifiers were trained using the features consisting of 152 people. For the speech UBM, we found 32 signif-
in the training folder. Then, features from the testing folder were icant MFCC features for every person’s speech signal, and a total
given as input to the CNN algorithm to get probabilistic scores. of 4320 significant MFCC features were found for 135 people. Af-
The higher the values of the scores indicate, the higher chance ter that, both 1-D CNNs were trained by using the features of both
of class ID. The number of scores we get is related to the num- genuine classes and additional UBM class (Class ID 0).
ber of classes we trained. If there is no class to reject the unau- Two main experiments have been done in the evaluation of
thorized user, the proposed system will appoint it to one known the proposed system. One of them was made where the thresh-
(genuine) class. Therefore, class ID 0 is defined as an imposter re- old value was set to zero, and the other was made by taking
jection class, and the proposed system assigns unauthorized users the threshold value of the point where False Acceptance Rate

6
G. Guven, U. Guz and H. Gürkan Digital Signal Processing 121 (2022) 103306

or month were separated for training the CNN algorithm, whereas


signals taken in the second week or month were used for testing
the algorithm’s performance. We increased our database, which
consists of the ECG signal of 58 people, with the CYHBi database,
which consists of an ECG signal of 63 people. After the ECG spike
and inconsistent beats algorithm, the total number of datasets de-
creased to 116 people with at least having 10 seconds of data for
training, 10 seconds of data for testing the proposed method. The
data of 5 people were automatically erased from the database be-
cause they did not have enough records or were not suitable for
the identification system. Thus, there were not enough records in
which the patients’ movements were stable, not affecting the ECG
signal. For constructing ECG-based UBM, we used 152 people from
Fig. 5. Our developed device to construct the fingertip ECG database.
the combination of PTB QT and MIT BIH Arrhythmia databases.
The role of the UBM is essential because when there is a single
(FAR, imposter) and False Rejection Rate (FRR, genuine) intersect. class to train, it provides to rejects the unauthorized user from the
If the maximum value in the scores obtained from classifiers is system.
not higher than the threshold value, class ID will be given as ‘0’, In our experiments, the speech database constructed by Lib-
which means “reject the feature”. If the maximum value in the riSpeech Corpus has been used [42]. The “train-clean-100” dataset
scores is higher than the threshold value, the class that gets the containing 251 speakers with 25 minutes recording time was se-
highest score is appointed as class ID. In a given 10 seconds, we lected. Speakers in the database read randomly selected chapters
accepted that every person has 11 P-QRS-T features, whereas 1428 in different kinds of books in English. The probability of people
MFCC features. Every 1428 MFCC features have passed through reading the exact text is unlikely in the database; therefore, it is
vector quantization, 32 significant MFCC features were extracted a convenient database to use in a text-independent identification
and given to the system with 11 P-QRS-T features. The class IDs system. All the data in the database were quantized with 16 bits
found from these features were then pass through the voting al- with a sampling rate of 16 kHz. We randomly selected the speech
gorithm in which the most repeated ID was assigned as the clas- signal of 116 people and downsampled it to 8 kHz. The remaining
sifier’s outcome. When the outcome class ID was found as ‘0’, the speech signal of 135 people has been selected for constructing the
classifier would reject it from the system. Otherwise, it would ac- UBM.
cept and give the ID of the most repeated ones obtained from CNN The proposed biometric identification method, described in sec-
classifiers. tion 4, was developed in Matlab 2018a with the Intel Core i7-3770
We have also integrated a soft score fusion algorithm, a com- 3.4 GHz processor and tested on the fingertip ECG and speech sig-
monly used method in the literature, into our system. This time, nals. The CNN training process of 90 people lasted approximately
before the soft score fusion algorithm, we applied min-max nor- 20 minutes for ECG-based identification, whereas it was 37 min-
malization for the features obtained by each classifier separately. utes for speech-based identification. Thus, the training process of
Then we averaged the features obtained from each classifier sepa- the fusion system was found to be about one hour.
rately to create two independent features that correspond to ECG
and speech. Finally, we combined these two features according to 5.2. Experiments
the algorithms such as Weighted Sum, Minimum t-norm, Prod-
uct t-norm, Hamacher product t-norm, Nilpotent minimum t-norm In the first experiment, we used the speech and ECG signals of
normalization and applied the same decision rule we created in 30, 45, 60, 75, and 90 people, respectively. The subjects were di-
voting fusion. vided into three folds, wherein each fold, speech, and ECG data
The proposed system works like a traditional identification sys- arranged according to the following rules: In the first fold, ECG
tem if the value of T hECG and T hSpeech is zero, and there are records in the first week or month of the subjects were used for
multiple numbers of classes containing only authorized users. The the training process. The records in the second week or month
proposed system works like a traditional verification system if the were used for the testing process. The first half of the speech sig-
value of T hECG and T hSpeech is higher than zero, and there is one nals were used for training, during the second half for testing. The
genuine and multiple imposter classes in the datasets. Unlike the test dataset was switched to the training dataset in the second
traditional verification and identification systems, our system com- fold. ECG and speech records in the testing and training folders
bines the benefits of both systems so that it rejects the unautho- were combined and selected in random order for training and test-
rized users from the system where there are multiple (genuine) ing processes in the third fold. In Table 2, the number of P-QRS-T
users. and MFCC features in each fold was given concerning the number
of subjects. By selecting the ECG and speech (rejection) thresholds
5. Experimental results as zero, the performance evaluation was conducted for each inde-
pendent 1-D CNN and their fusion, and the results are presented
5.1. Database and hardware properties in Table 3 and Table 4. Additionally, a comparison of soft fusion
methods tested on 90 people in the first fold is presented in Ta-
We have combined two fingertip ECG databases. The first ble 5. In the performance evaluation, four statistical measures were
database was called as CHYBi database, and it was constructed by used: accuracy, sensitivity, specificity, and F score .
Hugo da Silva et al. [40], whereas the second database was a newly In Table 4, it can be seen that the accuracy of ECG signals for
constructed database recorded by our developed device shown in each fold varies between 85.3% and 96.7%. The reason for its vari-
Fig. 5 [11,41]. ECG signals were extracted from people’s thumbs ability of 11.4% is because the statistical characteristics of the ECG
with a 1 kHz sampling rate in both databases. The recording sec- signals obtained from two different weeks of the individuals differ
tion was two different weeks in 1 – 5 minutes recording time for according to daily activity. These kinds of changes happen when-
our database, whereas two other months in 2 minutes recording ever people go out of their routine activity. This statistical differ-
time for the CYHBi database. The signal taken in the first week ence was proved by using an ANOVA test over each fold. Taking

7
G. Guven, U. Guz and H. Gürkan Digital Signal Processing 121 (2022) 103306

Table 2
Properties of test and train sets.

Test set Train set


# of # of # of # of # of # of
folds subjects P-QTS-T significant significant significant
MFCC P-QRS-T MFCC
Fold-1 30 1841 27827 480 25984
Fold-1 45 2715 35424 720 39168
Fold-1 60 3868 48032 960 53120
Fold-1 75 5265 61280 1200 67008

Fold-1 90 6292 74144 1440 81024


Fold-2 90 5783 81024 1440 74144
Fold-3 90 6037 75302 1440 82278

Table 3
Average accuracy (%) of 3-fold cross validation.

30 people 45 people 60 people 75 people 90 people Fig. 6. Detection Error Tradeoff.


ECG 95.85 94.38 89.96 90.43 90.22
Speech 99.66 98.43 98.00 98.67 97.94
Fusion 100.0 100.0 100.0 99.83 99.92

Table 4
Classification performance of the proposed method for each fold.

ECG-based identification system

# of # of Accuracy Sensitivity Specificity Fscore


folds Subjects (%) (%) (%) (%)
Fold-1 90 88.5 99.8 86.7 91.7
Fold-2 90 85.3 99.7 84.2 89.5
Fold-3 90 96.7 99.9 95.5 97.6

Speech-based identification system

# of # of Accuracy Sensitivity Specificity Fscore


folds Subjects (%) (%) (%) (%)
Fold-1 90 97.5 99.9 97.6 98.7
Fold-2 90 96.6 99.9 96.4 98.1
Fig. 7. FRR vs FAR for ECG-based identification system.
Fold-3 90 99.5 99.9 99.3 99.6

The proposed fusion based identification system

# of # of Accuracy Sensitivity Specificity Fscore


folds subjects (%) (%) (%) (%)
Fold-1 90 100 100 100 100
Fold-2 90 100 100 100 100
Fold-3 90 99.9 99.9 99.7 99.8

Table 5
Identification accuracy for different score fusion algorithms.

Fusion Weighted sum Product Minimum


method ECG=0.5, speech=0.5 t-norm t-norm
Accuracy (%) 98.77 97.75 96.52

Fusion Weighted sum Nilpotent Hamacher


method ECG=0.3, speech=0.7 t-norm t-norm
Accuracy (%) 99.80 87.73 96.93

Fig. 8. FRR vs FAR for speech-based identification system.

the alpha value as 0.01, we found that the statistical characteristics


of 16 people out of 90 people in the first two folds were changed. In Fig. 6, the Detection Error Tradeoff (DET) curve of the clas-
Besides, in the third fold, we found that the statistical character- sifier for 90 people in the first fold is given. It is determined by
istics of 7 people out of 90 people changed. It was found lower calculating the average score of each class. The equal error rate
than the first two folds because, in the third fold, we combined the was found 9.2% for ECG-based system, 6.6% for the speech-based
ECG signals from two different weeks and selected them randomly system, whereas 1.6% for the fusion of both systems.
for the training and testing stage. The test showed that recordings In the second experiment, the speech and ECG signals of 26
taken in two different weeks have a strongly significant effect on people were set as imposter class, and 90 people were set as a
the system’s performance. genuine class. We trained the system with 90 genuine people and

8
G. Guven, U. Guz and H. Gürkan Digital Signal Processing 121 (2022) 103306

Table 6 of 0.02 whereas ADAM and RMSProp started to learn the model
Accuracy rates of the system tested with 90 genuine and 26 imposter people. with the minimum of 0.005 learning rate. Learning the model was
Proposed # of Genuine Imposter Equal completed in approximately 72 minutes in ADAM and RMSProp
system subjects accuracy accuracy error rate whereas 57 minutes in SGDM. By the condition of increasing the
(%) (%) (%) imposter rejection accuracy of the fusion system, the optimum pa-
ECG 90 71.08 71.05 28.92 rameters for the learning rate and batch size of the CNN were
Speech 90 86.48 87.82 12.95 found as 0.01 and 128, respectively.
Fusion 90 91.68 96.05 6.82 In the third experiment, we increased the ECG dataset by
adding the remaining ECG signals measured at the palm in the
used both imposter and genuine people for testing the system. In CYBHi database [40] and ECG signals measured at the arm in the
the performance evaluation, ECG and speech (rejection) thresholds ECG-ID database [44] to our system. Our ECG database was in-
were calculated by finding the intersection of FRR (genuine) and creased to 226 people, and randomly selected 176 people used as
FAR (imposter), where both of them were minimized (shown in genuine classes, whereas 50 people for imposter classes. As for the
Fig. 7 and Fig. 8). The thresholds at the intersection point were speech database, we selected 361 people in the “train-clean-360”
found as 0.815 and 0.53 for the ECG (rejection) threshold and dataset from LibriSpeech, consisting of 921 people. We randomly
speech (rejection) threshold, respectively. After that, the perfor- selected 176 people for genuine, 50 people for imposter classes,
mance of the system was evaluated by using the obtained rejection whereas the remaining 135 people for constructing UBM. The ex-
thresholds, and the results are given in Table 6. periment was conducted using the previous (rejection) thresholds
Decreasing the threshold provides convenience, whereas in- and compare the results with new threshold values found from
creasing the threshold provides better security; however, it makes the intersection of FRR and FAR. The results given in Table 9 show
users do multiple attempts to be accepted into the system. It must that (rejection) threshold values change whenever the dataset is
be chosen wisely, taking into account the application to be used. increased, so a suitable decision rule must be set for the cause
In Table 6, the Genuine Accuracy and Imposter Accuracy were of balancing both imposter rejection and genuine acceptance accu-
found by the following equations racy.
In the fourth experiment, the effect of the time constraint has
Number of Genuine classes correctly identified
Genuine Accuracy = (4) been examined on 90 people’s fingertip ECG and speech data in
Number of genuine attempts the first fold. Using the databases in previous experiments, we
Number of Imposter classes correctly rejected changed the response time of the system to 1, 3, 5, 10, 20, and
Imposter Accuracy = (5)
Number of imposters attempts 60 seconds, respectively. If the length of ECG or speech signals of
some people did not meet the required time constant, we would
In Fig. 7 and Fig. 8, False Acceptance Rate (FAR) and False Re-
evaluate the system by the maximum length of signal each has.
jection Rate (FRR) were found by the following equations
The experiment was conducted for each class trained with suffi-
Number of incorrectly accept access attempts cient features without time constraints. In the performance evalu-
by unauthorized users ation, we tested the system by giving the features to the system in
FAR = (6)
Number of unauthorized users attempts a specific time range, and imposter rejection and genuine identifi-
Number of false rejection over genuine users cation accuracy rates are given in Table 10 and Table 11.
FRR = (7)
Number of genuine users attempts In the fifth experiment, we exchanged the LibriSpeech database
with the RedDots database [45], which had 62 speakers, includ-
In Fig. 7, it can be easily seen that the system accepted all the ing 49 male speakers and 13 female speakers from 21 countries.
imposter users when the value of T hECG was low. Increasing the The records were taken through mobile crowd-sourcing to benefit
value of T hECG improved the imposter rejection accuracy whereas a potentially wider population and greater diversity. The language
decreased the accuracy of correctly identifying genuine users. It was also English; however, the dataset was constructed to increase
was the same with speech signal (shown in Fig. 8). However, it
the difficulty of the identification system, which had background
differs at some point: the increases of the speech UBM dataset
noises such as the sound of a musical instrument, mouse-clicking
improved the imposter rejection accuracy significantly while the
sounds, the voice of crowds at the background, or had speak-
value of T hSpeech was still low. In our research, we increased the
ers whose pronouncing of the spoken language were bad. The
speech UBM dataset to 250 people and saw that it decreased FAR
database was used to simulate the performance of the proposed
(imposter) to 42% while T hSpeech was at ‘0’ whereas did not af-
algorithm for the speech signals recorded in noisy environments,
fect genuine accuracy. It indicated that increasing of speech UBM
and the results are given in Table 12.
dataset significantly improved the imposter rejection performance.
In the final experiment, a number of 1, 3, and 5 people
However, it could not be said for the ECG system because when we
were randomly selected as genuine classes, whereas the remain-
increase the ECG UBM dataset to 550 people, it only decreased the
ing classes in the 116 people were selected as imposter classes
FAR (imposter) to 90% while T hECG was at ‘0’. It also affected the
and used for training the proposed system. In addition to the num-
accuracy of genuine users to decrease for the ECG system. There-
ber of genuine classes, we exchanged person/people in the genuine
fore, we suggest that rather than increasing the ECG UBM dataset,
class/classes three times from the database. Then we conducted a
ECG UBM should be split into multiple (rejection) classes, not af-
test to evaluate the average genuine identification accuracy and the
fecting the genuine accuracy rate.
average imposter rejection accuracy of the proposed system, and
Tables 7 and 8 show the accuracy rates of the system with
the results are given in Table 13.
respect to changes in learning rates and batch size of the CNN,
respectively. The CNN was optimized according to our classifica-
tion problem. The hyperparameters used in the training of the 6. Discussion
CNN were fine-tuned by making performance comparison [43]. For
our application, we concluded that the stochastic gradient descent To the best of our knowledge, there is no research work which
with momentum (SGDM) was a better optimizer than ADAM and combines the ECG and speech modalities in the literature for bio-
RMSProp due to its performance and learning speed. SGDM opti- metric identification purposes. Therefore, we discussed the identi-
mizer started to learn the model with the minimum learning rate fication results of the relevant research in that area with respect

9
G. Guven, U. Guz and H. Gürkan Digital Signal Processing 121 (2022) 103306

Table 7
Accuracy rates of the system tested with 90 genuine and 26 imposter people by changing the learning rates of
the CNN (the batch size is 128).
Genuine accuracy (%) Imposter accuracy (%)
Learning rate 0.01 0.005 0.001 0.0005 0.01 0.005 0.001 0.0005
ECG 71.09 66.07 70.50 72.87 71.05 76.32 73.03 73.03
Speech 86.48 87.57 85.66 86.18 87.82 88.71 87.82 87.07
Fusion 91.68 93.47 89.90 91.49 96.05 93.42 93.42 93.42

Table 8
Accuracy rates of the system tested with 90 genuine and 26 imposter people by changing the batch size of the
CNN (the learning rate is 0.01).
Genuine accuracy (%) Imposter accuracy (%)
Batch size 64 128 256 512 64 128 256 512
ECG 70.10 71.09 76.04 73.66 71.05 71.05 75.66 73.68
Speech 84.14 86.48 87.70 88.17 85.14 87.82 88.11 87.37
Fusion 91.09 91.68 91.29 95.25 96.05 96.05 90.13 94.74

Table 9
Accuracy rates of the system tested with 176 genuine and 50 imposter people.
Previous rejection thresholds New rejection thresholds
Proposed Genuine Imposter Genuine Imposter Equal error
system accuracy accuracy accuracy accuracy rate (%)
(%) (%) (%) (%)
ECG 78.74 60.31 73.24 72.76 27.03
Speech 59.85 99.09 85.83 88.56 13.09
Fusion 74.61 100.0 89.54 96.10 10.30

Table 10 to each modality. Additionally, we compared the results of our ex-


Identification performance (%) for genuine people in a given response time. periment works with relevant research.
1 sec 3 sec 5 sec 10 sec 20 sec 60 sec Although the obtained results in Tables 3 and 4 did not out-
ECG 62.07 64.62 66.75 71.08 71.98 72.00 perform state-of-the-art recent research (shown in Table 14) in
Speech 55.09 71.38 79.28 86.48 89.95 90.34 identification using ECG signals, the system gives solutions for the
Fusion 63.15 72.82 82.14 91.68 90.69 93.89
identification of multiple users in a real-time application. The ob-
tained identification results can also be considered relatively mod-
erate if we take regarding the number of subjects and response
Table 11
Rejection performance (%) for imposter people in a given response time. time of the system. We offered users a choice of changing the
1 sec 3 sec 5 sec 10 sec 20 sec 60 sec
identification system into a verification system in a multiuser en-
ECG 64.22 66.78 68.56 71.05 72.72 75.00
vironment.
Speech 55.44 70.71 79.98 87.81 89.29 90.09 The results in Table 6 show that the identification rate of the
Fusion 69.07 85.66 92.63 96.05 90.69 93.89 system decreases by approximately 20% when the system is set up
as verification. Decreasing the identification accuracy is the major
drawback of the proposed algorithm. The users can decide how ac-
Table 12 curately identification of the authorized people or how accurately
Genuine and Imposter Accuracy rates for RedDots speech database.
rejecting unauthorized people are needed for their application. For
Proposed # of # of Genuine Imposter Equal error this reason, they can increase the identification performance of the
system genuine imposter accuracy accuracy rate (%)
classes classes (%) (%)
authorized people by decreasing the ECG threshold value. Fig. 7
ECG 48 14 75.51 77.64 23.93
shows the system sensitivity versus user convenience while ECG
Speech 48 14 67.10 67.34 32.87 thresholds are changing.
Fusion 48 14 73.52 85.36 26.05 Additionally, we offer users a choice of changing the system’s
response time by pointing out that it can decrease the identifica-
tion rate. It is one of the advantages of our system because the
Table 13 majority of the research did not point out how fast their system
Accuracy rates of the proposed system when few people were registered in the
identifies the person. The identification performances correspond-
system.
ing to system response times are shown in Tables 10 and 11. It is
Proposed # of # of Genuine Imposter
system genuine imposter accuracy accuracy
seen that the more features were given to the system, the more
classes classes (%) (%) identification performance increased. Table 13 shows that system
ECG 1 115 100.0 90.67 can work with only one authorized person. We can see from the
ECG 3 113 81.37 87.07 table that sharp changes in identification accuracy occur while in-
ECG 5 111 70.93 70.93 creasing the number of unauthorized people. It can be interpreted
Speech 1 115 100.0 99.60
Speech 3 113 99.40 98.47
from the table that features of an ECG signal are not as distinctive
Speech 5 111 98.53 98.37 as features of a speech signal when the system is set as veri-
Fusion 1 115 100.0 99.93 fication. The reason is that, in the speech identification system,
Fusion 3 113 100.0 98.93 increasing unauthorized test subjects does not decrease the identi-
Fusion 5 111 100.0 98.63
fication performance of authorized people.

10
G. Guven, U. Guz and H. Gürkan Digital Signal Processing 121 (2022) 103306

Table 14
Comparison of the biometric identification systems in the literature.
Author Year DB Modality NS Results (%) RT
L. Biel et al. [15] 2001 Private ECG 20 IDR 98 NA
Z. Zhang et al. [17] 2006 NA ECG 60 IDR 85.3 10 s
W. Yongjin et al. [13] 2007 PTB ECG 13 IDR 74.45 1
MIT-BIH IDR 74.95 HB
Z. Zhao et al. [12] 2011 QT ECG 20 IDR 95.3 NA
H. Yu et al. [24] 2015 TIMIT Speech 20 IDR 97.6 ∼9 s
N. Almaadeed et al. [25] 2015 Grid Speech 34 IDR 97.5 NA
N. Aboelenien et al. [26] 2016 CHAINS Speech 36 IDR 91 2s
X. Lei et al. [18] 2016 PTB ECG 100 IDR 99.3 NA
L. Wieclaw et al. [19] 2017 Private Fingertip ECG 18 IDR 96 NA
H. Gürkan et al. [20] 2019 MIT-BIH Arr. ECG 46 IDR 99.3 NA
A. Imran et al. [27] 2019 MOOC Speech 119 IDR 93.37 3s
IDR 94.44 5s
IDR 94.64 7s
S. El-Moneim et al. [28] 2020 Chinese Mandarin Speech 5 IDR 98.7 NA
Corpus
R. Srivastva et al. [21] 2021 CYBHI Fingertip ECG 63 IDR 99.66 3
HB
Y. Zhang et al. [22] 2021 CYBHI Fingertip ECG 60 IDR 98.77 3
HB
The proposed method 2021 CYBHI + Private Fingertip ECG 90 IDR 90.17 10 s
(ECG modality)
The proposed method 2021 LibriSpeech Speech 90 IDR 97.87 10 s
(speech modality)
The proposed method 2021 LibriSpeech Speech 90 IDR 99.97 10 s
(ECG&speech fusion) CYBHI + Private Fingertip ECG
where DB refers to database, NS refers to number of subjects, IDR refers to identification rate, RT refers to response time
indicating the frame length of ECG or speech signals while evaluating the system, Private refers to a database constructs
by authors, NA indicates that information is not available or computable, HB refers to Heartbeat.

The proposed speech-based identification system outperformed experiment, we compared both imposter and genuine class accu-
the other works given in Table 14 in terms of the accuracy, num- racy rates by changing the response time of the proposed system.
ber of subjects, and identification performance. Although accuracy In the fifth experiment, the speech dataset was exchanged with
rates were decrease when the system set as verification, they did the RedDots database, given 48 genuine classes and 14 imposter
not decrease as much as our identification system that uses only classes; in this case, the proposed method achieved 73.52% gen-
ECG signals. In addition, the contribution of speech signals on the uine identification accuracy and 85.36% imposter rejection accu-
performance of our proposed fusion-based identification system is racy. In the final experiment, we showed that the proposed system
relatively high. Thus, it causes our fusion systems to get vulnerable also worked when a few people were registered.
to spoofing attacks when the weight of speech signals compared It should be noted that when the proposed system is set as a
to ECG signals is increased. It is a major drawback of using speech verification system, it begins to reject unauthorized people. How-
signals. In order to eliminate the drawbacks which arise from ECG ever, the identification accuracy of the proposed system decreases
and speech modalities alone, we combined ECG and speech modal- for genuine people. Also, we provide users a choice of how much
ities to increase the performance of the overall system. Thus, while security or convenience is needed for the proposed system accord-
speech signals are increasing identification performance, ECG sig- ing to the application.
nals increase the confidence of the proposed system because of its It is concluded that the performance of the proposed fusion
liveliness detection. system is higher than that of the performances of the other works
that use a single type of modality.
7. Conclusion
Declaration of competing interest
In this research, we introduced a fusion-based system that com-
bines the ECG and speech modalities for identification and verifica- The authors declare that they have no known competing finan-
tion tasks in a single algorithm. The proposed fusion algorithm was cial interests or personal relationships that could have appeared to
developed to work on real-time security applications with multiple influence the work reported in this paper.
users. In the algorithm, we provided a solution for the degradation
of fingertip ECG signals caused by patient’s movements. The pro- Acknowledgment
posed fusion method worked with the principle of a voting method
by looking at each independent system’s outcome. This research work was supported by Coordination Office for
The first experiment results showed that the proposed fusion- Scientific Research Projects, FMV ISIK University (Project Number:
based system achieved a 100% accuracy rate for 90 people when 14A203) and Scientific Research Projects Unit, Bursa Technical Uni-
there was no imposter rejection feature. The second experiment versity (Project Number: 181N14).
results showed that our algorithm rejected the imposter classes
with a 96.05% accuracy rate, whereas accepted and identified gen- References
uine classes with a 91.68% accuracy rate for 90 people. In the third
experiment, we increased both the ECG and speech databases, [1] S. Chauhan, A. Arora, A. Kaul, A survey of emerging biometric modalities, Proc.
Comput. Sci. 2 (2010) 213–218, https://doi.org/10.1016/j.procs.2010.11.027.
evaluated the system performance with 176 genuine, 50 imposter [2] A. Fratini, M. Sansone, P. Bifulco, M. Cesarelli, Individual identification via
classes, and achieved an 89.54% accuracy rate in the genuine electrocardiogram analysis, Biomed. Eng. Online 14 (1) (2015) 78, https://
classes, whereas 96.1% imposter rejection accuracy. In the fourth doi.org/10.1186/s12938-015-0072-y.

11
G. Guven, U. Guz and H. Gürkan Digital Signal Processing 121 (2022) 103306

[3] J. Ribeiro Pinto, J. Cardoso, A. Lourenco, Evolution, current challenges, and fu- [28] S. El-Moneim, M. Nassar, M. Dessouky, N. Ismail, A. El-Fishawy, F. Abd El-Samie,
ture possibilities in ECG biometrics, IEEE Access 6 (2018) 34746–34776, https:// Text-independent speaker recognition using LSTM-RNN and speech enhance-
doi.org/10.1109/ACCESS.2018.2849870. ment, Multimed. Tools Appl. 79 (2020) 24013–24028, https://doi.org/10.1007/
[4] W. Wójcik, K. Gromaszek, M. Junisbekov, Face recognition: issues, methods and s11042-019-08293-7.
alternative applications, https://doi.org/10.5772/62950, 2016. [29] J. Pan, W.J. Tompkins, A real-time QRS detection algorithm, IEEE Trans. Biomed.
[5] M. Bassiouni, A machine learning technique for person identification using ECG Eng. 32 (1985) 230–236.
signals, IOSR J. Appl. Phys. 1 (2016) 37. [30] J. Sohn, S. Kim, W. Sung, A statistical model based voice activity detector, IEEE
[6] J. Arteaga-Falconi, H. Al Osman, A. El Saddik, ECG authentication for mobile de- Signal Process. Lett. 6 (1999) 1–3, https://doi.org/10.1109/97.736233.
vices, IEEE Trans. Instrum. Meas. 65 (2015) 1–10, https://doi.org/10.1109/TIM. [31] S.B. Davis, P. Mermelstein, Comparison of parametric representations for mono-
2015.2503863. syllabic word recognition in continuously spoken sentences, IEEE Trans. Acoust.
[7] N. Samarin, D. Sannella, A key to your heart: biometric authentication based Speech Signal Process. (1980) 357–366.
on ECG signals, arXiv:1906.09181 [abs]. [32] S. Chakroborty, G. Saha, Feature selection using singular value decomposition
[8] T.W. Shen, W.J. Tompkins, Y.H. Hu, One-lead ECG for identity verification, in: and QR factorization with column pivoting for text-independent speaker iden-
Proceedings of the Second Joint 24th Annual Conference and the Annual Fall tification, Speech Commun. 52 (9) (2010) 693–709, https://doi.org/10.1016/j.
Meeting of the Biomedical Engineering Society, in: Engineering in Medicine specom.2010.04.002.
and Biology, vol. 1, 2002, pp. 62–63. [33] N. Almaadeed, A. Aggoun, A. Amira, Text-independent speaker identification
[9] W.-H. Jung, S.-G. Lee, ECG identification based on non-fiducial feature ex- using vowel formants, J. Signal Process. Syst. 82 (3) (2016) 345–356, https://
traction using window removal method, Appl. Sci. 7 (2017) 1205, https:// doi.org/10.1007/s11265-015-1005-5.
doi.org/10.3390/app7111205. [34] M. Xu, L.-Y. Duan, J. Cai, L.-T. Chia, C. Xu, Q. Tian, HMM-based audio keyword
[10] B. Noureddine, R. Fournier, A. Naït-ali, F. Reguig, A novel biometric authenti- generation, in: K. Aizawa, Y. Nakamura, S. Satoh (Eds.), Advances in Multimedia
cation approach using ECG and EMG signals, J. Med. Eng. Technol. 39 (2015) Information Processing - PCM 2004, Springer Berlin Heidelberg, Berlin, Heidel-
1–13, https://doi.org/10.3109/03091902.2015.1021429. berg, 2005, pp. 566–574.
[11] G. Guven, H. Gürkan, U. Guz, Biometric identification using fingertip electro- [35] S. Kiranyaz, O. Avci, O. Abdeljaber, T. Ince, M. Gabbouj, D.J. Inman, 1D convo-
cardiogram signals, Signal Image Video Process. 12 (2018) 933–940, https:// lutional neural networks and applications: a survey, Mech. Syst. Signal Process.
doi.org/10.1007/s11760-018-1238-4. 151 (2021) 107398, https://doi.org/10.1016/j.ymssp.2020.107398.
[12] Z. Zhao, L. Yang, ECG identification based on matching pursuit, in: 2011 4th [36] Q. Zhang, D. Zhou, X. Zeng, HeartID: a multiresolution convolutional neural
International Conference on Biomedical Engineering and Informatics (BMEI), network for ECG-based biometric human identification in smart health applica-
vol. 2, 2011, pp. 721–724. tions, IEEE Access 5 (2017) 11805–11816, https://doi.org/10.1109/ACCESS.2017.
[13] W. Yongjin, A. Foteini, D. Hatzinakos, K. Plataniotis, Analysis of human electro- 2707460.
cardiogram for biometric recognition, EURASIP J. Adv. Signal Process. (2008), [37] E.J. da Silva Luz, G.J.P. Moreira, L.S. Oliveira, W.R. Schwartz, D. Menotti, Learning
https://doi.org/10.1155/2008/148658. deep off-the-person heart biometrics representations, IEEE Trans. Inf. Forensics
[14] H. Gürkan, U. Guz, S. Yarman, A novel biometric authentication approach Secur. 13 (5) (2018) 1258–1270, https://doi.org/10.1109/TIFS.2017.2784362.
using electrocardiogram signals, in: Conference Proceedings: Annual Interna- [38] R. Donida Labati, E. Muñoz, V. Piuri, R. Sassi, F. Scotti, Deep-ECG: convolu-
tional Conference of the IEEE Engineering in Medicine and Biology Society, tional neural networks for ECG biometric recognition, Pattern Recognit. Lett.
IEEE Engineering in Medicine and Biology Society, Conference 2013, 2013, 126 (2019) 78–85, Robustness, Security and Regulation Aspects in Current Bio-
pp. 4259–4262. metric Systems.
[15] L. Biel, O. Pettersson, L. Philipson, P. Wide, ECG analysis: a new approach in [39] N. Bento, D. Belo, H. Gamboa, ECG biometrics using spectrograms and deep
human identification, IEEE Trans. Instrum. Meas. 50 (3) (2001) 808–812. neural networks, Int. J. Mach. Learn. Comput. 10 (2020) 259–264.
[16] M. Kyoso, A. Uchiyama, Development of an ECG identification system, in: 2001 [40] H. Plácido da Silva, A. Lourenco, A. Fred, N. Raposo, M. Aires-de Sousa, Check
Conference Proceedings of the 23rd Annual International Conference of the your biosignals here: a new dataset for off-the-person ECG biometrics, Com-
IEEE Engineering in Medicine and Biology Society, vol. 4, 2001, pp. 3721–3723, put. Methods Programs Biomed. 113 (2014) 503–514, https://doi.org/10.1016/j.
https://doi.org/10.1109/IEMBS.2001.1019645. cmpb.2013.11.017.
[17] Z. Zhang, D. Wei, A new ECG identification method using Bayes’ teorem, in: [41] G. Guven, Fingertip ECG signal based biometric recognition system, Mas-
TENCON 2006 - 2006 IEEE Region 10 Conference, 2006, pp. 1–4. ter’s Thesis, FMV ISIK University, 2016 (Supervisor: Associate Professor Hakan
[18] X. Lei, Y. Zhang, Z. Lu, Deep learning feature representation for electrocardio- Gürkan, Co-supervisor: Associate Professor Umit Guz).
gram identification, in: 2016 IEEE International Conference on Digital Signal [42] V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Librispeech: an ASR corpus
Processing (DSP), 2016, pp. 11–14. based on public domain audio books, 2015, pp. 5206–5210.
[19] L. Wieclaw, Y. Khoma, P. Fałat, D. Sabodashko, V. Herasymenko, Biometrie [43] M.A. Ozdemir, O.K. Cura, A. Akan, Epileptic EEG classification by using time-
identification from raw ECG signal using deep learning techniques, in: 2017 frequency images for deep learning, Int. J. Neural Syst. 31 (08) (2021) 2150026,
9th IEEE International Conference on Intelligent Data Acquisition and Ad- https://doi.org/10.1142/S012906572150026X.
vanced Computing Systems: Technology and Applications (IDAACS), vol. 1,
[44] A.L. Goldberger, L.A.N. Amaral, L. Glass, J.M. Hausdorff, P.C. Ivanov, R.G. Mark,
2017, pp. 129–133.
J.E. Mietus, G.B. Moody, C.-K. Peng, H.E. Stanley, PhysioBank, PhysioToolkit, and
[20] H. Gürkan, A. Hanilci, ECG based biometric identification method using QRS PhysioNet: components of a new research resource for complex physiologic sig-
images and convolutional neural network, Pamukkale Üniv. Müh. Bilim. Derg. nals, Circulation 101 (23) (2000) e215–e220, https://doi.org/10.1161/01.CIR.101.
26 (2) (2020) 318–327. 23.e215.
[21] R. Srivastva, A. Singh, Y. Singh PlexNet, A fast and robust ECG biometric system [45] K.A. Lee, A. Larcher, H. Aronoitz, G. Wang, P. Kenny, The RedDots challenge:
for human recognition, Inf. Sci. 558 (2021), https://doi.org/10.1016/j.ins.2021. towards characterizing speakers from short utterances, in: Interspeech 2016,
01.001. San Francisco, 2016, https://sites.google.com/site/thereddotsproject/reddots-
[22] Y. Zhang, Z. Zhao, D. Yanjun, X. Zhang, Y. Zhang, Human identification driven challenge.
by deep CNN and transfer learning based on multiview feature representations
of ECG, Biomed. Signal Process. Control 68 (2021) 102689, https://doi.org/10.
1016/j.bspc.2021.102689.
[23] S. Marinov, I. För, S. Högskolan, I. Skövde, Text dependent and text independent
speaker verification systems. Technology and applications. Gokhan Guven graduated from Isik University, En-
[24] H. Yu, Z. Ma, M. Li, J. Guo, Histogram transform model using MFCC features gineering Faculty, Department of Electrical-Electronics
for text-independent speaker identification, in: Conference Record - Asilomar Engineering, Istanbul, Turkey, in 2014. He received the
Conference on Signals, Systems and Computers 2015, 2015, pp. 500–504. M.S. degree in Electronics Engineering from Isik Uni-
[25] N. Almaadeed, A. Aggoun, A. Amira, Speaker identification using multimodal versity, Graduate School of Science, Istanbul, Turkey,
neural networks and wavelet analysis, Biometrics 4 (2015) 18–28, https://doi. in 2016. He has been pursuing the Ph.D. degree in
org/10.1049/iet-bmt.2014.0011.
Electronics Engineering from Isik University, School of
[26] N. Aboelenien, K. Amin, M. Ibrahim, M.M. Hadhoud, Improved text-
Graduate Studies, since 2016. He was a research and
independent speaker identification system for real time applications, in: Japan-
teaching assistant at Isik University, Engineering Fac-
Egypt International Conference on Electronics, Communications and Computers
(JEC-ECC), 2016, pp. 58–62. ulty, Department of Electrical-Electronics Engineering between 2014 and
[27] A. Imran, Z. Kastrati, T. Svendsen, A. Kurti, Text-independent speaker id em- 2017. He has been working as a senior researcher at the Department of In-
ploying 2D-CNN for automatic video lecture categorization in a MOOC setting, formation Technologies, The Scientific and Technological Research Council
in: 2019 IEEE 31st International Conference on Tools with Artificial Intelligence of Turkey (TUBITAK) since 2017. His research areas include speech pro-
(ICTAI), 2019, pp. 273–277. cessing and bio-signal processing.

12
G. Guven, U. Guz and H. Gürkan Digital Signal Processing 121 (2022) 103306

Umit Guz received the B.S. degree in Electron- ing, automatic speech recognition, natural language processing, machine
ics Engineering from the Istanbul University, College learning, and bio-signal processing.
of Engineering, Turkey, in 1994, the M.S. and Ph.D.
degrees in Electronics Engineering from the Institute Hakan Gürkan received the B.S., M.S., and Ph.D.
of Science, Istanbul University, Turkey, in 1997 and degrees in Electronics and Communication Engineer-
2002, respectively. He was awarded a post-doctoral ing from Istanbul Technical University, Turkey, in
research fellowship by the Scientific and Technolog- 1994, 1998, and 2005, respectively. He was a Re-
ical Research Council of Turkey (TUBITAK) in 2006. search Assistant in the Department of Electronics
He was accepted as an international research fellow Engineering, Engineering Faculty, Isik University, Is-
by the SRI (Stanford Research Institute)-International, Speech Technology tanbul, Turkey, from 1998 to 2005. Then, he worked
and Research (STAR) Laboratory, Menlo Park, CA, USA, in 2006. He was as an Assistant Professor and an Associate Professor
awarded a J. William Fulbright post-doctoral research fellowship, USA, in in the Department of Electrical-Electronics Engineer-
2007. He was accepted as an international research fellow by the Inter- ing, Engineering Faculty at Isik University, Istanbul, from 2009 to 2014 and
national Computer Science Institute (ICSI), Speech Group at the University from 2014 to 2017, respectively. In 2018, he joined Electrical-Electronics
of California at Berkeley, Berkeley, CA, USA, in 2007 and 2008. He worked Engineering Department at Bursa Technical University, where he is cur-
as an Assistant Professor and an Associate Professor in the Department rently working as a Professor. Also, he is head of the department of
of Electrical-Electronics Engineering, Engineering Faculty at Isik University, Electrical-Electronics Engineering at Bursa Technical University. His main
Istanbul, from 2008 to 2013 and from 2013 to 2019, respectively. He has research areas include biometric identification, machine learning, pattern
been a full-time professor in the Department of Electrical-Electronics Engi- recognition, biomedical and speech signals modeling, representation, and
neering, Faculty of Engineering and Natural Sciences at Isik University, Sile, compression.
Istanbul, Turkey, since 2019. His research interest covers speech process-

13

You might also like