0% found this document useful (0 votes)
22 views17 pages

Alzheimer S Dementia Speech Audio Vs Text Multi Model Machine Learning at High Vs Low Resolution

Uploaded by

chriscluck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views17 pages

Alzheimer S Dementia Speech Audio Vs Text Multi Model Machine Learning at High Vs Low Resolution

Uploaded by

chriscluck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Article

Alzheimer’s Dementia Speech (audio vs text): Multi-model


Machine Learning at High vs Low Resolution
Prachee Priyadarshinee 1 ∗ , Christopher Johann Clarke1 , Jan Melechovsky1 , Cindy Ming Ying Lin2 ,
Balamurali B T1 , Jer-Ming Chen1

1 Singapore University of Technology and Design, Singapore; (prachee, balamurali_bt,


jerming_chen)@sutd.edu.sg, (christopher_clarke, jan_melechovsky)@mymail.sutd.edu.sg
2 [email protected]
* Correspondence: [email protected]

Abstract: We investigate a multi-modal approach (audio and text) for the automatic detection 1

of Alzheimer’s Dementia determined from recordings of spontaneous speech. Sixteen features, 2

including four feature extraction methods not previously applied in such contexts, were utilized and 3

tested to determine their relative performance. These features encompass two modalities (audio 4

vs text) at two resolution scales (frame-level vs file-level). We compared the accuracy of these 5

features and found that text-based classification outperformed audio-based classification with the 6

best performance attaining 88.7%, surpassing other reports to-date relying on the same dataset. For 7

text-based classification in particular, the best file-level feature performed 9.8% better than frame-level 8

feature. However, when comparing audio-based classification, the best frame-level feature performed 9

1.4% better than the best file-level feature. This multi-modal multi-model comparison at high- and 10

low-resolution offers insights into which approach is most efficacious, depending on the context. 11

Keywords: Alzheimer’s Dementia; Deep learning; Text/Acoustic analysis; spontaneous speech 12

1. Introduction 13

Alzheimer’s Dementia (AD) is a neuro-degenerative disease that progressively wors- 14

ens with time. Gradual decline in language and speech are one of the important cues of 15

AD. It is necessary to find quick, inexpensive, non-invasive and objective tools to detect 16

it, due to the ever-increasing number of dementia patients worldwide. Some previous 17

studies [1,2] have reported a significantly higher number of syllables, lexicon, and difficult 18

words in the healthy subjects than the in people suffering from Alzheimer’s Dementia. 19

[1] also reported differences in vocabulary and patterns of word between CN and AD 20

Citation: Priyadarshinee, P.; Clarke, groups. [3] reported that a basic set of acoustic features such as F0, F1,F2, jitter, shimmer, 21

C.J.; Melechovsky, J; Lin, C.M.Y.; BT, B.; mel-frequency cepstral coefficients (MFCC), have the potential to detect physiological 22

Chen, J.M. Title. Appl. Sci. 2023, 1, 0. changes in voice production. In addition, other speech indicators which are the known 23

https://doi.org/ symptoms of AD are mispronunciation [4], articulation rate, [5], higher hesitation ratio 24

[6], impairments of speech rhythm [7], fluency, speech tempo, speech-pause distribution 25
Received:
[8,9], and speech-silence patterns, among others. Disfluencies present in speech can be 26
Revised:
Accepted:
indicative as a possible marker for AD [10,11]. Patients suffering from AD talk slow, take 27

Published:
longer pauses, and take more time to think of the correct words; each of these add to the 28

disfluency of speech. Thus, language and speech can be useful distinguishing markers 29
Copyright: © 2024 by the authors.
between healthy control (CN) group and patients suffering from Alzheimer’s Dementia 30
Submitted to Appl. Sci. for
(AD) [12,13]. Due to the popularity of machine learning (ML) tools in the last decade, meth- 31
possible open access publication
ods using automatic speech analysis for the screening, early detection and intervention 32
under the terms and conditions
of the Creative Commons Attri-
of AD are gaining momentum in aiding clinical diagnosis [13–16]. Capitalizing on the 33

bution (CC BY) license (https://


availability of a multitude of complex audio and text feature extraction techniques due to 34

creativecommons.org/licenses/by/ breakthrough in natural language processing (NLP) tools, pre-trained transformer-based 35

4.0/).

Version July 1, 2024 submitted to Appl. Sci. https://www.mdpi.com/journal/applsci


Version July 1, 2024 submitted to Appl. Sci. 2 of 17

language models [17,18] and automatic speech recognition (ASR) methods [19], studies 36

[20–24] have shown promising results in interpreting and predicting AD. 37

Generally employed approaches for AD prediction are using audio features [14,21,25– 38

27], text representations [26–31], or a fusion of both [32]. While [33] relied solely on acoustic 39

features (using Bayesian Networks (BN), Trees-Random Forest (RF), Adaboost (AB) and 40

Meta- Bagging (MB) classifiers) to detect AD with a high accuracy, [34] achieved a high clas- 41

sification accuracy by utilizing only linguistic features (with pre-trained Transformer based 42

models). However, comparisons between performances of acoustic-based features and 43

text-based features in dementia classification showed that text-based features performed 44

better than acoustic-based features[3,23,35,36]. Some newly released research indicates that 45

a fusion of acoustic and linguistic models improves AD prediction accuracy[32,37–43]. 46

Deep Learning (DL) models have grown in popularity in speech recognition research 47

[44], because they allow for the implementation of multiple layers that capture information 48

at different granularity levels, improving prediction accuracy [45]. Using VGGish deep 49

acoustic embeddings on the ADReSSo dataset [3], combined with other feature aggregation 50

methods such as Fisher Vector encodings (FVs) and Bag-of-Audio-Words (BoAW), [40] 51

achieved a high accuracy. On the ADReSSo-2021 dataset (same as ours), [46] attained an 52

accuracy of 83.33% using fine-tuned Bidirectional Encoder Representations from Trans- 53

formers (BERT). For automatic AD identification from continuous speech, [47] utilized 54

perplexity characteristics collected from N-gram language models. [48] investigated both 55

language features retrieved from the transcripts and encoded pauses and [49] utilized 56

multi-layered perceptrons (MLP) and recurrent neural networks (RNN) to several types 57

of audio and linguistic characteristics. By combining the text data (both word level and 58

phoneme level) with audio features, [38] reported high AD classification accuracy using 59

cross-validation. They demonstrated that text features were more distinct than acoustic 60

features on their own. They also noted that tiny datasets for deep learning text systems 61

frequently overfit, and used a multi-scale multi-model approach reporting phoneme-level 62

embedding to train on a small dataset. ALBERT, XLNet, RoBERTa, BioBERT, BioClinical- 63

BERT, ConvBERT, and BERT were all used by [1] and reported to have the highest accuracy 64

of 87.50% on ADReSSo-2021 dataset. 65

To make sense of the multitude of approaches already developed, we report here a 66

parallel investigation of multi-modal feature space with multi-model, multi-feature cues 67

in addressing Alzheimer’s dementia on the ADReSSo-2021 dataset [3]. Previous studies 68

did not have the opportunity to compare these models on the same dataset. Here, we offer 69

a rich comparison of sixteen different methods to accomplish the classification between 70

Alzheimer’s dementia (AD) and healthy control (CN) group from a set of speech recordings. 71

This comparison informs readers on the performance of various models and architectures 72

on the same dataset both on the file-level and frame-level, which has not been reported 73

before. 74

2. Experimental Methodology 75

In our investigation, we use a set of audio recordings in which participants are asked 76

by the interviewer to describe the Cookie Theft picture from the Boston Diagnostic Aphasia 77

Examination provided as part of the ADReSSo (Alzheimer’s Dementia Recognition through 78

Spontaneous Speech) dataset [3]. Participants include both cognitively normal (CN) subjects 79

and people who have been diagnosed with Alzheimer’s Dementia (AD). The Mini-Mental 80

State Exam (MMSE) score was used to distinguish between CN and AD subjects. The 81

interview lasts anywhere between 30 seconds and 250 seconds. 82

2.1. Audio Pre-processing and Choice of Features 83

In order to ensure training quality of our model, we wish to ensure only data from 84

the subject is used. Hence, all audio associated with the interviewer (including when it 85

overlapped with the subject) was removed while retaining all other audio segments (such 86

as silence and filler words), as these non-speech segments could still contain useful cues. 87
Version July 1, 2024 submitted to Appl. Sci. 3 of 17

To achieve this, segments corresponding to the interviewer were manually removed using 88

Adobe Audition [50]. Various training features were then extracted in both audio and text 89

domains, at both frame-level or file-level (detailed in Table 1). Text features were extracted 90

from transcriptions of the recorded speech generated using Otter.ai [51], a commercial 91

speech-to-text transcription service. 92

As speech cues can occur at different levels (individual phonated utterances, or the 93

entire string of utterances across the interview), we analyze the data at two levels: “frame- 94

level" (‘granular’ or ‘frame-by-frame’ descriptors) and “file-level" (across the entire inter- 95

view recording). 96

Table 1. Training features used in our 16 models.

file-level frame-level
Audio eGeMAPS openSMILE(prosody)
emobase-Large VGG
emobase openL3
Speech/Silence
Energy-time Plots
Text Keg of Text Analytics Word Embedding
Summed Word Embedding BERT, RoBERTa, distilBERT
Keg of Text Analytics-Extended XLNet

2.2. Audio Feature Extraction 97

2.2.1. File-level Audio Features 98

File-level features (or high-resolution features) were extracted from entire audio files 99

provided (minus the interviewer speech). These include features extracted using vari- 100

ous configuration files of the OpenSMILE library (emobase, emobase-Large, eGeMAPS), 101

and our original features: Speech/Silence statistics and energy-time plots. OpenSMILE 102

[52](Speech and Music Interpretation by Large Space Extraction) is a convenient toolkit 103

which provides acoustic feature extraction and is capable of extracting Low-Level Descrip- 104

tors (LLD). 105

• emobase: We extracted 988 acoustic features: (26 Low Level Descriptors + 26 delta)*19 106

functionals [52,53]. The feature set contains the mel-frequency cepstral coefficients 107

(MFCC), voice quality, fundamental frequency (F0), F0 envelope, line spectral pairs 108

(LSP), intensity features alongwith their first- and second-order derivatives and several 109

statistical functions applied to these features, resulting in a total of 988 features for 110

every speech segment. 111

• emobase-Large: 6552 file-level features ((56 low level descriptors + 56 delta+ 57 delta- 112

delta)*39 functionals) were extracted [52,53]. Local features or low level descriptors 113

(MFCCs, pitch, energy, quality of voice, etc.), their first and second derivatives (i.e., 114

delta and delta-delta) and statistical functionals (global features) were extracted. 115

• eGeMAPS: 88 file-level features (25 low level descriptors and their functionals) were 116

extracted using the eGeMAPS [54] configuration in the OpenSMILE toolkit. eGeMAPS 117

offers a reduced fundamental feature set to include 88 features, comprising of F0 118

semitone, loudness, spectral flux, MFCC, jitter, shimmer, F1, F2, F3, alpha ratio, 119

Hammarberg index, slope V0 features, and their most common statistical functionals 120

[52], and can detect physiological changes in voice production. 121

• Speech/Silence: Pauses were identified in the pre-processed audio using a Voice 122

Activity Detection (VAD) algorithm [55]. Speech segments of less than 0.3 seconds 123

and silence segments of less than 0.2 seconds are considered false detections and are 124

converted into the opposite type. The feature set consists of nine features. Out of the 125

nine features, five correspond to the number of pauses, which were organized into five 126

“pause bins" (pb). Each pause bin is associated with a duration: <0.5s (pb1), 0.5-1s 127
Version July 1, 2024 submitted to Appl. Sci. 4 of 17

Figure 1. A typical distribution of normalized Speech/Silence features analyzed for Alzheimer’s


Dementia (AD) and Control Normal (CN).

(pb2), 1-2s (pb3), 2-4s (pb4), and >4s (pb5). The sixth feature is the Centre of Mass of 128

these pause bins (i.e., number of pauses in a bin multiplied by bin number), calculated 129

as a weighted sum of the 5 bins, pbcm = 1*pb1+2*pb2+...+5*pb5. Furthermore, the 130

remaining three features are "sprat" = speech chunk to pause chunk ratio; where 131

the ratio of lengths of each pair of speech chunk and consequent pause chunks are 132

calculated. This speech-silence subsequent segment pair duration ratio in first, second 133

(median), and third quartiles across the whole recording (spratq1, spratq2, spratq3) 134

followed by average duration of speech segments were then calculated. Fig. 1 shows 135

a typical distribution of normalized Speech/Silence features analyzed. As seen in this 136

Figure 1, the pause bins of healthy control (CN) group is always lower than the AD 137

group. As expected, the opposite trend of higher speech chunk to pause chunk ratio 138

was observed for spratq features in the CN group. 139

• Energy-time plots: Two types of images were generated to represent the time-amplitude 140

signal of the segmented audio files: a plot with Cartesian coordinates (x = time, y = 141

absolute value of amplitude) (Fig. 2 (a),(b)), and a polar plot with time mapped to 142

angle θ and absolute amplitude values mapped to the radii ρ (Fig. 2 (c),(d)). All 143

the values were normalised to a number between 0 and 1, except for θ which was 144

normalised to an angle between 0 and 2π. The resulting images were then used to 145

train an image-based model. 146

2.2.2. Frame-level Audio Features 147

Frame-level features extracted from audio frames include VGG, openL3 feature em- 148

beddings, and features extracted using another configuration file, Prosody, of openSMILE: 149

• VGG: Semantically meaningful feature embedding extracted from the audio using 150

VGGish resulted in a 128 dimension embedding of the input audio feature (modified 151

mel-spectrogram to a log scale) extracted from an audio frame [56,57]. 152

• openL3: openL3 embeddings were extracted from audio signal, resulting in a 512 153

dimensional embedding of the input audio feature (mel-spectrogram) extracted from 154

an audio frame [58]. 155

• Prosody: Features include Fundamental frequency, voice probability, PCM loudness, 156

among others [52]. 157

2.3. Text Feature Extraction 158

Automatic transcription was performed for the manually segmented audio signal 159

using Dropbox integration with Otter.ai[51]. From the generated transcripts, text features 160

were extracted. Unfortunately, the automated transcription excluded instances of hesitation 161

such as uhm, errr, etc., which may in fact contain useful cues for classification. We also note 162
Version July 1, 2024 submitted to Appl. Sci. 5 of 17

Figure 2. A typical energy-time plot is shown. Cartesian representation for (a) Control Normal (CN)
and (b) Alzheimer’s Dementia (AD), and polar representation for (c) Control Normal (CN) and (d)
Alzheimer’s Dementia (AD), shows rhythmic/metrical representations as different spatial features.

transcription accuracy may be compromised by the audio quality; manual transcription 163

would be required to assess the accuracy of the automated Otter.ai transcript. Similar to 164

the audio feature extraction, both file-level and frame-level features were extracted from 165

the transcript. 166

2.3.1. File-level Text Features 167

Linguistics metrics were extracted from the automatically transcribed text. Features 168

from the BERT family (BERT, distilBERT, RoBERTa) and XLNet were extracted using the 169

HuggingFace transformer [59] library. Keg of Text Analytics is an original feature. 170

• Kegs of Text Analytics: The file-level text features extracted using Matlab’s text 171

analytics toolbox include: total number of words (Wt ); total number of unique words 172

Wu ; unique words normalised (Wu /Wt ) (looking out for repeated words or repetitive 173

use of simple words); speech rate in words per seconds (Wt /t); number of words 174

which are not ‘stop words’ (Wt − Ws ) (‘stop words’ are words which can be omitted 175

without losing meaning of the sentence, e.g. ‘a’, ‘the’, ‘to’, ‘and’); ratio of number 176

of words with ≥4, ≥5 and ≥6 letters to the number of unique words (although the 177

correlation between longer words and complexity is not constant in English, it aids in 178

filtering out short words); the number of nouns, pronouns, adjectives, verbs, adverbs, 179

auxiliary-verbs, ad-positions, coordinating conjunctions, interjections, subordinating 180

conjunctions and determiners; lastly, binary representation of the presence of the 181

word "cookie" (included because many dementia subjects struggle to use the word 182

"cookie" in the "Cookie Theft" picture"). Figure 3 shows a typical parts-of-speech 183

comparison for AD and CN. Further, a word cloud of 45 randomly selected AD and 184

CN subjects from training is shown in Figure 4 (words with fewer than four characters 185

were ignored in these charts). The difference in word count between CN and AD 186

is obvious (seen in both Figure 3 and Figure 4 ) - CN group has a larger vocabulary 187

bank than the AD group. Furthermore, CN subjects produce longer words frequently, 188

indicating a higher cognitive capacity. 189


Version July 1, 2024 submitted to Appl. Sci. 6 of 17

Two sets of file-level text features were investigated: Keg of Text Analytics and Keg of 190

Text Analytics-Extended. In the former, a total of 12 features were used (Wt , Wu , ratio 191

of number of words with more than ≥4, ≥5 and ≥6 letters to the number of unique 192

words, Wt − Ws , Wt /t, number of pronoun, nouns, adverb, adjectives and auxiliary 193

verbs) and in the latter, a super-set with 18 features (containing the former 12 plus 6 194

additional features: number of ad-positions, coordinating conjunctions, interjections, 195

subordinating conjunctions and determiners and binary representation of presence of 196

"cookie") is used. 197

Figure 3. A typical parts-of-speech comparison, extracted as part of Keg of Text Analytics. Differences
between Alzheimer’s Dementia (AD) and Normal Control (CN) can be observed, particularly for
nouns.

Figure 4. Word cloud of transcribed speech from (a) Alzheimer’s Dementia (AD) and (b) Healthy
Control (CN) groups.

• Summed Word Embedding: Words from the transcripts are embedded into a vector 198

space model using fastText pre-trained word embedding and sentence classification 199

[60,61]. The hyperspace dimension for this embedding was 300, resulting in a vector 200

of 300 elements for each word. The resulting vectors were then summed up to produce 201
Version July 1, 2024 submitted to Appl. Sci. 7 of 17

an overall representation of the transcription. Although the implication of this sum 202

is not immediately intuitive, our results here suggest that, at least for this dataset, it 203

seems helpful to provide cues for the overall transcript. 204

• BERT: The text transcriptions were used to generate feature embedding vector from 205

the pre-trained BERT (Bidirectional Encoder Representations from Transformers) [18] 206

model. BERT is used for pre-training a language representation over a large amount 207

of unlabeled textual data, employs an attention mechanism (an encoder that reads 208

the text and a decoder that predicts) to learn deeper sense of language contextual 209

relationships between words in a text. 210

• RoBERTa: RoBERTa (a Robustly optimized BERT approach) [17] uses dynamic mask- 211

ing (unlike BERT’s static masking) which increases the data variability (by augmenta- 212

tion) and helps the network learn more robust features. 213

• distilBERT: DistilBERT is a cheaper, smaller version of BERT[62], which is 60% faster, 214

and 40% smaller while retaining 97% of its performance. 215

• XLNet: XLNet [63] uses an autoregressive pretraining method unlike BERT’s autoen- 216

coding based pretraining. 217

2.3.2. Frame-level Text Features 218

Word Embedding: This feature extraction is similar to the Summed Word Embedding 219

extraction, except they are not summed; instead the 300 element feature representation of 220

the individual words was analyzed. 221

2.4. Training Models 222

This study looked into a variety of machine learning and deep learning models. Table 223

2 shows the investigated models for the sixteen feature set. The choice of selected models 224

were motivated by the success of these algorithms in supporting fast-prototyping when 225

working with various frame-level or file-level features. Various combinations of classifiers 226

and their hyper-parameters were fine tuned for every feature set and the resulting training 227

losses were compared. Those with the lowest training loss for a particular feature are 228

reported here. Voting was also used to combine various ensemble models, including 229

bagging, gradient boost, random forest, and adaboost classifiers because the resulting 230

combination outperformed the individual classifiers. In the case of frame-level features, 231

where long-term dependencies and dynamics are present in the sequences of data captured 232

at the frame-level, GMM-UBM and Bi-LSTM were used. GMM-UBM uses a 512 mix of 233

Gaussians and the BiLSTM models chosen were only one layer deep with 100 hidden 234

units, except for OpenL3 where it was two-layers deep. For Deep Neural Networks used 235

for modelling various text embeddings, a grid search was applied and hyperparameter 236

optimization was performed to decide on a Dense (fully-connected) layer network with 237

either ReLU or Softmax activations. Convolution Neural Network (CNN), a traditional 238

approach for handling images, was chosen with 4 convolution layers of 3x3 with max 239

pooling for modelling the Energy-time plot feature-set. 240


Version July 1, 2024 submitted to Appl. Sci. 8 of 17

Table 2. Training features and models investigated in this work. AD: Alzheimer’s Dementia, CN:
Normal Control

Resolution (Modality) Feature Models


(number of estimators, if applicable)
eGeMAPS Bagging Classifier (500)
file-level (Audio) Emobase-Large Hard Voting
Emobase Bagging Classifier (200)
Speech/Silence Catboost (1000)
Energy-time plot CNN
Keg of Text Analytics Random Forest (500)
file-level (Text) Keg of Text Analytics-Extended Random Forest (500)
Summed Word Embedding Hard Voting
BERT DNN (4-layers)
distilBERT DNN (5-layers)
RoBERTa DNN (3-layers)
XLNet DNN (3-layers)
openSMILE(prosody) GMM-UBM
frame-level (Audio) VGG Bi-LSTM
OpenL3 Bi-LSTM
frame-level (Text) Word Embedding Bi-LSTM

2.5. Train-test Dataset and Result Evaluation Metric 241

The ADReSSo-2021 dataset [3] has subjects from healthy control group (CN cohort) 242

and subjects with cognitive decline (assigned to the AD cohort). 237 audio recordings were 243

made available, which were balanced for age and gender (to avoid biases). 166 of these 244

were used for training and the models were tested against a mutually exclusive 71 audio 245

recordings. The number of audio recordings in the AD training and testing group were 87 246

and 35, respectively. The CN training and testing group contained 79 and 36 recordings, 247

respectively. 248

The results for AD classification task are shown in terms of accuracy. Accuracy = (True 249

Positive + True Negative)/(True Positive +True Negative + False Positive + False Negative). 250

We considered AD class to be the positive one. 251

3. Results 252

All our 16 models performed well above chance (see Table 3), with the poorest per- 253

formance arising from Speech/Silence with 66.2% accuracy. On the other hand, the best 254

performing model was RoBERTa at an accuracy of 88.7%, followed by distilBERT at 85.9% 255

and BERT at 84.5 %. Looking at the confusion matrix (Figure 5) for RoBERTa, 29 out 256

of 36 CN subjects and 34 out of 35 AD subjects were correctly classified. Comparison 257

between file-level text features and file-level audio features reveals the file-level text fea- 258

tures outperform file-level audio features: 78.87±7.66% vs 70.98±4.52%, respectively on 259

average. This observation is true even on a broader scale (overall text vs overall audio): 260

78.87±7.09% vs 73.23±4.95%, respectively on average. This agrees with previous observa- 261

tions [3,23,35,36,38] that text features contain more distinguishing cues than audio features 262

for identifying AD. 263

Regarding automated transcription, whose accuracy depends on the overall intel- 264

ligibility of utterances recorded, CN cases naturally transcribe richer (due to increased 265

verbosity and word variety) and more faithfully (due to better speech clarity) than AD. 266

Transcriptions that yield meaningful strings of words (or not) will result contrastingly in 267

the text feature analysis (see Fig. 3), facilitating classification between AD and CN. Accord- 268

ingly, we indeed observe our best performances at 88.7%, 85.9% and 84.5% when using 269

BERT-related features. While analyzing other text-feature sets, the Keg of Text Analytics 270

and Keg of Text Analytics-Extended both resulted in identical accuracies of 76.1%, a drop 271

in performance compared to the BERT-related features (which is associated with more AD 272
Version July 1, 2024 submitted to Appl. Sci. 9 of 17

misclassifications). Finally, the Summed Word Embedding resulted in greater AD and CN 273

misclassifications due to lower accuracy (73.2 % ) in comparison with Keg of Text features 274

and BERT-related text features. 275

Now, frame-level audio features (77.0±3.29%) performed better than file-level audio 276

features (70.98±4.52%), suggesting that extended audio recordings need not be aggregated, 277

but rather brief speech audio samples representing intrinsic mechanics and dynamics of 278

speaker idiosyncrasies may be preferred, being also easier to collect and execute. Among 279

the various file-level audio features explored however, Emobase and Emobase-Large have 280

yielded accuracies of 70.4% vs 77.5%, respectively. In this case, Emobase-Large (consisting 281

of larger feature set) resulted in lower misclassification for both AD and CN. eGeMAPS, on 282

the other hand, resulted in 67.6% accuracy, which is associated with greater AD and CN 283

misclassification than Emobase (70.4%). 284

The best performing text-based feature at the file-level (RoBERTa) performed 9.8% 285

better than the frame-level (Word Embedding). Conversely, the best audio-based feature at 286

the frame-level (VGG) performed 1.4% better than the best file-level audio feature (Emobase- 287

Large). Extracting deep features using VGG’s deep neural networks at the frame-level 288

could be a reason for its higher performance than the conventional acoustic features (low 289

level descriptors and functionals) of extractors such as Emobase-Large. Mostly, VGG was 290

able to find task-specific features leading to better classification accuracy. 291

Our four original feature extraction methods proposed - Speech/Silence, Energy-time 292

Plot, Keg of Text Analytics, and Keg of Text Analytics-Extended - achieved accuracies of 293

66.2%, 73.2%, 76.1% and 76.1%, respectively. These ab initio results may be currently lower 294

than the well-established features, but with further exploration we may expect overall 295

improvements in performance. 296

The pre-trained BERT, distilBERT, RoBERTa and XLNet models used in this study 297

achieved accuracies of 84.5%, 85.9%, 88.7% and 67.6%, respectively (Table 3). These pre- 298

trained models performed feature extraction by first processing the transcribed texts fol- 299

lowed with a hyper-parameter optimization for Deep Neural Network (DNN) models. 300

Appendix A details how the DNN models for BERT, distilBERT, RoBERTa and XLNet were 301

optimized. 302

Table 4 compares the best results of this study to state-of-the-art models (at the time 303

of writing) for other AD classification studies based on the same ADReSSo-2021 dataset, 304

surpassing them in accuracy by 1.2%. Consistent with earlier reports of successful pre- 305

trained BERT models in AD classification, in this study RoBERTa achieved the best accuracy 306

of 88.7% . 307
Version July 1, 2024 submitted to Appl. Sci. 10 of 17

Table 3. Training features and their associated performance are shown. Accuracy is presented
as a percentage reflecting the total correct predictions out of total predictions. Accuracy = (True
Positive+True Negative)/(True Positive+True Negative+False Positive+False Negative). AD class is
positive. AD: Alzheimer’s Dementia; CN: Normal Control.

Resolution (Modality) Feature Accuracy (%)


(Correct CN Class/Total CN Class,
Correct AD Class/Total AD Class)
eGeMAPS 67.6 (26/36, 22/35)
file-level (Audio) Emobase-Large 77.5 (30/36, 25/35)
Emobase 70.4 (29/36, 21/35)
Speech/Silence* 66.2 (25/36, 22/35)
Energy-time plot 73.2 (25/36, 23/35)
Keg of Text Analytics 76.1 (30/36, 24/35)
file-level (Text) Keg of Text Analytics-Extended 76.1 (30/36, 24/35)
Summed Word Embedding 73.2 (26/36, 26/35)
BERT* 84.5 (30/36, 30/35)
distilBERT* 85.9 (29/36, 32/35)
RoBERTa* 88.7 (29/36, 34/35)
XLNet* 67.6 (25/36, 23/35)
openSMILE(prosody) 78.9 (28/36, 28/35)
frame-level (Audio) VGG 78.9 (31/36, 25/35)
OpenL3 73.2 (22/36, 30/35)
frame-level (Text) Word Embedding 78.9 (28/36, 28/35)
* See Appendix for model architecture.

Figure 5. Confusion matrix (0- CN, 1- AD) for the best performing model: RoBERTa

4. Discussion 308

This study presents a systematic comparison of various approaches and methods, and 309

thus allow us some insight into the speech information that should be considered depending 310

on the speech sampling context. If the audio quality is amenable to faithful transcription 311
Version July 1, 2024 submitted to Appl. Sci. 11 of 17

(low background noise, good speaker clarity, monolingual and well-documented accents), 312

our study shows that file-level text is most effective, with the BERT family performing 313

rather satisfactorily at 86.26±2.27%. However, care must be taken with file-level text as 314

XLNet feature (67.6%) offered the second lowest accuracy among the sixteen features 315

studied. 316

In contrast, if the audio quality is good (high signal:noise) but the speaker does not 317

enunciate clearly or perhaps speaks in a way not compatible with automated text tran- 318

scription (e.g. mixed languages used, poorly documented accent), the results guide us 319

to suggest that frame-level analysis is then preferred. It can be expected that phonatory 320

dynamics captured in short time frames may contain enough inherent information about 321

the speaker’s psycho-motor health, thus obviating the need to analyse an extended du- 322

ration of speech audio. Further, such an approach relying on using audio feature sets 323

alone has the added benefit of not being limited to a particular carrier language since it 324

should reflect the speaker’s cognitive/physiological state without the extra burden and 325

process of transcription pigeon-holed in a particular carrier language (which may introduce 326

transcription errors compromising the efficacy of a text-based approach). 327

Frame-level features such sub-word or phoneme-level representations are suitable for 328

training on small datasets and have been shown to have higher dementia classification 329

accuracy than file-level features (such as word or sentence level) [38]. On the other hand, 330

file-level features have the benefit of larger temporal samples and increased opportunities 331

to rely on cues that arise randomly or infrequently [48]. 332

Overall, text modality performs better on average (file-level + frame-level) than audio. 333

This may suggest that with the onset of AD cognitive decline affects linguistic capacity 334

more sensitively and may be apparent earlier than physiological decline. Given that factors 335

such as semantic distance, lexical resource and grammar structure may become simpler 336

as cognitive decline progresses, text-based methods could be more useful and provide 337

greater insight into differentiating AD from CN, resulting in higher detection accuracy. 338

However, text-based approaches require faithful and robust transcription processing as 339

an additional step, complicating the analysis. Furthermore, many transcription services 340

currently available automatically omit non-semantic utterances (such a mmm, uh, um) which 341

may infact further enhance detection accuracy [48,64]. 342

Thus, although the results in the current study indicate that text-based methods outper- 343

form audio-based methods, the successful implementation of text-based methods depend 344

sensitively to the context in question (such as carrier language, choice and availability of 345

reliable transcription, among others). Audio-based methods, on the other hand, will offer 346

greater flexibility less sensitively to the context: as long as good signal:noise is achieved in 347

recording the speaker’s utterances, analysis can still proceed with some reliability, agnostic 348

to the carrier language(s) involved. Lastly, depending again on the audio context and 349

linguistic requirements, a simple voting- and/or fusion-model can then be implemented to 350

reconcile the mutli-modal multi-model multi-feature approach to yield an optimal AD/CN 351

classification outcome. However, this approach was not tested in our investigation due to 352

the absence of a validation dataset. 353

Thus, although text generally has better accuracy results than audio, it is limited 354

by the accuracy of transcription services. Audio features, on the other hand, are free of 355

carrier/target language. 356

5. Conclusions 357

An approach for assessing audio recordings of spontaneous speech utterances related 358

to Alzheimer’s Dementia, utilizing a multi-model machine learning strategy, is presented. 359

The study analysed the effectiveness of various features extracted from audio recording 360

and text derived from the audio, employing 16 different models to assess their relative 361

performance. All models trained on various distinct feature sets demonstrated accuracy 362

above chance levels. Overall, text-based features extracted from the transcribed audio 363

outperformed the audio-based features. The best performing model was achieved by 364
Version July 1, 2024 submitted to Appl. Sci. 12 of 17

Table 4. Comparison with other studies reporting from the ADReSSo-2021 dataset.

Reference Best performing modality Highest accuracy % (model)


[32] Fusion 83.7
[65] Text 84.3 (FastText)
[66] Multimodal (with Gating) 84 (BiLSTM)
[67] Transcript 84.5
[68] Text (with Disfluency) 83
[69] Text (with sentence-level pauses) 83
[1] Text 87.5
[3] (Baseline) Fusion 78.8
This study Text 88.6 (DNN)

modelling the RoBERTa text embeddings, which attained an accuracy of 88.7%, achieving 365

near-perfect classification for AD, identifying 34 out of 35 cases. This represents a 1.2% 366

improvement surpassing other state-of-the-art models for AD classification trained on the 367

same ADReSSo-2021 dataset. 368

When comparing the level of granularity (resolution) in the feature extraction process, 369

frame-level audio features generally outperform their file-level counterparts. However, as 370

the number of features increase, such as with Emobase-Large, the results become more 371

comparable. This suggests that a sufficient number of low-level descriptors can effectively 372

represent both AD and CN classes. In contrast, while file-level text embeddings have pro- 373

duced higher accuracy than frame-level word embeddings, this trend cannot be generalized 374

since only one model has been tested at the latter level. 375

The four original feature extraction methods we proposed, namely Speech/Silence, 376

Energy-time Plot, Keg of Text Analytics, and Keg of Text Analytics-Extended, have demon- 377

strated reasonable accuracy. While their performance may not yet match off-the-shelf 378

feature sets, additional exploration and fine-tuning could lead to further improvements. 379

This investigation suggests that the transcribed textual data can produce meaningful 380

word sequences that lead to contrasting results in text feature analysis. This is particularly 381

helpful in classifying Alzheimer’s disease (AD) and cognitively normal (CN) patients, 382

as factors such as semantic distance, vocabulary usage, and grammar structures tend to 383

be simpler for dementia patients. While transcribing English audio datasets is relatively 384

straightforward, transcription resources for other languages may not be as easily available 385

nor as mature and robust, making the text-based approach sensitive to the linguistic context 386

of the target speaker. On the other hand, an audio-based approach data is not constrained by 387

carrier language, making it more generalizable and useful for developing AD classification 388

models which are more universal, as long as good signal-to-noise ratio of the speech 389

utterances are captured. In practical terms, however, it is expected that the optimal solution 390

is to implement a combination of both approaches to best reconcile differences arising from 391

the context. However, the optimal solution is a combination of both approaches. 392

This study provides insight into the effectiveness of machine learning techniques 393

trained on both textual and audio data for a given dataset, which is an improvement over 394

previous studies that focused on only one (or one of the many) aspect(s) of the feature 395

space or model selection. The implications of this research are particularly relevant as they 396

may assist clinicians and caregivers in detecting early-stage biomarkers of Alzheimer’s 397

disease (AD). Significantly, this can be accomplished using non-invasive and cost-effective 398

approaches, which could be of great benefit to patients and their families. 399

References 400

1. Ilias, L.; Askounis, D. Explainable identification of dementia from transcripts using transformer 401

networks. IEEE Journal of Biomedical and Health Informatics 2022, 26, 4153–4164. 402

2. Forbes-McKay, K.E.; Venneri, A. Detecting subtle spontaneous language decline in early 403

Alzheimer’s disease with a picture description task. Neurological sciences 2005, 26, 243–254. 404
Version July 1, 2024 submitted to Appl. Sci. 13 of 17

3. Luz, S.; Haider, F.; de la Fuente, S.; Fromm, D.; MacWhinney, B. Detecting cognitive decline 405

using speech only: The adresso challenge. arXiv preprint arXiv:2104.09356 2021. 406

4. Orange, J.B.; Lubinski, R.B.; Higginbotham, D.J. Conversational repair by individuals with 407

dementia of the Alzheimer’s type. Journal of Speech, Language, and Hearing Research 1996, 408

39, 881–895. 409

5. Szatloczki, G.; Hoffmann, I.; Vincze, V.; Kalman, J.; Pakaski, M. Speaking in Alzheimer’s 410

disease, is that an early sign? Importance of changes in language abilities in Alzheimer’s 411

disease. Frontiers in aging neuroscience 2015, 7, 195. 412

6. Hoffmann, I.; Nemeth, D.; Dye, C.D.; Pákáski, M.; Irinyi, T.; Kálmán, J. Temporal parameters of 413

spontaneous speech in Alzheimer’s disease. International journal of speech-language pathology 414

2010, 12, 29–34. 415

7. Martínez-Sánchez, F.; Meilán, J.J.; Vera-Ferrandiz, J.A.; Carro, J.; Pujante-Valverde, I.M.; Ivanova, 416

O.; Carcavilla, N. Speech rhythm alterations in Spanish-speaking individuals with Alzheimer’s 417

disease. Aging, Neuropsychology, and Cognition 2017, 24, 418–434. 418

8. Pastoriza-Domínguez, P.; Torre, I.G.; Diéguez-Vide, F.; Gomez-Ruiz, I.; Gelado, S.; Bello-López, 419

J.; Ávila-Rivera, A.; Matias-Guiu, J.A.; Pytel, V.; Hernández-Fernández, A. Speech pause 420

distribution as an early marker for Alzheimer’s disease. Speech Communication 2022, 136, 107– 421

117. 422

9. Yuan, J.; Cai, X.; Bian, Y.; Ye, Z.; Church, K. Pauses for detection of Alzheimer’s disease. Frontiers 423

in Computer Science 2021, 2, 624488. 424

10. Rohanian, M.; Hough, J.; Purver, M. Multi-modal fusion with gating using audio, lexical and 425

disfluency features for Alzheimer’s dementia recognition from spontaneous speech. arXiv 426

preprint arXiv:2106.09668 2021. 427

11. López-de Ipiña, K.; Alonso, J.B.; Travieso, C.M.; Solé-Casals, J.; Egiraun, H.; Faundez-Zanuy, M.; 428

Ezeiza, A.; Barroso, N.; Ecay-Torres, M.; Martinez-Lage, P.; et al. On the selection of non-invasive 429

methods based on speech analysis oriented to automatic Alzheimer disease diagnosis. Sensors 430

2013, 13, 6730–6745. 431

12. König, A.; Satt, A.; Sorin, A.; Hoory, R.; Toledo-Ronen, O.; Derreumaux, A.; Manera, V.; Verhey, 432

F.; Aalten, P.; Robert, P.H.; et al. Automatic speech analysis for the assessment of patients with 433

predementia and Alzheimer’s disease. Alzheimer’s & Dementia: Diagnosis, Assessment & Disease 434

Monitoring 2015, 1, 112–124. 435

13. Pulido, M.L.B.; Hernández, J.B.A.; Ballester, M.Á.F.; González, C.M.T.; Mekyska, J.; Smékal, Z. 436

Alzheimer’s disease and automatic speech analysis: A review. Expert systems with applications 437

2020, 150, 113213. 438

14. Meilán, J.J.; Martínez-Sánchez, F.; Carro, J.; Sánchez, J.A.; Pérez, E. Acoustic markers associated 439

with impairment in language processing in Alzheimer’s disease. The Spanish journal of psychology 440

2012, 15, 487. 441

15. Hason, L.; Krishnan, S. Spontaneous speech feature analysis for alzheimer’s disease screening 442

using a random forest classifier. Frontiers in Digital Health 2022, 4. 443

16. Yang, Q.; Li, X.; Ding, X.; Xu, F.; Ling, Z. Deep learning-based speech analysis for Alzheimer’s 444

disease detection: a literature review. Alzheimer’s Research & Therapy 2022, 14, 1–16. 445

17. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, 446

L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv preprint 447

arXiv:1907.11692 2019. 448

18. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional trans- 449

formers for language understanding. arXiv preprint arXiv:1810.04805 2018. 450

19. Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised 451

learning of speech representations. Advances in neural information processing systems 2020, 452

33, 12449–12460. 453

20. López-de Ipiña, K.; Alonso, J.B.; Barroso, N.; Faundez-Zanuy, M.; Ecay, M.; Solé-Casals, J.; 454

Travieso, C.M.; Estanga, A.; Ezeiza, A. New approaches for Alzheimer’s disease diagnosis 455

based on automatic spontaneous speech analysis and emotional temperature. In Proceedings of 456

the International Workshop on Ambient Assisted Living. Springer, 2012, pp. 407–414. 457

21. Lopez-de Ipiña, K.; Alonso, J.B.; Solé-Casals, J.; Barroso, N.; Faundez-Zanuy, M.; Ecay-Torres, M.; 458

Travieso, C.M.; Ezeiza, A.; Estanga, A.; et al. Alzheimer disease diagnosis based on automatic 459

spontaneous speech analysis 2012. 460

22. Qiu, S.; Miller, M.I.; Joshi, P.S.; Lee, J.C.; Xue, C.; Ni, Y.; Wang, Y.; De Anda-Duran, I.; Hwang, 461

P.H.; Cramer, J.A.; et al. Multimodal deep learning for Alzheimer’s disease dementia assessment. 462

Nature communications 2022, 13, 3404. 463


Version July 1, 2024 submitted to Appl. Sci. 14 of 17

23. Pappagari, R.; Cho, J.; Moro-Velazquez, L.; Dehak, N. Using State of the Art Speaker Recognition 464

and Natural Language Processing Technologies to Detect Alzheimer’s Disease and Assess its 465

Severity. In Proceedings of the INTERSPEECH, 2020, pp. 2177–2181. 466

24. Haulcy, R.; Glass, J. Classifying Alzheimer’s disease using audio and text-based representations 467

of speech. Frontiers in Psychology 2021, 11, 624137. 468

25. Balagopalan, A.; Novikova, J. Comparing acoustic-based approaches for alzheimer’s disease 469

detection. arXiv preprint arXiv:2106.01555 2021. 470

26. Kato, S.; Homma, A.; Sakuma, T. Easy screening for mild alzheimer’s disease and mild cognitive 471

impairment from elderly speech. Current Alzheimer Research 2018, 15, 104–110. 472

27. Khodabakhsh, A.; Demiroglu, C. Analysis of speech-based measures for detecting and monitor- 473

ing Alzheimer’s disease. In Data Mining in Clinical Medicine; Springer, 2015; pp. 159–173. 474

28. Ahangar, A.A.; Jafarzadeh Fadaki, S.M.; Sehhati, A. The Comparison of Morpho-Syntactic 475

Patterns Device Comprehension in Speech of Alzheimer and Normal Elderly People. Zahedan 476

Journal of Research in Medical Sciences 2018, 20. 477

29. Beltrami, D.; Calzà, L.; Gagliardi, G.; Ghidoni, E.; Marcello, N.; Favretti, R.R.; Tamburini, F. Au- 478

tomatic identification of mild cognitive impairment through the analysis of Italian spontaneous 479

speech productions. In Proceedings of the Proceedings of the Tenth International Conference 480

on Language Resources and Evaluation (LREC’16), 2016, pp. 2086–2093. 481

30. Fraser, K.C.; Meltzer, J.A.; Graham, N.L.; Leonard, C.; Hirst, G.; Black, S.E.; Rochon, E. Auto- 482

mated classification of primary progressive aphasia subtypes from narrative speech transcripts. 483

cortex 2014, 55, 43–60. 484

31. Sanz, C.; Carrillo, F.; Slachevsky, A.; Forno, G.; Gorno Tempini, M.L.; Villagra, R.; Ibáñez, A.; 485

Tagliazucchi, E.; García, A.M. Automated text-level semantic markers of Alzheimer’s disease. 486

Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring 2022, 14, e12276. 487

32. Ying, Y.; Yang, T.; Zhou, H. Multimodal fusion for alzheimer’s disease recognition. Applied 488

Intelligence 2022, pp. 1–12. 489

33. Al-Hameed, S.; Benaissa, M.; Christensen, H. Simple and robust audio-based detection of 490

biomarkers for Alzheimer’s disease. In Proceedings of the 7th Workshop on Speech and 491

Language Processing for Assistive Technologies (SLPAT), 2016, pp. 32–36. 492

34. Searle, T.; Ibrahim, Z.; Dobson, R. Comparing Natural Language Processing Techniques for 493

Alzheimer’s Dementia Prediction in Spontaneous Speech. arXiv preprint arXiv:2006.07358 2020. 494

35. Farzana, S.; Parde, N. Exploring MMSE Score Prediction Using Verbal and Non-Verbal Cues. In 495

Proceedings of the INTERSPEECH, 2020, pp. 2207–2211. 496

36. Clarke, C.J.; Melechovsky, J.; Lin, C.M.Y.; Priyadarshinee, P.; Balamurali, B.; Chen, J.M.; Kapoor, 497

S.; Aharonov, O. ADDRESSING MULTI-MODAL MULTI-MODEL MULTI-FEATURE CUES IN 498

ALZHEIMER’S DEMENTIA. 499

37. Koo, J.; Lee, J.H.; Pyo, J.; Jo, Y.; Lee, K. Exploiting Multi-Modal Features From Pre-trained 500

Networks for Alzheimer’s Dementia Recognition. arXiv preprint arXiv:2009.04070 2020. 501

38. Edwards, E.; Dognin, C.; Bollepalli, B.; Singh, M.K.; Analytics, V. Multiscale System for 502

Alzheimer’s Dementia Recognition Through Spontaneous Speech. In Proceedings of the 503

INTERSPEECH, 2020, pp. 2197–2201. 504

39. Gosztolya, G.; Vincze, V.; Tóth, L.; Pákáski, M.; Kálmán, J.; Hoffmann, I. Identifying mild 505

cognitive impairment and mild Alzheimer’s disease based on spontaneous speech using ASR 506

and linguistic features. Computer Speech & Language 2019, 53, 181–197. 507

40. Syed, M.S.S.; Syed, Z.S.; Lech, M.; Pirogova, E. Automated Screening for Alzheimer’s Dementia 508

through Spontaneous Speech. INTERSPEECH (to appear) 2020, pp. 1–5. 509

41. Syed, Z.S.; Syed, M.S.S.; Lech, M.; Pirogova, E. Tackling the ADRESSO Challenge 2021: The 510

MUET-RMIT System for Alzheimer’s Dementia Recognition from Spontaneous Speech. In 511

Proceedings of the Interspeech, 2021, pp. 3815–3819. 512

42. Chen, J.; Ye, J.; Tang, F.; Zhou, J. Automatic detection of alzheimer’s disease using spontaneous 513

speech only. In Proceedings of the Interspeech. NIH Public Access, 2021, Vol. 2021, p. 3830. 514

43. Pappagari, R.; Cho, J.; Joshi, S.; Moro-Velázquez, L.; Zelasko, P.; Villalba, J.; Dehak, N. Automatic 515

Detection and Assessment of Alzheimer Disease Using Speech and Language Technologies in 516

Low-Resource Scenarios. In Proceedings of the Interspeech, 2021, pp. 3825–3829. 517

44. Kim, Y.; Lee, H.; Provost, E.M. Deep learning for robust feature generation in audiovisual 518

emotion recognition. In Proceedings of the 2013 IEEE international conference on acoustics, 519

speech and signal processing. IEEE, 2013, pp. 3687–3691. 520

45. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. nature 2015, 521, 436–444. 521
Version July 1, 2024 submitted to Appl. Sci. 15 of 17

46. Balagopalan, A.; Eyre, B.; Rudzicz, F.; Novikova, J. To BERT or Not To BERT: Comparing 522

Speech and Language-based Approaches for Alzheimer’s Disease Detection. arXiv preprint 523

arXiv:2008.01551 2020. 524

47. Guo, Z.; Ling, Z.; Li, Y. Detecting Alzheimer’s disease from continuous speech using language 525

models. Journal of Alzheimer’s Disease 2019, 70, 1163–1174. 526

48. Yuan, J.; Bian, Y.; Cai, X.; Huang, J.; Ye, Z.; Church, K. Disfluencies and Fine-Tuning Pre- 527

trained Language Models for Detection of Alzheimer’s Disease. Proc. Interspeech 2020 2020, pp. 528

2162–2166. 529

49. Sarawgi, U.; Zulfikar, W.; Soliman, N.; Maes, P. Multimodal Inductive Transfer Learning for 530

Detection of Alzheimer’s Dementia and its Severity. arXiv preprint arXiv:2009.00700 2020. 531

50. Adobe Audition-version 23.0. https://www.adobe.com/products/audition.html. Accessed: 532

2022-12-30. 533

51. Otter.ai. https://otter.ai/login. Accessed on: 2021-03-21. 534

52. Eyben, F.; Wöllmer, M.; Schuller, B. Opensmile: the munich versatile and fast open-source audio 535

feature extractor. In Proceedings of the Proceedings of the 18th ACM international conference 536

on Multimedia, 2010, pp. 1459–1462. 537

53. Parlak, C.; Diri, B. Emotion recognition from the human voice. In Proceedings of the 2013 21st 538

Signal Processing and Communications Applications Conference (SIU). IEEE, 2013, pp. 1–4. 539

54. Eyben, F.; Scherer, K.R.; Schuller, B.W.; Sundberg, J.; André, E.; Busso, C.; Devillers, L.Y.; Epps, J.; 540

Laukka, P.; Narayanan, S.S.; et al. The Geneva minimalistic acoustic parameter set (GeMAPS) for 541

voice research and affective computing. IEEE transactions on affective computing 2015, 7, 190–202. 542

55. Brookes, M.; et al. Voicebox: Speech processing toolbox for matlab. Software, available [Mar. 543

2011] from www. ee. ic. ac. uk/hp/staff/dmb/voicebox/voicebox. html 1997, 47. 544

56. Gemmeke, J.F.; Ellis, D.P.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; 545

Ritter, M. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings 546

of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 547

IEEE, 2017, pp. 776–780. 548

57. Hershey, S.; Chaudhuri, S.; Ellis, D.P.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, 549

D.; Saurous, R.A.; Seybold, B.; et al. CNN architectures for large-scale audio classification. In 550

Proceedings of the 2017 ieee international conference on acoustics, speech and signal processing 551

(icassp). IEEE, 2017, pp. 131–135. 552

58. Cramer, J.; Wu, H.H.; Salamon, J.; Bello, J.P. Look, listen, and learn more: Design choices for 553

deep audio embeddings. In Proceedings of the ICASSP 2019-2019 IEEE International Conference 554

on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 3852–3856. 555

59. Transformers — transformers 3.3.0 documentation - Hugging Face. https://huggingface.co/ 556

transformers/v3.3.0/index.html. Accessed: 2022-12-30. 557

60. Grave, E.; Bojanowski, P.; Gupta, P.; Joulin, A.; Mikolov, T. Learning word vectors for 157 558

languages. arXiv preprint arXiv:1802.06893 2018. 559

61. Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of tricks for efficient text classification. 560

arXiv preprint arXiv:1607.01759 2016. 561

62. Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: smaller, 562

faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 2019. 563

63. Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized 564

autoregressive pretraining for language understanding. Advances in neural information processing 565

systems 2019, 32. 566

64. Davis, B.H.; Maclagan, M.A. UH as a pragmatic marker in dementia discourse. Journal of 567

Pragmatics 2020, 156, 83–99. 568

65. Meghanani, A.; Anoop, C.; Ramakrishnan, A.G. Recognition of alzheimer’s dementia from 569

the transcriptions of spontaneous speech using fastText and cnn models. Frontiers in Computer 570

Science 2021, 3, 624558. 571

66. Rohanian, M.; Hough, J.; Purver, M. Alzheimer’s dementia recognition using acoustic, lexical, 572

disfluency and speech pause features robust to noisy inputs. arXiv preprint arXiv:2106.15684 573

2021. 574

67. Pan, Y.; Mirheidari, B.; Harris, J.M.; Thompson, J.C.; Jones, M.; Snowden, J.S.; Blackburn, D.; 575

Christensen, H. Using the Outputs of Different Automatic Speech Recognition Paradigms for 576

Acoustic-and BERT-Based Alzheimer’s Dementia Detection Through Spontaneous Speech. In 577

Proceedings of the Interspeech, 2021, pp. 3810–3814. 578


Version July 1, 2024 submitted to Appl. Sci. 16 of 17

68. Qiao, Y.; Yin, X.; Wiechmann, D.; Kerz, E. Alzheimer’s Disease Detection from Spontaneous 579

Speech through Combining Linguistic Complexity and (Dis) Fluency Features with Pretrained 580

Language Models. arXiv preprint arXiv:2106.08689 2021. 581

69. Zhu, Y.; Obyat, A.; Liang, X.; Batsis, J.A.; Roth, R.M. WavBERT: Exploiting Semantic and 582

Non-Semantic Speech Using Wav2vec and BERT for Dementia Detection. In Proceedings of the 583

Interspeech, 2021, pp. 3790–3794. 584

Author Contributions: “Conceptualization, P.P., B.BT., C.J.C., and JM.C.; methodology, P.P., B.BT., 585

C.J.C., JM.C, C.M.Y.L., and J.M.; software, P.P., B.BT., C.J.C., C.M.Y.L., and J.M.; validation, P.P., B.BT. 586

and C.J.C., and JM.C.; investigation, P.P., B.BT., C.J.C., JM.C, C.M.Y.L., and J.M.; data curation, P.P., 587

B.BT. C.J.C., and JM.C.; writing—original draft preparation, P.P., B.BT., C.J.C., JM.C. and J.M.; writing— 588

review and editing, P.P., B.BT., C.J.C., and JM.C.; visualization, P.P., B.BT. and C.J.C.; supervision, 589

JM.C.; funding acquisition, JM.C." 590

Funding: This research was funded by SUTD Growth Plan (SGP) Grant to Healthcare Sector (PIE- 591

SGP-HC-2019-01). 592

Data Availability Statement: The dataset is free and made available by Dementia Bank Pitt Corpus 593

https://luzs.gitlab.io/adresso-2021/ 594

Conflicts of Interest: The authors declare no conflict of interest. 595

Abbreviations 596

The following abbreviations are used in this manuscript: 597

598
AD Alzheimer’s Dementia
CN Normal Control
MFCC Mel-Frequency Cepstral Coefficients
ML Machine Learning
DL Deep Learning
NLP Natural Language Processing
ASR Automatic Speech Recognition
MLP Multi-Layered Perceptrons
BERT Bidirectional Encoder Representations from Transformers
MMSE Mini-Mental State Exam 599

LLD Low Level Descriptors


GMM-UBM Gaussian Mixture Model-Universal Background Model
DNN Deep Neural Networks
CNN Convolutional Neural Networks
VAD Voice Activity Detection
LSTM Long Short-Term Memory
RNN Recurrent Neural Networks
BN Bayesian Networks
RF Random Forest

Appendix A 600

BERT*: The DNN model used was 4-layer deep. The number of units in the dense 601

first, second, third and fourth layers were 103, 76, 65, and 105, respectively. The associated 602

activation functions used were Softmax, ReLu, Softmax and ReLu, respectively. Adam 603

optimizer with a learning rate of 0.01 was used. 604

RoBERTa*: The DNN model used was 3-layer deep. The number of units in the dense 605

first, second, and third layers were 65, 154, and 224, respectively. The activation functions 606

used for these layers were Softmax, ReLu, and Softmax, respectively. Adam optimizer was 607

used with a learning rate of 0.023. 608

DistilBERT*: The DNN model used for DistilBERT was 5-layer deep. The number of 609

units in the dense first, second, third, fourth and fifth layers were 120, 29, 213, 98 and 152, 610

respectively. The activation functions were ReLu, Softmax, Softmax, Softmax and ReLu, 611

respectively. Adam optimizer with a learning rate of 0.075 was used. 612
Version July 1, 2024 submitted to Appl. Sci. 17 of 17

XLNet*: Number of layers of the DNN model used was three, while the number of 613

units in the dense first, second and third layers was 103, 137 and 73, respectively. The 614

activation functions used for these three layers were Softmax, Relu, and Relu, respectively. 615

Adam optimizer with a learning rate of 0.05 was used. 616

Speech/Silence*: A 20-fold cross-validation and a grid search were conducted across 617

learning rates and a number of PCA components. The final model’s learning rate was 0.05 618

and PCA dimension was six. The classifier used was Catboost. 619

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual 620

author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to 621

people or property resulting from any ideas, methods, instructions or products referred to in the content. 622

You might also like