Alzheimer S Dementia Speech Audio Vs Text Multi Model Machine Learning at High Vs Low Resolution
Alzheimer S Dementia Speech Audio Vs Text Multi Model Machine Learning at High Vs Low Resolution
Abstract: We investigate a multi-modal approach (audio and text) for the automatic detection 1
including four feature extraction methods not previously applied in such contexts, were utilized and 3
tested to determine their relative performance. These features encompass two modalities (audio 4
vs text) at two resolution scales (frame-level vs file-level). We compared the accuracy of these 5
features and found that text-based classification outperformed audio-based classification with the 6
best performance attaining 88.7%, surpassing other reports to-date relying on the same dataset. For 7
text-based classification in particular, the best file-level feature performed 9.8% better than frame-level 8
feature. However, when comparing audio-based classification, the best frame-level feature performed 9
1.4% better than the best file-level feature. This multi-modal multi-model comparison at high- and 10
low-resolution offers insights into which approach is most efficacious, depending on the context. 11
1. Introduction 13
ens with time. Gradual decline in language and speech are one of the important cues of 15
AD. It is necessary to find quick, inexpensive, non-invasive and objective tools to detect 16
it, due to the ever-increasing number of dementia patients worldwide. Some previous 17
studies [1,2] have reported a significantly higher number of syllables, lexicon, and difficult 18
words in the healthy subjects than the in people suffering from Alzheimer’s Dementia. 19
[1] also reported differences in vocabulary and patterns of word between CN and AD 20
Citation: Priyadarshinee, P.; Clarke, groups. [3] reported that a basic set of acoustic features such as F0, F1,F2, jitter, shimmer, 21
C.J.; Melechovsky, J; Lin, C.M.Y.; BT, B.; mel-frequency cepstral coefficients (MFCC), have the potential to detect physiological 22
Chen, J.M. Title. Appl. Sci. 2023, 1, 0. changes in voice production. In addition, other speech indicators which are the known 23
https://doi.org/ symptoms of AD are mispronunciation [4], articulation rate, [5], higher hesitation ratio 24
[6], impairments of speech rhythm [7], fluency, speech tempo, speech-pause distribution 25
Received:
[8,9], and speech-silence patterns, among others. Disfluencies present in speech can be 26
Revised:
Accepted:
indicative as a possible marker for AD [10,11]. Patients suffering from AD talk slow, take 27
Published:
longer pauses, and take more time to think of the correct words; each of these add to the 28
disfluency of speech. Thus, language and speech can be useful distinguishing markers 29
Copyright: © 2024 by the authors.
between healthy control (CN) group and patients suffering from Alzheimer’s Dementia 30
Submitted to Appl. Sci. for
(AD) [12,13]. Due to the popularity of machine learning (ML) tools in the last decade, meth- 31
possible open access publication
ods using automatic speech analysis for the screening, early detection and intervention 32
under the terms and conditions
of the Creative Commons Attri-
of AD are gaining momentum in aiding clinical diagnosis [13–16]. Capitalizing on the 33
4.0/).
language models [17,18] and automatic speech recognition (ASR) methods [19], studies 36
Generally employed approaches for AD prediction are using audio features [14,21,25– 38
27], text representations [26–31], or a fusion of both [32]. While [33] relied solely on acoustic 39
features (using Bayesian Networks (BN), Trees-Random Forest (RF), Adaboost (AB) and 40
Meta- Bagging (MB) classifiers) to detect AD with a high accuracy, [34] achieved a high clas- 41
sification accuracy by utilizing only linguistic features (with pre-trained Transformer based 42
better than acoustic-based features[3,23,35,36]. Some newly released research indicates that 45
Deep Learning (DL) models have grown in popularity in speech recognition research 47
[44], because they allow for the implementation of multiple layers that capture information 48
at different granularity levels, improving prediction accuracy [45]. Using VGGish deep 49
acoustic embeddings on the ADReSSo dataset [3], combined with other feature aggregation 50
methods such as Fisher Vector encodings (FVs) and Bag-of-Audio-Words (BoAW), [40] 51
achieved a high accuracy. On the ADReSSo-2021 dataset (same as ours), [46] attained an 52
formers (BERT). For automatic AD identification from continuous speech, [47] utilized 54
perplexity characteristics collected from N-gram language models. [48] investigated both 55
language features retrieved from the transcripts and encoded pauses and [49] utilized 56
multi-layered perceptrons (MLP) and recurrent neural networks (RNN) to several types 57
of audio and linguistic characteristics. By combining the text data (both word level and 58
phoneme level) with audio features, [38] reported high AD classification accuracy using 59
cross-validation. They demonstrated that text features were more distinct than acoustic 60
features on their own. They also noted that tiny datasets for deep learning text systems 61
BERT, ConvBERT, and BERT were all used by [1] and reported to have the highest accuracy 64
did not have the opportunity to compare these models on the same dataset. Here, we offer 69
Alzheimer’s dementia (AD) and healthy control (CN) group from a set of speech recordings. 71
This comparison informs readers on the performance of various models and architectures 72
on the same dataset both on the file-level and frame-level, which has not been reported 73
before. 74
2. Experimental Methodology 75
In our investigation, we use a set of audio recordings in which participants are asked 76
by the interviewer to describe the Cookie Theft picture from the Boston Diagnostic Aphasia 77
Spontaneous Speech) dataset [3]. Participants include both cognitively normal (CN) subjects 79
and people who have been diagnosed with Alzheimer’s Dementia (AD). The Mini-Mental 80
State Exam (MMSE) score was used to distinguish between CN and AD subjects. The 81
In order to ensure training quality of our model, we wish to ensure only data from 84
the subject is used. Hence, all audio associated with the interviewer (including when it 85
overlapped with the subject) was removed while retaining all other audio segments (such 86
as silence and filler words), as these non-speech segments could still contain useful cues. 87
Version July 1, 2024 submitted to Appl. Sci. 3 of 17
To achieve this, segments corresponding to the interviewer were manually removed using 88
Adobe Audition [50]. Various training features were then extracted in both audio and text 89
domains, at both frame-level or file-level (detailed in Table 1). Text features were extracted 90
from transcriptions of the recorded speech generated using Otter.ai [51], a commercial 91
As speech cues can occur at different levels (individual phonated utterances, or the 93
entire string of utterances across the interview), we analyze the data at two levels: “frame- 94
level" (‘granular’ or ‘frame-by-frame’ descriptors) and “file-level" (across the entire inter- 95
view recording). 96
file-level frame-level
Audio eGeMAPS openSMILE(prosody)
emobase-Large VGG
emobase openL3
Speech/Silence
Energy-time Plots
Text Keg of Text Analytics Word Embedding
Summed Word Embedding BERT, RoBERTa, distilBERT
Keg of Text Analytics-Extended XLNet
File-level features (or high-resolution features) were extracted from entire audio files 99
provided (minus the interviewer speech). These include features extracted using vari- 100
ous configuration files of the OpenSMILE library (emobase, emobase-Large, eGeMAPS), 101
and our original features: Speech/Silence statistics and energy-time plots. OpenSMILE 102
[52](Speech and Music Interpretation by Large Space Extraction) is a convenient toolkit 103
which provides acoustic feature extraction and is capable of extracting Low-Level Descrip- 104
• emobase: We extracted 988 acoustic features: (26 Low Level Descriptors + 26 delta)*19 106
functionals [52,53]. The feature set contains the mel-frequency cepstral coefficients 107
(MFCC), voice quality, fundamental frequency (F0), F0 envelope, line spectral pairs 108
(LSP), intensity features alongwith their first- and second-order derivatives and several 109
statistical functions applied to these features, resulting in a total of 988 features for 110
• emobase-Large: 6552 file-level features ((56 low level descriptors + 56 delta+ 57 delta- 112
delta)*39 functionals) were extracted [52,53]. Local features or low level descriptors 113
(MFCCs, pitch, energy, quality of voice, etc.), their first and second derivatives (i.e., 114
delta and delta-delta) and statistical functionals (global features) were extracted. 115
• eGeMAPS: 88 file-level features (25 low level descriptors and their functionals) were 116
extracted using the eGeMAPS [54] configuration in the OpenSMILE toolkit. eGeMAPS 117
semitone, loudness, spectral flux, MFCC, jitter, shimmer, F1, F2, F3, alpha ratio, 119
Hammarberg index, slope V0 features, and their most common statistical functionals 120
• Speech/Silence: Pauses were identified in the pre-processed audio using a Voice 122
Activity Detection (VAD) algorithm [55]. Speech segments of less than 0.3 seconds 123
and silence segments of less than 0.2 seconds are considered false detections and are 124
converted into the opposite type. The feature set consists of nine features. Out of the 125
nine features, five correspond to the number of pauses, which were organized into five 126
“pause bins" (pb). Each pause bin is associated with a duration: <0.5s (pb1), 0.5-1s 127
Version July 1, 2024 submitted to Appl. Sci. 4 of 17
(pb2), 1-2s (pb3), 2-4s (pb4), and >4s (pb5). The sixth feature is the Centre of Mass of 128
these pause bins (i.e., number of pauses in a bin multiplied by bin number), calculated 129
remaining three features are "sprat" = speech chunk to pause chunk ratio; where 131
the ratio of lengths of each pair of speech chunk and consequent pause chunks are 132
calculated. This speech-silence subsequent segment pair duration ratio in first, second 133
(median), and third quartiles across the whole recording (spratq1, spratq2, spratq3) 134
followed by average duration of speech segments were then calculated. Fig. 1 shows 135
Figure 1, the pause bins of healthy control (CN) group is always lower than the AD 137
group. As expected, the opposite trend of higher speech chunk to pause chunk ratio 138
• Energy-time plots: Two types of images were generated to represent the time-amplitude 140
signal of the segmented audio files: a plot with Cartesian coordinates (x = time, y = 141
absolute value of amplitude) (Fig. 2 (a),(b)), and a polar plot with time mapped to 142
angle θ and absolute amplitude values mapped to the radii ρ (Fig. 2 (c),(d)). All 143
the values were normalised to a number between 0 and 1, except for θ which was 144
normalised to an angle between 0 and 2π. The resulting images were then used to 145
Frame-level features extracted from audio frames include VGG, openL3 feature em- 148
beddings, and features extracted using another configuration file, Prosody, of openSMILE: 149
• VGG: Semantically meaningful feature embedding extracted from the audio using 150
VGGish resulted in a 128 dimension embedding of the input audio feature (modified 151
• openL3: openL3 embeddings were extracted from audio signal, resulting in a 512 153
dimensional embedding of the input audio feature (mel-spectrogram) extracted from 154
• Prosody: Features include Fundamental frequency, voice probability, PCM loudness, 156
Automatic transcription was performed for the manually segmented audio signal 159
using Dropbox integration with Otter.ai[51]. From the generated transcripts, text features 160
were extracted. Unfortunately, the automated transcription excluded instances of hesitation 161
such as uhm, errr, etc., which may in fact contain useful cues for classification. We also note 162
Version July 1, 2024 submitted to Appl. Sci. 5 of 17
Figure 2. A typical energy-time plot is shown. Cartesian representation for (a) Control Normal (CN)
and (b) Alzheimer’s Dementia (AD), and polar representation for (c) Control Normal (CN) and (d)
Alzheimer’s Dementia (AD), shows rhythmic/metrical representations as different spatial features.
transcription accuracy may be compromised by the audio quality; manual transcription 163
would be required to assess the accuracy of the automated Otter.ai transcript. Similar to 164
the audio feature extraction, both file-level and frame-level features were extracted from 165
Linguistics metrics were extracted from the automatically transcribed text. Features 168
from the BERT family (BERT, distilBERT, RoBERTa) and XLNet were extracted using the 169
HuggingFace transformer [59] library. Keg of Text Analytics is an original feature. 170
• Kegs of Text Analytics: The file-level text features extracted using Matlab’s text 171
analytics toolbox include: total number of words (Wt ); total number of unique words 172
Wu ; unique words normalised (Wu /Wt ) (looking out for repeated words or repetitive 173
use of simple words); speech rate in words per seconds (Wt /t); number of words 174
which are not ‘stop words’ (Wt − Ws ) (‘stop words’ are words which can be omitted 175
without losing meaning of the sentence, e.g. ‘a’, ‘the’, ‘to’, ‘and’); ratio of number 176
of words with ≥4, ≥5 and ≥6 letters to the number of unique words (although the 177
correlation between longer words and complexity is not constant in English, it aids in 178
filtering out short words); the number of nouns, pronouns, adjectives, verbs, adverbs, 179
conjunctions and determiners; lastly, binary representation of the presence of the 181
word "cookie" (included because many dementia subjects struggle to use the word 182
"cookie" in the "Cookie Theft" picture"). Figure 3 shows a typical parts-of-speech 183
comparison for AD and CN. Further, a word cloud of 45 randomly selected AD and 184
CN subjects from training is shown in Figure 4 (words with fewer than four characters 185
were ignored in these charts). The difference in word count between CN and AD 186
is obvious (seen in both Figure 3 and Figure 4 ) - CN group has a larger vocabulary 187
bank than the AD group. Furthermore, CN subjects produce longer words frequently, 188
Two sets of file-level text features were investigated: Keg of Text Analytics and Keg of 190
Text Analytics-Extended. In the former, a total of 12 features were used (Wt , Wu , ratio 191
of number of words with more than ≥4, ≥5 and ≥6 letters to the number of unique 192
words, Wt − Ws , Wt /t, number of pronoun, nouns, adverb, adjectives and auxiliary 193
verbs) and in the latter, a super-set with 18 features (containing the former 12 plus 6 194
Figure 3. A typical parts-of-speech comparison, extracted as part of Keg of Text Analytics. Differences
between Alzheimer’s Dementia (AD) and Normal Control (CN) can be observed, particularly for
nouns.
Figure 4. Word cloud of transcribed speech from (a) Alzheimer’s Dementia (AD) and (b) Healthy
Control (CN) groups.
• Summed Word Embedding: Words from the transcripts are embedded into a vector 198
space model using fastText pre-trained word embedding and sentence classification 199
[60,61]. The hyperspace dimension for this embedding was 300, resulting in a vector 200
of 300 elements for each word. The resulting vectors were then summed up to produce 201
Version July 1, 2024 submitted to Appl. Sci. 7 of 17
an overall representation of the transcription. Although the implication of this sum 202
is not immediately intuitive, our results here suggest that, at least for this dataset, it 203
• BERT: The text transcriptions were used to generate feature embedding vector from 205
the pre-trained BERT (Bidirectional Encoder Representations from Transformers) [18] 206
model. BERT is used for pre-training a language representation over a large amount 207
of unlabeled textual data, employs an attention mechanism (an encoder that reads 208
the text and a decoder that predicts) to learn deeper sense of language contextual 209
• RoBERTa: RoBERTa (a Robustly optimized BERT approach) [17] uses dynamic mask- 211
ing (unlike BERT’s static masking) which increases the data variability (by augmenta- 212
tion) and helps the network learn more robust features. 213
• distilBERT: DistilBERT is a cheaper, smaller version of BERT[62], which is 60% faster, 214
• XLNet: XLNet [63] uses an autoregressive pretraining method unlike BERT’s autoen- 216
Word Embedding: This feature extraction is similar to the Summed Word Embedding 219
extraction, except they are not summed; instead the 300 element feature representation of 220
This study looked into a variety of machine learning and deep learning models. Table 223
2 shows the investigated models for the sixteen feature set. The choice of selected models 224
were motivated by the success of these algorithms in supporting fast-prototyping when 225
working with various frame-level or file-level features. Various combinations of classifiers 226
and their hyper-parameters were fine tuned for every feature set and the resulting training 227
losses were compared. Those with the lowest training loss for a particular feature are 228
reported here. Voting was also used to combine various ensemble models, including 229
bagging, gradient boost, random forest, and adaboost classifiers because the resulting 230
combination outperformed the individual classifiers. In the case of frame-level features, 231
where long-term dependencies and dynamics are present in the sequences of data captured 232
at the frame-level, GMM-UBM and Bi-LSTM were used. GMM-UBM uses a 512 mix of 233
Gaussians and the BiLSTM models chosen were only one layer deep with 100 hidden 234
units, except for OpenL3 where it was two-layers deep. For Deep Neural Networks used 235
for modelling various text embeddings, a grid search was applied and hyperparameter 236
optimization was performed to decide on a Dense (fully-connected) layer network with 237
either ReLU or Softmax activations. Convolution Neural Network (CNN), a traditional 238
approach for handling images, was chosen with 4 convolution layers of 3x3 with max 239
Table 2. Training features and models investigated in this work. AD: Alzheimer’s Dementia, CN:
Normal Control
The ADReSSo-2021 dataset [3] has subjects from healthy control group (CN cohort) 242
and subjects with cognitive decline (assigned to the AD cohort). 237 audio recordings were 243
made available, which were balanced for age and gender (to avoid biases). 166 of these 244
were used for training and the models were tested against a mutually exclusive 71 audio 245
recordings. The number of audio recordings in the AD training and testing group were 87 246
and 35, respectively. The CN training and testing group contained 79 and 36 recordings, 247
respectively. 248
The results for AD classification task are shown in terms of accuracy. Accuracy = (True 249
Positive + True Negative)/(True Positive +True Negative + False Positive + False Negative). 250
3. Results 252
All our 16 models performed well above chance (see Table 3), with the poorest per- 253
formance arising from Speech/Silence with 66.2% accuracy. On the other hand, the best 254
performing model was RoBERTa at an accuracy of 88.7%, followed by distilBERT at 85.9% 255
and BERT at 84.5 %. Looking at the confusion matrix (Figure 5) for RoBERTa, 29 out 256
between file-level text features and file-level audio features reveals the file-level text fea- 258
average. This observation is true even on a broader scale (overall text vs overall audio): 260
78.87±7.09% vs 73.23±4.95%, respectively on average. This agrees with previous observa- 261
tions [3,23,35,36,38] that text features contain more distinguishing cues than audio features 262
Regarding automated transcription, whose accuracy depends on the overall intel- 264
ligibility of utterances recorded, CN cases naturally transcribe richer (due to increased 265
verbosity and word variety) and more faithfully (due to better speech clarity) than AD. 266
Transcriptions that yield meaningful strings of words (or not) will result contrastingly in 267
the text feature analysis (see Fig. 3), facilitating classification between AD and CN. Accord- 268
ingly, we indeed observe our best performances at 88.7%, 85.9% and 84.5% when using 269
BERT-related features. While analyzing other text-feature sets, the Keg of Text Analytics 270
and Keg of Text Analytics-Extended both resulted in identical accuracies of 76.1%, a drop 271
in performance compared to the BERT-related features (which is associated with more AD 272
Version July 1, 2024 submitted to Appl. Sci. 9 of 17
misclassifications). Finally, the Summed Word Embedding resulted in greater AD and CN 273
misclassifications due to lower accuracy (73.2 % ) in comparison with Keg of Text features 274
Now, frame-level audio features (77.0±3.29%) performed better than file-level audio 276
features (70.98±4.52%), suggesting that extended audio recordings need not be aggregated, 277
but rather brief speech audio samples representing intrinsic mechanics and dynamics of 278
speaker idiosyncrasies may be preferred, being also easier to collect and execute. Among 279
the various file-level audio features explored however, Emobase and Emobase-Large have 280
yielded accuracies of 70.4% vs 77.5%, respectively. In this case, Emobase-Large (consisting 281
of larger feature set) resulted in lower misclassification for both AD and CN. eGeMAPS, on 282
the other hand, resulted in 67.6% accuracy, which is associated with greater AD and CN 283
The best performing text-based feature at the file-level (RoBERTa) performed 9.8% 285
better than the frame-level (Word Embedding). Conversely, the best audio-based feature at 286
the frame-level (VGG) performed 1.4% better than the best file-level audio feature (Emobase- 287
Large). Extracting deep features using VGG’s deep neural networks at the frame-level 288
could be a reason for its higher performance than the conventional acoustic features (low 289
level descriptors and functionals) of extractors such as Emobase-Large. Mostly, VGG was 290
Our four original feature extraction methods proposed - Speech/Silence, Energy-time 292
Plot, Keg of Text Analytics, and Keg of Text Analytics-Extended - achieved accuracies of 293
66.2%, 73.2%, 76.1% and 76.1%, respectively. These ab initio results may be currently lower 294
than the well-established features, but with further exploration we may expect overall 295
The pre-trained BERT, distilBERT, RoBERTa and XLNet models used in this study 297
achieved accuracies of 84.5%, 85.9%, 88.7% and 67.6%, respectively (Table 3). These pre- 298
trained models performed feature extraction by first processing the transcribed texts fol- 299
lowed with a hyper-parameter optimization for Deep Neural Network (DNN) models. 300
Appendix A details how the DNN models for BERT, distilBERT, RoBERTa and XLNet were 301
optimized. 302
Table 4 compares the best results of this study to state-of-the-art models (at the time 303
of writing) for other AD classification studies based on the same ADReSSo-2021 dataset, 304
surpassing them in accuracy by 1.2%. Consistent with earlier reports of successful pre- 305
trained BERT models in AD classification, in this study RoBERTa achieved the best accuracy 306
of 88.7% . 307
Version July 1, 2024 submitted to Appl. Sci. 10 of 17
Table 3. Training features and their associated performance are shown. Accuracy is presented
as a percentage reflecting the total correct predictions out of total predictions. Accuracy = (True
Positive+True Negative)/(True Positive+True Negative+False Positive+False Negative). AD class is
positive. AD: Alzheimer’s Dementia; CN: Normal Control.
Figure 5. Confusion matrix (0- CN, 1- AD) for the best performing model: RoBERTa
4. Discussion 308
This study presents a systematic comparison of various approaches and methods, and 309
thus allow us some insight into the speech information that should be considered depending 310
on the speech sampling context. If the audio quality is amenable to faithful transcription 311
Version July 1, 2024 submitted to Appl. Sci. 11 of 17
(low background noise, good speaker clarity, monolingual and well-documented accents), 312
our study shows that file-level text is most effective, with the BERT family performing 313
rather satisfactorily at 86.26±2.27%. However, care must be taken with file-level text as 314
XLNet feature (67.6%) offered the second lowest accuracy among the sixteen features 315
studied. 316
In contrast, if the audio quality is good (high signal:noise) but the speaker does not 317
enunciate clearly or perhaps speaks in a way not compatible with automated text tran- 318
scription (e.g. mixed languages used, poorly documented accent), the results guide us 319
to suggest that frame-level analysis is then preferred. It can be expected that phonatory 320
dynamics captured in short time frames may contain enough inherent information about 321
the speaker’s psycho-motor health, thus obviating the need to analyse an extended du- 322
ration of speech audio. Further, such an approach relying on using audio feature sets 323
alone has the added benefit of not being limited to a particular carrier language since it 324
should reflect the speaker’s cognitive/physiological state without the extra burden and 325
process of transcription pigeon-holed in a particular carrier language (which may introduce 326
Frame-level features such sub-word or phoneme-level representations are suitable for 328
training on small datasets and have been shown to have higher dementia classification 329
accuracy than file-level features (such as word or sentence level) [38]. On the other hand, 330
file-level features have the benefit of larger temporal samples and increased opportunities 331
Overall, text modality performs better on average (file-level + frame-level) than audio. 333
This may suggest that with the onset of AD cognitive decline affects linguistic capacity 334
more sensitively and may be apparent earlier than physiological decline. Given that factors 335
such as semantic distance, lexical resource and grammar structure may become simpler 336
as cognitive decline progresses, text-based methods could be more useful and provide 337
greater insight into differentiating AD from CN, resulting in higher detection accuracy. 338
However, text-based approaches require faithful and robust transcription processing as 339
an additional step, complicating the analysis. Furthermore, many transcription services 340
currently available automatically omit non-semantic utterances (such a mmm, uh, um) which 341
Thus, although the results in the current study indicate that text-based methods outper- 343
form audio-based methods, the successful implementation of text-based methods depend 344
sensitively to the context in question (such as carrier language, choice and availability of 345
reliable transcription, among others). Audio-based methods, on the other hand, will offer 346
greater flexibility less sensitively to the context: as long as good signal:noise is achieved in 347
recording the speaker’s utterances, analysis can still proceed with some reliability, agnostic 348
to the carrier language(s) involved. Lastly, depending again on the audio context and 349
linguistic requirements, a simple voting- and/or fusion-model can then be implemented to 350
reconcile the mutli-modal multi-model multi-feature approach to yield an optimal AD/CN 351
classification outcome. However, this approach was not tested in our investigation due to 352
Thus, although text generally has better accuracy results than audio, it is limited 354
by the accuracy of transcription services. Audio features, on the other hand, are free of 355
5. Conclusions 357
An approach for assessing audio recordings of spontaneous speech utterances related 358
The study analysed the effectiveness of various features extracted from audio recording 360
and text derived from the audio, employing 16 different models to assess their relative 361
performance. All models trained on various distinct feature sets demonstrated accuracy 362
above chance levels. Overall, text-based features extracted from the transcribed audio 363
outperformed the audio-based features. The best performing model was achieved by 364
Version July 1, 2024 submitted to Appl. Sci. 12 of 17
Table 4. Comparison with other studies reporting from the ADReSSo-2021 dataset.
modelling the RoBERTa text embeddings, which attained an accuracy of 88.7%, achieving 365
near-perfect classification for AD, identifying 34 out of 35 cases. This represents a 1.2% 366
improvement surpassing other state-of-the-art models for AD classification trained on the 367
When comparing the level of granularity (resolution) in the feature extraction process, 369
frame-level audio features generally outperform their file-level counterparts. However, as 370
the number of features increase, such as with Emobase-Large, the results become more 371
comparable. This suggests that a sufficient number of low-level descriptors can effectively 372
represent both AD and CN classes. In contrast, while file-level text embeddings have pro- 373
duced higher accuracy than frame-level word embeddings, this trend cannot be generalized 374
since only one model has been tested at the latter level. 375
The four original feature extraction methods we proposed, namely Speech/Silence, 376
Energy-time Plot, Keg of Text Analytics, and Keg of Text Analytics-Extended, have demon- 377
strated reasonable accuracy. While their performance may not yet match off-the-shelf 378
feature sets, additional exploration and fine-tuning could lead to further improvements. 379
This investigation suggests that the transcribed textual data can produce meaningful 380
word sequences that lead to contrasting results in text feature analysis. This is particularly 381
helpful in classifying Alzheimer’s disease (AD) and cognitively normal (CN) patients, 382
as factors such as semantic distance, vocabulary usage, and grammar structures tend to 383
be simpler for dementia patients. While transcribing English audio datasets is relatively 384
straightforward, transcription resources for other languages may not be as easily available 385
nor as mature and robust, making the text-based approach sensitive to the linguistic context 386
of the target speaker. On the other hand, an audio-based approach data is not constrained by 387
carrier language, making it more generalizable and useful for developing AD classification 388
models which are more universal, as long as good signal-to-noise ratio of the speech 389
utterances are captured. In practical terms, however, it is expected that the optimal solution 390
is to implement a combination of both approaches to best reconcile differences arising from 391
the context. However, the optimal solution is a combination of both approaches. 392
This study provides insight into the effectiveness of machine learning techniques 393
trained on both textual and audio data for a given dataset, which is an improvement over 394
previous studies that focused on only one (or one of the many) aspect(s) of the feature 395
space or model selection. The implications of this research are particularly relevant as they 396
may assist clinicians and caregivers in detecting early-stage biomarkers of Alzheimer’s 397
disease (AD). Significantly, this can be accomplished using non-invasive and cost-effective 398
approaches, which could be of great benefit to patients and their families. 399
References 400
1. Ilias, L.; Askounis, D. Explainable identification of dementia from transcripts using transformer 401
networks. IEEE Journal of Biomedical and Health Informatics 2022, 26, 4153–4164. 402
2. Forbes-McKay, K.E.; Venneri, A. Detecting subtle spontaneous language decline in early 403
Alzheimer’s disease with a picture description task. Neurological sciences 2005, 26, 243–254. 404
Version July 1, 2024 submitted to Appl. Sci. 13 of 17
3. Luz, S.; Haider, F.; de la Fuente, S.; Fromm, D.; MacWhinney, B. Detecting cognitive decline 405
using speech only: The adresso challenge. arXiv preprint arXiv:2104.09356 2021. 406
4. Orange, J.B.; Lubinski, R.B.; Higginbotham, D.J. Conversational repair by individuals with 407
dementia of the Alzheimer’s type. Journal of Speech, Language, and Hearing Research 1996, 408
5. Szatloczki, G.; Hoffmann, I.; Vincze, V.; Kalman, J.; Pakaski, M. Speaking in Alzheimer’s 410
disease, is that an early sign? Importance of changes in language abilities in Alzheimer’s 411
6. Hoffmann, I.; Nemeth, D.; Dye, C.D.; Pákáski, M.; Irinyi, T.; Kálmán, J. Temporal parameters of 413
7. Martínez-Sánchez, F.; Meilán, J.J.; Vera-Ferrandiz, J.A.; Carro, J.; Pujante-Valverde, I.M.; Ivanova, 416
O.; Carcavilla, N. Speech rhythm alterations in Spanish-speaking individuals with Alzheimer’s 417
8. Pastoriza-Domínguez, P.; Torre, I.G.; Diéguez-Vide, F.; Gomez-Ruiz, I.; Gelado, S.; Bello-López, 419
J.; Ávila-Rivera, A.; Matias-Guiu, J.A.; Pytel, V.; Hernández-Fernández, A. Speech pause 420
distribution as an early marker for Alzheimer’s disease. Speech Communication 2022, 136, 107– 421
117. 422
9. Yuan, J.; Cai, X.; Bian, Y.; Ye, Z.; Church, K. Pauses for detection of Alzheimer’s disease. Frontiers 423
10. Rohanian, M.; Hough, J.; Purver, M. Multi-modal fusion with gating using audio, lexical and 425
disfluency features for Alzheimer’s dementia recognition from spontaneous speech. arXiv 426
11. López-de Ipiña, K.; Alonso, J.B.; Travieso, C.M.; Solé-Casals, J.; Egiraun, H.; Faundez-Zanuy, M.; 428
Ezeiza, A.; Barroso, N.; Ecay-Torres, M.; Martinez-Lage, P.; et al. On the selection of non-invasive 429
methods based on speech analysis oriented to automatic Alzheimer disease diagnosis. Sensors 430
12. König, A.; Satt, A.; Sorin, A.; Hoory, R.; Toledo-Ronen, O.; Derreumaux, A.; Manera, V.; Verhey, 432
F.; Aalten, P.; Robert, P.H.; et al. Automatic speech analysis for the assessment of patients with 433
predementia and Alzheimer’s disease. Alzheimer’s & Dementia: Diagnosis, Assessment & Disease 434
13. Pulido, M.L.B.; Hernández, J.B.A.; Ballester, M.Á.F.; González, C.M.T.; Mekyska, J.; Smékal, Z. 436
Alzheimer’s disease and automatic speech analysis: A review. Expert systems with applications 437
14. Meilán, J.J.; Martínez-Sánchez, F.; Carro, J.; Sánchez, J.A.; Pérez, E. Acoustic markers associated 439
with impairment in language processing in Alzheimer’s disease. The Spanish journal of psychology 440
15. Hason, L.; Krishnan, S. Spontaneous speech feature analysis for alzheimer’s disease screening 442
16. Yang, Q.; Li, X.; Ding, X.; Xu, F.; Ling, Z. Deep learning-based speech analysis for Alzheimer’s 444
disease detection: a literature review. Alzheimer’s Research & Therapy 2022, 14, 1–16. 445
17. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, 446
L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv preprint 447
18. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional trans- 449
19. Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised 451
learning of speech representations. Advances in neural information processing systems 2020, 452
20. López-de Ipiña, K.; Alonso, J.B.; Barroso, N.; Faundez-Zanuy, M.; Ecay, M.; Solé-Casals, J.; 454
Travieso, C.M.; Estanga, A.; Ezeiza, A. New approaches for Alzheimer’s disease diagnosis 455
based on automatic spontaneous speech analysis and emotional temperature. In Proceedings of 456
the International Workshop on Ambient Assisted Living. Springer, 2012, pp. 407–414. 457
21. Lopez-de Ipiña, K.; Alonso, J.B.; Solé-Casals, J.; Barroso, N.; Faundez-Zanuy, M.; Ecay-Torres, M.; 458
Travieso, C.M.; Ezeiza, A.; Estanga, A.; et al. Alzheimer disease diagnosis based on automatic 459
22. Qiu, S.; Miller, M.I.; Joshi, P.S.; Lee, J.C.; Xue, C.; Ni, Y.; Wang, Y.; De Anda-Duran, I.; Hwang, 461
P.H.; Cramer, J.A.; et al. Multimodal deep learning for Alzheimer’s disease dementia assessment. 462
23. Pappagari, R.; Cho, J.; Moro-Velazquez, L.; Dehak, N. Using State of the Art Speaker Recognition 464
and Natural Language Processing Technologies to Detect Alzheimer’s Disease and Assess its 465
24. Haulcy, R.; Glass, J. Classifying Alzheimer’s disease using audio and text-based representations 467
25. Balagopalan, A.; Novikova, J. Comparing acoustic-based approaches for alzheimer’s disease 469
26. Kato, S.; Homma, A.; Sakuma, T. Easy screening for mild alzheimer’s disease and mild cognitive 471
impairment from elderly speech. Current Alzheimer Research 2018, 15, 104–110. 472
27. Khodabakhsh, A.; Demiroglu, C. Analysis of speech-based measures for detecting and monitor- 473
ing Alzheimer’s disease. In Data Mining in Clinical Medicine; Springer, 2015; pp. 159–173. 474
28. Ahangar, A.A.; Jafarzadeh Fadaki, S.M.; Sehhati, A. The Comparison of Morpho-Syntactic 475
Patterns Device Comprehension in Speech of Alzheimer and Normal Elderly People. Zahedan 476
29. Beltrami, D.; Calzà, L.; Gagliardi, G.; Ghidoni, E.; Marcello, N.; Favretti, R.R.; Tamburini, F. Au- 478
tomatic identification of mild cognitive impairment through the analysis of Italian spontaneous 479
speech productions. In Proceedings of the Proceedings of the Tenth International Conference 480
30. Fraser, K.C.; Meltzer, J.A.; Graham, N.L.; Leonard, C.; Hirst, G.; Black, S.E.; Rochon, E. Auto- 482
mated classification of primary progressive aphasia subtypes from narrative speech transcripts. 483
31. Sanz, C.; Carrillo, F.; Slachevsky, A.; Forno, G.; Gorno Tempini, M.L.; Villagra, R.; Ibáñez, A.; 485
Tagliazucchi, E.; García, A.M. Automated text-level semantic markers of Alzheimer’s disease. 486
Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring 2022, 14, e12276. 487
32. Ying, Y.; Yang, T.; Zhou, H. Multimodal fusion for alzheimer’s disease recognition. Applied 488
33. Al-Hameed, S.; Benaissa, M.; Christensen, H. Simple and robust audio-based detection of 490
biomarkers for Alzheimer’s disease. In Proceedings of the 7th Workshop on Speech and 491
Language Processing for Assistive Technologies (SLPAT), 2016, pp. 32–36. 492
34. Searle, T.; Ibrahim, Z.; Dobson, R. Comparing Natural Language Processing Techniques for 493
Alzheimer’s Dementia Prediction in Spontaneous Speech. arXiv preprint arXiv:2006.07358 2020. 494
35. Farzana, S.; Parde, N. Exploring MMSE Score Prediction Using Verbal and Non-Verbal Cues. In 495
36. Clarke, C.J.; Melechovsky, J.; Lin, C.M.Y.; Priyadarshinee, P.; Balamurali, B.; Chen, J.M.; Kapoor, 497
37. Koo, J.; Lee, J.H.; Pyo, J.; Jo, Y.; Lee, K. Exploiting Multi-Modal Features From Pre-trained 500
Networks for Alzheimer’s Dementia Recognition. arXiv preprint arXiv:2009.04070 2020. 501
38. Edwards, E.; Dognin, C.; Bollepalli, B.; Singh, M.K.; Analytics, V. Multiscale System for 502
39. Gosztolya, G.; Vincze, V.; Tóth, L.; Pákáski, M.; Kálmán, J.; Hoffmann, I. Identifying mild 505
cognitive impairment and mild Alzheimer’s disease based on spontaneous speech using ASR 506
and linguistic features. Computer Speech & Language 2019, 53, 181–197. 507
40. Syed, M.S.S.; Syed, Z.S.; Lech, M.; Pirogova, E. Automated Screening for Alzheimer’s Dementia 508
through Spontaneous Speech. INTERSPEECH (to appear) 2020, pp. 1–5. 509
41. Syed, Z.S.; Syed, M.S.S.; Lech, M.; Pirogova, E. Tackling the ADRESSO Challenge 2021: The 510
MUET-RMIT System for Alzheimer’s Dementia Recognition from Spontaneous Speech. In 511
42. Chen, J.; Ye, J.; Tang, F.; Zhou, J. Automatic detection of alzheimer’s disease using spontaneous 513
speech only. In Proceedings of the Interspeech. NIH Public Access, 2021, Vol. 2021, p. 3830. 514
43. Pappagari, R.; Cho, J.; Joshi, S.; Moro-Velázquez, L.; Zelasko, P.; Villalba, J.; Dehak, N. Automatic 515
Detection and Assessment of Alzheimer Disease Using Speech and Language Technologies in 516
44. Kim, Y.; Lee, H.; Provost, E.M. Deep learning for robust feature generation in audiovisual 518
emotion recognition. In Proceedings of the 2013 IEEE international conference on acoustics, 519
45. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. nature 2015, 521, 436–444. 521
Version July 1, 2024 submitted to Appl. Sci. 15 of 17
46. Balagopalan, A.; Eyre, B.; Rudzicz, F.; Novikova, J. To BERT or Not To BERT: Comparing 522
Speech and Language-based Approaches for Alzheimer’s Disease Detection. arXiv preprint 523
47. Guo, Z.; Ling, Z.; Li, Y. Detecting Alzheimer’s disease from continuous speech using language 525
48. Yuan, J.; Bian, Y.; Cai, X.; Huang, J.; Ye, Z.; Church, K. Disfluencies and Fine-Tuning Pre- 527
trained Language Models for Detection of Alzheimer’s Disease. Proc. Interspeech 2020 2020, pp. 528
2162–2166. 529
49. Sarawgi, U.; Zulfikar, W.; Soliman, N.; Maes, P. Multimodal Inductive Transfer Learning for 530
Detection of Alzheimer’s Dementia and its Severity. arXiv preprint arXiv:2009.00700 2020. 531
2022-12-30. 533
52. Eyben, F.; Wöllmer, M.; Schuller, B. Opensmile: the munich versatile and fast open-source audio 535
feature extractor. In Proceedings of the Proceedings of the 18th ACM international conference 536
53. Parlak, C.; Diri, B. Emotion recognition from the human voice. In Proceedings of the 2013 21st 538
Signal Processing and Communications Applications Conference (SIU). IEEE, 2013, pp. 1–4. 539
54. Eyben, F.; Scherer, K.R.; Schuller, B.W.; Sundberg, J.; André, E.; Busso, C.; Devillers, L.Y.; Epps, J.; 540
Laukka, P.; Narayanan, S.S.; et al. The Geneva minimalistic acoustic parameter set (GeMAPS) for 541
voice research and affective computing. IEEE transactions on affective computing 2015, 7, 190–202. 542
55. Brookes, M.; et al. Voicebox: Speech processing toolbox for matlab. Software, available [Mar. 543
2011] from www. ee. ic. ac. uk/hp/staff/dmb/voicebox/voicebox. html 1997, 47. 544
56. Gemmeke, J.F.; Ellis, D.P.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; 545
Ritter, M. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings 546
of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 547
57. Hershey, S.; Chaudhuri, S.; Ellis, D.P.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, 549
D.; Saurous, R.A.; Seybold, B.; et al. CNN architectures for large-scale audio classification. In 550
Proceedings of the 2017 ieee international conference on acoustics, speech and signal processing 551
58. Cramer, J.; Wu, H.H.; Salamon, J.; Bello, J.P. Look, listen, and learn more: Design choices for 553
deep audio embeddings. In Proceedings of the ICASSP 2019-2019 IEEE International Conference 554
on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 3852–3856. 555
60. Grave, E.; Bojanowski, P.; Gupta, P.; Joulin, A.; Mikolov, T. Learning word vectors for 157 558
61. Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of tricks for efficient text classification. 560
62. Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: smaller, 562
63. Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized 564
autoregressive pretraining for language understanding. Advances in neural information processing 565
64. Davis, B.H.; Maclagan, M.A. UH as a pragmatic marker in dementia discourse. Journal of 567
65. Meghanani, A.; Anoop, C.; Ramakrishnan, A.G. Recognition of alzheimer’s dementia from 569
the transcriptions of spontaneous speech using fastText and cnn models. Frontiers in Computer 570
66. Rohanian, M.; Hough, J.; Purver, M. Alzheimer’s dementia recognition using acoustic, lexical, 572
disfluency and speech pause features robust to noisy inputs. arXiv preprint arXiv:2106.15684 573
2021. 574
67. Pan, Y.; Mirheidari, B.; Harris, J.M.; Thompson, J.C.; Jones, M.; Snowden, J.S.; Blackburn, D.; 575
Christensen, H. Using the Outputs of Different Automatic Speech Recognition Paradigms for 576
68. Qiao, Y.; Yin, X.; Wiechmann, D.; Kerz, E. Alzheimer’s Disease Detection from Spontaneous 579
Speech through Combining Linguistic Complexity and (Dis) Fluency Features with Pretrained 580
69. Zhu, Y.; Obyat, A.; Liang, X.; Batsis, J.A.; Roth, R.M. WavBERT: Exploiting Semantic and 582
Non-Semantic Speech Using Wav2vec and BERT for Dementia Detection. In Proceedings of the 583
Author Contributions: “Conceptualization, P.P., B.BT., C.J.C., and JM.C.; methodology, P.P., B.BT., 585
C.J.C., JM.C, C.M.Y.L., and J.M.; software, P.P., B.BT., C.J.C., C.M.Y.L., and J.M.; validation, P.P., B.BT. 586
and C.J.C., and JM.C.; investigation, P.P., B.BT., C.J.C., JM.C, C.M.Y.L., and J.M.; data curation, P.P., 587
B.BT. C.J.C., and JM.C.; writing—original draft preparation, P.P., B.BT., C.J.C., JM.C. and J.M.; writing— 588
review and editing, P.P., B.BT., C.J.C., and JM.C.; visualization, P.P., B.BT. and C.J.C.; supervision, 589
Funding: This research was funded by SUTD Growth Plan (SGP) Grant to Healthcare Sector (PIE- 591
SGP-HC-2019-01). 592
Data Availability Statement: The dataset is free and made available by Dementia Bank Pitt Corpus 593
https://luzs.gitlab.io/adresso-2021/ 594
Abbreviations 596
598
AD Alzheimer’s Dementia
CN Normal Control
MFCC Mel-Frequency Cepstral Coefficients
ML Machine Learning
DL Deep Learning
NLP Natural Language Processing
ASR Automatic Speech Recognition
MLP Multi-Layered Perceptrons
BERT Bidirectional Encoder Representations from Transformers
MMSE Mini-Mental State Exam 599
Appendix A 600
BERT*: The DNN model used was 4-layer deep. The number of units in the dense 601
first, second, third and fourth layers were 103, 76, 65, and 105, respectively. The associated 602
activation functions used were Softmax, ReLu, Softmax and ReLu, respectively. Adam 603
RoBERTa*: The DNN model used was 3-layer deep. The number of units in the dense 605
first, second, and third layers were 65, 154, and 224, respectively. The activation functions 606
used for these layers were Softmax, ReLu, and Softmax, respectively. Adam optimizer was 607
DistilBERT*: The DNN model used for DistilBERT was 5-layer deep. The number of 609
units in the dense first, second, third, fourth and fifth layers were 120, 29, 213, 98 and 152, 610
respectively. The activation functions were ReLu, Softmax, Softmax, Softmax and ReLu, 611
respectively. Adam optimizer with a learning rate of 0.075 was used. 612
Version July 1, 2024 submitted to Appl. Sci. 17 of 17
XLNet*: Number of layers of the DNN model used was three, while the number of 613
units in the dense first, second and third layers was 103, 137 and 73, respectively. The 614
activation functions used for these three layers were Softmax, Relu, and Relu, respectively. 615
Speech/Silence*: A 20-fold cross-validation and a grid search were conducted across 617
learning rates and a number of PCA components. The final model’s learning rate was 0.05 618
and PCA dimension was six. The classifier used was Catboost. 619
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual 620
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to 621
people or property resulting from any ideas, methods, instructions or products referred to in the content. 622