Teixeira 2004
Teixeira 2004
The thesis fulfilled the degree of Doctor in Electrotechnical and Computer Engineering
(Engenharia Electrotécnica e de Computadores)
Supervisor:
May 2004
Jury:
Key words: TTS systems, speech synthesis, prosody, intonation, timing, F0, modeling, European
Portuguese.
iii
Resumo
Este trabalho apresenta o desenvolvimento de um sistema de prosódia para o Português
Europeu (EP) para aplicação em sistemas de conversão texto-fala (TTS). Basicamente, estes
sistemas fazem a leitura automática de um texto escrito e consistem numa sequência de diversos
módulos. Esses módulos implementam o pré-processamento do texto de entrada, a transcrição
fonética e o processamento supra-segmental que consiste na introdução de padrões prosódicos.
As características prosódicas são responsáveis pela marcação de uma intenção comunicativa e
por conferirem naturalidade na forma como o texto é lido. Estas características consistem na
imposição de um ritmo, caracterizado pelas durações segmentais e pausas, de uma entoação,
descrita por uma curva de frequência fundamental (F0), e pela curva de intensidade.
No início são apresentados os trabalhos denominados preparatórios que foram
fundamentais para o estudo e desenvolvimento do sistema. Inicia-se com um estudo preliminar
sobre a sílaba tónica. Neste estudo são identificadas as gamas de variação dos parâmetros F0,
duração e intensidade na sílaba tónica em diversos contextos. Depois é apresentada a base de
dados de fala FEUP-IPB DB usada nos estudos seguintes. Esta base de dados de fala em EP está
etiquetada ao nível do fonema, da palavra, da frase e de F0. Seguidamente apresentam-se dois
algoritmos de divisão silábica para o texto escrito e para a sequência de fonemas. Este capítulo
termina com a proposta de um conjunto de regras para realizar automaticamente a transcrição
fonética dos grafemas mais problemáticos no EP.
O modelo de prosódia proposto consiste em vários sub-modelos, concretamente num
modelo de durações segmentais, que faz a predição das durações dos segmentos e o modelo de
predição do contorno de F0.
São propostas duas alternativas, baseadas em redes neuronais artificiais (ANN), para
predição das durações segmentais.
A primeira proposta consiste numa ANN cuidadosamente seleccionada no que concerne
à sua arquitectura e tipo, bem como o conjunto de características a usar no vector de entrada,
sempre com o objectivo de minimizar o erro entre as durações preditas e as medidas. A segunda
proposta de modelo, denominada modelo alternativo, baseia-se nos mesmos pressupostos da
primeira proposta, mas com uma ANN dedicada à predição da duração de cada fonema, num
total de 44 ANNs. Este modelo demonstrou conseguir melhores resultados que o anterior.
Propõe-se ainda um modelo de inserção e predição das durações das pausas baseado
num estudo preliminar sobre a base de dados usada.
O modelo proposto para predição do contorno de F0, baseia-se no modelo de Fujisaki e
divide-se em dois sub-modelos. Um para predição dos parâmetros dos comandos de frase (PCs)
e outro para predição dos parâmetros dos comandos de acento (ACs).
Foram manualmente estimados os PCs e os ACs de referência em 101 parágrafos da
base de dados FEUP-IPB de forma a minimizar o erro entre os contornos de F0 estimado e
medido.
A predição dos PCs é realizada em duas etapas. A primeira consiste num algoritmo para
inserir PCs associados ao texto com base num modelo matemático obtido a partir dos resultados
experimentais. A segunda faz a predição da amplitude dos PCs, Ap, e da antecipação destes
relativamente à sua posição inicial, T0a. Esta antecipação permite determinar a sua localização
exacta no sinal de fala. Estes dois parâmetros são preditos com AANs.
Encontrou-se uma forte associação entre os ACs e as sílabas. Esta associação levou à
adopção da metodologia de predição dos ACs associados às sílabas. Assim, o modelo de ACs
consiste numa ANN para fazer a predição da existência de AC associado à sílaba e mais 3 para
fazer a predição dos parâmetros amplitude (Aa) e antecipação dos instantes de início (T1a) e de
fim (T2a).
Os testes perceptuais finais usando o método de julgamento de categorias na escala
MOS, resultaram numa classificação de 4.6 para a fala original, de 4.4 para F0 estimado, 4.2
para a predição das durações, 3.1 para o F0 predito e de 2.9 para o modelo completo (modelos
de durações e F0). O valor final do modelo completo está ao nível ‘Aceitável’.
Palavras Chave: Sistemas TTS, síntese da fala, prosódia, entoação, ritmo, F0, modelização, Português.
v
Résumé
Cette thèse de PhD présente le développement d'un système de prosodie pour le
Portugais européen (EP) pour les applications texte-parole (TTS). Fondamentalement, les
systèmes de TTS effectuent l'expression automatique d'un texte et consistent en une séquence de
plusieurs modules. Ces modules mettent en application le pré-traitement des textes, la
transcription phonétique et le traitement supra-segmentaire qui consiste dans l’inclusion des
modèles prosodiques. La prosodie est responsable pour l’intention communicative et garantit de
la naturalité dans le discours parlé. Les dispositifs prosodiques consistent dans l’imposition de
la synchronisation, caractérisée par les durées et les pauses segmentaires, l'intonation,
caractérisée par la courbe de la fréquence fondamentale (F0), et par la courbe d'intensité.
Dans le début sont présentés les travaux nommés préparatoires qui ont été
fondamentaux pour l'étude et le développement du système. Il s'initie avec une étude
préliminaire sur la syllabe tonique. Dans cette étude sont identifiés les intervaux de variation des
paramètres F0, durée et intensité dans la syllabe tonique dans de divers contextes.
Ensuite est présentée la base de données de parole FEUP-IPB DB utilisée dans les
études suivantes. Cette base de données de parole dans EP est étiquetée au niveau du phonème,
du mot, de la phrase et de F0. Ensuite se présentent deux algorithmes de division silábique pour
le texte écrit et pour la séquence de phonèmes. Ce chapitre de la thèse finit avec la proposition
d'un ensemble de règles pour réaliser automatiquement la transcription phonétique des
graphèmes les plus problématique dans l’EP.
Le modèle de prosodie proposé est composé de deux sous-modèles, le modèle de
duration pour prédire les durées segmentales et le modèle pour prédire le tracé de F0.
Deux propositions, basées dans les réseaux neuronaux artificiels (ANNs), pour prévoir
les durées segmentaires sont présentées.
Le premier consiste en un ANN soigneusement choisi au sujet de son architecture et
type aussi bien que l'ensemble de caractéristiques d'entrée avec l'objectif de réduire au minimum
l'erreur entre les durées prévues et mesurées. La deuxième proposition, apellée modèle
alternatif, est basée sur les mêmes considérations de la première proposition mais utilise une
ANN consacrée pour chaque phonème, dans un total de 44 ANNs. Le modèle alternatif avec
ANNs consacré a amélioré l'exécution finale.
On propose un modèle d'insertion et de prévision des durées des pauses, basé sur une
étude préliminaire sur de la base de données de FEUP-IPB.
Le modèle proposé pour prévoir le contour de F0 est basé sur le modèle de Fujisaki et se
compose de deux sous-modèles. Un prédit les paramètres Commandes de Phrase (PCs) et láutre
prévoit les paramètres des Commandes d'accent (ACs). Les PCs et les ACs de référence ont été
manuellement estimés à 101 paragraphes de la base de données sous le critérium de la
minimisation de l'erreur entre courbes estimées F0 et mesurées.
La prévision des PCs est exécutée dans deux étapes. La première étape est effectuée par
un algorithme responsable de l'insertion des PCs reliés au texte et basés sur un modèle
mathématique obtenu à partir des observations expérimentales. La deuxième étape du modèle
prévoit l'amplitude de PCs, Ap, et l'anticipation, T0a, relativement à la position initiale.
L'anticipation permet la détermination de la position exacte dans le son articulé. Les deux
paramètres sont prévus avec ANNs.
Un raccordement fort entre ACs et syllabes a été trouvé dans la base de données. Ce
raccordement fort a justifié la méthodologie adoptée de prévoir ACs associées aux syllabes. Par
conséquent, le modèle d'ACs se compose d'une ANN pour prévoir l'existence du AC associée à
la syllabe et à autres trois ANNs pour prévoir l'amplitude du paramètre (Aa) et l'anticipation du
début (T1a) et de la fin (T2a).
L'essai perceptuel final en utilisant la méthode de catégorie-jugement et l’échelle MOS
a eu comme conséquence une classification de 4.6 pour le discours naturel, de 4.4 pour le F0
estimé, de 4.2 pour des durées prévues, de 3.1 pour le F0 prévu et de 2.9 pour le modèle proposé
complet (durée et F0). Le MOS pour le modèle complet est de niveau juste de ‘juste'.
Mots clé: Systèmes TTS, synthèse de parole, prosodie, intonation, rythme, F0, modeler, Portugais
Européen.
vii
Acknowledgements
I would like to thank my supervisor Diamantino Freitas for his support and advises
always with gentleness, and the opportunities to be involved in national and international
projects and cooperate with other international researchers and research laboratories.
My gratitude also to the colleagues of the LPF-ESI laboratory that were directly or
indirectly involved in the work, namely Daniela Braga, Paulo Gouveia, Maria João Barros,
Vagner Latsch and Helder Ferreira. Thanks to Constança Homem that helped me in some
translations and to Esmeralda Miguel for the printings. I express my thanks also to Irene
Fernandes for all diligences.
I would like to homage Prof. Carlos Espain, a senior member of LPF-ESI and friend
that left us during this work.
A special thank to Prof. Hiroya Fujisaki from the University of Tokyo for the important
discussions and advices during the development work, and for the reviewing work of part of this
document.
I am also very grateful to Daniel Hirst from CNRS, Nick Campbell from ATR and
Mark Huckvale from UCL for their important comments and reviews to this document.
I am grateful also to the participants in the COST 258 Action “Naturalness of Synthetic
Speech” in the name of the chairperson, Eric Keller, for the shared experiences and contacts
with some of the most important European researchers and research laboratories in this topic.
I appreciated also some discussion related with my work with Alex Mohanagan,
Eduardo Banga from the University of Vigo, Hansjörg Mixdorff from the University of Berlin,
J.- P. Martens from University of Gent, Isabel Trancoso and Luis Oliveira from INESC-L2F,
the colleagues of the Univ. of Aveiro namely, Lurdes Moutinho and António Teixeira and João
Veloso from FLUP. The discussions with Luis Calôba, Manuel Seixas, Fernando Gil and
Sérgio Netto from UFRJ-LPS were also welcomed.
My thanks to the colleagues of my Department that allowed me been released of
teaching duties in 1999-2000 for developing this work. My appreciation to the directors of the
ESTiG-Bragança that gave me conditions to develop this work, namely, Rolando Dias and José
Adriano. I would like to homage the memory of the director in charge in the beginning of this
work, Prof. Alcínio Miguel. I am also grateful to the dean of the Polytechnic Institute of
Bragança, Prof. Dionísio Gonçalves, for authorizing the application to the PRODEP scholarship.
I express my thanks also to the RDP Porto for conceding me all technical support for
recording the database and to the speaker Diamantino Guedes that gave his voice and attention
in the recording process of the database.
I would like to acknowledge my gratitude to the colleagues that participated in the
perceptual tests.
A special hug to my friends that decompressed me in the coffe-break times with their
always interesting talks, particularly to my dear friend Paula Odete, and for Luis Alves, Carlos
Balsa, Alcina, Ana Moura, Henrique Gonçalves, Avelino Marques, João Nunes, Florbela,
Ramiro Martins, Fernando Monteiro, João Ribeiro, Pedro Oliveira and many others.
Recognition for my canary friends that allowed me thinking beyond the PhD, keeping
my mind healthy (I think).
Finally, last but not the least, my thanks to my beloved wife Lina for all eventual
surcharges of tasks, responsibilities and fatigue that she has been exposed to during this work,
and to Monica. The biggest thank is to my lovely Dorothy Rita for making me proud every day I
play with her.
ix
Contents
Contents
Abstract….....................................................................................................................................iii
Resumo….......................................................................................................................................v
Résumé….....................................................................................................................................vii
Acknowledgements…...................................................................................................................ix
Contents…....................................................................................................................................xi
List of Figures............................................................................................................................xvii
List of Tables… .........................................................................................................................xxi
Abbreviations............................................................................................................................xxv
1 INTRODUCTION.......................................................1
1.1 Foreword........................................................................................................................... 2
2 PREPARATORY WORK......................................... 15
2.1 Introduction.................................................................................................................... 16
xi
A Prosody Model to TTS Systems
2.4 Syllabification................................................................................................................. 34
2.4.1 Introduction.............................................................................................................. 34
2.4.2 Syllable splitting of written text ............................................................................... 36
2.4.2.1 Rules .................................................................................................................... 36
2.4.2.2 Algorithm............................................................................................................. 37
2.4.3 Syllabic splitting of spoken text ............................................................................... 39
2.4.3.1 Rules .................................................................................................................... 39
2.4.3.2 Algorithm............................................................................................................. 39
2.4.4 Analysis and results ................................................................................................. 42
2.4.5 Conclusions.............................................................................................................. 42
xii
Contents
3.6 Pauses.............................................................................................................................. 98
3.6.1 Pause occurrence...................................................................................................... 98
3.6.2 Pause duration.......................................................................................................... 99
3.6.3 Final considerations on studying pauses ................................................................ 101
xiii
A Prosody Model to TTS Systems
xiv
Contents
BIBLIOGRAPHY................................................... 207
xv
List of Figures
List of Figures
Fig. 2.1 – Recorded parameters for tonic and reference syllables using the developed package for
analysis. Top graph: waveform signal of the word “café” and its classifications, in red as 1 –
silence; 2 – unvoiced; 3 – mixed; 4 – voiced. Middle graph: F0. Bottom graph: Intensity............. 18
Fig. 2.2 – Relative variation of F0 in tonic syllable (95% confidence). .......................................... 19
Fig. 2.3 –Standard Deviation of F0 variation between the three speakers...................................... 20
Fig. 2.4 – Relative Duration of tonic syllable (95% confidence)..................................................... 21
Fig. 2.5 – Standard deviation of average duration between the three speakers. ............................ 21
Fig. 2.6 – Average intensity variation of tonic syllable for all speakers (95% confidence). ........... 22
Fig. 2.7 – Standard deviation of average intensity variation between the three speakers............... 23
Fig. 2.8 – Above: representation of the acoustic signal in the phoneme sequence [lej] in the word
‘lei’ – ‘law’. Below: spectrogram.................................................................................................... 28
Fig. 2.9 – Relative frequencies of the segments in the corpus. ........................................................ 30
Fig. 2.10 – Illustration of the speech rate for the different texts (here represented by the inverse,
that is, time per segment in average). The figure shows the accumulated duration of elapsed
segments. Track one is displayed using a solid line, track two using a dotted line and thus
successively for the 7 tracks............................................................................................................. 30
Fig. 2.11 – Flow chart for one word syllabic splitting of a a written text. V-vowel; C-consonant; ...-
any sequence of graphemes; .- syllable boundary; ?-grapheme not determined yet; bold- grapheme
already stored in the output string; underline-pointed grapheme by index i................................... 38
Fig. 2.12 – Flowchart of a spoken text syllabic splitting. ................................................................ 41
Fig. 2.13 – Previous processing blocks of phonetic transcription................................................... 44
Fig. 2.14 – Processing of phonetic transcription............................................................................. 45
xvii
A Prosody Model to TTS Systems
Fig. 3.5 – Sequence of processing blocks prior to the development stage of the duration model and
its application to TTS. ...................................................................................................................... 76
Fig. 3.6 – Error histogram and normal distribution curve for every segment in both sets.............. 87
Fig. 3.7 – Normal probability distribution and absolute error curve for every segment in both sets.
......................................................................................................................................................... 87
Fig. 3.8 – Measured, predicted and average duration contours for the phoneme sequence in the
sentence “Conhece a situação na pele. Aprendeu-a na idade em que se aprende e se não esquece.”.
Meaning ‘Knows the situation on the skin. Learned it in the ages when we learn and don’t forget.’.
......................................................................................................................................................... 88
Fig. 3.9 – Measured and predicted duration contours for the paragraph “Que igualdade perante a
lei? João Amaral”. Meaning ‘How equal before the law? João Amaral’. ...................................... 89
Fig. 3.10 – Histogram of measured and predicted durations for phoneme [a]. .............................. 92
Fig. 3.11 – Histogram of measured and predicted durations for the burst part of phoneme [t]. .... 92
Fig. 3.12 – Error histogram and normal distribution curve for all segments in both sets with the
alternative model. ............................................................................................................................ 94
Fig. 3.13 – Normal probability distribution and absolute error curve for all segments in both sets
with the alternative model................................................................................................................ 94
xviii
List of Figures
Fig. 4.13 – Flow chart of the algorithm to connect ACs to syllables............................................. 125
Fig. 4.14 – Organization structures. On the top, the orthographic marks..................................... 126
Fig. 4.15 – Representation of Eligible positions, T0E, and anticipation, T0a, of PCs. .................. 128
Fig. 4.16 – Histogram and Gaussian approximation of distances from PCs not linked with
orthographic marks to previous PCs and next PCs. ...................................................................... 130
Fig. 4.17 – Weight for length of previous word. ............................................................................ 131
Fig. 4.18 – Flow chart to insert PC in text. ................................................................................... 132
Fig. 4.19 – Eligible area and candidate positions. ........................................................................ 133
Fig. 4.20 – Application example of the algorithm. ........................................................................ 133
Fig. 4.21 – Comparison of histograms of estimated and inserted PC distances............................ 135
Fig. 4.22 – Comparison of estimated and inserted PC positions. Black arrows are the estimated
PCs; magenta arrows are the inserted PCs................................................................................... 136
Fig. 4.23 – Evolution of ANNs performances in test set, over the used extension of the training set.
....................................................................................................................................................... 138
Fig. 4.24 – Best Linear fit between target (T) and predicted (A) values for Ap (left) and T0a (right).
....................................................................................................................................................... 142
Fig. 4.25 – Probability error in test set for predicted Ap and T0a. Lines show the adjusted normal
probability distribution with a) µ=0.093, σ=0.075 and b) µ=0.148, σ=0.097............................. 142
Fig. 4.26 – Application example of the insertion PC model. PCs and components: black –
estimated; green - initial position of estimated PCs with predicted Ap and T0a; magenta –
predicted with PC model................................................................................................................ 143
Fig. 4.27 – Evolution of average ANNs performances in the test set, over the dimension of training
set................................................................................................................................................... 147
Fig. 4.28 – Best Linear fit between target (T) and predicted (A) values for Aa (left) and Probability
error (|Aatarget-Aapredicted|) in test set for predicted Aa (right), red line shows the adjusted normal
probability distribution with µ=0.12 and σ=0.12. ........................................................................ 155
Fig. 4.29 – Best Linear fit between target (T) and predicted (A) values for T1a (left) and
Probability error (|T1atarget-T1apredicted|) in test set for predicted the T1a values (right), red line
shows the adjusted normal probability distribution with µ=0.022 (s) and σ=0.024 (s)................ 157
Fig. 4.30 – Best Linear fit between target (T) and predicted (A) values for T2a (left) and
Probability error in test set for predicted T2a (right), red line shows the adjusted normal
probability distribution with µ=0.028 (s) and σ=0.026 (s). .......................................................... 157
Fig. 4.31 – Result of predicted ACs. In black, the estimated PCs, ACs and the associated F0
contour. In magenta, the predicted ACs, based on estimated PCs, and the corresponding F0
contour. Vertical lines represent word boundaries........................................................................ 160
Fig. 4.32 – Application of the complete F0 model. In black the estimated PCs, ACs and F0 contour.
In magenta the predicted ACs, PCs and F0 contour. .................................................................... 162
Fig. 4.33 – Application of the complete F0 model over the modified duration with the duration’s
model. In magenta the predicted ACs, PCs and F0 contour.......................................................... 163
xix
A Prosody Model to TTS Systems
Fig. 5.1 – Average opinion values of each subject for the 5 stimuli. ............................................. 171
Fig. 5.2 – Average opinion values by paragraph for the 5 stimuli. ............................................... 171
Fig. 5.3 – Analysis of opinion scores. ............................................................................................ 172
Fig. 5.4 – Comparison of measurement indicators by paragraph for Alternative Model.............. 175
Fig. 5.5 – Comparison of measurement indicators by paragraph for Model. ............................... 175
Fig. 5.6 – Comparison of measurement indicators by paragraph for No model. .......................... 176
Fig. 5.7 – Average opinion values for each subject in the 9 stimuli. ............................................. 181
Fig. 5.8 – Average opinion values for each paragraphs in the 9 stimuli. ...................................... 182
Fig. 5.9 – Analysis of opinion scores by stimuli. Stimuli from 0 to 8 corresponds to: 0 – No model;
1 – Natural; 2 – Durations; 3 – Estimated F0; 4 – Predicted ACs based on estimated ACs and PCs;
5 – Predicted ACs with estimated PCs; 6 – F0 Model; 7 – Duration + F0 model with Aa*0.75; 8 –
Durations + F0 model. .................................................................................................................. 184
Fig. 6.1 – PC and AC error components in stimuli 5 and 6, considering orthogonal axis. ........... 201
Fig. 6.2 – PC and AC error components in stimuli 5 and 6, considering non-orthogonal axis..... 202
xx
List of Tables
List of Tables
xxi
A Prosody Model to TTS Systems
(r); measured average (Av.) and predicted average (Pred. Av.); measured minimum value (Min.)
and predicted minimum value (Pred. Min.); measured maximum value (Max.) and predicted
maximum value (Pred. Max.)........................................................................................................... 95
Table 3.11: Statistics on pause occurrence...................................................................................... 98
Table 3.12: Parameters for the pause duration predictor. ............................................................ 100
Table 3.13: Best results for the intra-paragraph pause duration predictor. ................................. 100
Table 3.14: Marker type results for the pause duration predictor................................................. 100
xxii
List of Tables
Table 5.1: Portuguese and respective translation of the 5 paragraphs used in the perceptual test,
and respective number of segments. .............................................................................................. 170
Table 5.2: Correlation coefficient, r, and rmse between original and the other three stimuli in each
paragraph. ..................................................................................................................................... 170
Table 5.3: Mean Opinion Score (MOS) and standard deviation of the perceptual test................. 172
Table 5.4: Significance level between pairs of stimuli................................................................... 173
Table 5.5: Measurement indicators for models, by paragraph...................................................... 174
Table 5.6: Correlation coefficient along paragraphs between measurement indicators............... 176
Table 5.7: Mean values along paragraphs of evaluation measurements....................................... 177
Table 5.8: Correlation between mean values of evaluation measurements................................... 177
Table 5.9: Portuguese and respective translation of the 5 paragraphs used in the perceptual test.
....................................................................................................................................................... 179
Table 5.10: Objective measurements of each stimulus by paragraph. For each paragraph the first
line represents the correlation coefficient and second line the rmse. ............................................ 180
Table 5.11: Mean Opinion Score (MOS) and standard deviation of the perceptual test............... 183
Table 5.12: Significance level between pairs of stimuli. Stimuli from 0 to 8 have the same
correspondence as the ones in Fig. 5.9.......................................................................................... 184
Table 5.13: Indicator measurements for stimuli by paragraph. .................................................... 187
Table 5.14: Correlation coefficient along paragraphs between measurement indicators............. 187
Table 5.15: Mean values along paragraphs of indicator parameters. .......................................... 188
Table 5.16: Correlation between mean values along models of indicator parameters. ................ 188
Table 6.1: Resume of average (over the 5 paragraphs) evaluation parameters in the 4 stimuli types
used for perceptual tests. ............................................................................................................... 199
xxiii
Abbreviations
Abbreviations
Aa – Amplitude of AC;
ABU – Acoustic Building Unit;
AC – Accent Command;
ANN – Artificial Neural Network;
Ap – Magnitude of phrase command;
Ca – ANN that predicts the amplitude of the AC;
CA – ANN that predicts the existence of AC associated to the syllable;
CEFAT – Centro de Estudos de Física, Acústica e Telecomunicações;
EP – European Portuguese;
F0 – Fundamental frequency;
FEUP – Faculty of Engineer of University of Porto;
FEUP-IPB DB – FEUP-IPB speech DataBase;
FEUP-TTS – FEUP Text-To-Speech system;
LPF-ESI – Research Laboratory for Speech Processing, Electroacustic, Signal and Instrumenta-
tion of FEUP;
LSS – Laboratory of Signals and Systems research unit of FCT, hosted at LPF-ESI;
MOS – Mean Opinion Score;
PC – Phrase Command;
r – Linear correlation coefficient;
rmse – Root mean squared error;
std – Standard deviation;
T0 – Onset time of PC;
T0a – Anticipation of PC;
T0E – Beginning of accent group where PC was inserted;
T1 – Onset time of AC;
T1a – Anticipation of the onset time of the AC;
T2 – Offset time of AC;
T2a – Anticipation of the offset time of the AC;
TPML – Text Processing Markup Language;
TTS – Text-To-Speech;
UFRJ – Federal University of Rio de Janeiro;
XML – eXtensible Markup Language;
δ – Mean absolute error;
σ – Standard deviation.
xxv
1 Introduction
This introductory chapter makes a short overview of what is prosody and describes the motivations
and objectives for this work. The FEUP-TTS system for European Portuguese, which will be, in
first instance, the host of the proposed prosody model, is briefly described. Finally an overview of
this document and a reference to the original contributions are made.
A Prosody Model to TTS Systems
1.1 Foreword
This document attempts to report and arguing the results of the work and experiences been de-
veloped under the construction of a prosody model for European Portuguese (henceforth EP). The
work was developed under a PhD program in electrotechnical and computer engineering in FEUP
by the author.
The object of study is the European Portuguese Language. The Portuguese language belongs to
the family of Romance languages and is the fifth most widely spoken language in the world with
more than 200 million speakers. It is the official language in Portugal (10 millions) (Europe), Brazil
(175 millions) (South America), Angola (10 millions), Mozambique (20 millions), Guinea-Bissau
(1.3 millions), São Tomé and Príncipe (165 thousandths), Cape Verde (400 thousandths) (Africa)
and East Timor (800 thousandths) (Asia). Any country has its own version of Portuguese. Even
though any version could be understood in any of the speakers’ country, the pronunciations are dif-
ferent and it is not easy to accept in Europe a TTS system with a Brazilian version of Portuguese,
neither the opposite.
The lack of resources in this language for speech science, like tools or labelled databases, has in-
troduced an inevitable delay in achievements of the main objectives in order to create and prepare
those resources for this work.
The initial main objective of the work was the development of the naturalness of synthetic
speech. This objective lead to an important effort under prosodic modelling since it was the main
lack in the existent TTS system.
There is no unique definition of what prosody is, but a broadly accepted concept was summa-
rised by Ladd and Cutler [1983] into “concrete” and “abstract” categories. The “concrete” defini-
tion lies with objective physical measurable acoustic parameters like F0, duration and intensity.
The “abstract” definition stands for the linguistic point of view concerning its structure “as phe-
nomena that involve phonological organization at levels above the segment”. That is why prosody
is considered as a suprasegmental category. The first definition is more close to objective meas-
urements and the second one with building theories, according to caricature made by the authors. A
third definition was presented by Fujisaki [1997] that aggregates both previous definitions and
brings together, with a pleasant will, the work usually made by Engineers and Linguistics (not al-
ways working under the same “prosody”):
“Prosody is the systematic organization of various linguistic units into an utterance or a coher-
ent group of utterances in the process of speech production. Its organization involves both segmen-
tal and suprasegmental features of speech, and serves to convey not only linguistic information, but
also paralinguistic and non-linguistic information.”
This definition is more consentaneous with the present work. Any how, it concentrates in the su-
prasegmental features: duration and F0, as the ones, broadly known, as being most perceptually
important. The proposed models for duration and F0 gather just part of the linguistic information
and neither paralinguistic nor non-linguistic information is available. No syntactic morphologic or
semantic information is used since these cannot be automatically extracted from text due to non
availability of such a tool with the required quality to use with the FEUP-TTS system. Concerning
paralinguistic and non-linguistic information, they are not related with text at all, but just with the
speaker. Since not all information gathered by the speaker is used in the prosody model it can not
be expected that the prosody model can exhaustively reproduce the same patterns as the speaker.
2
Chapter 1-Introduction
The same sequence of basic sounds can be produced with very different characteristics, depend-
ing on the intention of the speaker. These characteristics are named prosodic features and consist in
segmental durations, or the duration of each sound, the intensity variation and the tone pattern. A
good pattern of prosody is essential for reaching the objective of naturalness in synthetic speech. In
an extreme, it is possible that a different pattern of prosody can even change the meaning of the ut-
terance. Different ways to produce the same utterance can be natural, but not all patterns of prosody
are natural. Unnatural prosody pattern can be the reason for rejection of the synthetic speech.
All utterances carry a prosodic pattern, even if it is a theoretical constant pattern, that by the
way, is also not well accepted. In the scientific community it is well known that the timing and the
tone or pitch or even F0 pattern, are the more important features of prosody.
This work mainly consists in a proposed model to automatically produce the durations of the
segments and the F0 pattern for EP written text. These prosodic feature patterns are very variable
according to the context sounds, the meaning of the utterance, the sequence of words, the intention,
the type of sentence and the length.
For humans, the task of producing a natural prosodic utterance is very simple. Persons speak
without thinking about prosody. They do not care about duration of segments, the intensity or the
tone. They do not think about sequences of segments either. All this information is intuitively proc-
essed by the human mind. However humans still cannot produce systems that do the same process-
ing they do intuitively.
This is where the author found some of the reasons to use ANN to produce prosody. The funda-
mentals of ANNs are based in the human neurons [Rumelhard and McClelland, 1986]. ANNs, just
like humans, can produce a result based on previous experiments. For instance, it is time to mention
that the author´s daughter when learning how to speak, at about one year of age, could not even ut-
ter the words correctly, or know their meanings, but she already produced a perfect prosody to ex-
press some feelings. Her neurons were in a strong process of learning and prosody was learned be-
fore vocabulary or sounds. This was, one of the reasons for the strong use of ANNs in this work.
Another more objective reason was the ANNs’ capacity of very fast processing the input to deter-
mine the right output after the training phase is accomplished. Other pattern recognition techniques
could be used in complement of ANNs or even instead, however this was considered to be out of
the scope of the present thesis.
3
A Prosody Model to TTS Systems
Latter, in 1996, a subsequent project aiming the improvements of the MULTIVOX TTS system
was followed. This second version broke several restrictions of the first version but the main im-
provement was in the acoustic module, introducing a human formant coded speech database. This
second version allowed the usage of large prosodic markers, claiming more prosodic knowledge to
deal with those markers. A strong need for a prosodic model was felt in this project.
Meanwhile, the author had gained experience in speech analysis with the work in his Master dis-
sertation [Teixeira, 1995], where he developed several analysis tools such as: automatic extraction
of F0, formant frequencies and respective bandwidths; voiced, unvoiced, mixed and silence classi-
fication; and formant synthesis processing, among others. This Master dissertation preparation
lasted one year research.
The field of the PhD work was defined by the strong need for improving naturalness of Euro-
pean Portuguese TTS systems, including the inevitable prosody module.
The original main objective of this work was defined as the improvement of the naturalness of
EP synthetic speech by prosodic and acoustic modulation of the EP language. The objects of the
work were:
In order to fulfil the main goal, the original planed tasks consisted of:
• study of the state of the art concerning TTS systems and prosody models;
• study and development of a consistent set of instrumental tools of analysis and synthesis
for prosodic studies;
4
Chapter 1-Introduction
• creation of new models or adaptations of the existing ones to the EP language, their vali-
dation and comparison with other models already known;
• study and development of a new TTS system for EP to incorporate the previous planed
developments of speech naturalness at the acoustic and prosodic levels.
Meanwhile, the author participated in several projects under his integration in the CEFAT re-
search laboratory, firstly, and then in the LSS research laboratory. Namely:
• the successor of the previous project, the “SIRI” also with the UFRJ team;
• the “ANTÍGONA” project, that aimed the development of a speech interface for elec-
tronic commerce [Freitas et al., 2002];
• and the most important for this work, the COST 258 action “Naturalness of Synthetic
Speech”, where the author and the laboratory had the opportunity to cooperate with sev-
eral other European speech research laboratories and researchers [Keller et al., 2002].
The participation in these projects and mainly in COST 258 action gave a good background in
the state of the art and helped to clarify the original objectives focussing now the main purpose on a
prosody model.
The participation of the laboratory in the “ANTÍGONA” project allowed the development of a
robust EP TTS system namely FEUP-TTS that will be briefly described bellow.
Under the scope of improving naturalness of a TTS system several modules were found as need-
ing improvements. Therefore, improvements were made in those modules, some of them under this
PhD work, and others under the projects and made by other colleagues. Namely, the pre-processing
module suffered several improvements made by Hélder Ferreira [Report of ANTIGONA project,
01], [Braga et al., 2003] and [Ferreira, 2003]. The linguistic module had several improvements
made under this work, in collaboration with other researchers, namely, Paulo Gouveia and Daniela
Braga in a work reported in the next chapter like phonetic transcription, syllable division and label-
ling of the speech database.
The acoustical module was improved also, and there are two alternatives. The first alternative is
a formant synthesiser (formant module) with five formants, implemented in a co-work with Vagner
Latsch [Report of ANTIGONA project, 01]. The second alternative uses pitch-synchronous con-
catenative techniques and was developed by Barros [2002].
The prosody module is presented in this work. This module consists basically in the model to
predict segmental duration, based on ANNs, and in the F0 prediction scheme based on the Fujisaki
model, with parameters predicted from text, also by means of ANNs.
Since the prosody module was produced with the objective of being introduced in the FEUP-
TTS system, only the automatically available information was used. Although the focus of this dis-
sertation is the report of duration and F0 models, several preparatory modules were fundamental
and are also described as important issues, not just for the prosody studies purpose, but also as re-
5
A Prosody Model to TTS Systems
sources for EP language researchers. Those resources are the FEUP-IPB database, the syllable divi-
sion module and the set of phonetic transcription rules.
The motivation for the strong usage of ANNs was based on the typology of the problems, where
no sets of rule are known as solutions and a statistical tool like ANNs could efficiently achieve
good results, based on a good representation and carefully prepared statistic information. The
ANNs allow obtaining a good solution, without being known the functional mechanism, using only
the already known results. Anyhow, ANNs allow the evolution to a model where the phenomena
that interfere in the functional mechanism are known.
6
Chapter 1-Introduction
Pre-processing of text:
TEXT
Conversion to plain text of:
Numerals, acronyms, abbreviations, dates, etc.
Linguistic analysis:
Morphology and syntactic structure
Word, phrase and sentence boundaries
7
A Prosody Model to TTS Systems
Other European Portuguese TTS system, the DIXI, was reported by Oliveira and co-authors in
[Oliveira et al., 1991, 1993] and [Oliveira, 1996]. This system is a rule based synthesiser using the
Formants Klatt model. Latter, the system improved to the DIXI+, which is a concatenative based
synthesiser [Carvalho et al., 1998].
The architecture of the FEUP TTS system can be described by the 5 combined modules pre-
sented in Fig. 1.1.
The implementation of the first phase is based on linguistic context information extracted from
text of morphological nature including gender and number.
The implementation of this classification can immediately activate the adequate conversion or an
intermediate labelling of the element to be converted latter. This label activates the second phase
where a parser interprets the meaning and finally converts the number to extended text format.
After that, the text is organized into smaller units like sentences and paragraphs. The text is,
also, labelled with the mark-up language specially developed for this purpose and based in XML
mark-up language. This mark-up TPML language is also extended to allow the insertion of pro-
sodic labels.
The word and sentence boundaries are easily generated, but the phrase boundaries depend
strongly on the syntactic structure.
The system has an organisation structure of the dynamic variables which is prepared to receive
the information generated by the morpho-syntactic analysis and stores this information for further
usage in subsequent blocks.
In this phase, some morpho-prosodic labels should be introduced to be used in the grapheme-
phoneme conversion and in the prosodic module.
8
Chapter 1-Introduction
The pre-phonetic transcription converts some digraphs into the unequivocal phoneme represen-
tation such as <rr> into [R], <lh> into [L], <nh> into [J], etc, in order to facilitate the subsequent
possessing.
The following task is splitting words into syllables. The rules and algorithm are described below
in section 2.4.
Then, the tonic syllable is determined using a set of rules already described in [Teixeira, 1995].
After that, the grapheme-phoneme conversion is performed using firstly a table of exceptions,
then a set of rules as described in section 2.5 and finally the set of co-articulation rules, described in
the same section, are applied. The co-articulation rules are under the phase of implementation.
The output of this block is the phoneme code sequences (using SAMPA code [Wells, 2000]), the
delimitation of syllables, words, phrases, sentences and paragraphs boundaries, and the identifica-
tion of tonic syllables.
The block of prosody pattern determination consists of the prosody models developed under this
work and described in further chapters.
Firstly the phoneme sequences are converted into diphone sequences. Then two alternative tech-
niques are available for the acoustic processing: the formant synthesizer and the concatenation syn-
thesizer.
The formant synthesizer retrieves the sequence of frames of the diphones from the specific data-
base. This database consists of diphones of natural speech coded in a sequence of frames. Fig. 1.2
presents the information of the sequence of frames of one diphone. Each frame corresponds to 10
ms of speech coded in the parameters: F1, F2, F3, F4, F5, (5 formants), B1, B2, B3, B4, B5 (re-
spective bandwidths), information about voicing/devoicing and the amplitude of the excitation
source. Then the sequences of frames and the patterns of F0 and duration are used as inputs of the
synthesizer module represented by the blocks of Fig. 1.3.
9
A Prosody Model to TTS Systems
Fig. 1.2 – Sequence of 5 frames data of a diphone (F1 to F5, B1 to B5, voiced/unvoiced, amplitude).
F0
Ag
5 Formants and
Glottal excita- X bandwidths
tion
...
Noise generator X
Fig. 1.3 – Formant synthesizer block diagram. Ag and An mean the amplitude of excitation source.
The glottal excitation is produced with the LF model [Fant et al., 1985] allowing the association
of physical characteristics with parameters of the model and better control over the voice quality.
The noise generator produces a noise signal generated by means of random numbers with Gaus-
sian distribution.
The filter of spectral correction was introduced to compensate the observed difference in spec-
tral decay between natural human speech and synthetic speech, accounting for linear distortion in
the coding phase and in the selection of the source signal parameters.
The prosody manipulation is produced parametrically. F0 patterns are a sequence of values in-
dexed like the sequence of frames and are used to control the frequency of glottal pulses. The Dura-
tion of segments controls the number of frames used to produce each segment, by a removal or in-
sertion of frames process.
The time domain concatenation synthesizer described in [Barros, 2002] uses diphone units, col-
lected from the speech FEUP-IPB database. The control of segmental duration is done by repetition
or deletion of pitch periods, and the F0 control is achieved by shorting of enlarging the pitch peri-
ods.
10
Chapter 1-Introduction
Several parts of the work reported here were already published in International specialised con-
ferences. It is intended to report in this dissertation the work with more detail and some times with
more developments.
The main prosody work is documented in chapters 3 and 4, duration model and F0 model, re-
spectively. The accessory works, although also important for the main prosody model, are de-
scribed in chapter 2, where several preparatory work are reported, and in chapter 5 where the per-
ceptual evaluations of the models are discussed.
Chapter 2 describes several components and works as preparatory works to reach the main pros-
ody model. This chapter starts by describing a preliminary study with tonic syllable in EP in section
2.2. This work was not used directly in the present model, but was developed under this PhD work
and the resulting experience was important to clarify the research trajectory. This study intended to
present preliminary measurements of the modifications in the tonic syllable of prosodic features
syllable duration, F0 and intensity according to their position in the word and in the phrase. Section
2.3 describes the speech corpus FEUP-IPB database used in the development of the prosody model.
The process of labelling the database at segmental, word and phrase levels is described, several sta-
tistics of the database are presented and finally several phonetic modifications phenomena in the
database are reported. Section 2.4 describes two developed algorithms and set of rules to do the syl-
labic splitting both of the text and of the phonetic sequences resulted from the process of phonetic
labelling the database. This chapter ends with section 2.5 where some contributions to the EP pho-
netic transcription of text are presented.
Chapter 3 describes the proposed model to predict segmental durations. An overview of other
recent duration models is presented in section 3.2. In section 3.3 one model is proposed to predict
the segmental durations for EP based on one ANN. The aspects of ANN architecture and training
as well as a study of important features are presented. Then section 3.4 presents some parameters
used in the measurements, and the proposed model is evaluated. An alternative model using the
characteristics of the proposed one but with dedicated ANNs for each segment type is proposed in
section 3.5. Finally, section 3.6 presents a preliminary study of a model to predict the pause inser-
tion and pause duration.
Chapter 4 presents the proposed model to predict F0 patterns from text with dedicated ANNs. A
short overview of F0 coding models is described in the introduction section 4.1. In section 4.2
some discussion about theory and practical aspects of the Fujisaki modelling is made. Section 4.3
describes the process of estimation of the model parameters. Section 4.4 clarifies the sequences of
the application of the model. Section 4.5 presents the whole process of inserting phrase commands
controlled by an algorithm and the prediction of their magnitudes and final positions with ANNs, as
well as the study of selecting the ANNs and the set of features. Section 4.6 presents the process of
insertion of accent commands, the prediction of their parameters with ANNs, the selection of the
ANNs, and features. In section 4.7 the results of the F0 predicted contour are analysed, separating
the phrase and accent components. The predicted F0 contour over the predicted segmental dura-
tions is also analysed.
11
A Prosody Model to TTS Systems
In chapter 5 perceptual tests made to evaluate the proposed models are presented and discussed.
Section 5.2 compares the results of both the proposed duration models with natural speech and the
considered absence of a duration model. Section 5.3 discusses the loss in naturalness after the ap-
plication of each component of the prosody model (duration and F0 models).
Finally, Chapter 6 presents the final extended and resumed conclusions and future develop-
ments.
12
Chapter 1-Introduction
• the variation of prosodic acoustic parameters study in tonic syllable, already published in
[Teixeira et al., 1999] and [Teixeira and Freitas, 2002];
• the speech labelled corpus FEUP-IPB database for EP, already published in [Teixeira et
al., 2001];
• the algorithms for text syllabification and phonetic syllabification of EP based in the con-
sideration of grammatical sequences of vowels and consonants and some complementary
rules, already published in [Gouveia et al., 2000];
• the contribution to the set of rules of the phonetic transcription of text of several graph-
emes in EP.
The usage of ANNs in the models presented in chapter 3, for prediction of segmental duration,
was already experimented for other languages with good results. The original contributions in this
model were the extended list of features and the dedicated ANNs for each type of segment pro-
posed in the alternative model. Both contributions proved to improve the performance of the final
model. This work was already partially published in [Teixeira and Freitas, 2002, 2003a, 2003b].
The estimation of the F0 contour with the Fujisaki model was already published for several lan-
guages with their own peculiarities. The known published works reporting the prediction of F0 con-
tour from text are [Navas, 2003] for Basque language, [Mixdorff, 1998, 2002] and [Möbius et al.,
1993] for German language. Mixdorff presented in 2002 the prediction of parameters by the usage
of one ANN, the other works to predict the parameters by a rule process are based on statistical
analysis. So, the process of predicting the model parameters with dedicated ANNs is also innova-
tive. Other new contributions in this model are the process of insertion of the phrase commands and
the association of accent commands to syllables. Some parts of this work were already published in
[Teixeira et al., 2003, 2004].
13
2 Preparatory Work
This chapter describes several components developed during the work. These components are not
directly related with the prosody model but are used by it and are essential to supply linguistic in-
formation about the text. A preliminary study about the tonic syllable was done before the prosody
model. This study gave several hints for the duration and F0 models, and produced some quantita-
tive information about the tonic syllable. The FEUP-IPB phonetically labelled speech database is
also described which was used in all following studies. The algorithms for syllabification of the
written text and the phoneme sequence produced by the speaker are also described. Finally, sev-
eral rules for the EP grapheme-phoneme transcription are presented, as an important part to pro-
duce accuracy in synthetic speech of TTS systems.
A Prosody Model to TTS Systems
2.1 Introduction
This chapter presents several disconnected block developed during this work by the author in
cooperation with other colleagues of the laboratory. Each sub-chapter describes one disconnected
study, but any of them has an important or fundamental contribution for the final prosody model.
The first sub-chapter, 2.2, describes the variation of the prosodic parameters duration, F0 and in-
tensity introduced by the effect of accented syllable. This study was done previously to the prosody
model presented in next chapters and was presented and published in [Teixeira et al., 1999] and
[Teixeira and Freitas, 2002]. Although the following model has a radically different methodology
of the one followed in this study, a very good suggestions to the development of the prosody model
resulted from the experience and discussion of this work.
The section 2.3 presents the FEUP-IPB speech corpus database for EP which were used in all
subsequent developments. The database was produced with the main objective of the development
of the prosody model and as a source to extract speech segments for a TTS database. The database
is phonetically labelled and has also several other labels described in the mentioned section and in
[Teixeira et al., 2001].
Section 2.4 describes the developed algorithms, also presented in [Gouveia et al., 2000], to split
words into syllables. Two distinct algorithms were developed. The first one with a very good per-
formance, splits the written text, and attempts to be applied in the TTS process. The second one
splits phonetic word, as they were produced by the speaker. This second algorithm has the addi-
tional difficulty of dealing with several suppressions, very frequent in EP. This last algorithm was
used in all development studies of prosodic model, once the source of information are the pho-
nemes sequences as they were produced by the speaker. These algorithms can be considered as part
of the prosody model.
Last section presents a set of rules to be used in the grapheme-phoneme conversion process of
the TTS system, and discuss the major difficulties for the EP language. The set of rules are already
implemented in FEUP-TTS system. Some post-lexical rules are also proposed in order to reduce
the distance between phonetic transcription and phonological production.
16
Chapter 2 - Preparatory Work
2.2.1 Introduction
It is assumed by some authors, for instance [Zellner, 1998], [Andrade and Viana, 1988] and
[Mateus et al., 1990], that accurate modelling of tonic syllables is crucially important in the modu-
lation of prosody, and specifically in developing prosodic models to improve the naturalness of
synthetic speech. This requires the modification of the acoustic parameters duration, intensity and
fundamental voicing frequency, F0, but there are no previously published works that quantify sys-
tematically the variation of these parameters for EP.
F0, duration or intensity variation in the tonic syllable may depend on its function in the context,
the word length, the position of the tonic syllable in the word, or the position of this word in the
sentence (initial, medial or final). The function of words will not be considered, since it is not gen-
erally predictable by a TTS system. The main objective was to develop a quantified statistical
model to implement the necessary F0, intensity and duration variations on the tonic syllable for
TTS synthesis, considering only the position dependency.
2.2.2 Method
2.2.2.1 Corpus
A short corpus was recorded with phrases of varying lengths with a selected tonic syllable al-
ways containing the phoneme [E] (Sampa code). The syllables were analysed in various positions
in the phrases and in isolated words. The short corpus was built bearing in mind that this study
should be extended to a larger corpus with other phonemes and with refinements in the method re-
sulting from this first stage.
Two words were considered for each of the three positions of the tonic syllable (final, penulti-
mate and antepenultimate stress). Three sentences were created with each word, and one sentence
with the word isolated, giving a total of 24 sentences. The non sense word “fefeto” was also in-
cluded. The characteristics of the tonic syllable were then extracted and analysed in comparison to
a neighbouring reference syllable (unstressed) in the same word (e.g. Amélia, ferro, café: bold =
tonic syllable, underlined = reference syllable). The non-sense word is full of interest because it
contains the same syllable twice, in pre-tonic or post-tonic positions, allowing the reference sylla-
ble to be the same as the tonic syllable.
17
A Prosody Model to TTS Systems
The 24 sentences were read by three speakers, two males and one female. Each speaker read the
material three times. Recording was done directly to a PC hard disk using a 50 cm unidirectional
microphone and a sound card (16 bits, 11 kHz). The room that was used was only moderately
acoustically treated.
The MATLAB package was used for analysis, and appropriate measuring tools were created. All
frames were first classified into voiced, unvoiced, mixed and silence. Intensity in dB was calculated
as in [Rowden, 1992], and in voiced sections the F0 contour was extracted using a cepstral analysis
technique [Rabiner and Schafer, 1978]. These three aspects of the signal were verified by eye and
by ear. The following values were recorded for tonic syllables (T) and reference syllables (R) as
depicted in Fig. 2.1: syllable duration (DT – tonic and DR - reference), maximum intensity (IT and
IR), and initial (FA and FC) and final (FB and FD) F0 values, as well as the type of shape of the con-
tour.
DR DT
k α f ε
Signal 2
-2
100 200 300 400 500 600
250
F0 200 FC
FD
Hz
150 FA FB
100
100 200 300 400 500 600
40
Intens.
dB 30
IR IT
20
10
0
100 200 300 400 500 600
ms
Fig. 2.1 – Recorded parameters for tonic and reference syllables using the developed package for analysis. Top
graph: waveform signal of the word “café” and its classifications, in red as 1 – silence; 2 – unvoiced; 3 –
mixed; 4 – voiced. Middle graph: F0. Bottom graph: Intensity.
18
Chapter 2 - Preparatory Work
tence results from analysis of 18 utterances (two different words in each sentence type read three
times by three speakers: 2x3x3).
The difference in F0 variation between tonic and reference syllables relative to the initial value
of F0 in the tonic syllable, given by Eq. (2.1), was determined for all sentences. As these syllables
are in neighbouring positions the common variation of F0 is the result of sentence intonation. The
difference of F0 variation in these two syllables is due to the tonic position.
( FB − FA ) − ( FD − FC ) × 100
Relative variation of F0 = (%) Eq. (2.1)
FA
There are some cross-speaker tendencies, and some minor variations that seem irrelevant. Fig.
2.2 shows average relative variation of F0, ± 2·σ (σ-standard deviation), of the tonic syllable for all
speakers.
Fig. 2.3 shows the standard deviation between the three speakers. In some cases (low standard
deviation) the F0 variations in the tonic syllable are similar for the three speakers but in other cases
(high standard deviation) the F0 variations are very different. Reliable rules can therefore only be
derived in a few cases. Table 2.1 shows the cases that can be taken as a more consistent rule, taken
in consideration the standard deviation. These rules can be interpreted as the situations where the
F0 variation should be incremented in the mentioned percentage amount.
6. End
10.0 Word in the
M iddle
0.0 7. Beginning
1 2 3 4 5 6 7 8 9 10 11 12 8. Middle
-10.0 9. End
Word at the End
-20.0 10. Beginning
11. Middle
-30.0 12. End
-40.0
19
A Prosody Model to TTS Systems
16.0
14.0
12.0
10.0
std (%) 8.0
6.0
4.0
2.0
0.0
Position of
Beginning
Beginning
Middle
tonic in the
Middle
End w ord
Isolated
End
Position of w ord in the phrase
Although only the values for F0 variation are reported here, the shape of the variation is also im-
portant. The patterns were observed and recorded. In most cases they can be approximated by ex-
ponential curves.
2.2.3.2 Duration
The relative duration for each tonic syllable was calculated by the relation in Eq. (2.2). For each
speaker the average relative duration of the tonic syllable was determined and tendencies were ob-
served for the position of the tonic syllable in the word and the position of this word in the phrase.
DT
relative duration of tonic = × 100 ( % ) Eq. (2.2)
DR
Fig. 2.4 shows the average duration ± 2·σ (σ-standard deviation) of the tonic relative to the ref-
erence syllable for all speakers at 95% confidence. A general increase can be seen in the duration
of the tonic syllable from the beginning to the end of the word. The low values for standard devia-
tion in Fig. 2.5 (compared to the ones of previous figure) show that the patterns and ranges of
20
Chapter 2 - Preparatory Work
variation are quite similar across the three speakers, leading to the conclusion that variation in rela-
tive duration of the tonic syllable is speaker independent.
Rules for tonic syllable duration can be derived from Fig. 2.4, based on position in the word and
the position of the word in the phrase. Table 2.2 summarises these rules.
Note that when the relative duration is less than 100% the duration of the tonic syllable will be
reduced.
6. End
250.0 Word in the
200.0 Middle
7. Beginning
150.0 8. Middle
100.0 9. End
Word at the End
50.0 10. Beginning
0.0 11. Middle
1 2 3 4 5 6 7 8 9 10 11 12 12. End
30.0
25.0
20.0
standart deviation
15.0
in %
10.0
5.0
Isolated
0.0
End
Position of word
End
in the phrase
Middle
Middle
Beginning
Beginning
Fig. 2.5 – Standard deviation of average duration between the three speakers.
21
A Prosody Model to TTS Systems
There are still some questions about these results. Firstly, the reference syllable differs segmen-
tally from the tonic syllable. Secondly, the results were obtained for a specific set of syllables and
may not apply to other syllables. Thirdly, in synthesising a longer syllable, which constituents are
longer? Only the vowel, or also the consonants should be longer? Does the type of consonant (stop,
fricative, nasal, lateral) matter? A future study with a much larger corpus will address these issues.
2.2.3.3 Intensity
For each speaker the average intensity variation between tonic and reference syllables was de-
termined (Eq. (2.3)), in dB, according to the position of the tonic syllable in the word and the posi-
tion of this word in the phrase. There are cross-speaker patterns of decreasing relative intensity in
the tonic syllable from the beginning to the end of the word. Fig. 2.6 shows the average intensity
variation, ± 2·σ (95% confidence).
The standard deviation between speakers is shown in Fig. 2.7. The pattern of variation for this
parameter is consistent across speakers.
15.0
Middle
10.0 7. Beginning
5.0 8. Middle
9. End
0.0
Word at the End
1 2 3 4 5 6 7 8 9 10 11 12
-5.0 10. Beginning
-10.0
11. Middle
12. End
Fig. 2.6 – Average intensity variation of tonic syllable for all speakers (95% confidence).
22
Chapter 2 - Preparatory Work
8.0
7.0
6.0
5.0
4.0 dB
3.0
2.0
1.0
0.0
Isolated
Beginning
End
Middle
Middle
Position of tonic
Beginning
End
Fig. 2.7 – Standard deviation of average intensity variation between the three speakers.
In contrast to the duration parameter, a general decreasing trend can be seen in the tonic syllable
intensity variation as its position changes from the beginning to the end of the word. Again, a set of
rules can be derived from Fig. 2.6, giving the change in intensity of the tonic syllable according to
its position in the word and in the phrase. Table 2.3 shows these rules. It can be seen that in cases 1,
2, 10 and 11 the inter-speaker variability is high and the rules are therefore unreliable.
As in these experiments the tonic syllable always contains the phoneme [E], that is one rather
open phoneme and strongly pronounced, how much does this affect the results? In order to elimi-
nate this problem the reference syllable should ideally be the same as the tonic, even if non sense
words like (“fefeto”) should be used.
23
A Prosody Model to TTS Systems
the sets [1,2,3], [4,5,6], [7,8,9] and [10,11,12]. The average values of these sets show the effect of
the position of the word in the phrase.
Firstly, the variation of average relative duration and intensity of the tonic syllable are opposite
in phrase-initial, phrase-final and isolated words. Secondly, Comparing the variation in average
relative variation of F0 in Fig. 2.2 and average relative duration in Fig. 2.4, the effect of syllable
position in the word is similar in the cases of phrase-initial and phrase-medial words, but opposite
in phrase-final words. Thirdly, for relative F0 and intensity variation shown in Fig. 2.2 and Fig. 2.6
respectively, opposite trends can be observed for phrase-initial words but similar trends for phrase-
final words. In phrase-medial and isolated words the results are too irregular for valid conclusions.
These qualitative comparisons are summarised in Table 2.4.
Table 2.4: Summary of qualitative trends (varying the tonic position from beginning to the end of word) for all
word positions in the phrase.
Word position
Parameter
Isolated Beginning Middle End
Relative F0 variation *
Relative duration
Intensity
* Irregular variation.
Finally, there are some general tendencies across all syllable and word positions. For F0 relative
variation, the most significant tendency is a regular decrease from the initial to the final position in
the phrase, but in isolated words the behaviour is irregular with an increase at the middle of the
word. There is a regular increase in the relative duration of the tonic syllable, up to 200%. Less
regular variation in intensity can be observed, moderately decreasing (2-3 dBs) as the word varies
from the initial to the medial position in the phrase, but increasing (2-4 dBs) phrase-final and in
isolated words.
In informal listening tests of each individual characteristic in synthetic speech, the most impor-
tant perceptual parameter is F0 and the least important is intensity. Duration and F0 are thus the
most important parameters for a synthesiser.
This preliminary study clarified some important issues. In future studies the reference syllable
should be similar to the tonic syllable for comparisons of duration and intensity values, and should
be contiguous to the tonic in a neutral context. Consonant duration should also be controlled. These
conditions are quite hard to fulfil in general, leading to the use of nonsense words containing the
same syllable twice.
For duration and F0 variations a larger corpus of text is needed in order to increase the confi-
dence levels. The default duration of each syllable should be determined and compared to the dura-
tion in tonic position. The F0 variation in the tonic syllable is assumed to be independent of seg-
mental characteristics. The number and variety of speakers should be also increased so that the
results could be more generally applicable.
24
Chapter 2 - Preparatory Work
2.3.1 Introduction
The present database was built during this work because there was not at the time any public
phonetically labelled European Portuguese DB. With FEUP/IPB-DB, described below, it was
aimed at developing a new high quality EP TTS, for two purposes. The first purpose is to supply
word and phrase level annotations that are used to study and built prosody models for EP read
speech. The second is to provide a phonetically rich and natural database of EP phonemes and ar-
ticulations specifically recorded from the high quality voice of a skilled professional speaker. This
database was phonetically segmented, labelled and annotated in a way that allows it to be used for
quasi-automatic construction of the segmental base of a TTS system, because of its structural or-
ganization.
It is also important to stress that this DB allows us to extract segmental and supra-segmental fea-
tures for EP, what means that it is the basis for a broader knowledge on EP phonetics and prosody.
The voice recordings were done in an acoustically treated professional studio of RDP, the public
national radio broadcast company. The professional male speaker read the text materials and
speech was digitally recorded using the regular studio equipment. A careful preparation of the ses-
sion had been done with text preparations and trial readings. Different text materials serve different
purposes of the database and the speaker was carefully instructed in accordance. After some edition
treatment of the digital sound records, such as cutting out mistakes, sound material with a total du-
ration of approximately 100 minutes was produced, organized in a set of sound tracks with duration
between 2 and 3 minutes each. An audio CD in cda format and a set of .wav files in 44.1 KHz sam-
pling rate, 16 bits, mono, were produced.
Section 2.3.2 describes the text corpus, section 2.3.3 the segmentation process, and section 2.3.4
reports several characteristics of the database. In section 2.3.5, some relevant phonetic aspects are
presented that resulted from the phonetic inspection, segmentation and labelling of the database, as
reported by the phonetician Daniela Braga [Teixeira et al., 2001] and incorporated here as an im-
portant piece to complete the description of the database.
25
A Prosody Model to TTS Systems
cally engineered log-atoms carrying all standard Portuguese diphones and several triphones in a
congruent context. Some text readings, due to their extensions are divided into two or more sound
tracks.
The set of log-atoms consists of syllables with vowels, nasal vowels and diphthongs, read in a
continuous way in concatenative alternation between vocalic sounds or between vocalic and conso-
nantal sounds. This was divided into 3 tracks. The main purpose of this set is to guarantee that
some specimens of each rare diphone are present in the data base, spoken in an as monotonous as
possible way, for use in speech synthesis.
Each track was latter divided into files associated with text paragraphs.
When one word starts right after the previous symbol without a break, the code of start of word
was used to simultaneously label the end of one word and the beginning of the next. The same pro-
cedure is used for phrases and sentence boundaries.
All work of word and phrase labelling and about half of the phonetic labelling were manually
done. This task was accomplished by a professional phonetician and production rate was about 1
day for 1 minute of sound material. The other half phonetic labelling was initially done using an
automatic alignment tool from University of Gent [Vorstermans et al., 1996] and the result was
subsequently manually reviewed and corrected. This automatic alignment tool starts from the wave
file and the phonetic transcription of the text, as well as the word and phrase labels in the phonetic
transcription, to finally produce the phonetic labelling, inserting or removing some phones due to
reduction phenomena. This process is strongly encouraged because there are benefits in time con-
sumption.
In spite of the usage of specific tools for the labelling process, phone boundary identification is
neither always obvious nor consensual. Fig. 2.8 shows an example of the difficulty identifying
boundaries between [e] and [j] in the word ‘lei’ – ‘law’. The transition between [e] and [j] occurs in
the period of about 50 ms labelled as [ej], in the above picture. It is clear that there is no precise lo-
cation for the boundary.
To minimise this problem, the database was labelled by only one phonetician, so as to keep the
regularity in the identification of the mentioned boundary. No study was made to quantify the error
in phoneme and boundary labelling, since a study of that kind would require more phoneticians to
label a sample of the database and a comparison of the results from each labelling. However, some
observations have pointed out to an average labelling error of 5 to 10 ms.
26
Chapter 2 - Preparatory Work
Table 2.5: Phoneme, word and sentence level labels used in labelling the database.
Label Meaning
p, b, t, d, k, g Burst segment of plosive consonants
! Occlusion segment of plosive consonants
in SAMPA
f, v, s, z, S, Z Fricatives
code
m, n, J Nasals
L, l, R, r Liquid consonants
l* l in syllable-final position (velar)
i, e, E, a, 6, O, o, u, @ Vowels
in SAMPA
i~,e~,6~,o~,u~,w~,j~ Nasal vowels
code
w, j Glides
X Silence
XX Inhalation
“ Beginning of tonic syllable
Word Level
p Beginning of word
f End of word
Sentence Level
i Beginning of sentence
. End of sentence
, ! () - ; : ... “ Every punctuation marker in the text
Language changing issues were taken into consideration in the construction of this DB, in par-
ticular those related to dialectical or geographic varieties, as well as those concerning individual
tendencies, style or habits. These aspects will be described below.
Labelled files are read and processed with a Matlab-generated function, making all the labelling
information available. Phone identity, phone duration, word boundary, sentence boundary and
punctuation information can thus be extracted from the labelling files.
27
A Prosody Model to TTS Systems
0.28
l e ej j
-0.28
1.205 1.45
Time (s)
4500
0
1.205 1.45
Time (s)
Fig. 2.8 – Above: representation of the acoustic signal in the phoneme sequence [lej] in the word ‘lei’ – ‘law’.
Below: spectrogram.
2.3.4 Characteristics
The tracks 1, 2, 3, 4, 5, 7 and 8 were first manually labelled and the others tracks were, in a sec-
ond phase, semi-automatically labelled using the automatic alignment tool and then manually cor-
rected. Only the seven tracks were used in the following studies. These seven tracks give a total of
21 minutes of speech, which consist of 18.647 segments and 15.633 phones.
For each considered phone segment or phoneme, the relative occurrence frequency (in %), aver-
age duration and standard deviation were determined in a general position and in the tonic syllable
position. This data are reported in Table 2.6. Fig. 2.9 displays the relative frequency of each seg-
ment.
Comparing the phone segment’s duration in a general position with the phone segment’s dura-
tion in tonic syllable position, we can conclude that all vowels and the phoneme [l*] are longer in
tonic position, the phoneme [L] is shorter, and all other consonants including the stops of plosives
[!] are not affected by the tonic syllable position.
Fig. 2.10 presents a graph showing the regularity of the speech rate in the readings of the tracks
1, 2, 3, 4, 5, 7 and 8. A different slope in the time axis would indicate a distinct rate for that specific
track. The speech rate for the reading of the tracks varies between 11.6 and 13.0 phones/sec. The
average speech rate is 12.2 phones/sec.
28
Chapter 2 - Preparatory Work
Table 2.6: Percentage of occurrences, average duration and standard deviation of all phones, considering gen-
eral positions (including tonic) and just tonic syllable positions.
29
A Prosody Model to TTS Systems
0
a 6 E e @ i O o u j w j~ w~ 6~ e~ i~o~ u~ p !p t !t k !k b !b d !d g !g m n J l l* L r R v f z s S Z
1400
1200
1000
Time (sec.)
800
600
400
200
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
nº of segments 4
x 10
Fig. 2.10 – Illustration of the speech rate for the different texts (here represented by the inverse, that is, time
per segment in average). The figure shows the accumulated duration of elapsed segments. Track one is dis-
played using a solid line, track two using a dotted line and thus successively for the 7 tracks.
30
Chapter 2 - Preparatory Work
Before any regard on phonetic transcription, two main aspects must be considered: in one hand,
the inherent subjectivity of the transcriptor himself when making the report of the speech signal,
and, on the other, the linguistic changing factors. Therefore, being aware of these conditions, a trial
to carry out an accurate and close phonetic transcription of the DB, following coherent criteria was
done. Some of the questions that have to be taken into consideration when labelling the speech sig-
nal are now going to be described. These are of great importance to the quality of the synthetic
speech subsequently produced, because of their strong impact in phonetic co-articulation events
and consequently in prosodic aspects.
Social-linguistics explains that each language has a range of regional varieties that may differ in
phonetic, morphological, syntactic or even lexical aspects, though they still belong to the same lan-
guage. Political, sociological and historical reasons decide which variety is elected to be the stan-
dard and prestigious one. Hence, regional varieties are understood by these classes as deviations,
outsiders or outcasts. Considering language as social phenomena, it was decided to choose the
standard Portuguese, for its official, institutional and academic importance and extension. Never-
theless, some of the “dialectal slips” that are legitimate and interesting in a certain way are going to
be described.
These “dialectal slips” originate in relaxed articulation habits that sometimes happen even in a
professional speaker. In Table 2.7 some of these habits that can be identified in Oporto region are
presented.
31
A Prosody Model to TTS Systems
Linguistic changing is also related to phonetic context and inter-segmental co-articulation phe-
nomena. Despite the classic well-known EP distribution features of the phonemes /l/ or /s/, this DB
allows an experimental and faithful report of Portuguese phonetic reality, especially concerning
suppressions, additions and allophones.
From the labelling of this DB, it can be observed that the vowels [@] and [u] are often practi-
cally omitted, at every possible position in the word (beginning, middle, or end), except in a tonic
syllable position. Anyhow, these phenomena occur in non stressed syllables, thus producing unex-
pected consonant clusters (Table 2.8).
These phenomena occur when two vowels of different qualities get together in an utterance.
Two events are expected:
- the two vowels melt and experience a quality change; this occurs between non-closed vowels
(e.g. <fica admirado> [fikadmiradu]; <contra o> [kõtrO]).
- one of the vowels, the closed one, [@] or [i], is reduced and becomes a semivowel; the result is
a diphthong [ e.g. <se aprende> [sj6pre~d]>; <na idade> [n6jdad])
The above-described events are ancient and have always existed in a conscious domain since
Latin literature, which always used this knowledge with metrical and rhythmic purposes.
2.3.5.2.3 ADDITIONS
It is also common to produce reduced vocalic sounds so-called “schwas” between relaxed con-
sonantal groups such as the pair plosive/lateral (pl, tl, kl, bl, dl, gl) or plosive/trill (pr, tr, kr, br, dr,
gr): e.g. <branco> [b@rãku].
2.3.5.2.4 ALLOPHONES
Using the common definition of an allophone, in phonology, as a variant of a phone, when ana-
lysing the speech signal’s physical and acoustical characteristics, it can be observed that two equal
phones cannot be found; they all have a certain degree of dispersion which allows them to vary ac-
cording to the speaker’s mood, age, health, condition or other factors. Anyway, there are some es-
32
Chapter 2 - Preparatory Work
sential features that remain intact and that carry out the information conveyed. Additionally, there
are some contextual interference from the neighbour phones that change phones so much that they
can only be recognized by the phonological structure of the word and its connections to the psycho-
cognitive meaning. Some of those changes motivated by the articulatory context are listed and ex-
plained below:
- <-te> syllable in a word final position followed by a pause: the closed reduced vowel [@] is
acoustically weak and its presence is not absolutely necessary for the communication success,
which causes its reduction; the plosive “fricatizes” with the voiceless fricative consonant that is
closer to its articulatory point – [s]; we can observe this phenomenon in the database: e.g. <sete>
[sEt].
- <-r> in word final position followed by a pause: it’s a different [r], longer in duration and usu-
ally voiceless.
- <l> in closed syllable: as this phoneme’s contextual variant is already assumed by Portuguese
phonetics, we labelled it with a stipulated code [l*], because of its big distinctive acoustical im-
portance.
- The “fricatization” of voiced plosives in an intervocalic context (< -b- >→ β; < -d- >→ ð ;< -g-
>→ γ) can also be observed.
Those seven tracks were also separated by paragraphs with their respective labels. Thus, the total
of 21 minutes of labelled speech is available in the format of seven tracks of seven newspaper texts
or as a set of 101 paragraphs.
Those 101 paragraphs were later prosodically labelled with accent commands and phrase com-
mands according to the Fujisaki model of F0 [Fujisaki et al., 2001], as described in chapter 4.
33
A Prosody Model to TTS Systems
2.4 Syllabification
This work consists of an algorithm that allows carrying out the syllabic splitting automatically as
a stage of the development of a more extensive work that is the study of prosodic models for the
EP.
The work [Gouveia et al., 2000] of syllabic splitting is conceived for application in two distinct
situations: in the first one it is applied to the written text and in the second one to the sequence of
phonemes really produced in the locution of this text. Each one of the applications has its peculiari-
ties and difficulties, that are described, as well as the solutions adopted for its resolution. In the first
case an error rate of 0.06% is obtained and in the second case the score is 0.89%. The algorithm is
based on the consideration of syllables of types V, VC, VCC, CV, CVC, CCV and CCVC, V being
a vowel or diphthong and C a consonant. It is admitted that these categories of syllables cover all
the existing syllables realizations in Portuguese.
2.4.1 Introduction
It is commonly accept by authors of prosody models for other languages that the syllable is one
important part-of-speech in the determination of prosodic parameters such as phonemes’ durations
or fundamental frequency variations in speech synthesized from text. Being the aim of this work
the construction of prosody models for a TTS system a process is necessary to automatically split
words into syllables. These words can be in a written text or even as a sequence of produced pho-
nemes by the speaker.
In a preceding studies for Portuguese [Catarino, 2000], and for Spanish [Benenati, 2000], it was
observed that there are several common rules of splitting syllables. It was also observed that the
sets of rules in both references are not a consistent set to allow its implementation in an algorithm
to produce automatic syllabic splitting. Besides, they are not enough to solve all the cases. In these
two references there are some contradictory rules such as the splitting of digraphs <rr> and <ss> in
[Catarino, 2000] and non-spitting of the same digraphs in [Benenati, 2000].
The rules presented in [Catarino, 2000] for Portuguese are the following:
− Diphthongs and thriphthongs are not divided;
− Vowels must be separated from hiatus1;
− Following digraphs are not split: <ch>, <lh>, <nh>, <qu>, <gu>;
− Following digraphs must be divided: <rr>, <ss>, <sc>, <sç>, <xc>;
− Impure consonantal jointures must be separated;
− Identical vowels and groups of consonants <cc> and <cç> must be separated;
− Consonant in the end of a prefix must be linked to previous syllable if the word begins with
consonant, or linked to next syllable if the word begins with vowel.
The rules presented in [Benenati, 2000] for Spanish are the following:
− When it is possible syllable must finish with vowel;
34
Chapter 2 - Preparatory Work
The previous set of rules aim the syllabic splitting of text. Differently, the syllabification pre-
tended in this work must separate “phonetic syllables”. Therefore, syllables must be separated in
the way they are spelled, making no sense, for instance, to split the digraph <rr> to different sylla-
bles because together they are produced as just one phoneme.
In spite of the previous contradictions, syllabic division is a more or less objective question, ex-
cept in those situations where two vowels may form a hiatus or a diphthong, and in some cases of
consonant clusters. As rising diphthongs2 are unstable, according to [Cunha and Cintra, 1997] and
[Bergström and Reis, 1997], they can be uttered as hiatus or diphthong, the cases where two vowels
may form a rising diphthong or a hiatus, can be always considered as hiatus. When the vowels se-
quence indicates a falling diphthong, very frequently they really are a diphthong. Anyhow, as it
will be shown in the results section, the found mistakes are exclusively a few very rare situations of
two vowels sequence been erroneously interpreted as a falling diphthong.
Consonant clusters, in medial position of word, are in several situations (<bc>, <bd>, <bj>,
<bs>, <bt>, <cm>, <cn>, <ct>, <dj>, <dm>, <dq>, <cç>, <ds>, <dv>, <fn>, <ft>, <gd>, <gm>,
<gn>, <mn>, <pç>, <pn>, <ps>, <pt>, <tm> and <tn>) an ambiguous question from the view
point of descriptive linguistics and psycholinguistics. From the point of view of the descriptive lin-
guistics, these consonantal clusters should be divided into different syllables, generating the follow-
ing division type: <rit-mo> – ‘rhythm’. On the other hand, from the psycholinguistic point of view,
they should not be divided, resulting one division of the type: <ri-tmo>. According to [Cunha and
Cintra, 1997], both are possible in a tense pronunciation. Yet for same authors those consonants
clusters in the beginning of words are indivisible (e.g. <psi-có-lo-go> – ‘psychologist’). Both
points of view were implemented in different versions of the algorithm, been reported just the re-
sults of the first one.
As mentioned before, the syllabic division operations were applied to written text and spoken
text.
The first situation considers the grapheme sequence as they are written in text after some pre-
processing.
The second situation, aiming just the prosodic analysis, considers exactly the phones resulted
from the phonetic transcription of FEUP-IPB database, as described in the previous section. This
case has the following type of sequence (using symbols of Table 2.5):
2 Rising diphthongs consist in a sequence of semi-vowel followed by vowel. Falling diphthong consist in the
opposite sequence, vowel followed by semi-vowel.
35
A Prosody Model to TTS Systems
This short example of ‘spoken text’ corresponds to the text “Um porto vintage aqueceu o título.
Arraia miúda cresce a olhos vistos” – ‘A vintage port warmed the title. Small teams rise in classifi-
cation’.
The semi-vowels are grouped with vowels to form diphthongs. The semi-vowels surrounded by
two vowels are grouped with previous vowel forming a falling diphthong instead of a rising diph-
thong for the same reason pointed out before.
Syllables of written text with suppressed vowels become quite difficult to identify (e.g. <fute-
bol> [ftbOl] – ‘football’). New consonant clusters come out, from consonants belonging originally
to different syllables. As the objective is to identify the original sequence, this leads to the consid-
eration of those syllables formed just by consonants where the vowels were eventually suppressed.
The word boundaries and the beginning of tonic syllables codes are used as a syllable boundary,
facilitating the correct identification of that syllable boundary.
The melted vowels of two words sequence introduce the problem of automatically deciding
which word they belong to?
It is also very frequent the phenomenon of addition, as discussed in 2.3.5.2.3, and suppression
forming a new legal syllable (e.g. <bran-co> [b@-r6~-ku], <pa-ra> [pr6]), difficult, again, the
process of syllable identification boundaries as the ones produced in the written form.
Section 2.4.2 describes the set of rules and their implementation for the written text. Section
2.4.3 describes the set of rules and their implementation for the spoken text. Section 2.4.4 presents
the results for both implementations and an error analysis. Section 2.4.5 presents the conclusions of
the syllable splitting rules.
2.4.2.1 Rules
The set of considered rules aiming syllabic splitting are based in the supposition that any EP syl-
lables can be of one of the following groups: V, VC, VCC, CV, CVC, CCV and CCVC, where C
means a phonetic consonant and V a vowel or diphthong. This supposition is an enormous contri-
bution to the process of detecting syllable boundaries. The small number of cases not solved just by
this supposition demand complementary rules. Just two types of situations are not solved by this
supposition. The first case is two consonants between vowels (...VCCV...) and the second case is
three consonants between vowels also (...VCCCV...).
The first case is solved by the rule that the syllable boundary can not separate a vowel after con-
sonant (C-V); if the two consonants form a inseparable pair, that is the first consonant belong to the
group <b, p, d, t, g, k, v, f> (<k> corresponds to one of the letters <k>, <c> or <q>) and the second
36
Chapter 2 - Preparatory Work
one belongs to the group <l, r> (e.g. <a-tlas>) then the two consonants start a new syllable; if not,
the boundary will be necessarily between the consonants (e.g. <al-tas>).
The second case, (...VCCCV...), is solved by following rule: as a sequence of three consonants
can not belong to the same syllable, the boundary will be between the second and the third conso-
nants if the first two consonants form an inseparable pair or if the second consonant is an <s>, once
the consonant <s> preceded by an other consonant makes the boundary between the consonants
(e.g. <obs-tar>); if not, the boundary will be necessarily between the first two consonants (e.g. <ul-
tra>).
When two or more vowels follow in a sequence, it is necessary to verify if they form a falling
diphthong or hiatus3. For falling diphthongs detection the sequence of vowel followed by semi-
vowel is searched. Phonetic semi-vowels are considered the letters <i> and <u> not preceding an
<r> or <l> as last letter of word or as the first of two or more consonants (e.g. semi-vowels: <cai>,
<cai-ro>; hiatus: <ca-ir> and <ca-ir-mos>). They are not considered as semi-vowel when pre-
ceded by the same letter (e.g. <ni-ilismo>), or when preceding the vowel <u> (e.g. <ca-iu>) or the
case of nasalization (e.g. <a-in-da>). Finally, the letter <o> proceeded by letter <a> is also consid-
ered as a semi-vowel (e.g. <ao>).
2.4.2.2 Algorithm
The implementation was done in C. Fig. 2.11 illustrates the flow chart responsible for one word
split. The word to be processed is stored in the string designated by pal, being represented as in C
language. The character ‘\0’ is used for word end and the first string character has the index 0. The
variable i is the index of the grapheme of the word been processed. The functions vowel(x) and
semivow(x) have the function of identify if character x is phonetic vowel or semi-vowel, respec-
tively. Function put(x) sends to the output string the character x. The function seg(x,y) allows to
verify if x and y form one pair of inseparable consonants. Finally, the character ‘.’ is used as a syl-
lable boundary (e.g. being the word <fluxograma> the input string, the result is the string
<flu.xo.gra.ma>).
37
A Prosody Model to TTS Systems
Beg. of word
i<-0
?
END of word
put(pal[i])
Yes
?
c c? c? cc
vowel(pal[i]) ? No i++ pal[i]='\0' ? No put(pal[i]) vowel(pal[i]) ? No
v i++
Yes Yes
...v
...v put(pal[i]) cc v
put(pal[i]) put(pal[i])
Yes pal[i]='\0' ?
No
i++
...c v
...vc? ...vc
pal[i]='\0' ? Yes put(pal[i-1]) END of word
put(pal[i])
No
...vcv ...v.cv
i++
vowel(pal[i]) ? Yes put('.') put(pal[i-1]) put(pal[i])
No put(pal[i])
...vcc ...vccv
...v.c cv
vowel(pal[i+1]) ? Yes seg(pal[i-1], pal[i]) ? Yes put('.') put(pal[i-1])
...v c.cv
No No put(pal[i-1]) put('.')
Yes
...vcccv
...vc .c cv
No put(pal[i-1]) put('.') put(pal[i]) i++
Fig. 2.11 – Flow chart for one word syllabic splitting of a a written text. V-vowel; C-consonant; ...- any se-
quence of graphemes; .- syllable boundary; ?-grapheme not determined yet; bold- grapheme already stored in
the output string; underline-pointed grapheme by index i.
38
Chapter 2 - Preparatory Work
2.4.3.1 Rules
In this case the distinction between diphthongs and hiatus are simplified because the semivowels
are already identified by their respective label.
The major problem is due to the suppression of several vowels originating consonant clusters of
different syllables, making difficult the correct identification of syllable boundaries according to
the respective written text. The suppression phenomena lead to the consideration of two more ab-
stract syllable types, besides the ones listed in 2.4.2.1 that are the C and CC. The new types appear
in syllables of types CV, CVC and CCV, where the vowel was suppressed. In syllables of the type
CCVC these suppression phenomena are very rare (e.g. [p@nEtr6S]). Thus, the option was to not
consider syllables of type CCC avoiding very frequent erroneous boundary identification.
The syllabic splitting of the spoken text is also based in the suppositions that any syllable be-
longs to one of the types: V, VC, VCC, CV, CVC, CCV, CCVC, C and CC. However, due to ad-
ditional difficulties introduced by the vowel suppression phenomena, an additional set of rules are
needed to those specific situations:
− Consonants [l, r, S, z and Z] followed by other consonant always precede syllable boundary
(e.g. [sal-tu]). This group includes also the consonant [Z] as result of voicing the unvoiced
consonant [S] in voiced context (e.g. [meZ-mu]).
− Consonants [S, z and Z] in end of word position, belongs to previous syllable4.
− A vowel followed by one of the following pair of consonants {[bk], [bd], [bZ], [bs], [bt], [km],
[kn], [kt], [ks], [dZ], [dm], [dk], [ds], [dv], [fn], [ft], [gd], [gm], [gn], [mn], [ps], [pn], [pt],
[tm] and [tn]} inserts a syllable boundary between consonants, producing a syllable of the type
VC(e.g. [ap-tu]). The same pair of consonants in beginning of word are not separated (e.g.
[pnew]), [Cunha and Cintra, 1997].
2.4.3.2 Algorithm
Before the application of the algorithm of syllabic splitting for each word the marks of occlusion
(!) and semi-vowels are excluded from the string of phonetic symbols and their original positions
are stored. After the syllabic splitting those marks are re-introduced in the original positions to re-
store the correct sequence of segments (occlusive consonants and diphthongs).
This algorithm introduces the next syllable boundary returning again to the beginning of the al-
gorithm to find another boundary until the end of word.
4 This rule originate some non recognised syllables in words where originally last syllable consisted in those
consonant followed by a suppressed vowel (e.g. original phonetic word – [Ri-a-Z@] non recognized sylla-
ble [Ri-aZ]).
39
A Prosody Model to TTS Systems
An execution cycle of the algorithm presented in Fig. 2.12 ceases with one decision of the type
of syllable detected V, CV, CVC, etc.
The function F(x) reads the phoneme with index x. The phoneme can belong to one of the fol-
lowing groups: V – vowel; C – consonant; V1 – one of the vowels [a, 6, O, o]; C1 – one of the con-
sonants [b, p, d, t, g, k, v, f]; C2 – one of the consonants [l, r]; C22 – one of the consonants [S, z, Z];
C3 – one of the consonants [l, r, S, z, Z]; C-C3 – no C3 group’s consonant; ac – phoneme has the
marker of tonic syllable (just the first phoneme of a syllable can carry this marker); fp – end of
word (last phoneme of word is previous to this mark).
Function cond(a,b) returns the logical value 1 (yes) if [(F(a)=C1 and F(b)=C2) or (syllable in be-
ginning of word) and (F(a)F(b) are one of the sequences: {[bk], [bd], [bZ], [bs], [bt], [km], [kn],
[kt], [ks], [dZ], [dm], [dk], [ds], [dv], [fn], [ft], [gd], [gm], [gn], [mn], [ps], [pn], [pt], [tm], [tn]})],
otherwise returns the logical value 0 (no).
The decision process to split the word [kaz6] is presented as example: F(1) is the phoneme [k],
once it is a consonant the algorithm takes the right branch and read F(2); as F(2) is the phoneme [a]
that is a vowel the algorithm proceeds by the left branch and reads F(3); as F(3) is the consonant [z]
belonging to the group C3 the left branch is taken and reads F(4); as F(4) is the vowel [6] the deci-
sion is taken considering one syllable of the type CV inserting the boundary after [ka]. In the new
cycle the algorithm reads the new F(1); as the new F(1) is the consonant z the algorithm proceeds
by the right branch and reads F(2); as the F(2) is now the vowel [6], the algorithm follows by left
branch and reads F(3); as F(3) is the end of word (fp) the new syllable is of the type CV. The split-
ting process of the word has finished producing the syllables [ka-z6].
40
Chapter 2 - Preparatory Work
Beginning of syllable
V C
F(1)
V, ac, fp C-C3
V C
F(2) F(2)
V ac, fp
C
3
C3 C-C3 C Yes No
F(3) cond(1,2)
V other
F(3)
V, ac, fp
V VC V, fp C , ac CV
3
F(3)
V ac, fp V, ac C C, ac, fp fp C
F(4) F(4) V F(3) F(3)
V C22 C2 VC
CV CVC CC other
C CV fp
V, ac C
fp
F(4)
Yes No
VC C V cond(2,3) other, fp
other C other C C other'S' other
F(5) 22 F(3) F(4) 3 22 F(2) F(2)
VC fp
CVC
Yes CCV
F(1)=V1 No CVC CV CC C
e F(2)='b'
V CC C
VCC VC C V
22 other Yes No other, fp
F(4) cond(3,4) F(5)
CCV CCVC
CV CVC
CVC CV
41
A Prosody Model to TTS Systems
The presented results were measured using a test set of texts different from the one used in de-
velopment and refinement process. Both tests belong to the corpus used in the FEUP-IPB database.
The written text algorithm was tested with a set of not repeated words taken from five texts of
the mentioned corpus. Just words with more than two letters were considered. The algorithm com-
mit just two mistakes in a total of 1164 words and 3387 syllables, corresponding to an error rate of
0.06% by syllable. Both error situations correspond to a hiatus wrongly interpreted as a falling
diphthong (<cai-re-mos> and <reu-ni-ão>). This error takes place when one vowel is followed by
the grapheme <i> or <u> that, exceptionally, does not behave as a semi-vowel (the rules classify
this generic case as a falling diphthong). The identified errors have not an immediate solution since
other words of same kind behave in a different manner (e.g. <Cai-ro> and <reu-má-ti-co>).
Probably the consideration of syllabic context will help to solve these cases.
The spoken text algorithm was tested with two texts from the mentioned corpus. Monosyllable
words were not considered. Fourteen (14) mistakes were produced by the algorithm in a total of
1569 syllables, corresponding to an error rate of 0.89%. The mistakes took place in seven different
words (this test admitted repeated words). The mistakes occurred in underlined phonemes: <fute-
bol> [ft-“bol], <evidentemente> [iv-de~-t-“me~-t], <ministério> [mniS-“tE-riw], <irresponsabili-
dade> [iRS-po~-s6-bli-“da], <industrial> [i~d-S-tri-“al], <acusação> [6k-z6-“s6~w], <demon-
stração> [dmo~S-tr6-“s6~w]. Tonic syllables are identified by <”>. All mistakes took place in
syllables where the vowel was suppressed and the consonants were associated to the neighbour syl-
lable. The situations where the spoken text boundaries do not follow written text boundaries but the
produced syllables are phonetically ‘admissible’ (e.g. pa-ra uttered as [pr6]), were not added-up as
errors.
2.4.5 Conclusions
The developed algorithms has show different error rates in the two applications (written and
spoken text), as expected, due to the additional difficulty introduced in spoken text by the vowel
suppression phenomena. Nevertheless, in both cases the error rate is very low, 0.06% and 0.89%
for written and spoken text, respectively.
Unfortunately, there are no other published works with measured results to be compared. But in
both cases the results fulfil the objectives.
42
Chapter 2 - Preparatory Work
A previously published work for grapheme phone conversion of EP were presented in [Trancoso
et al., 1994], [Teixeira, 1995] and more recently [Caseiro and Trancoso, 2002]. The work presented
by Teixeira implements the grapheme-phoneme conversion in MULTIVOX TTS system for the EP
version [Teixeira et al., 1998], and proceeds in two phases. The first phase consists in the applica-
tion of a list of rules presented in a tabular format that converts sequences of graphemes by se-
quences of phoneme codes. These rules specify the elementary conversion of graphemes, se-
quences of graphemes, words and parts of text. The second phase consists in a programmed
application of the more complex rules that corrects several sequences of the previous phase. Dia-
mantino Caseiro and their co-workers presented a description and comparison of a rule-based ap-
proach, a data-driven approach by mean of Weighted Finite State Transducers (WFSTs) trained
with automatically transcribed material and a hybrid approach. The best score was achieved with
the rule system with an error rate per word of 3.25%, whereas the compilation of that set of rules
with WFSTs scores 3.56%. The WFSTs implemented by the way of a data-driven approach
achieved a 9.02% error rate. Combining data-driven and knowledge based approaches the best
score was 3.94%. Despite the rule based best scores the data-driven with knowledge base approach
are very promising.
Filipe Barbosa and co-authors presented a grapheme-phone transcription algorithm for a Brazil-
ian Portuguese TTS system [Barbosa, Ferrari and Resende, 2003] based in rules with an accuracy
rate of 98,4% per phone.
Despite the set of rules and table of exceptions presented, the problem of homograph words re-
mains unsolved. These cases many times can be solved by the knowledge of the morphology of the
word, for words like <espeto> verb – [SpEtu] and <espeto> noun – [Spetu] with different morpho-
logical categories, but in other words like <sede> noun ‘headquarters’ [sEd@] and <sede> noun
‘thirst’ [sed@] , with identical morphological categories, even this information can not help in the
decision. Filipe Barbosa and co-authors presented a work [Barbosa et al., 2003] to disambiguate the
word <sede> with an accuracy rate of 95%.
The set of grapheme-phoneme conversion rules described in this section were implemented in
the FEUP-TTS system. The list of rules is not complete yet, but solves almost all the cases. Only
some graphemes in EP <a, e, o, x> present a higher complexity to be converted by rules, or they
can not even be completely described just by rules without morphologic knowledge or even knowl-
edge of the origin of word. Therefore a special list of rules and their exceptions was developed for
those graphemes. The exceptions and other cases not solved by rules can be correctly converted us-
ing an additional table of conversion.
The produced set of rules for EP incorporates the previous rules reported in [Teixeira, 1995]. As
for other languages EP has graphemes that are univocally converted into one phoneme and a simple
rule is needed, or a sequence of two graphemes converted into just one phoneme (e.g. <ch> – [S]
and <lh> – [L]), or even one grapheme converted into more than one phoneme (e.g. <têm> –
[t6~j~6~j~] ). But, these cases are well behaved and have always the same conversion. The major
problems come in the conversion of graphemes <a, e, o, x>, which can be converted into different
phonemes, according to the specific case, and no known set of rules can solve all the cases. This
work concentrates in those graphemes, since others can be transcribed using rather immediate rules.
43
A Prosody Model to TTS Systems
The set of phones used for EP in SAMPA code is presented in Table 2.9.
Fig. 2.13 displays the processing blocks sequence leading to the phonetic transcription in the
FEUP-TTS system. Fig. 2.14 displays the processing sequence of the phonetic transcription block.
44
Chapter 2 - Preparatory Work
The experiment consisted in finding one of the possible phonemes to transcribe the specific
grapheme <a> or <e>. In the case of <a>, the considered possible phonemes were [a, 6], and in
case of <e> the possible phonemes were [E, e, @, i]. For grapheme <a> a perceptron layer ANN
was used, and in the case of grapheme <e> a feed-forward ANN was used. About 2 thousand non
repeated words were used in the training set and another 2 thousand for test in both grapheme
cases. The list of features for both cases is presented in the following:
• Position of grapheme syllable concerning tonic syllable (5 possibilities: tonic syllable; pre-
vious, before previous; next syllable; after next syllable);
• Closed syllable finished with <al> sequence (used just in grapheme <a> ANN).
The relevance of each feature was measured by comparing the performance with and without the
specific feature. The first three features (plus last feature for grapheme <a>) really influence the
performance, the other features do not influence the general performance alone, but together, the
performance becomes improved with their inclusion in the set of features.
The output of a perceptron layer is binary. So each level was associated to the output category of
each target phone [a, 6]. The best (lowest) measured error rate (number of errors/number of graph-
emes) for the grapheme <a> in the test set was 1.7%.
The grapheme <e> cases that must be converted into a nasal vowel [e~] were previously tran-
scribed by rules. Several output ANN codifications could be used to select one of the four pho-
nemes [E, e, @, i]. The option of 4 nodes was selected, associating one node to each phoneme, and
selecting the highest output node. This grapheme is more difficult to correctly transcribe because
four phonemes can be obtained. The best-measured error rate was 6.4%.
45
A Prosody Model to TTS Systems
The solution of transcribing a specific grapheme with a dedicated ANN was not completely ex-
plored concerning possible features, like more phoneme context information, or implementing
some well known rules. So, there is a space for evolution in this matter with the present approach
that was not fulfilled.
The sequence of phones to be produced by TTS systems is established finally after the applica-
tion of the co-articulation rules. This final processing block of rules attempts to reduce the distance
between the phonetic transcription and the actually produced phone sequence by a speaker. This
distance is many times created by co-articulation effects of neighbour sounds. That’s the reason
why this bock is called co-articulation rules.
Grapheme <a> can be produced as phone [a] or [6] in general. Table 2.10 displays the set of
implemented rules in FEUP-TTS system.
Table 2.10: rules for conversion of grapheme <a>, presented by priority order.
An algorithm to measure the accuracy rate of the conversion of the grapheme <a> using texts
not seen in the development phase was used. The text contains 5619 <a> graphemes in non re-
peated words. The set of rules failed in 19 cases. The resultant error rate was 0.34%.
Grapheme <e> can be produced as phonemes [E], [e], [@] or [i] in general. In some particular
cases, due to the articulation process, as will be seen in section 2.5.3 , this grapheme can also be
produced as phoneme [6]. Table 2.11 displays the set of implemented rules in FEUP-TTS system.
46
Chapter 2 - Preparatory Work
Table 2.11: rules for conversion of grapheme <e>, presented by priority order.
6 Open syllable finishes with vowel; closed syllable finishes with consonant.
47
A Prosody Model to TTS Systems
Grapheme <o> can be produced as phonemes [O], [o], or [u] in general. Table 2.12 displays the
set of implemented rules in FEUP-TTS system.
Table 2.12: rules for conversion of grapheme <o>, presented by priority order.
48
Chapter 2 - Preparatory Work
49
A Prosody Model to TTS Systems
Rules implemented in FEUP-TTS system for <x> are presented in an algorithm format (more
easy to understand , in this case):
The cases: <proxi>, <próxi>, <maxim>, <auxili>, <troux> - [s]
<ex>
in beginning of word
as prefix [S] (e.g.ex-ministro)
followed by vowel [z] (e.g. exemplo)
followed by consonant [S] (e.g. exposto)
in middle word position
followed by vowel
preceded by <in> in beginning of word [z] (e.g. inexis-
tente)
preceded by <s> [ks] (e.g. sexualidade)
other cases
<e> in tonic syllable [ks] (e.g. convexo)
in non tonic syllable
preceded by cons + <l> [ks] (e.g. flexível)
50
Chapter 2 - Preparatory Work
An algorithm to measure the accuracy rate of the conversion of the grapheme <x> using texts
not seen in the development phase was used. The text contains 1649 <x> cases in non repeated
words. The set of rules failed in 56 cases. The resultant error rate was 3,4%.
The phonetic transcription is applied to each individual word. Co-articulations rules will care
about the phoneme modifications phenomena that happen by co-articulation effects when words are
spoken together.
51
A Prosody Model to TTS Systems
Brinckmann and Trouvain [2003] reported based in their experiments that a group of listeners
clearly reject the synthetic speech produced with the lexical form resulted straight from phonetic
transcription, which might sound too unnatural, but makes no difference, to them, between original
form (as uttered by a speaker) and post lexical rules (or co-articulation rules).
8 Craze – contraction of two non tonic vowels with similar or equal timbre into just one.
9 <e e> can follow this rule or the suppression of glide [@].
52
Chapter 2 - Preparatory Work
The set of co-articulation rules presented in Table 2.13 will be soon implemented in FEUP-TTS
system.
53
3 Duration Model
This chapter describes one of the most important parts of the prosody model developed in this
work, the segmental duration model. It starts by making an overview of the most recent and
prominent duration models. Then some considerations are made about the speech database,
concerning segmental durations, the architecture of the selected ANN is represented, some training
functions are introduced, the training process explained, and the set of features is detailed. Besides
de proposed model with one ANN to predict segmental durations, an alternative model based in
one ANN dedicated to each type of segment is also proposed. Results of both models are discussed.
Finally, a simple model to insert pauses and predict their durations is presented.
A Prosody Model to TTS Systems
3.1 Introduction
In this work, the word duration refers to the period of time that a given speech unit lasts or in
other words, its length. Distinct speech units have been considered along different models and
languages. Some authors use highly distinctive units such as syllables [Campbell and Isard, 1991]
or Inter-Perceptual-Centre-Group - IPCG, [Barbosa and Bailly, 1994]. Basically, IPCG are speech
segments between the beginnings of vowels or from the beginning of initial syllables. Others use
models to estimate the length of the phonemes their selves [Córdoba et al., 1997]. Throughout this
work, speech units will be regarded as segments. These segments are either a clearly indivisible
part of the phoneme, e.g. the two segments into which plosives can be divided, occlusion and
explosion, or the phoneme itself.
Ferreira [1998] claims that since phonological syllables in European Portuguese derive from the
collapse of weaker syllables, they cannot be regarded as rhythmic units, as opposed to other
languages.
Correct utterance requires the duration of each segment to have a suitable degree of harmony. It
is accepted that this prosodic parameter follows the F0 contour as the second most important
parameter to achieve naturalness in speech. If we consider it to belong to the rhythmic dimension of
prosody, then we have to mention different types of breaks and their corresponding duration as part
of this dimension.
Different types of phonemes have different elasticity degrees, concerning their durations. The
standard deviation of a sufficiently vast amount of measured durations of a segment or phoneme is
a reliable elasticity indicator [Campbell and Isard, 1991]. Thus, generally speaking, vowels have
more elasticity than consonants. Exceptionally, some fricative consonants, [f], [s], [S], and velar [l],
have similar elasticity to vowels in Portuguese [Teixeira et al., 2001].
The difficulty in handling this matter is the set of features that may influence the duration of a
segment, as well as its influence degree on others and the way they correlate. Generally these
features aren’t linearly independent and they cannot form an orthogonal basis [van Santen, 1994].
Segments occupying stressed-syllable positions have larger duration; therefore, this feature should
be taken into consideration, as well as others which, with more or less detail, characterize the
context, such as the identity of the surrounding segments, within-word position, phrase position,
etc. Semantic features such as emphasis, intonation groups or sentence type, prosodic features such
as pitch level accents, as well as syntactic features such as word class, may also be used. The
choice of features shouldn’t disregard the fact that the system where the model is included may or
may not be able to determine them.
Since some of the presented features aren’t linearly independent, their effects cannot be added to
others’ when measuring duration [van Santen, 1994]. One way to deal with this problem is to use
quasi-minimal sets of feature vectors and compare the average duration of the segments on those
sets to acknowledge the dependency between features. These quasi-minimal sets of features consist
of two sets of condition vectors in which all features but one match.
Syntactic features have a strong influence on the prosodic structure. Since the duration structure
is dependent on the latter [Zellner, 1994], that would be reason enough to add them to a list of
useful features. However, they do not show in most TTS linguistic analysis models. Syntactic
analysis tools are still very expensive to the system as a whole and that is why not all TTS systems
include them.
56
Chapter 3 – Duration Model
Duration models use and handle relevant features distinctively. We can distinguish the
traditional models by the way they handle these features. Thus, there are rule-based models such as
the Keller-Zellner Model, which applies more or less complex rules to lengthen or shorten the
duration of the segments, mathematical models such as the Klatt and the van Santen Models, which
combine multiple features into a single expression, usually a sum-of-products, that establishes the
duration of the segments, and finally statistical models, which apply generic tools such as
Classification and Regression Trees – CARTs, or ANNs, and consider the sets of features in their
input to predict the duration of the segments. Some models combine several of those
functionalities, as does the Nick Campbell and the Barbosa-Bailly models, which combines neural
nets and mathematical models.
This chapter presents the state-of-the-art concerning duration models in the next section. Section
3.3 describes the proposed model, based on ANNs. It starts making some considerations on speech
database concerning segmental durations, then describes the ANN architecture and the training
process. The set of features is discussed in section 3.3.4. The model is evaluated and its results
discussed in section 3.4. In section 3.5, a variation of the model is presented as the alternative
model. This variation, basically, consists in splitting the task of predicting segmental durations with
one ANN by 44 dedicated ANNs. Once this alternative model has its results improved, they will be
taken as a serious model to be perceptually evaluated in chapter 5. The present chapter ends with a
simple proposed model to insert pauses and predict its durations.
57
A Prosody Model to TTS Systems
Rule-based models should allow a straightforward knowledge of the effects of each feature in
the duration of the segments. Examples of this type of models are the Klatt rule-based model [Klatt,
1976], the rule-based algorithm for French [Zellner, 1998], presented by Zellner for different
speech rates, and the look-up-table for Galician [Salgado and Banga, 1999].
Mathematical models usually appear as a Sum-of-Products, where the features are statistically
weighted and summed to produce the segmental duration [van Santen, 1994].
Statistical duration models become more and more used with the availability of large
phonetically labelled data-bases. Neural networks and regression trees are the more often used
tools, applied in different ways for different languages and using different type of segments.
Campbell [1993] introduced the concept of Z-score to distribute the duration estimated by a neural
network, for a syllable, among its segments. He argued in favour that the syllable is the more stable
unit. Barbosa and Bailly also presented a two steps model for French [Barbosa and Bailly, 1997]
and Brazilian Portuguese [Barbosa, 1997]. In the first step, using a neural network, they estimate
the duration of the Inter-Perceptual Centre Groups (IPCG), arguing that is the more stable unit. In
the second step they distribute the duration of the IPCG among its segments, using the Z-score
concept. This model can deal with different speech rates, and pauses. Other neural network-based
models were also presented for Spanish [Córdoba et al., 1999] and Arabic [Hifny, 2002]. Example
of a CART-based model applied for Korean can be found in Chung [2002].
Some recent, successful duration models are now shortly described in terms of result and
application to Text-To-Speech systems.
The model consists of an equation, Eq. (3.1), which is applied to a sequence of segments
successively, starting with an initial or inherent segment.
(
D p = Dmin, p + k × Din − Dmin, p ) Eq. (3.1)
Here, Dp is the predicted duration for segment p, Dmin, p is the minimum duration for segment p,
Din is the output from preceding rules. For the first segment of the sequence, Din equals the inherent
duration of segment p. Finally, k is a parameter reflecting the contribution to duration of a set of
features expressed by the following rules:
58
Chapter 3 – Duration Model
N
k = ∏ k fi Eq. (3.2)
i =1
where kfi is the value of feature i. k has a value between 0 and 1 for shortening rules and superior
to one for lengthening rules.
This model is both based on rules and mathematical modelling. Moreover, it implies minimum
duration and inherent duration values for each segment.
Si, j is the parameter which associates j, and possibly the correlation between features i and j, to
the duration of segment p.
For a given set of features, several sum-of-products may be generated. The possibilities increase
in proportion to the number of features.
This model is basically a generalization of several existing models, namely, of the previously
described Klatt model. It is also used in the Jan van Santen model which will now be briefly
described.
59
A Prosody Model to TTS Systems
The system is composed of a tree (Fig. 3.1) that can handle the linguistic heterogeneity of the
segments, allowing a separate treatment for each category and its own sum-of-products model at
the end of the tree. Each model differs from the remaining because the features affecting each
category also differ. For instance, the features affecting vowel duration are different from the ones
affecting intervocalic consonants. A second category classification distinguishes consonants
according to their articulation and voicing: there are voiceless plosives, voiceless affricates, liquid
consonants and glides, voiceless fricatives, nasals, voiced plosives, voiced affricates, voiced
fricatives and aspirate. In addition, plosives and affricates are divided into two moments: occlusion
and burst part. There are tables of predicted parameter values for each model.
All cases
Vowels Consonants
Intervocalic In Clusters
Coda
60
Chapter 3 – Duration Model
• Pitch accent;
• Syllabic stress;
• Vowel identity;
• Phrasal position.
• Within-word position;
• Phrasal position.
The models for consonants in clusters as syllable onsets, phrase-medial codas and phrase-final
codas employ the following features distinctively:
• Syllable boundary;
• Silence.
The system has a sum-of-products model for vowels and several models for consonants.
The reported results refer to the correlation coefficient considering all types of segments of 0.93
for the parameter determination database and 0.884 for other databases, which is excellent.
Perceptual tests were made to compare this model with a previous one based on hundreds of
duration rules and the van Santen model was an overall preference percentage of 73%.
61
A Prosody Model to TTS Systems
Final syllable duration and final segment duration increase according to the previous component.
This increase goes from a minimum to a maximum empirical value, initially taking the same steps.
It corresponds to the re-length that is usually observed in speech phenomena.
Rhythmic variance was also observed in post-verb position and within 4-to-6-word components.
Rhythmic variance occurs when the lengthening of one element is superior to the strictly necessary.
Consequently, the following element must be shortened to end the component “in time”. This leads
to the inversion of duration of variant word pairs.
The linear correlation between predicted and measured values reported by the author is never
inferior to 0.7 for final syllable plus pause and usually around 0.8.
Later, in her PhD thesis, Brigitte Zellner [1998] suggests another duration model also for
French. This model proceeds in two phases. In the first phase predicts the syllable duration based
on the type of word the syllable belongs to (lexical VS grammatical), the position of the syllable in
the word, group, sentence, etc. In the second phase the distribution of that duration to the
component segments of each syllable is made. The logic of that distribution varies with different
types of syllabic structure.
When estimating syllable duration on the first stage, the author employs the following six
parameters:
• X2 – Temporal groups (10 groups: minor initial; major initial in the beginning of the
sentence; minor initial after pause; major initial after pause; intermediate position; minor
final; minor final before pause; major final before pause; major final; major final in the
end of the sentence);
1 Each syllable contains a set of segments. The author attributes a set of duration classes to each set of
segments. It took 158 different combinations of segments to translate all the syllables in her study.
62
Chapter 3 – Duration Model
These parameters are combined into a sum-of-products called linear model according to the
following Eq. (3.4):
Y = b0 + b1 X1 + b2 X 2 + b3 X 3 + b4 X 4 + b5 X 5 + b6 X 6 Eq. (3.4)
Two models are considered, one for fast speech rate and the other for slow speech rate. The
statistically-obtained coefficients are considerably different for the two models. For fast speech
rate, the duration is strongly conditioned by segmental class types – the segments are intrinsically
long or short - whereas for slow speech rate there is a higher degree of syllable elasticity, since
duration is highly dependant on the number of segments. This work also tested a neural net model
which used the same parameters, only with worse results.
In the second stage, the syllable durations are distributed by the segments each syllable contains,
according to the following algorithm [Zellner, 1998:139] for both speech rates:
If the syllable has a single segment,
Attribute duration to the segment
Otherwise
Add durations of intermediate segments according to
their classes
Determine the difference between the predicted syllable
duration and the sum of segmental durations
If the result is different, adjust:
If the syllable has 2 segments,
Attribute MAX or MIN value to the first segment
Re-determine the difference between the predicted
syllable duration and the sum of segmental durations
If the result is different, adjust:
Attribute MAX or MIN value to the second segment
Re-determine the difference between the predicted
syllable duration and the sum of segmental durations
If the result is different, adjust:
the nucleus so that the syllable has the predicted
value.
If the syllable has 3 segments
Attribute MAX and MIN values to every segment
Re-determine the difference between the predicted
syllable duration and the sum of segmental durations
If the result is different, adjust:
The nucleus in minor or major.
The author presents a duration model for two speech rates. The result is presented as the correlation
coefficient between a sequence of predicted values and a sequence of values produced by a
speaker. The results were partially presented for the two stages of the model. The correlation
coefficient values of predicted durations obtained for syllable duration are, for fast and slow speech
rates respectively, of 0.80 and 0.73 and of 0.74 for the segmental duration prediction of both rates,
63
A Prosody Model to TTS Systems
based on the syllable durations produced by a speaker. For the two stages jointly, the results were
never inferior to 0.7 for both rates.
The first stage uses a perceptron multi-layer ANN which describes the syllable according to the
10 features presented in descending order of its relevance:
• Break index;
• Function/content distinction;
• Stress index;
• Type of foot;
The second stage develops the elasticity concept, according to which the duration of syllable
segments is obtained through the application of a single z score, normalized duration, in the
Eq. (3.5), so that the sum of segmental durations equals the syllable duration, Eq. (3.6).
µi , and σi respectively are the mean and standard deviation of the transformed durations or
logarithmic duration for segment i.
The author registered that the model has difficulty in predicting final syllable segmental
durations, due to segmental lengthening in this position.
64
Chapter 3 – Duration Model
Later, Campbell [1993] enhanced his model so it could handle the final syllable lengthening
problem, which affects the rhyme more than the onset. The modification consisted of considering
an alpha value to multiply by the z score, alpha depending on the context. This improvement
caused the model to have better results and the author presented a correlation coefficient of 0.93 for
syllable duration. When comparing the predicted and the measured durations produced by 4
speakers the model achieve an average correlation coefficient of 0.71.
The perceptual centre (PCenter) is located at the vocalic onset, when the syllable is not preceded
by a silence. If there is a silence, the PCentre is usually placed earlier in the syllable. The gap
between two perceptual centres is the Inter-Perceptual-Centre-Interval and the unit is known as the
IPCG.
The authors used an internal clock to actively control the speech rate. A feed-forward ANN
transforms simple ramps, indicating the length and function of each linguistic unit of the utterance,
into rhythmic contours according to speech rate, prosodic markers, nature of the vowel, number of
consonants in coda and number of consonants in IPCG [Barbosa and Bailly, 1997]. The ANN
predicts the duration’s logarithm using the following parameters:
• Sentence modality;
• Sentence extent (using a ramp with the number of IPCG in the phrase);
• Prosodic group extent (using a ramp with the number of IPCG in the group);
The distribution of durations to the IPCG constituents is accomplished with a modification to the
previously mentioned repartition algorithm, developed by Campbell and Isard [1991], as to include
emerging pauses. That modification, at first justified with experimental results and then presented,
65
A Prosody Model to TTS Systems
is based on the assumption that a pause has a minimum duration of approximately 60 ms, which
was experimentally confirmed for different speech rates. The modified algorithm consists of:
Computation of the z-score for a given IPCG;
If z is smaller or equal to the critical value of 0.79,
The procedure is over: segmental durations are obtained
by using Eq. (3.5);
If z is greater than 0.79,
The z-score of the vowel is obtained by regression
Eq. (3.7) by setting zvs = z; and determine the new z=zv
Values µi and σi, in the Eq. (3.5), were previously determined for each segment using the mean
and standard deviation of a database with several occurrences for each segment.
The authors present the mean and standard deviation error, for segments in general and pause or
silence, at 5 speech rates in the whole model. The test set exhibits values of -105 ± 113 ms for
silence and 5 ± 43 ms for the remaining segments at a normal speech rate. For fast speech rate,
however, the test set displays values of 64 ± 144 ms for silence and 0 ± 28 ms for the remaining
segments. The IPCG has better results with pauses and consonants, but no advantage for vowels,
compared to the syllable as a unit. The results show very low mean error values but this mean error
is different from the absolute mean error other works mention. Null mean error value indicates that
the time unit (sentence, text, etc.) in which it was measured has the same duration as its reference.
Using the IPCG would help maintaining the rhythmic structure of the sentence, including its speech
rate, i.e., its total duration.
Later, Barbosa [1997] applied this model to Brasilian Portuguese, with the suitable adjustments,
with mean error of 2 ms and standard deviation of 36 ms.
• Level 1 - determines the specific segmental duration, influenced only by the articulation of
adjacent sounds, with no supra-segmental effects;
66
Chapter 3 – Duration Model
• Level 3 - modifies the previous level’s durations to establish final durations according to
the length of the word, the position of the word within the phrase and sentence boundaries.
Pauses are separately inserted in sentence break markers and between phrases.
Author reported that the durations are set about 98% correctly after level 3.
The reported absolute mean error and standard deviation values for allophones in the training set
are 16, 3 and 19, 6 ms, respectively.
The model chose the phoneme as its segmental unit. The ANN is a multi-layer perceptron type.
For the network input, several parameters were studied, but only the following ones succeeded:
• Phoneme identity;
• Syllable stress;
• Phoneme in function-word;
• Position in the sentence (position of the phoneme within the syllable, of the syllable within
the word and of the word within the sentence);
67
A Prosody Model to TTS Systems
• Number of phrasal units (number of phonemes in the syllable, number of syllables in the
word and number of words in the sentence);
• Beginning of the sentence (up to the first accent) and end of the sentence (after final
accent).
After suitable codification, these parameters had better results than the ones obtained for a
reference set, composed of phoneme identity and accent, exclusively.
The network output corresponds to the duration of the phoneme presented as a standard
deviation logarithm, since it shows better results that those of other tested codifications.
The authors made their assessment according to the specifications of the database they used.
Their best result is of an absolute error equivalent to 14.3 ms, which is far better than the results by
their previous rule-based model.
68
Chapter 3 – Duration Model
If the ANN input contains all the features likely to influence segmental duration and if its
architecture is able to learn how each feature exerts its influence under different circumstances,
based on a sufficiently large set of natural utterances exempla, then, supposedly, the network is
able to predict the sequence of durations that correspond to the natural utterance of the segments
resulting from the analysis of a text.
This was the basic idea for the creation of the model. The next sections will describe the process
to choose a set of examples to “teach” the ANN, the choice of network architecture and its training,
the selection of the most suitable set of features and its parameters. Finally, the model will be
evaluated and criticized.
Initially, the model was a large number or parameters, which were then modelled and tested, to
find the set with the best results, having in mind the relevance degree of each parameter. The aim
was not so much to reduce the number of features, but to reduce the error in the segmental duration
prediction. The decision of including or not including a particular feature was based in the
improvement or not of the correlation coefficient between predicted and original durations, with
and without that feature in the input vector. Since the correlation coefficient is very high correlated
(r=0.999) with the MOS of a perceptual test using the whole paragraphs of the test set, as described
in section 5.2.1.1, this process can be considered capable to select the features by their perceptual
relevance.
There was an attempt to improve the structure of the ANN in terms of hidden layers, nodes per
layer, learning functions, output functions of the final layer and codification of the input and output
parameters.
High level linguistic features, such as morpho-syntactic features, were not considered due to the
lack of automatically accessible information at this stage.
The chosen segmental unit is the phoneme, but plosive consonants are divided into their two
moments: occlusion and burst part. A list of segments is presented in Table 2.6.
The corpus used here consists of more than 100 paragraphs of every type and dimension divided
into 7 texts, with a total of 18.700 sound segments for 21 minutes of speech uttered by the same
69
A Prosody Model to TTS Systems
speaker. Training was done using sentences from 6 texts, with approximately 15.000 phoneme
segments and testing was done with the remaining text containing about 3.000 segments.
In spite of there being other features that the model should not disregard, the identity of the
phoneme segment is the most important one. This fact justifies an analysis of the corpus
composition in relation to that feature. Fig. 2.9 therefore shows the distribution of frequency of the
phoneme segments in the corpus, which is identical in the training and test sets, as seen in Fig. 3.2.
%
9
8 Training set
Test set
7
0
a 6 E e @ i O o u j w j~ w~ 6~ e~ i~ o~ u~ p !p t !t k !k b !b d !d g !g m n J l l* L r R v f z s S Z
Fig. 3.2 – Relative frequency (%) of the phonemes in the training and test sets.
This duration model is valid for the speech rate at which the database was recorded. For a
different speech rate, another database would have to be recorded and labelled at the chosen rate,
and the ANN trained with new data. Some co-articulation phenomena, modelled for this speech
rate at the grapheme-phoneme conversion level, may differ for other speech rates. In chapter 2 it is
said that the speech rate for this database is of 12.2 phones per second, the equivalent to the normal
reading of a news report.
Other phonetic changing phenomena likely to influence the model and documented in [Teixeira
et al., 2001], such as dialectal changing, contextual changing (suppression and reduction, vowel
quality transformation, addition, allophones and phonetic changes) are treated in the phonetic
transcription and co-articulation events process and thus, supposedly, included in the model.
Considering the features established ahead and the way they are parameterized in the network,
the number of possible combinations for different input vectors is of about 1016. However, only a
minute part of those vectors is linguistically possible, since many combinations are merely
hypothetical. In a total of 18.700 sound segments, around 1000 are pauses and silences; the
remaining 17.700 phoneme segments were used for the training and testing of the model and later
parameterized in vectors. Of these 17.700, only about 2% are repeated, with a remaining of 17.350
distinct vectors.
70
Chapter 3 – Duration Model
Several architectures were tested, with different network types, structures, number of hidden
layers and corresponding transfer functions, as well as number of nodes per layer.
Perceptron networks and recurrent networks (Hopfield networks and Elman networks) were also
tested but, the learning results were never satisfactory. The given network is a feed-forward type
network, and it was trained using back-propagation algorithms with good results from the
beginning.
The network input has all the necessary nodes to codify the chosen parameters, later discussed.
The output has one node, which will indicate the segment duration value. Between one and four
hidden layers were tested, but, the best option varies between one or two layers. Table 3.1 displays
the performance values for the best architectures. Where, Log, Tan and Lin means hyperbolic
logarithmic, hyperbolic tangent and linear transfer functions, respectively. The number of nodes in
the input layer is not shown in the first column but will be discussed in detail in the following
sections. The choice was for a network with two hidden layers, 4 nodes in the first hidden layer and
2 nodes in the second, because it got the best results in the testing phase.
Fig. 3.3 exhibits the architecture of the chosen network, with n input nodes, duration, d, in the
output layer activated by the linear transfer function, a first hidden layer with 4 nodes, activated by
the hyperbolic tangent transfer function, and a second hidden layer with 2 nodes, activated by the
hyperbolic logarithmic transfer function. The nodes of subsequent layers are fully connected. The
polarisation values or bias of each node are expressed by b. To avoid confusion, weights were not
displayed in the figure, but are expressed in each filled arrow connecting one node to the other,
including input nodes. The total number of weights are nx4+4x2+2x1+7=4n+17.
71
A Prosody Model to TTS Systems
p1 1
0.5
p2 Σ
0
-0.5
-1
b1,1
-6 -4 -2 0 2 4 6 8
p3 1
1
0.5
0.5
Σ Σ
0
0
d
5
-0.5 -0.5
4
3
-1 -1
2
-8 -6 -4 -2 0 2 4 6
-6 -4 -2 0 2 4 6 8 1
Σ
0
b1,2 b2,1
-1
-2
-3
-4
.
1 -5
-5 -4 -3 -2 -1 0 1 2 3 4 5
0.5
1
Σ b3,1
0
0.5
Σ
-0.5
.
0
-1
-0.5
-6 -4 -2 0 2 4 6 8
b1,3
-1
-8 -6 -4 -2 0 2 4 6
. 1
b2,2
0.5
Σ
0
-0.5
-1
-6 -4 -2 0 2 4 6 8
b1,4
pn
The activating functions for each node are graphically displayed in Fig. 3.3. The activating
functions for the first hidden layer, hyperbolic tangent functions, vary between -1 and +1; the
second layer’s hyperbolic logarithmic functions vary between 0 and +1. The output node activating
function is a strictly linear function.
The architecture of an ANN should be carefully designed in order to guarantee that the available
number of training vectors is several (at least more than 5) times larger than the number of weights
of the ANN. Otherwise the training set will not be enough to optimise all ANN weights. Even if the
predicted data of the training set is very good, when data from others sets is used the predicted
results of the ANN do not follow the quality level of the results of the training set. In the presented
case, considering the architecture described above, and the number of features, discussed in next
section, the number of weights is 410, and the number of training vectors is about 15.000, about 36
times larger.
On the other hand, an over-fitting problem may occur, independently of the relation between
number of training vectors / number of ANN weights, if the number of training sessions is
excessive. The network adapts itself perfectly to the training set (the easier the smaller the relation
between the number of training vectors and the number of ANN weights is), but fails to handle
other input vectors. It ‘memorizes’ the training examples but doesn’t ‘learn’ how to deal with the
problem.
In order to avoid over-fitting problems, three sets were initially used in the training process. The
training set, used to train the ANN, the validation set, used to stop training early if further training
with the training set will hurt generalisation to the validation set, and a test set to evaluate if
training and validation sets are representative of the universality of the problem. If this does not
72
Chapter 3 – Duration Model
happen, the performance in the test set does not follow the performance in the training and
validation sets.
Consequently, the database was first divided into a training set of approximately 13.000 vectors,
a validation set of about 3.000 vectors and a test set of about 2.000 vectors. These sets were
organized by distributing 5 texts to the training set and one to each of the others. Later, the test set
was eliminated since the performance followed closely the performance of the other two sets,
proving that the training and validation sets are representative. The data of the test set was
transferred to the training set and the validation set was used also for testing. Hence, the final
training set comprises approximately 15.000 vectors and the test set about 3.000.
The chosen performance function was the root-mean-square error between the predicted outputs
and the target values.
For the network’s training, several variants of the back-propagation training algorithm were
tested, all available in Matlab®’s toolbox for neural networks [Demuth and Beale, 2000] and
shortly described next.
Every tested algorithm is of the ‘batch training’ type, i.e., at every iteration, the weights and
biases are only updated after every vector in the training set has been applied to the network.
• traingd - ‘Batch Gradient Descending’ – the weights and biases are updated in the
direction of the negative gradient of the performance function. The learning rate is fixed;
These two algorithms are usually very slow handling practical problems. The alternative is fast
learning algorithms, which can be divided into two categories. The first one uses heuristic
techniques developed from the analysis of the performance of the standard steepest descent
algorithm. One heuristic technique is the momentum technique, used in the previously mentioned
algorithm. The other two techniques are variable learning rate and resilient back-propagation. The
second category uses standard numerical optimization techniques. There are three types of
optimization techniques for neural nets: conjugate gradient, ‘quasi-Newton’ and ‘Levenberg-
Marquardt’.
• traingda - standard steepest descent algorithms use a constant learning rate throughout the
training, but the performance of the algorithm is very sensitive to the setting of the
learning rate. If the learning rate is too high, the algorithm may oscillate and become
unstable; if it is too small, the algorithm will take too long to converge. This algorithm
uses an adaptive learning rate, in order to keep the learning step as large as possible and
make sure the algorithm remains stable;
73
A Prosody Model to TTS Systems
• trainrp - multilayer networks typically use hyperbolic transfer functions in the hidden
layers. This function compresses an infinite input range into a finite output range and one
of its main features is that its slope must approach zero as the input gets large. The
learning process is slow because it is proportional to the performance function gradient.
To eliminate the harmful effects of the magnitudes of the partial derivatives, almost null,
this algorithm uses the sign of the derivatives to determine the direction of the weight
update; the magnitude of the derivative has no effect on the weight update. [Riedmiller
and Braun, 1993]. It requires a small increment to memory resources.
Basic back-propagation algorithms adjust the network weights in the steepest descent direction
(negative of the gradient). However, this process doesn’t necessarily lead to faster convergence.
Conjugate gradient algorithms search the performance function variation throughout conjugate
directions. Their learning rate is adjusted at each iteration, so as to minimize the performance
function throughout conjugate directions. These algorithms are usually faster than variable learning
rate algorithms, and sometimes even faster than resilient back-propagation algorithms, but their
results depend on the kind of problem they’re handling. They only require a little more storage than
the simpler algorithms, so they are often a good choice for networks with a large number of
weights. The alternatives available in the neural network ‘toolbox’ [Demuth and Beale, 2000] are:
traincgf – ‘Fletcher-Reeves Update’, traincgp – ‘Polak-Ribiére Update’, traincgb – ‘Powell-Beale
Restarts’ and trainscg – ‘Scaled Conjugate Gradient’. Essentially, they differ in the way the search
is done for a new direction.
• trainbfg - the Newton method uses a second derivative matrix of the performance index
for each iteration, the Hessian matrix. However, this is a complex and expensive to
compute matrix. ‘Quasi Newton’ algorithms update an approximate matrix at each
iteration and the update is computed as a function of the gradient, making the algorithm
lighter. They require more computation in each iteration and more storage than the
conjugate gradient methods, although they generally converge in less iterations. It is
recommended for smaller networks;
• trainoss - ‘One Step Secant Algorithm’ – this algorithm is an attempt to bridge the gap
between conjugate gradient algorithms and the previous algorithm, as far as storage and
computation requirements are concerned. It doesn’t store the complete Hessian matrix, it
assumes that at each iteration, the previous Hessian was the identity matrix.
To sum up:
74
Chapter 3 – Duration Model
The trainlm algorithm is recommended for networks with a few hundred weights, otherwise it is
rather heavy in terms of memory. The trainrp algorithm is faster recognizing patterns, but
ineffective for approximate functions because it degrades significantly when the error rate is small.
The trainscg algorithm handles a wide range of problems, especially on large networks. It doesn’t
require too much storage and it is almost as fast as trainlm handling approximate functions (even
faster for large networks). Also, it is almost as fast as trainrp recognizing patterns and it does not
degrade as much as trainrp with small error rates. The trainbfg algorithm is similar to trainlm in
terms of performance and it does not require as much storage. However, its computation needs
increase geometrically with the network size. Finally, the traingdx algorithm is usually the slowest
and requires as much storage as trainrp. It is quite useful for slow convergence situations.
For present problem, each of the mentioned algorithms was tested. In the initial phase, for a
large amount of features and, consequently, a large amount of weights to adjust, the resilient
backpropagation algorithm was selected - trainrp. Once the feature set was significantly reduced,
and consequently the number of weights to adjust, the trainlm algorithm became more useful and
produced better results. With the other algorithms, the performance values were far from the
adopted solution, the network was in some cases even unable to ‘learn’.
Fig. 3.4 displays the error evolution in the performance function for the training and validation
sets in a training session with trainlm algorithm. Training was automatically interrupted after 41
iterations to prevent over-fitting, after 300 seconds2. With the resilient back-propagation algorithm,
trainrp, the time used in training is approximately 10 times smaller, but the performance values are
also worse.
0
Performance is 0.00713325, Goal is 0.0016
10
Training-Blue Goal-Black Validation-Green
-1
10
-2
10
-3
10
0 5 10 15 20 25 30 35 40
41 Epochs
Fig. 3.4 – Error evolution in the performance function in the training and validation sets during a training
session.
The very close evolution of performance in both training and validation sets, proves the
homogeneity in the two sets. When the test set was used, it also followed the evolution of
performance of the training and validation sets.
2
The training process ran on a computer with a Pentium IV 1.8 MHz processor and 512 Mbytes of RAM.
75
A Prosody Model to TTS Systems
3.3.4 Features
The basic idea for the creation of this model was attempting to collect every feature likely to
influence the duration of a given segment, even if the influence of certain features is rather subtle
or if some of them are in strict correlation to others.
This section will describe the set of tested features, how they are automatically extracted from
the text, the best way to codify them, whether or not they are influential and how.
Phoneme labels
Word labels
Pre-processed text Sentence labels
Syllable division
Word labelling and phoneme
line-up
Tonic syllable labelling
Feature extraction
Duration model
Fig. 3.5 – Sequence of processing blocks prior to the development stage of the duration model and its
application to TTS.
Before describing the chosen features and the way they are automatically extracted from the
data, it is convenient to describe the sequence of processing blocks presented in Fig. 3.5. To the
left, written in continuous lines, the processing blocks for the TTS converter. To the right, written
in broken lines, the processing blocks for the development of the duration model. Below, in dotted
lines, lie the common blocks and the model itself.
76
Chapter 3 – Duration Model
The two processing sequences, for the TTS application stage and for the development of the
model stage, are distinct because the object for the development stage is the labelled database and
not the text. However, one may question: why is the left sequence not the only source? Why are the
readings of the database not used and followed by a line-up of the phonetic transcription results and
the database labelling results?
In fact that was not the option, since some TTS blocks, namely the phonetic transcription block,
was simultaneously developed with the duration model, therefore, still lacking stability when the
model began to be developed. Moreover, the usage of the database durations would necessarily
imply the usage of a phoneme sequence matching the database sequence. There was no warranty
that the phonetic transcription results would be exactly the same as the labelling results. But to
prevent neglecting any specific aspect of the model when applied to TTS, the phonetic transcription
block should be handled carefully, with regard to post-lexical rules. The transcription should be
phonological rather than phonetic, so that its results match those of the labelling for the same texts3.
The left block sequence in Fig. 3.5 has the pre-processed text at the input, so that any acronym,
abbreviation or numeric character is already in full text form. Afterwards, a syllable division
algorithm [Gouveia et al., 2000] is applied to divide the text into syllables. Then, the tonic syllable
is marked according to specific rule set in [Teixeira, 1995]. Later, the phonetic transcription is
made and co-articulation rules applied, according to the description in the previous chapter.
At the input, the right block sequence, the model’s development stage, has files containing three
labelling levels: phonetic labelling, word labelling and sentence labelling, as seen in section 2.3.
These files show the time instant label and its corresponding label (Table 2.5). Phoneme segments
are not yet grouped in words or sentences, so the first processing block lines-up the word markers
and the phoneme segment markers, allowing segments to be easily distributed to the words they
belong to. The same happens with sentences, making it easy to group words and phonemes
belonging to the same sentence. The third processing block handles syllable division, but with a
different algorithm than the one mentioned for the left side of the figure, since this division is
phoneme-based instead of grapheme-based. Syllable identification became, in some cases, much
harder, due to several phoneme reductions and suppressions in the spoken text. However, the
database markers at the beginning of the tonic syllable are very handy at this stage. The algorithm
used in the syllable division was described in previous chapter.
One way or the other, after knowing the phoneme sequences, divided into syllables, the tonic
syllables and word and sentence boundaries, accent groups and phrases can be identified. Every
sentence marker in Table 2.5 was considered a phrase boundary marker4. As for accent groups, the
idea was to combine words with their neighbouring mono-syllables in order to create groups of
over three syllables, but only one tonic accent. These groups are made by a word combination
process: each group should have more than two syllables and no less than two phonemes in the last
one, unless it is the last word in the phrase. Because it lacks higher level linguistic background
3 The phoneme sequences should be very close to the one produced by the speaker, and not the full lexical
form. For that, the set of post lexical or co-articulation rules proposed in previous chapter is very important.
This aspect of a correct lexical sequence instead of the full lexical form is very important to the naturalness
as reported in [Brinckmann and Trouvain, 2003].
4 Linguistically speaking, this is not the accurate phrase boundary identification process. However, because it
was rather difficult to group phrases automatically, respecting linguistic criteria, these text groupings
became the less correct option for what we call phrases.
77
A Prosody Model to TTS Systems
information, this process sometimes fails and separates words which belong together as a unit (ex:
Vieira da Silva, Vila Real). If there’s more than one tonic syllable marker in the group, only the
final one is valid. The final step is numbering syllables in accent groups. Accent groups have
usually 3 to 5 syllables. An example of application of the concept of accent groups is presented in
the following sentence (‘a strong reserve with justice situation’): “uma forte / reserva / em relação /
à situação / da justiça”.
• Phoneme syllable position in relation to group’s tonic syllable – this feature was initially
codified so as to activate the input node that corresponds to one of the following 5
categories: before prior to tonic; prior to tonic; tonic; subsequent to tonic; after subsequent
to tonic. This feature was later re-codified into a single node with values obtained from the
correlation, r, between each category and segmental duration, according to Table 3.2. The
new codification reduces the number of input nodes without loss in final performance.
Table 3.2: codification of the ‘position’ feature in relation to the tonic syllable.
• Phoneme syllable type – initially codified activating one of the following categories5: V;
C; VC; CV; CC; VCC; CVC; CCV; CCVC, where V stands for vowel or diphthong and C
stands for consonant. This feature was later re-codified into a single node, with the values
obtained from the correlation between each category and segmental duration, according to
the third column in Table 3.3. Again, the new codification reduces the number of input
nodes without loss in final performance.
• Type of previous syllable – similar processing to the ‘type of syllable’ feature, but final
codification used different values, since the correlation values also differ. The last column
in Table 3.3 displays the codification values for this feature.
• Type of syllable vowel – Initially codified activating one of the following types: long
vowels – {a, E, e, O, o}; medial vowels – {6, i}; short vowels – {@, u}; diphthongs;
nasals – {6~, e~, i~, o~, u~}. This feature was later re-codified into a single node with the
values obtained from the correlation of each type with segmental duration, according to
third column in Table 3.4.
5 These categories were considered the only possible syllable types for Portuguese, section 2.4. Types C and
CC are only possible in the phonetic sequence due to vowel suppression in syllables of the CV, CVC and
CCV types.
78
Chapter 3 – Duration Model
• Type of previous syllable vowel – similar processing to the ‘type of syllable vowel’
feature, but final codification used different values, since the correlation values also differ.
The fifth column in Table 3.4 displays the codification values for this feature.
• Type of following syllable vowel – similar processing to the ‘type of syllable vowel’
feature, but final codification used different values, since the correlation values also differ.
The last column in Table 3.4 displays the codification values for this feature.
Table 3.3: Codification of the ‘syllable type’ and ‘previous syllable type’ features.
Table 3.4: Codification of the ‘syllable vowel’, ‘previous syllable vowel’ and ‘following syllable vowel’
features.
• Position in accent group – codified into two nodes, both showing normalized positions,
one from the beginning and the other from the end of the group.
• Position in phrase – codified into two nodes, both showing normalized positions, one from
the beginning and the other from the end of the phrase.
79
A Prosody Model to TTS Systems
• Distance to next pause – measured in seconds and normalized relatively to the maximum
value in database.
• Accent group length – codified into two nodes, both normalized, one showing the number
of group segments and the other the number of group syllables.
• Accent group position in the phrase – Codified into three nodes, by activating the node
that corresponds to the beginning, the middle or the end of the phrase.
• Final vowel suppression – Codified6 into a single node, showing whether or not there is
final vowel suppression. This feature is only used for the final phoneme in the word.
• Identity of the previous segment (-1) – After analysing the correlation between the identity
of the previous segment (-1) and the duration of the current segment, a total of 20 relevant
phones were found. Thus, this feature is codified into 20 nodes, by activating the node that
corresponds to the previous segment.
• Identity of the following segment (+1) – After analysing the correlation between the
identity of the following segment (+1) and the duration of the current segment, a total of
12 relevant phones was found. Thus, this feature is codified into 12 nodes, by activating
the node that corresponds to the following segment.
• Identity of the segment subsequent to the following (+2) – After analysing the correlation
between the identity of the segment subsequent to the following (+2) and the duration of
the current segment, a total of 4 relevant phones was found. Thus, this feature is codified
into 4 nodes, by activating the node that corresponds to the segment subsequent to the
following.
• Identity of the segment (+3) – After analysing the correlation between the identity of the
segment (+3) and the duration of the current segment, a total of 2 relevant phones was
found. Thus, this feature is codified into 2 nodes, by activating the node that corresponds
to the segment (+3).
In the first 6 features, the codification allowed a significant reduction of network input nodes,
without any consequences for the model’s performance. As for the features concerning
neighbouring segments, the last 4 features, the number of nodes was also considerably reduced:
segment types showing a weak correlation with segmental durations were not considered, not
harming the model’s performance.
6 The codification for this feature is distinct for both routes in Fig. 3.5. The left route, application to TTS,
extracts this information from the co-articulation block, where final vowel may or may not be suppressed
due to co-articulation. As for the right route, training and model development stage, that information is
harder to obtain because suppressions are not registered in the database. Therefore, final consonants are a
good indicator of whether or not suppression occurred. If the consonant is of the {r, l*, S} type, then
admittedly there was no suppression, since those consonants are likely to occur in word-final position; if the
consonant is of a different type, then admittedly suppression occurred, since no other consonant is likely to
appear in word-final position unless the vowel was omitted.
80
Chapter 3 – Duration Model
Table 3.5: Final feature set, the corresponding importance and the correlation with the segmental durations.
#
Feature Detail r Importance
Node
Position in relation to tonic
1 0.145 Relevant
syllable
2 Type of syllable 0.175 Slightly relevant
3 Type of previous syllable -0.055 Slightly relevant
4 Syllable vowel 0.208 Relevant
5 Previous syllable vowel -0.075 Slightly relevant
6 Following syllable vowel -0.151 Slightly relevant
7 From beginning 0.026 Slightly relevant
Position in accent group
8 From end -0.153 Relevant
9 From beginning -0.038 Slightly relevant
Position in phrase
10 From end -0.244 Relevant
11 Distance to next pause 0.203 Relevant
12 # of Syllable 0.052 Slightly relevant
Accent group length
13 # of Phoneme 0.026 Slightly relevant
14 Beginning 0.015 Slightly relevant
Accent group position in
15 Middle -0.081 Relevant
the phrase
16 End 0.114 Relevant
17 Final vowel suppression 0.082 Slightly relevant
18-61 Identity of the segment Detail in Table 3.6 Very relevant
Identity of the previous
62-81 Detail in Table 3.6 Relevant
segment (-1)
Identity of the following
82-93 Detail in Table 3.6 Relevant
segment (+1)
Identity of the segment
94-97 subsequent to following Detail in Table 3.6 Relevant
(+2)
Identity of the segment
98-99 Detail in Table 3.6 Relevant
(+3)
The relative importance of the final feature set is presented in Table 3.5. The importance was
measured taken out one feature from the set of features and measuring the new performance of the
model in the test set. The decreasing in performance was the measure of importance for that
particular feature. The value r, presented in the table is the correlation between the input feature
and the output. The r was not directly used in the measure of importance. Table 3.6 shows the
81
A Prosody Model to TTS Systems
correlation of each segment identity and the neighbouring segments identity in detail, with
segmental durations.
Table 3.6: Correlation between the segments and the surrounding segments with the segmental durations.
# Segment # Segment
Feature r Feature r
Node Identity Node Identity
18 a 0.262 59 s 0.235
19 6 0.065 60 Phone S 0.151
20 E 0.130 61 Z 0.060
21 e 0.122 62 !p -0.189
22 @ -0.019 63 t 0.083
23 i 0.052 64 !t -0.184
24 O 0.143 65 k 0.050
25 o 0.118 66 !k -0.121
26 u -0.025 67 b 0.042
27 j -0.050 68 !b -0.122
28 w -0.072 69 d 0.071
29 j~ -0.005 70 !d -0.227
30 w~ -0.012 71 Phone g 0.053
31 6~ 0.062 72 (-1) !g -0.123
Phone
32 e~ 0.140 73 n 0.055
33 i~ 0.110 74 J 0.051
34 o~ 0.093 75 l 0.068
35 u~ 0.046 76 r 0.089
36 p -0.195 77 R 0.053
37 !p 0.019 78 v 0.060
38 t -0.187 79 z 0.075
39 !t -0.068 80 S -0.057
40 k -0.124 81 Pause 0.082
41 !k -0.005 82 a -0.083
42 b -0.121 83 6 -0.119
43 !b -0.048 84 Phone (+1) u -0.077
44 d -0.226 85 6~ -0.056
45 !d -0.110 86 o~ -0.052
82
Chapter 3 – Duration Model
# Segment # Segment
Feature r Feature r
Node Identity Node Identity
46 g -0.123 87 t -0.063
47 !g -0.050 88 !t 0.107
48 m 0.009 89 d -0.104
Phone
49 n -0.025 90 !d 0.095
(+1)
50 J 0.011 91 l* 0.062
51 l -0.031 92 v 0.053
52 Phone l* 0.025 93 Pause 0.282
53 L -0.002 94 t 0.107
54 r -0.189 95 Phone d 0.091
55 R 0.026 96 (+2) r -0.080
56 v 0.014 97 Pause 0.141
57 f 0.097 98 Phone u 0.049
58 z 0.032 99 (+3) Pause 0.110
Apart from the final feature set, other features and other codifications for some features in the
final set were studied, but they brought no benefit to the model’s performance, neither individually
nor as a whole. Notwithstanding, those features are listed below and shortly described:
• Previous segment duration – codified between 0 and 1 by dividing the mentioned duration
in ms by 250. If the segment’s duration is superior to 250 ms, then it is codified as 1
(minimum {D(ms)/250,1}).
• Previous segment type – by activating one of the following: vowel; glide; nasal vowel;
plosive consonant; nasal consonant; lateral consonant; multiple vibrant (R); simple vibrant
(r); fricative consonant and pause. The surrounding segments’ codification was less
profitable for the model this way.
• Identity of previous segment (-2) – After analysing the correlation between the identity of
previous segment (-2) and duration of current segment, no relevant phone identity was
found in that position.
83
A Prosody Model to TTS Systems
• Identity of previous segment (-3) – After analysing the correlation between the identity of
previous segment (-3) and duration of current segment, no relevant phone identity was
found in that position.
These were the results for the given database. For other data sets, the best results would not
exactly match these ones. However, the best feature set would probably not show considerably
different results, because the features presented in Table 3.5 were confirmed by the test and
validation data sets. Even if the best feature set is different for other databases, it would probably
only differ in the ‘not relevant’ features, with no major changes to the model’s performance.
The network’s output node represents the duration for a given segment (ms/250) between 0 and
1, i.e., between 0 and 250 ms. Other codifications for segmental duration were tested, namely using
logarithmic functions, but the results did not improve. The ANN architecture has the ability to
model non-linear functions such as logarithmic function. This can be the explanation why the
logarithmic codification of the segmental durations in the ANN output did not improve the model
performance like reported in other model types by other authors.
84
Chapter 3 – Duration Model
∑ xi2
σ= i Eq. (3.8)
N
where N is the number of segments, and xi is the error difference of each segment and the mean
error:
xi = ei − e Eq. (3.9)
where the error, ei, equals the difference between the measured and the predicted duration of
each segment:
If the average error, e , is null, the standard deviation equals the root mean square error, rmse,
used by some authors [Goubanova and Taylor, 2000], and given in the expression:
∑ ei2
rmse = i
Eq. (3.11)
N
The mean absolute error (δ) is given by the following expression, indicating the mean error:
∑ ei
δ= i Eq. (3.12)
N
7
‘Measured durations’ means the durations resulting from the reading of texts in the labelled database.
85
A Prosody Model to TTS Systems
Since the variance between vectors A=[a1 a2 ... ai ...] and B=[b1 b2 ... bi ...] with the same
dimension N is:
∑ ( ai − a ) .( bi − b )
VA, B = i Eq. (3.13)
N
The variance of a certain X vector with itself is just the squared standard deviation of that
vector:
VX , X = σ x2 Eq. (3.14)
The correlation coefficient between vectors A and B is then the cross variance of those vectors,
divided by the product of their corresponding standard deviation values:
VA, B VA, B
rA, B = = Eq. (3.15)
VA2, A .VB2, B σ A .σ B
Fig. 3.6 shows an error histogram of every segment in both sets (since they have similar error),
compared with the normal distribution. One may observe the concentration of very low error
situations, more frequent than in the normal distribution. The error is the difference between
measured and predicted values; therefore, the Figure gives a clear account of the model’s difficulty
to predict high duration values and how low values are slightly predicted by excess. This is one of
the typical characteristics of the outcome of a statistical model. It can be said that there is a slight
reduction in the dynamics of the predicted duration when compared with the original.
86
Chapter 3 – Duration Model
1400
1200
1000
800
600
400
200
0
-100 -50 0 50 100 150 200
e (ms)
Fig. 3.6 – Error histogram and normal distribution curve for every segment in both sets.
0.999
0.997
0.99
0.98
0.95
Probability
0.90
0.75
0.50
0.25
0.10
0.05
0.02
0 10 20 30 40 50
|e| (ms)
Fig. 3.7 – Normal probability distribution and absolute error curve for every segment in both sets.
Fig. 3.7 shows the normal probability curve and the absolute error probability distribution for
every segment in both sets. If the error had a normal distribution, it would be shaped like a straight
line in the chart.
The analysis of the charts in Fig. 3.6 and Fig. 3.7 indicates that, although the distribution
deviates from the normal pattern, it is somewhat close to it. The very low error situations are more
concentrated than in the normal distribution, therefore, they are more likely to occur. The higher
87
A Prosody Model to TTS Systems
positive error situations (the positive error situation occurs when measured durations are superior to
predicted ones) are also more frequent than in normal distribution.
When analyzing the dispersion values for measured and predicted durations, using the standard
deviation of those durations, (σ=35.9 ms and σ=30.6 ms for measured and predicted durations
respectively), one can see that there is lower dispersion in the model’s predicted durations. This
tendency confirms the model’s difficulty to predict the durations of very large segments in
comparison to the average, which validates the impressions of the visual inspection made to some
examples.
Fig. 3.7 shows that the model predicts 75% of the durations with an error inferior to 20 ms, 90 %
with an error inferior to 30 ms and 95% with an error inferior to 40 ms.
Fig. 3.8 shows real, predicted and average durations for the phoneme sequence of a given
sentence. This example is not an attempt to evaluate the proximity of durations in the model, but
simply the application of the model to a sentence from the database. The average duration sequence
consists of replacing the duration of a segment with the average duration of the corresponding
phoneme in the database. The figure reveals the model’s difficulty to match the highest measured
duration values. Fig. 3.9 shows predicted and measured durations for a different sentence.
250
“Conhece a situação na pele. Aprendeu-a na idade em que se aprende e se não esquece.”
Measured
200
Predicted
Average
150
ms
100
50
Fig. 3.8 – Measured, predicted and average duration contours for the phoneme sequence in the sentence
“Conhece a situação na pele. Aprendeu-a na idade em que se aprende e se não esquece.”. Meaning ‘Knows the
situation on the skin. Learned it in the ages when we learn and don’t forget.’.
88
Chapter 3 – Duration Model
160
Que igualdade perante a lei? João Amaral
140 Measured
Predicted
120
100
80
60
40
20
k i ! g w a l* ! "d a ! d ! p @ "r 6~! t 6 l 6 j Z u "6~w 6 m 6 "r a l*
0
0 5 10 15 20 25 30 35
Fig. 3.9 – Measured and predicted duration contours for the paragraph “Que igualdade perante a lei? João
Amaral”. Meaning ‘How equal before the law? João Amaral’.
Desirably, these results would be compared with those of other models, which would certainly
prove a complex task. However, it is probably not correct to do so, since each model has its own
particular characteristics, not covered by these parameters, such as the ability to predict duration at
different speech rates, as in the Barbosa-Bailly and Keller-Zellner models, or even the ability to
insert pauses. Besides, models for other languages use different databases; there is no common
corpus for precise evaluation. The choice of the evaluation corpus is a relevant aspect, since results
differ from sentence to sentence, even for the same type of sentences. The very size of the database
used in the model’s learning stage is likely to influence final results. Additionally, there is some
divergence in the indicators used for results presentation. The language itself imposes a different
number of phoneme segments, which varies from author to author: thinner segments may be used,
causing results to differ. Finally, the speech rate is not always the same, and some times not
mentioned. The model results are very sensitive to the speech rate.
Thus, due to the previously stated reasons, this model was not objectively compared with other
duration models. Still, its standard deviation, of approximately 20 ms, as well as its linear
correlation coefficient, superior to 0.8, is at the state-of-the-art level of duration models, judging by
the relevant papers from the bibliographical list and by the systems presented earlier in this chapter.
In spite of a wide feature range specifying each segment, the phoneme identity feature is clearly
dominant. Consequently, an analysis of the model’s results, by segment type, is now presented.
Table 3.8 displays values concerning occurrence number, standard deviation of the error, mean
absolute error, linear correlation coefficient, measured and predicted average, measured and
predicted minimum8 and measured and predicted maximum9 for each type of segment in both sets.
8 If these minimum values are very low, they were either caused by a labelling error or a sporadic situation.
Consequently, they have little importance.
9
When the measured value is superior to 250 ms, it is limited to that value.
89
A Prosody Model to TTS Systems
The best linear correlation value does not always correspond to the best standard deviation value,
as seen in the phoneme segments of [E] and [i], for instance.
Table 3.8: Values for each segment type (phone) in both sets: occurrence number (#); error standard deviation
(σ); mean absolute error (δ); linear correlation coefficient (r); measured average (Av.) and predicted average
(Pred. Av.); measured minimum value (Min.) and predicted minimum value (Pred. Min.); measured maximum
value (Max.) and predicted maximum value (Pred. Max.).
90
Chapter 3 – Duration Model
Fig. 3.10 and Fig. 3.11 contain examples of measured and predicted duration histograms for
vowel [a] and consonant [t], respectively. A similarity exists between predicted and measured
duration histograms. The same similarity exists in histograms of the other segments. Some
differences between measured maximum values and predicted maximum values presented in Table
3.8, occur because of the outliers in measured values as happen in Fig. 3.10 and Fig. 3.11.
Statistical data concerning the comparison of the model’s results with utterance values, which so
far we named ‘real’ values, are not the only parameters used for evaluation, since the model’s
performance is compared with an utterance which is possibly not the best one and certainly not the
only accurate one. Thus, chapter 5 presents a perceptual test, which is an additional and important
evaluation indicator for the duration model.
91
A Prosody Model to TTS Systems
Fig. 3.10 – Histogram of measured and predicted durations for phoneme [a].
Fig. 3.11 – Histogram of measured and predicted durations for the burst part of phoneme [t].
92
Chapter 3 – Duration Model
To answer these questions, the model was tested by an alternative application of the neural
network, from now on referred to as the alternative model. The alternative model consists in one
ANN for each type of segment where all the features were kept, except the one concerning the
identity of the phoneme segment. The structure for each network is the same as in the previous
model, although in this case, each network can only access each set’s stimuli for a given segment.
The networks were individually trained, using a similar process to the previously mentioned one.
Shortly, each of the 44 ANNs has the same structure presented in Fig. 3.3, and is composed by
55 input nodes, 4 nodes in the first hidden layer, activated by the hyperbolic tangent function, and 2
nodes in the second hidden layer, activated by the hyperbolic logarithmic function, and 1 node in
the output layer, activated by the linear function. The node corresponds to segmental duration.
One of the advantages of the alternative model is the fact that a given phoneme segment duration
cannot be “disturbed” in any direction by the influence of the other segments’ features. However,
that may also become a disadvantage, since the parameter information for a given segment is not
applied to others. This becomes more relevant when the number of stimuli for each segment is
clearly not enough to train a sizeable network.
Fig. 3.12 shows an error histogram for all segments in both sets in comparison to the normal
distribution plot.
The error distribution in Fig. 3.12 is very similar to the error distribution for the initial model, in
Fig. 3.6.
Fig. 3.13 shows the normal probability distribution and absolute error curve for every segment in
both sets with the alternative model.
93
A Prosody Model to TTS Systems
The absolute error curve in Fig. 3.13 has a similar pattern to that of Fig. 3.7. However, this
model predicts 75% of the duration values with an error rate inferior to 18 ms, against the 20 ms for
the previous model, 90 % with an error rate inferior to 30 ms and 95% with an error rate inferior to
37 ms, against the 40 ms for the previous model.
1600
1400
1200
1000
800
600
400
200
0
-100 -50 0 50 100 150 200
e (ms)
Fig. 3.12 – Error histogram and normal distribution curve for all segments in both sets with the alternative
model.
0.999
0.997
0.99
0.98
0.95
0.90
Probability
0.75
0.50
0.25
0.10
0.05
0.02
0 10 20 30 40 50
|e| (ms)
Fig. 3.13 – Normal probability distribution and absolute error curve for all segments in both sets with the
alternative model.
94
Chapter 3 – Duration Model
Table 3.10: Values for each segment type (phone) in both sets of the alternative model: occurrence number
(#); error standard deviation (σ); mean absolute error (δ); linear correlation coefficient (r); measured average
(Av.) and predicted average (Pred. Av.); measured minimum value (Min.) and predicted minimum value
(Pred. Min.); measured maximum value (Max.) and predicted maximum value (Pred. Max.).
95
A Prosody Model to TTS Systems
Table 3.10 displays values concerning occurrence number, error standard deviation, mean
absolute error, linear correlation coefficient, measured and predicted average, measured and
predicted minimum and measured and predicted maximum values for each type of segment in both
sets.
In comparison to Table 3.8, one can observe significantly different values for some phoneme
segments, as far as standard deviation, mean absolute error and linear correlation coefficient are
concerned, usually with better results for this model. However, this model experienced greater
difficulty estimating extreme segmental duration values, very high or very low. As expected, it
exhibits lower value dispersion for each phone, since the training set for each phone is also smaller.
This model was only presented as the alternative model because in the beginning of the study the
set of features and network input nodes was much larger than the current one, which led to the
calculation of a larger set of network parameters during training. Since the number of training
situations should be at least 5 times the number of these parameters, most phones in the training set
never reached those numbers and the results were slightly worse than those of the original model.
With a significant reduction of network input nodes, most phones were able to satisfy that
requirement and consequently the model’s results improved significantly.
96
Chapter 3 – Duration Model
The alternative model got better results than the original one. The perceptual test in the
following section will be determinant for the choice between the two.
97
A Prosody Model to TTS Systems
3.6 Pauses
As far as pauses are concerned, it is important to distinguish intra-paragraph pauses from inter-
paragraph pauses: the first occur within the paragraph, separating sentences or phrases or even
imposed by punctuation (e.g.: . , ; : ! ? etc); the second are paragraph boundaries, usually longer.
Each type of pause has its own duration, which had to be studied.
This study used the speech corpus described in chapter 2. Due to recording and editing
conditions (interruptions and cuts between paragraphs), inter-paragraph pause duration was not
considered, since some of these pauses are artificial. It is known that there is always a relatively
long pause between paragraphs, but its duration is not part of this study, only that of intra-
paragraph pauses.
Due to several restrictions, like not much adequate database, calendar and schedule, a very
simple and incomplete model of pausing is presented.
This study has two different tasks. The first one is to predict the locations where pauses occur;
the second, to model their durations.
Table 3.11 exhibits the number of occurrences for each type of studied sentence marker and the
number of silences associated to those occurrences, only for the texts used in the training set (same
as for segmental durations). The table reports markers in no paragraph endings. Apart from the
reported examples, 119 pauses between words with no punctuation marker occurred.
98
Chapter 3 – Duration Model
*1 - Only hyphens between words were considered, and not within words (ex: ‘chamo-me’ – My
name is).
*2 - This type of sentence marker is always associated to paragraph change, so there is no intra-
paragraph occurrence. However, this marker is known to impose a pause.
*3 - This marker was always (just two cases) followed by another one (comma or full stop).
As for sentence markers “.” and “,” the number of occurrences in the database is statistically
relevant, it allows to conclude that there is always a pause associated to “.” and a frequent pause
associated to “,”. For other markers, the number of occurrences has no statistical significance,
though it indicates that there is a pause associated to “?”, “!”, “;”, “:” and “(“. For “-“ there is
usually a pause associated, and usually no pause associated to “””.
For pauses between words with no sentence marker, there was an attempt to identify words
associated to pauses, before or after them, but there was no word with significant occurrence near
any pause.
Thus, in spite of knowing that this issue needs further studying, a preliminary rule was
established for pause imposition in a synthesizable text:
The occurrence of at least one of the following markers imposes a pause: {. , ? ! ; : - ... (}. The
other types of pauses (about 30%) were not considered.
Pauses with no association to sentence markers should function as semantic group boundaries, if
they were obtained by a prosodic phrasing determination process. If not, with the available
semantic group automatic identification tools, the door is open for future research in this area.
Pausing model should consider a prosodic phrasing, as work described by Viana and others
[2003], and their phrasing markers are very serious candidates for pausing.
99
A Prosody Model to TTS Systems
A set of automatically-extractable features was assembled, to enable the network to enhance its
performance. This set of features is presented in Table 3.12.
Features Fields
Type of sentence marker associated to
By activating 1 of 3{. , other}
pause
Time (s), number of segment, number of
Distance to previous pause
intonation group
Sentence marker of the previous By activating 1 of 4{beginning of paragraph . ,
pause other}
Time (s), number of segment, number of
Distance to following pause
intonation group
Sentence marker of the following
By activating 1 of 4{end of paragraph . , other}
pause
Table 3.13: Best results for the intra-paragraph pause duration predictor.
Table 3.14: Marker type results for the pause duration predictor.
Training Test
*1 - Number of cases.
*2 - Average duration of the pauses produced by the speaker.
*3 - Standard deviation of the pauses produced by the speaker.
*4 – Root mean square error (Eq. (3.11)) of the difference between predicted durations and
durations produced by the speaker.
100
Chapter 3 – Duration Model
Table 3.13 exhibits the model’s best results for the training and test sets10. Table 3.14 displays the
results by type of marker associated to pause, both for the training and test sets.
In this case, the root mean square error (rmse) replaced standard deviation of the difference
between predicted and measured durations since it has no mean null value, which means the model
failed to estimate the average duration value for every duration and for durations grouped by
marker type. This happens because the average duration value differed significantly for the training
and test sets. Moreover, rmse value for predicted durations is large in comparison to the standard
deviation (σ) of measured durations, which proves that results were not good.
A logarithmic codification for duration values was also tested, using Eq. (3.16), where D is the
pause duration in seconds and D’ is the codified duration. Yet, there was no improvement.
First, it was necessary to separate inter-paragraph pauses from intra-paragraph pauses. Inter-
paragraph pauses were not studied due to cuts during the recording process. As for intra-paragraph
pauses, statistical data allowed some rules to be established, regarding their location in relation to
sentence markers, though there were not many occurrences in the database. There are also pauses
between words, not associated to any sentence marker. For these pauses, the only available
information was words; it was impossible to establish a model, given the database restrictions and
the amount of automatically-extractable information. Desirably, linguistic information would
enable an automatic division of the sentences into semantic groups in this case [Oliveira, 2002] and
[Masaki et al., 2002]. Viana and her collaborators [2003] mention that their phrasing module for
European Portuguese in a correctly punctuated text should have a 61% average performance of
correctly inserted pauses and no false pause insertion. When word information is added (functional
or relative to contents), the performance level increases to 85%, but false insertions also increase
from 0 to 17%. When punctuation information is crossed with POS (part-of-speech) information,
correct pauses have a performance level 92% and false insertions drop to 4%, for a set of 12 labels.
In this work a very simple pausing model was proposed to insert pauses and predict its durations
in intra-paragraph breaks. The model inserts pauses just in accordance with orthographic text
markers, disregarding other breaks also important. In the considered database pauses correlated
with orthographic markers correspond to 70%. An ANN was proposed, considering just distances
and type of previous and following pause. The rmse (95 ms) and correlation coefficient (0.54)
achieved in the test set are at the level of the results produced by Navas [2003] for Basque language
using a CART based approach, 80 ms and 0.50 of rmse and correlation, respectively.
10 These sets were obtained from the existing intra-paragraph pauses in the training and test texts of the
duration model.
101
A Prosody Model to TTS Systems
The pausing module can be improved, with a specific database containing a lot more pauses,
which would attain relevant statistic results and a significant amount of syntactic information. For
the given database, there is no need to label the phonemes; it is sufficient to identify the pauses by
type and register syntactic information.
102
Chapter 3 – Duration Model
3.7 Conclusion
One model and one alternative of that model were proposed to predict from the text the phoneme
segmental durations with view to synthesizing the speech of that text. Both models are based on
feed-forward artificial neural networks, with a set of input features that specify the identity and
context of a given segment. The model consists of just one ANN with the identity of segment
codified in the input features while the alternative model consists in an ANN for each segment
identity in a total of 44 ANNs. The remaining features are the same in both models. The set of
features, ANN architecture and training alternatives were carefully optimised. Training was done
under a read texts database with several types of sentences.
Both models achieved a very high performance level. But the alternative model, with a specific
training for each segment type had slightly better final results. The alternative model achieved a
standard deviation of error of 18.2 ms and a correlation coefficient of 0.86 against 19.5 ms and 0.84
achieved by the model. The perceptual relevance of this difference will be studied in chapter 5. It
was proved that the prediction of segmental duration benefits in splitting a large model into a
smaller dedicated model units.
The presented results were as good as the best presented in the literature for different models and
other languages.
The model’s results were analysed by comparison with the speech labelling duration results of a
text set. The way the texts were read is certainly not the only possible way and although the model
tried to “imitate” a (professional) speaker, his reading rhythm is not always coherent for every
sentence. This becomes quite obvious when the model’s results are preferred to the original ones.
One should also take into consideration the error margin in the speech labelling itself. There are
two types of errors: gross errors, resulting from the incorrect marking of segments; and precision
errors, resulting from the lack of coherence marking every segment in the same moment of the
cycle. The first errors were deleted as they were found; the second are typical of the manual
labelling process and reflect the difficulty to identify phoneme boundaries. Consequently, there is a
certain error margin in the very identification of the original segment durations.
The purpose of the duration model is the application to a text-speech synthesis system; therefore
the durations of the voiced speech segments are always multiple of the fundamental period
durations. Thus, there is no benefit if the model’s durations are more precise than the fundamental
period’s durations. Usually, for a fundamental frequency of about 100 Hz, this period has 10 ms,
and, in some cases, about half.
As mentioned by several authors, namely Klatt [1976], there is a minimum value in segmental
duration differences which is perceived by the listener. It differs according to the length of the
segment and its location within the word and sentence. In a summary of other studies for several
languages, Klatt points to 10 ms for duration segments of 100 ms for vowels, fricatives, plosives
and nasals in Japanese, where some phonemes are differentiated by their durations. He also
mentions 20 and 25 ms for studies by different authors for English. Lastly, he concludes that the
duration modifications inferior to the minimum perceived value of 25 ms are, from the perception
point of view, considerably less relevant than those superior to that value.
The model is obviously not perfect. It can evolve, specifically in those rare situations where the
error margin is large. However, for the model to improve when applied to a text-speech synthesis
103
A Prosody Model to TTS Systems
system, other blocks of the system should also be improved. Brinckmann and Trouvain [2003]
mention that for TTS purposes it is more important the quality of symbolic representation (instead
of full lexical representation) than some perceptually masked improvements in the prediction
duration models. At this stage, the focus of this work is no longer in segmental modelling, but the
improvement of other synthesis blocks.
Finally a simple pausing model to insert pauses and predict their durations was presented. Pause
emergence is determined just by orthographic punctuation markers, covering about 70% of existing
breaks. Durations are predicted with an ANN using the text information and having in mind
contextual aspects solely. The prediction duration model achieved the promising results of 95 ms of
rmse and a correlation of 0.54. Still, the database was considered not suitable for the purpose and
contextual information was not sufficient. A prosodic phrasing model with a larger linguistic basis
is required.
104
4 Fundamental Frequency
In this chapter, some of the most relevant Fundamental Frequency (F0) models are referred. An in-
troduction to the Fujisaki F0 contour generation model is made, as well as a description of the in-
teractive tool that allows the estimation of the F0 parameters semi-automatically according to this
model. There will be a discussion on the methodology adopted in the estimation process that asso-
ciates Accent Commands (henceforth ACs) with syllables. Phrase Commands (henceforth PCs) are
inserted by a rule based method aligned with accent groups. The final position of PCs is deter-
mined by anticipation to the accent group, which is predicted with ANN, as well as its magnitude.
ACs are predicted with four ANNs for its amplitude, onset time, offset time and existence of AC as-
sociated with syllable or not.
A Prosody Model to TTS Systems
4.1 Introduction
The F0 contour has been proven as the most relevant prosodic parameter to confer naturalness to
synthetic speech. Due to its complexity, it is also the most focused issue in scientific publications
related to prosody.
There is no consensus when it comes to defining prosody or prosodic models in the literature.
There are usually two views on these concepts, even if the definitions differ a lot. One of them is
more concrete [Ladd and Cutler, 1983], and it conceives prosody from a physical point of view as a
set of acoustic parameters that can be measured and modelled, including ‘pitch’ (F0), duration an
intensity. This was the view adopted in this work.
Prosody may include non-lexical information regarding types of utterance (declarative, inter-
rogative, etc); it may also accumulate utterance functions such as sentence focus or prominence of
certain sections of the sentence. Moreover, prosody may contain information on the potential emo-
tions of the utterance.
Prosody, here expressed by F0, represents information of linguistic, nonlinguistic and paralin-
guistic levels, as defined by Fujisaki [1997:28] and transcribed bellow:
“Here I define linguistic information as the symbolic information that is represented by a set of
discrete symbols and rules for their combination. It can be represented either explicitly by the
written language, or can be easily and uniquely inferred from context.”
“On the other hand, paralinguistic information is defined as the information that is not inferable
from the written counterpart but is deliberately added by the speaker to modify or supplement
the linguistic information.”.
“Nonlinguistic information concerns such factors as the age, gender, idiosyncrasy, physical,
and emotional states of the speaker, etc. These factors are not directly related to the linguistic
and paralinguistic contents of the utterances and cannot generally be controlled by the speaker,
...”.
Naturally, present TTS systems can not handle paralinguistic and nonlinguistic information.
However, this information is included in databases from which prosodic models are built and there-
fore it is also an unrestrained part of these models. Thus, linguistic information is the only source
providing hints to monitor the F0 contour according to the model in use.
Most TTS systems divide the intonation generation task into the linguistic and the F0 generation
components [Sproat, 1998]. The linguistic component is responsible for analysing the text, process-
ing the input text along with possible high-level markers. These markers, not deductible from the
text, contain information on prosodic intentions, information that gives birth to prosodic events.
The F0 generation component consists of the process of generating an F0 contour from linguistic
representations or generated prosodic events. Traditionally, the F0 generation component is con-
ceived to support a specific abstract representation.
Prosodic models become more effective, as better linguistic component information they get.
Syntactic and morphological analysers are likely to improve the linguistic component results sig-
nificantly, since they allow more accurate decision as far as the F0 generation component is con-
106
Chapter 4 - Fundamental Frequency
cerned. The linguistic component alone is not sufficient to build a prosodic model, because the
same sentence can be said in various ways, depending on the context. One can say:
• I didn’t eat the apple. (I picked it up, but gave it to the boy instead);
• I didn’t eat the apple. (I ate an orange and a pear, but not the apple).
In these examples of the same sentence, the linguistic information is precisely the same, but the
emphasis lies on the word written in bold, according to the context. Cases such as these can only be
solved with prosodic models prepared to handle high-level prosodic markers, so linguistic features
like emphasis or sentence focus can be identified. If a certain prosodic feature is not clearly infer-
able from the text or if it lacks an identifying marker, a neutral production rather than a wrong one
is preferred.
Approaching TTS interactive systems requires more freedom for prosodic expression than what
is currently allowed [Kochanski and Shih, 2002]. Most TTS systems are conceived to handle little
or no prosodic information marked outside the text. Kochanski and Shih believe that the next gen-
eration of TTS applications will not suffer from these constraints, as they will be directed at dia-
logue applications, thus containing information regarding the goals and intentions of the utterance.
This information must be expressed by prosody, so the “concept to speech” should be seriously
thought of in speech synthesis. Moreover, some applications require emotional simulation, stylistic
variation, etc, so this information should be provided to TTS systems by adding markers to the text.
With these markers, the system would not have to infer so much from the text and will conse-
quently make fewer mistakes and attempt a more daring, less neutral utterance.
For that matter, the model presented here can be easily adapted to handle a set of prosodic mark-
ers from a marking system.
There are different intonation schools describing prosody. The best known are now shortly de-
scribed:
• ToBI (Tone and Break Indices) – the most widely used intonation and prosodic structure
representation basis for several languages. It is based on thorough research of intonation
systems and on the relation between intonation and prosodic structure for a given lan-
guage. Each accent is represented by no more than two points, which specify the relative
contrast between high (H) and low (L) tones abstractly [Pierrehumbert, 1980], [Hirschberg
and Pierrehumbert, 1986] and [Silverman and Pierrehumbert, 1990] (Fig. 4.1). The ToBI
system aims at specifying a minimal set of intonation category markers, which are usually
interpreted as phonological distinctions of accent types. Frota [2000] made a prosodic
characterisation of Standard European Portuguese using this model for the intonation de-
scription;
107
A Prosody Model to TTS Systems
• Tilt – Model that represents intonation in the shape of a linear sequence of events, which
may be F0 accents or boundary tones. Each event is characterised according to continuous
parameters representing amplitude, duration and ‘tilt’ (measure of the shape of the event)
[Taylor, 2000];
The Fujisaki model was adopted in this work for the following reasons:
• It had a successful application to TTS systems in other languages, namely Japanese, Ger-
man [Mixdorff, 1998, 2002] and Basque [Navas, 2003]. Those systems get improved re-
sults when moved from the original prosodic models to the Fujisaki model;
108
Chapter 4 - Fundamental Frequency
• It allows precise modelling of the F0 contour with a relatively small amount of parame-
ters, as will be seen in 4.3;
• Separating phrase and accent components is like dividing the problem, thus a more rigor-
ous analysis of each part was made possible;
The physiological basis and the mathematical exploration of the Fujisaki model are presented in
section 4.2. Section 4.3, presents the developed tool for estimation of the Fujisaki parameters, and
the self process and consideration taken in the estimation of those data. The organization of data
and the definition of several parts-of speech used are explained in section 4.4. In section 4.5 a PC
insertion model is presented, as well as a model to predict the magnitude of these commands and
their anticipation from accent groups. In section 4.6 an AC prediction model is documented, where
their location, amplitude and duration are discussed. Results of each part of the model are pre-
sented. The predicted F0 contour is obtained after application of each part of the entire model as
documented in section 4.7, and is applied over segmental durations modified speech signal.
109
A Prosody Model to TTS Systems
2. Inferring the units and the structures of prosody from the commands.
Segmental
Lexical and
Message Utterance Motor Speech
Linguistic Syntactic supra-
Planning Planning Command Sound
Semantic segmental
Generation Production
Pragmatic features
of speech
Intentional
Para-
Attitudinal
linguistic
Stylistic
Non- Physical
linguistic Emotional
Fig. 4.2 – Processes by which various types of information are manifested in the segmental and supra-
segmental features of speech. (Figure published in [Fujisaki, 2002], edited with courtesy of Hiroya Fujisaki).
As step one is the inverse operation to the speech production process, it may be conducted with
more accuracy and objectively if there is a quantitative model for the production stage. That model
has been applied to several languages successfully. The process of inferring units and prosodic
110
Chapter 4 - Fundamental Frequency
structures described in step 2 gave way to the development of a statistical model in this work, one
able to generate the corresponding parameters from the text automatically.
loge Fb
loge Fb
t
Accent Commands Ga(t) Fundamental Frequency
Aa
Contour
Accent
Control
t Mechanism
Accent
Components
Fig. 4.3 – Functional model for the process of generating F0 contours. (Figure published in [Fujisaki, 2002],
edited with courtesy of Hiroya Fujisaki).
Fig. 4.3 represents the process of generating F0 contours from PCs and ACs in Fujisaki’s model.
The PCs are a set of impulses, and the ACs are a set of stepwise functions. The F0 contour can be
expressed by Eq. (4.1), where Gp(t), Eq. (4.2), represents the impulse response function of the
phrase control mechanism and Ga(t), Eq. (4.3), represents the step response function of the accent
control mechanism.
∑ Aaj {Ga ( t − T1 j ) − Ga ( t − T2 j )}
I J
log e F0 (t ) = log e Fb + ∑ Api G p ( t − T0i ) + Eq. (4.1)
i =1 j =1
α t exp ( −α t ) , t ≥ 0,
2
G p (t ) = Eq. (4.2)
0, t < 0,
min 1 − (1 + β t ) exp ( − β t ) , γ , t ≥ 0,
Ga ( t ) = Eq. (4.3)
0, t<0
where,
111
A Prosody Model to TTS Systems
Fujisaki assumes that parameters α and β are constant at least within an utterance, and the pa-
rameter γ is set equal to 0.9. The rapid downfall of F0, often observed at the end of a sentence, can
be regarded as response of the phrase control mechanism to a negative impulse for resetting the
phrase component.
Fujisaki [1988, 2002] presents the physiological and physical mechanism underlying the model.
The three components added in a logarithmic scale in the Fujisaki model are the F0 baseline that
is dependent of the speaker, the phrase component that is related with the prosodic phrasing, and
the accent component related with syllable or word accents. The first is constant and the model to
produce the last two components from text is explored in next sections.
Fig. 4.4 displays the phrase components for different magnitudes, Ap. The shape of the phrase
components is the same but the higher the magnitude Ap is, the higher is the components and the
faster is the rising and the falling slope, which models the declination line of the F0 contour.
Fig. 4.5 displays the phrase components for different natural angular frequency, α. As higher
α is as sharper becomes the shape, with faster rising and falling slope and also higher magnitude of
the phrase components.
α should be chosen by the shape of lower values of F0 contour and the magnitude Ap, adjusted
according to F0 amplitude.
112
Chapter 4 - Fundamental Frequency
140
110
F0(Hz)
100
90
80
70
0 0.5 1 1.5 2 2.5 3
Time(s)
Fig. 4.4 – Phrase component for PCs magnitude Ap= 0.15, 0.30, 0.50 and 0.80 with α=2 /s, logarithmically
added with Fb=75Hz.
160
130
120
F0(Hz)
110
100
90
80
70
0 0.5 1 1.5 2 2.5 3
Time(s)
Fig. 4.5 – Phrase components for PCs with α=1, 2, 3 and 4 /s with Ap=0.5, logarithmically added with
Fb=75Hz.
113
A Prosody Model to TTS Systems
Fig. 4.6, Fig. 4.7 and Fig. 4.8 display the accent component with the variation of Amplitude
(Aa), accent command duration (T2-T1) and angular frequency (β), respectively.
As higher is the amplitude Aa, the higher is the amplitude of accent component and the sharper
is the contour, with higher variation in rising and fall curves. The length of the component is inde-
pendent of the amplitude.
The accent component has different shapes as it reaches or not the maximum amplitude, depend-
ing of the accent command duration. This component is the addition of rising and fall parts, which
start with onset (T1) and offset (T2) time respectively. If the offset timing starts before the end of
the rising part, then the rising and fall parts are added till the end of the rising part. The timing be-
tween the end of rising part and the offset timing corresponds to the flat part of the accent compo-
nent, controlled by γ parameter.
If offset time comes after the completion of the rising part, the shape of the fall part of the com-
ponent is equal the inverted rising part. Therefore, have exactly the same duration, and the fall part
starts exactly at offset time. But, if offset time comes before the full rise of rising part, rising part is
shorter than fall part, and fall part starts only after offset time. This difference can be clearly ob-
served in Fig. 4.7 in the component with T2=50 ms.
The accent component amplitude is limited by the value Aa*γ (in logarithmic scale), as denoted
by Eq. (4.3). The value of γ was set to 0.9 as proposed by Fujisaki in the above mentioned refer-
ences.
The duration of the rising and fall parts depends of β. As higher is β faster is the rising and fall
parts of the accent component.
114
Chapter 4 - Fundamental Frequency
160
T2-T1=150 m s beta=30
150 Aa:
0.8
0.5
140 0.3
0.15
130
120
F0(Hz)
110
100
90
80
70
0 0.05 0.1 0.15 0.2 0.25 0.3
Time(s)
Fig. 4.6 – Accent components for ACs with T1=0 s, T2=0.15 s, beta=30 /s and Aa=0.15, 0.30, 0.50 and 0.80,
logarithmically added with Fb=75Hz.
140
beta=30, Aa=0.6, T1=0
110
F0(Hz)
100
90
80
70
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Time(s)
Fig. 4.7 – Accent components for ACs with beta=30 /s, Aa=0.60, T1=0 s, and T2=0.05, 0.1, 0.15 and 0.2 s,
logarithmically added with Fb=75Hz.
115
A Prosody Model to TTS Systems
140
Aa=0.6, T2-T1=150 ms
130 beta: 35
30
25
120 20
110
F0(Hz)
100
90
80
70
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Time(s)
Fig. 4.8 – Accent components for ACs Aa=0.60, T1=0 s, T2=0.15 s and beta=20, 25, 30 and 35 /s, logarithmi-
cally added with Fb=75Hz.
116
Chapter 4 - Fundamental Frequency
To start it is important to notice the difference between the terms estimation and prediction. The
word estimation is used for a bottom-up process that consists in getting parameters, commands in
this case, from F0 contour. The word prediction is used in a top down process that consists in get-
ting parameters, commands in this case, from text.
The process of parameters estimations is very laborious. There are some algorithms to do this
task automatically as the ones presented by [Mixdorff, 2000], [Rossi et al., 2002], [Fujisaki and
Narusawa, 2002] and [Narusawa et al., 2001, 2002a, 2002b]. In this work, the Mixdorff algorithm
was used in a first approximation, and then the parameters were manually corrected using a tool
specially developed for this task.
Fb is speaker dependent and is not constant even for one speaker and can vary slightly from ut-
terance to utterance.
Parameters α and β, do not vary so much from one speaker to another, nor from one utterance to
another, according to the Fujisaki’s experience on many languages and speakers [personally re-
ported], and can be approximated by 3.0 /s and 20 /s, respectively. A smaller value for α tends to
miss small and short phrases, and tends to approximate several small phrases by one long phrase.
There is a physiological reason to consider the value of β somewhat different for the onset and
offset of the accent command [Fujisaki, 2002]. It is larger for the offset, but the same value is used
for the sake of reducing the number of variables.
Once the data base used was recorded by the same speaker, the model developed here is opti-
mised for the characteristics of that particular speaker. In order to reduce the number of variables of
the model, and without loss of quality some parameters were considered constant. The experience
of estimating parameters for the speech of that particular speaker based in the preliminary analysis
of several utterances showed that it would be appropriate to considering the Fb, α and β, with the
values presented in Table 4.1.
Parameter Value
Fb 75 Hz
α 2.0 /s
β 20 /s
A special purpose tool was developed to support the manual estimation of parameters. Next sec-
tion presents this tool.
117
A Prosody Model to TTS Systems
The data provided by the tool is obtained from several modules described before, such as syl-
labification, intonation group rules, original F0 contour, supplied by PRAAT files, and other data
given by the labelled files of the database.
Fig. 4.9 displays the data provided to help the manual labelling. From top to bottom are pre-
sented: the speech signal; the F0 determined by PRAAT (with blue signs +); the estimated F0 pro-
duced with the labelled commands, accent components plus phrase components plus Fb (in black);
the phrase components plus Fb (in black); PCs (black arrows); ACs (black pedestals); the syllables
in descending lines (red - tonic syllable, blue – normal syllable, black – syllable without vowel),
each descending line is one accent group; the orthographic phrase marks (in red); the words (begin-
ning of words are marked with vertical cyan dotted lines); and finally the sequence of phoneme
segments. All data are synchronised with the speech signal waveform. Top of figure gives the root
mean squared error between estimated and original F0 contours, considering only the non zero val-
ues.
rmse1=2.96 Hz
300
250
200
Hz
150
Ap/100
100
50
Aa/100
-50 i , , , , .
sa
ho
co ens
nomo
e
at
ac
eu
po
so
seo
m
e
ai
er
ai
o
m
re
rq
fr e
fr e
ec
sc
ra
m
e
em
en
-100
ta
s m ! n i E k ! 6 X u @ @ 6~ O j S j "r j
re
w j u S ! 6 s 6 e ! ! u e i r a m m s
"O S m X "t ! e~ "r w "p k f r "s 6~ j a @ 6~
-150
0 1 2 3 4 5 6
t(s)
Fig. 4.9 – Example of the data provided by the tool to manually estimate the Fujisaki parameters.
Original F0 contour is determined using PRAAT 4.0 software and saving the data into a file
which will be read by the tool. The command ‘To Pitch…’ of PRAAT is used to determine the
original F0 contour. This command performs a pitch analysis based on autocorrelation method. The
118
Chapter 4 - Fundamental Frequency
The post processing algorithm removes all F0 values above a maximum threshold as well as se-
quences between one and four F0 values, where the variation before and after the sequence is
higher than a chosen delta variation. For the present speaker the threshold limit is 200 Hz and the
delta variation is 10 Hz.
Fig. 4.10 – Window with menus of the tool to manually estimate the Fujisaki parameters.
Fig. 4.10 displays the window of the tool. The left part of menus contains the default items of a
Matlab figure, and the right part contains the special purpose tool menus. In the middle there is a
menu with the identification of the paragraph (t2_p19, in this case). The figure toolbar of Matlab,
presented in a second line, has the facility to make zoom in and zoom out. The contents of the spe-
cial purpose menus are described below:
Play original:
119
A Prosody Model to TTS Systems
• Select and play – allows the selection and play of a part of speech. The initial
and final instants of the selected part still available for the present signal and
for the re-synthesized speech signal;
Play Re-synthesis:
• Load Re-synthesis – loads the file with the re-synthesised speech signal pre-
viously saved with PRAAT with a specific name;
• Select and play – allows the selection and play of a part of speech. The initial
and final instants of the selected part still available for the present signal and
for original speech signal;
C. Phrase:
• T0 – Change PC position;
• Ap – Change PC magnitude;
C. Accent:
Options:
• Load Commands – Load a saved set of commands and plots the Commands,
their respective components and F0 contour.
120
Chapter 4 - Fundamental Frequency
The play menus are also available through a shortcut keys. The change, insertion and deletion
operations are done with the mouse.
Initial set of commands are plotted in black colour. The changed commands and respective F0
contour are plotted in red colour. Fig. 4.10 displays an example with the change of amplitude of the
command phrase at instant about 3.3 s. The new value of rmse between the original F0 curve and
the new estimated F0 curve is displayed beside the rmse of the initial set of commands.
Fujisaki recommends the use of the logarithmic scale with the advantage of having a visual addi-
tion effect of the phrase and accent components in the manual commands labelling process. But, a
linear scale was used, in order to allow the representation in the same graphic of phrase and accent
commands, just by the use of a factor of scale (1/100), and the speech waveform, just by adding a
constant offset. The linear scale allows a better resolution in higher frequencies, especially interest-
ing during the manual labelling process and the process of manual commands labelling is also very
intuitive, according to the author’s experience.
In the second phase, the set of commands was manually optimized, using the tool described
above. No linguistic constrains were taken in consideration. This optimization started by adjusting
the PCs position and amplitude, making the phrase component touch the valleys of F0 contours.
Next, ACs were changed and/or introduced to produce an estimated F0 closely fitting the original.
Fig. 4.11 displays the first and second phases of the parameters estimation. The first estimation
uses few AC making the estimated cross the original F0 contours without concern in following ex-
actly the original shape. In the second phase, no restrictions about the proximity and number of AC
were kept, allowing a better fitting and a precise tracking of the original F0 shape. The rmse be-
comes improved from 8.97 Hz to 4.39 Hz between the first and the second phases of the estimation
process in the whole paragraph, partially presented in the above figure.
After an attentive analysis of the AC, a strong connection between AC and syllables becomes
clear. One AC is considered connected with one syllable if the accent component influences the F0
of the syllable, considering the delay between T1 of the AC and the effective contour of the respec-
tive accent component.
In order to objectively decide if the accent component influences the F0 of the syllable, the con-
cept of zone of influence was introduced as the interval between the instants where the accent com-
ponent is higher than X% of its maximum value. If there is any interception between the zone of in-
fluence of the AC and the voiced part of the syllable, the AC is considered as a candidate to be
connected to the syllable. Several values for X were considered between 35% and 60%.
121
A Prosody Model to TTS Systems
rmse1=8.97 Hz rmse2=4.39 Hz
250
200
Hz
150
Ap/100
100
50
Aa/100
-50 ,
po
es
qu
coo
de
al
ve
ta
rta
ti m
loc
bo
le
id
a
io
gr
ad
e
-100
e
u 6 g "t ! ! "b d ! l "d @
X r "l r i k k O @ t s a X
! ! E S m j o~ j "a 6 i ! XX
p t ! ! 6 u ! ! l* v ! d
Fig. 4.11 – Example of the estimated parameters in first (black) and second (red) phases.
Assuming the general observation of the connection between AC and syllables, there are some
exceptions discussed in the following topics:
• Syllables without any connected AC – These syllables can be either the type of no voiced
sound or the type with voiced sound(s). The first case is obvious. Some times the next syl-
lable needs a long excursion of F0 leading to a longer AC and consequently the onset time
(T1) of this AC must be early and starting in the current one. This AC must be associated
with the next syllable and not with the current syllable. This case is very frequent in sylla-
bles with or without vowels. No rules were found for these cases yet, but they should be
considered in the model;
• One AC with zone of influence spanned through more than one syllable – This AC must
be considered as a sequence of contiguous ACs with identical amplitudes, where each new
AC is associated with the respective syllable. This effect does not alter the accent compo-
nent because the accent component of one AC is the same as the addition of component
accents of two AC with same amplitude and total duration, if T2 of first AC coincides to
T1 of second AC (i.e. the system is linear);
• Several ACs with zone of influence in the same syllable – Usually two, very rarely three,
ACs appear in this case. These ACs will be named as candidates to be connected to the
syllable. The candidates that could be connected to neighbour syllables still with no con-
nected AC, must be connected to it. These cases are solved considering that the AC con-
nected to a given syllable may influence the F0 of neighbour syllables as well. There still
remain the unsolved cases where more than one AC are connected to just one syllable.
122
Chapter 4 - Fundamental Frequency
These cases were analysed in order to inquire the parameterization of the ACs, to observe
if there was really need for two or more ACs to produce the F0 contour for these syllables.
Once there was observed that no significant loss in the accuracy of the fitting the F0 con-
tour exist with only one AC, then a third phase of parameterization was performed.
The third phase of parameterization was just the correction of the cases where more than one AC
was connected with the same syllable. This correction did not cause significant loss in the param-
eterized database because the global rmse varies from 3.94 Hz to 3.98 Hz and the correlation coef-
ficient varies from 0.974 to 0.973. One example of the mentioned correction is presented in Fig.
4.12 where AC number 21 in black at 5.3 s was deleted. In this figure are also visible the ACs
(numbers in black) associated with syllables (numbers in bleu), and the accent component corre-
sponding to each AC.
It should be noted that the zone of influence of AC number 20 spans syllables 26 and 27, but,
since syllable 26 already has one associated AC (number 19), then AC number 20 becomes associ-
ated only to syllable 27. Also, the zone of influence of AC number 21 spans syllables 27 and 28,
but they do not coincide with the voiced part of syllable 28. In this case the AC is, again, associated
just with syllable 27. AC number 18 is associated with syllables 24 and 25. So, this AC will be di-
vided, exactly at the end of voiced part of syllable 24 in two ACs associated each one, to its.
Fig. 4.12 – Example of the AC parameters correction done in the third phase of parameters estimation.
An algorithm was implemented to connect ACs with syllables and identify syllables with more
than one related AC. The flow chart sequence for each syllable is presented in Fig. 4.13. In the flow
chart, the zone of influence of the ACs is between the time instants where these accent components
123
A Prosody Model to TTS Systems
keep values greater then 35% of its maximum. The ACs with zone of influence overlapping the
voiced part of the current syllable and not related yet to previous syllables are candidates to be re-
lated to the current syllable.
Table 4.2: Root mean squared error and correlation coefficient between estimated F0 and post processed
original F0 (non zero values).
rmse (Hz) r
3.98 0,973
No audibly perceptible difference seems to exist between the original speech and the re-
synthesised speech with estimated F0 contours. Anyhow, some perceptual testes were made to con-
firm the findings and the results are presented in chapter 5.
124
Chapter 4 - Fundamental Frequency
Yes
Yes
Splits previous AC into two ACs at the end of the voiced part of the pre-
vious syllable and connect the two new ACs to previous and current syl-
lables, respectively
Go to next syllable
125
A Prosody Model to TTS Systems
Fig. 4.14 presents the organization of linguistic and prosodic structures in a paragraph, from
lower to higher level: segment, syllable, word, accent group, prosodic phrase, phrase, sentence and
paragraph. The segment is one of the 44 different segments of phonemes that were considered (Ta-
ble 2.6). Syllable is well defined in chapter 2. Words, delimited by spaces, are also very well
known. Accent group intends to be a prosodic structure and is defined in the previous chapter. Pro-
sodic phrase is also a prosodic structure and is delimited by PCs. Phases are considered as any part
of text between two orthographic marks (, ! ( ) - ; : … “ .) (including the beginning of paragraph).
Sentences are delimited by any of the following marks (. ? ! …). Paragraphs are delimited by a car-
riage return in text. The orthographic marks presented on top are boundaries for phrases, sentences
and paragraphs.
, ? .
segment
syllable
word
accent group
phrase phrase
sentence sentence
paragraph
After the second phase of commands estimation several questions come out. These questions re-
sult from the statistical analysis of estimated commands and how to use them to build the model.
• How should connections between ACs and syllables be dealt with? Assuming that there is a
generic connection between ACs and syllables, how to deal with the situations where those
connection are not direct? More specifically the following cases:
126
Chapter 4 - Fundamental Frequency
The third question about ACs leaded to the third phase of the estimation process, as described in
section 4.3.2. First two questions, related to PCs, will be studied and answered in next two sections.
127
A Prosody Model to TTS Systems
Assuming the accent groups behaves as prosodic words, the only eligible positions to insert PCs
are the beginning of these groups. The onset time of the PCs, T0, is usually anticipated relative to
the beginning of the accent groups. This anticipation, noted as T0a, will be subtracted from the in-
stant time of the beginning of accent group (eligible position) T0E to produce T0, as depicted in
Fig.4.15, and represented by Eq. (4.4).
T 0 = T 0 E − T 0a Eq. (4.4)
250
200
150
Hz
100
Eligible positions
Ap/100
Estimated PCs
50
T0a
0
-50 i , .
sasu
opas
so es
sait bre
da ao
ju
re
m lam
re ta
es
caeo
im en
pa tan
to ra te
pa
pa
os ra
tqeu
re
na
re s
afa as
st
ui
do
ze
ve
fle
fo
sp
ua
rta
po te
in
rti
e
ic
rm
io
xa
r
on
c
r
m
a
ul
o
sa
ar
m
bi
s
en
lid
ad
-100
te
es
0 1 2 3 4 5 6 7 8 9 10
t(s)
Fig. 4.15 – Representation of Eligible positions, T0E, and anticipation, T0a, of PCs.
The following sections will deal with eligible positions in the text for inserting PCs, as well as
the prediction of magnitude, Ap, and anticipation, T0a.
128
Chapter 4 - Fundamental Frequency
Besides the PCs imposed by orthographic marks, about 70%, there are other PCs, about 30% of
total, not linked with the punctuation.
The algorithm described bellow, was designed to govern the location of inserted PCs. In the first
step, PCs linked with orthographic punctuation marks are inserted and, subsequently, several can-
didate positions to insert other PCs are considered. For each candidate position a score is deter-
mined by a mathematical model, as described in next section.
Table 4.3 presents the percentage of occurrences of orthographic punctuations that originate
PCs, according to the estimated parameters from the database. In this table the punctuation marks at
the end of paragraph are excluded, because no PCs are inserted at the end of a paragraph. Although
punctuation marks “!”,“…”“-““;”“:” do not present statistical relevance, the table suggests to have
one PC associated to each orthographic punctuation mark. In case of comma “,” the percentage is
not higher basically due to the proximity of some comas to other punctuation marks.
Table 4.3: Numbers of occurrences of orthographic punctuation marks, associated PCs and percentages of
punctuation marks with PCs associated.
Orthographic # of occur-
# of PC %
punctuation rences
. 67 64 96
, 379 261 69
? 12 12 100
! 4 3 75
… 1 1 100
- 7 6 86
; 2 2 100
: 6 5 83
In this section only this type of PC will be discussed. The objective is to find anchors to associ-
ate them. Firstly, every beginning of a paragraph should receive one PC in the obvious absence of
any punctuation. In the following the eventual existence of additional PCs is analysed. Text and
speech analysis of several of these PC, suggests that different factors seem to contribute to their lo-
cations. Factors like distance to previous PC, distance to next PC, presence of pause, length of pre-
vious word and type of next word, were statistically analyzed and correlated with the presence of
this type of PC.
For each candidate position, one score, S, will be calculated by Eq. (4.5) that combines the
weights of each factor.
129
A Prosody Model to TTS Systems
Where S is the score for the candidate position, WpPC, WnPC, Wp, Wlpw, and Wtw are the weights
for previous PC distance, next PC distance, pause, length of previous word and type of next word,
respectively.
30 30
25 25
20 20
15 15
10 10
5 5
0 0
0 1 2 3 4 -1 0 1 2 3 4 5
t(s) t(s)
Fig. 4.16 – Histogram and Gaussian approximation of distances from PCs not linked with orthographic marks
to previous PCs and next PCs.
The distances to previous and next PC factors have different histograms, but both can be ap-
proximated by normal distributions, as can be seen in Fig. 4.16. Table 4.4 presents the relevant sta-
tistical data. Weights for previous and next candidate are given by the normal probability density
function with the respective means and standard deviation presented in Table 4.4 for the respective
distances to previous and to next PC.
The distance to next PC is calculated as the end time of a so-called eligible area (see Fig. 4.19)
plus 0.75 s minus the candidate position. This procedure limits the used value of distance to next
PC of 4. However, next PC can, eventually, be more distant than 4 s, as depicted in Fig. 4.16.
The weight for presence of pause, Wp, is 1 or 0 in case of presence or not of pause.
For the weight relative to the length of the previous word, Wlpw, the length of the word is con-
sidered plus the length of the eventual pause. This factor assumes a higher correlation with pres-
ence of PC, for values above 0.5 s. The weight used for this factor is logarithmic and is given by
the empirical Eq. (4.6).
130
Chapter 4 - Fundamental Frequency
Wlpw
2.5
1.5
0.5
0
0 0.5 1 1.5
length of previous word + pause (s)
The weight, Wtw for the type of the next word, was determined according to the correlation of
some words with this type of PCs and is given by the Table 4.5. This table, containing the most
correlated words, has weights between 0.7 and 1. For other words not in table Wtw is 0.7, 0.5 and
0.2 for words with one two or more syllables. These values are empirical and based on several ob-
servations.
131
A Prosody Model to TTS Systems
The method starts by inserting PCs in the beginning of paragraph and just after the punctuation
marks of Table 4.3. Then it removes each PC whose distance to the previous is less than 1s if the
previous sentence is not of interrogative type.
Then, for the intervals between PCs that are longer then 3s, candidate positions inside the eligi-
ble area to insert a new PC are identified.
The candidate positions are the eligible positions inside the eligible area. The Eligible area, as
depicted in Fig. 4.19, starts 0.6 s after the previous PC and ends at minimum between next PC mi-
nus 0.75 s and previous PC plus 3.25s. These limits for eligible area of the candidates ensure the
minimum distances to previous and next PC, according to Table 4.4.
Then the score S is calculated for each candidate according to Eq. (4.5), and only the maximum
scored candidate will be considered. If the maximum scored candidate has a score greater than 1,
then one PC is inserted in its position. The process is repeated with the new set of PC until the end
of the paragraph.
Fig. 4.20 presents an example of the application of the algorithm. Two PCs at about 0.6 and 7s,
were initially inserted at the beginning of the paragraph and in eligible position, near the ortho-
graphic mark. But because the distance between them is greater than 3s, the eligible area was de-
fined (orange box), and the four eligible positions inside the eligible area were taken as candidate
positions. For each candidate the respective score was determined. The second candidate position
(at about 2.2 s) has the greater score of 2.7. Since the score is greater than 1, a new PC was inserted
on that position.
For intervals between PCs longer than 3 s, identify candidate positions inside eligible area
132
Chapter 4 - Fundamental Frequency
3.25s
0.6s 0.75s
Eligible area
t (s)
>3s
...
133
A Prosody Model to TTS Systems
propriateness of the methodology by evaluating the closeness of the inserted PCs to the labelled
ones.
Table 4.6 shows the total numbers of inserted and manually labelled (estimated) PCs, which are
very close, as well as the average of the respective distances. The standard deviations are somewhat
different, as expected, because of the statistical nature of the insertion process that generally re-
duces the variance of the model relative to the original. The histograms, presented in Fig. 4.21, of
distances between adjacent labelled and inserted PCs, are similar in terms of basic shape.
Table 4.7, presents the number of correctly inserted PCs (C) determined as the number of in-
serted PCs at position less distant than an arbitrary time distance in 3 values, from the nearest la-
belled PC, the number of inserted errors (I) as the number of inserted PCs whose distance to the
nearest labelled PC is longer than X seconds, and the number of deleted PCs (D) as the number of
labelled PCs without inserted PC at distance X or less1. The range X is a tolerance for T0a that will
affect the exact position of inserted PC. The maximum anticipation T0a was experimentally ob-
served to be almost 1s. The recall rate (R) and precision rate (P) are also presented, as adopted by
Hirose et al. [2003] and determined by the expression in Table 4.7.
Table 4.8 presents the recall rate and precision rate of inserted PCs considering correctly in-
serted only the PCs at the next eligible position just after the estimated PCs. This measure is more
exigent, since just the exact positions of labelled PCs are considered as correct positions to insert
PCs.
Table 4.6: Comparison between estimated and inserted PCs. The number of PCs, the minimum, maximum and
average distances and standard deviations in seconds.
Table 4.7: Numbers of correctly inserted PCs (C), insertion errors (I), deleted PCs (D), the recall rate (R) and
precision rate (P), at a tolerance time distance X, from the labelled PCs.
1 It must be noted that the number C+I and C+D in the case of Table 4.8 are exactly the numbers of inserted
and labelled PCs, respectively. In the case of Table 4.7, C+I is also equal the number of inserted PCs, but
C+D is superior than the number of labelled PCs because more than one inserted PC can be inside the range
X of the labelled PC counting two correctly inserted PCs but just one labelled PC.
134
Chapter 4 - Fundamental Frequency
Table 4.8: Correctly inserted (C), deletion errors (D), insertion errors (I), recall rate (R) and precision rate (P),
for the positions of inserted PC compared to the positions of estimated PC considering the eligible position.
C D I R P
435 211 208 67.3% 67.7%
Hirose and others [2003] reported a recall rate and precision rate of 82% and 85%, respectively,
for a process of automatically extraction of PCs from F0 contours (estimation process) using lin-
guistic information. In this case the correct and incorrect positions are clearly known.
Taking in consideration the reported values [Hirose et al., 2003] for an automatic process of PCs
estimation using linguistic information and having in mind the differences in this process of pre-
dicting PCs from text, the numbers achieved for recall rate and precision rate are acceptable con-
sidering that they are in the same range in case of the ones of Table 4.7 and relatively close in case
of the ones of Table 4.8.
100 100
80 80
60 60
40 40
20 20
0 0
0 1 2 3 4 5 0 1 2 3 4 5
t(s) t(s)
Fig. 4.21 – Comparison of histograms of estimated and inserted PC distances.
Visual inspection indicates that the inserted PCs are generally in a coherent position as can be
observed in the example given in Fig. 4.22. The final exact position, T0, of the inserted PCs, will
be affected by the anticipation T0a.
135
A Prosody Model to TTS Systems
250
200
150
Hz
100
Ap/100
50
-50 i .i , , , , .
ap
a g re c
us mo ra eu
qn u aca
e saeo o
el
co
mm
h o ita
a n ra
em go
qu
qd ua s
la
m
p ouit
d er o
a i n tr
e mnd ao
ve
a g-se
p ao ra
de e a o
sit
n o ca
qe u
e smeta
m
coelh
dq o ico s
n iue es
p g
coarauem
itua
ireia e
m
e
u
n
ua
a
pe
n
ti
tr
va o
nd or e
m
ig
ig
ig
pr
ra
ra
ra
ra
ee
va
nt
nt
nt
e
es
nd
-lh
er
e
-100
0 2 4 6 8 10 12
t(s)
Fig. 4.22 – Comparison of estimated and inserted PC positions. Black arrows are the estimated PCs; magenta
arrows are the inserted PCs.
Several architectures in what concerns type of network, structure, number of layers, number of
nodes in each layer, and activating functions, were considered and the more appropriate were tested
for both ANNs. For each ANN several thousands of training sessions were run and the best per-
formance session was selected giving the performance for this architecture. Feed-forward networks
trained with back-propagation algorithms were selected as the type of network to solve the prob-
lem.
The networks input layer has the necessary nodes to code the features discussed below. The out-
put node codes the predicted parameter, Ap or T0a. The output is 85% of parameter value divided
by the maximum parameter value and normalized to have null average and standard deviation equal
to 1.
Table 4.9 and Table 4.10 present the best correlation coefficients’ architectures, activating func-
tions and training algorithms of networks to predict Ap and T0a respectively. The column “Number
of features” refers the features presented later in 4.5.3.3, and corresponds to the number of nodes of
136
Chapter 4 - Fundamental Frequency
the input layer. The additional feature between the situations with 20 and 21 feature is the magni-
tude of previous PC.
Table 4.9: Best performance (correlation coefficient), architectures and training algorithms to predict Ap.
Table 4.10: Best performance (correlation coefficient), architectures and training algorithms to predict T0a.
The ANN with one or two hidden layers were used with number of nodes varying between 2 and
10. The output node has always the linear activating function (Lin); meanwhile the last hidden
layer has, in the best cases, the hyperbolic logarithmic activating function (Log); and the eventual
first hidden layer has the hyperbolic logarithmic or tangent (Tan) activating functions. The Leven-
berg-Marquardt back-propagation training algorithm [Hagan and Menhaj, 1994] gives always the
best results due the relatively low number of nodes of the input layer (20 or 21).
The selected architecture to predict Ap, is the feed-forward type with two, two nodes, hidden
layers with the hyperbolic logarithmic activating functions.
The architecture of the T0a ANN is also the feed forward type with two hidden layers, but with
four nodes in first hidden layer and the hyperbolic tangent activating function, and two nodes in the
second hidden layer activated by the hyperbolic logarithmic function.
The 101 paragraphs of the database was divided into the train set with 91 paragraphs and the test
set with the rest 10 representatives paragraphs picked from the original seven tracks. The training
set has 553 patterns (85%) and the test set has 93 patterns (15%).
137
A Prosody Model to TTS Systems
Training was done over the training set and using the test set for cross-validation in order to
avoid over-training. The test vector was used to stop training early if further training on the training
set will hurt generalization to the test set. The cost function used for training was the mean squared
error between output and target values.
Training algorithms described in section 3.3.3 were used. The algorithms trainoss – ‘One Step
Secant Algorithm’ and trainrp – ‘Resilient back-propagation’, give results with lower performance
than trainlm – ‘Levenberg-Marquardt’. This is clearly the best algorithm for the dimension of the
network, although the training process is slower.
For each variation of the ANN, concerning architecture, training algorithm, activating functions,
set of features ands its codification, several thousands of training sessions were ran and the best
performance were selected as the performance for this variation.
Fig. 4.23, displays the average performance of several architectures for Ap and T0a ANNs con-
sidering different extensions of the training set. Is visible that for Ap ANNs the performance is sta-
bilised from 75% of the training set, and more patterns do not improve performance. But, for T0a
ANNs, performance still increasing at the 100% of the training set, what lead to the idea that more
training patterns could improve performance of this parameter.
0,800
0,750
0,700
Ap
0,650
r
T0a
0,600
0,550
0,500
25 50 75 85 95 100
% of training set
Fig. 4.23 – Evolution of ANNs performances in test set, over the used extension of the training set.
Several features and different codifications were considered in this study. The final set of fea-
tures and its codification will be presented in this section as well as a brief discussion relative to the
excluded features.
Table 4.11, presents the list of features and their individual correlations with Ap and T0a. This
set was build selecting the features by their higher correlation with output parameters and correla-
138
Chapter 4 - Fundamental Frequency
tion between them, eliminating features linearly correlated, and, by a process of rejection, also the
features that deteriorate final performance.
Although some features presented in Table 4.11 do not have an individually significant correla-
tion with T0a, suggesting the exclusion from the ANN, in fact, all together, their presence improves
the prediction performance.
Table 4.11: Set of features and their correlations r with Ap and T0a.
Some features are highly mutually correlated as is the case of features 3 and 6, 4 and 7, 11, 12
and 13, 15 and 16, and finally, 18 and 19. Anyhow they do not carry exactly the same information,
and their ensemble use improves the performance. An explanation of the features, as measured in
Table 4.11, follows:
139
A Prosody Model to TTS Systems
1. the correlation coefficient values between most of the orthographic marks and Ap are simi-
lar and important, and are not relevant with T0a. Therefore just the comma and the full stop
were classified separately. This feature was coded in four levels according to the correlation
of each mark with Ap: other mark=0, full stop=1/3, comma=2/3, no mark=3/3. Correlation
and codification mean that PCs generated by means of other mark or full stop have higher
Ap than PCs generated by commas or not associated to orthographic marks;
2. only the interrogative type of sentence showed a different correlation with Ap and T0.
Therefore this feature was coded in the levels of interrogative type, 1, or other type, 0. Dif-
ferent types of interrogatives were not distinguished; This is one of the most relevant fea-
tures regarding T0a;
10. is the length of pause if there is one just before de PC. This feature is highly correlated with
Ap;
11. indication if the PC is in beginning position of a phrase. This position is correlated with
higher Ap;
12. indication if the PC is in beginning position of a sentence. This position is correlated with
higher Ap;
13. indication if the PC is in beginning position of a paragraph. This position is correlated with
higher Ap;
14. indication if the accent group starts with a tonic syllable. Slightly correlated with higher Ap
and longer T0a;
15. measured in seconds. As longer is preceding PC, higher is Ap and anticipation T0a;
16. measured in number of syllables with correlation similar to the previous feature;
17. similar with feature 1, but coded in different order due to different levels of correlation:
other mark=0, coma=1/3, no mark=2/3, full stop=3/3. Correlation and codification mean
that Ap is higher for PCs following PCs generated by full stop;
18. measured in seconds, it is the phrase component length. The longer the phrase component
length is, the higher is Ap and the shorter is T0. This is the most relevant feature for T0a;
140
Chapter 4 - Fundamental Frequency
19. measured in number of syllables, has similar correlations with the previous feature;
20. in this feature others mark are relevant. Therefore, it is coded in the following seven levels
according to the correlation of each mark with Ap: “other mark”=0, “.”=1/6, “…”=2/6,
“;”=3/6, “?”=4/6, “no mark”=5/6, “:”=6/6. Meaning that Ap is higher for PCs preceding
PCs generated by “:”, “no mark” and “?”;
21. is negatively correlated with Ap and slightly negatively correlated with T0a. But, is not used
in Ap ANN because it deteriorates the performance of this ANN. On the other hand, in spite
of its low correlation with T0a, in this network this feature improves the performance.
All features are normalised in range between 0 and 1 in the codification after been divided by an
established maximum limit for each feature.
The final set of features for each ANN was established by the best performance achieved. For
Ap ANN the final set of features is composed of features 1 to 20, though feature 21 has not negli-
gible correlation with Ap. For the T0a ANN the final set of features includes also feature 21. For
this ANN a set of the most correlated features (features numbers: 2, 5, 10, 14, 15, 16, 18 and 19)
was tried but with worse performance.
The usage of almost the same set of features for both ANNs was the advantage that no further
processing is needed for determine other features. Feature 21, the feature used only in the T0a
ANN, is the output of Ap ANN in previous PC.
Table 4.12: Linear correlation coefficient obtained in the test set for the predicted Ap and T0a values, relative
to the estimated (labelled) values.
Ap T0a
r 0.772 0.649
Fig. 4.24 plots the best linear fit between target and predicted values for Ap and T0a in the test
set with ANNs with correlation coefficients of 0.772 and 0.649 respectively. A concentration is
visible of the predicted values in a shorter interval than the target values, for both parameters. So,
ANNs impose less extensive limits for minimum and maximum predicted values.
Fig. 4.25 plots the probability of the error in the test set relative to the predicted Ap and T0a val-
ues, as well as the adjusted normal probability plot for same data in red. The figure shows an error
less than 0.12 for 80% of the predicted Ap, and an error less than 0.2 for 90%. The prediction of
T0a has an error less than 0.2 s for 75% of the cases and an error less than 0.3 s for 95%.
The values of average and standard deviation of the estimated Ap in test set are 0.356 and 0.187
respectively. The same values for the predicted Ap in test set are 0.353 and 0.144, respectively.
The average and standard deviation in the estimated T0a in the test set are 0.367 and 0.233, re-
spectively. The predicted values are 0.364 and 0.146, respectively.
141
A Prosody Model to TTS Systems
Best Linear Fit: A = (0.588)T + (0.143) Best Linear Fit: A = (0.406) T + (0.215)
0.9 1.4
Data Points Data Points
0.8 R = 0.772 Best Linear Fit R = 0.649 Best Linear Fit
A=T 1.2 A=T
0.7
1
0.6
0.5 0.8
A
A
0.4 0.6
0.3
0.4
0.2
0.2
0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4
T T
Fig. 4.24 – Best Linear fit between target (T) and predicted (A) values for Ap (left) and T0a (right).
0.75 0.75
0.50 0.50
0.25 0.25
0.10 0.10
0.05 0.05
0.02 0.02
0.01 0.01
0.003 0.003
0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4
Ap prediction error T0a prediction error (s)
Fig. 4.25 – Probability error in test set for predicted Ap and T0a. Lines show the adjusted normal probability
distribution with a) µ=0.093, σ=0.075 and b) µ=0.148, σ=0.097.
142
Chapter 4 - Fundamental Frequency
Fig. 4.26 presents the predicted PCs for one example paragraph. The estimated PCs and phrase
components are plotted in black colour. The results of the Ap and T0a ANNs can be observed in
green colour PCs and phrase components, which were predicted considering the initial positions of
the estimated PCs. Results of the complete model can be observed in magenta colour PCs and
phrase components where Ap and T0a were predicted from the inserted PCs. The three plots allow
the individual evaluation of each component part of the model (only the prediction of the magni-
tudes, Ap, and anticipations, T0a, in green colour, and the insertion of PCs plus the prediction of
the magnitudes, Ap, and anticipations, T0a, in magenta colour).
The Fig. 4.26 presents the paragraph: “Na passada quinta-feira, na RTP1 a jornalista Judite de
Sousa entrevistou o senhor Procurador geral da República. O Senhor Doutor Cunha Rodrigues
mostrou mais uma vez conhecimento profundo das matérias” (Last Thursday, in RTP 1, the journa-
list Judite de Sousa, interviewed mister Republic Attorney General. Mister Doctor Cunha Rodri-
gues showed once again a deep knowledge of matters).
250
200
150
100
Hz
50
Ap/100
i , , . .
-50
pnaa
qu ad
fe inta a
na
R
T
P
um
a
jo
ju
so
enusa
se isto
pr nho u
g e ra d
dr aral
os
donh
cu torr
ro h a
m gue
m tro
umais u
ve a
co z
pr ecim
da ndo to
ms
ep
e
rn
di
ira
os s
at
oc r
dr
of
ss
t re
uo
n
nh
ub
te
al
er
i
u en
u
v
is
lic
ia
ta
s
a
or
Fig. 4.26 – Application example of the insertion PC model. PCs and components: black –estimated; green -
initial position of estimated PCs with predicted Ap and T0a; magenta – predicted with PC model.
143
A Prosody Model to TTS Systems
In the process of commands’ estimation, the F0 contour can be fitted with more or less precision
according to the number of ACs used, as can be seen in Fig. 4.11, always without modelling micro-
prosody. It seems like two levels of fitting the F0 contour, one broader approximation and the other
a narrower one. In the broader approximation, the ACs, in a minor number, are associated with ac-
cent groups or the accented syllables of the accent groups. In the narrower approximation, the ACs,
in greater number, can be associated with syllables. Maybe, the best approximation depends on the
language and on the capacity of the model to accurately predict the AC parameters to produce a
natural F0 contour.
As already discussed in 4.3.2, during the process of estimating the parameters of the Fujisaki
model, the connection between ACs and syllables was followed. This approximation is different
from the approximations used by Mixdorff [2002] or by Eva [2003] that consider enough one AC
by accent group.
The present approach allows a more refined approximation of the F0 contour in the estimation
process, but does not guarantee a more reliable prediction of ACs, and the number of ACs parame-
ters to predict is larger. This approximation leaded to the third phase of the estimation of parame-
ters as documented in 4.3.2.
So, the model has to decide, for each syllable with voiced segments, if they will have one asso-
ciated AC or none, and then predict the parameters of the associated AC.
For each AC three parameters must to be predicted: amplitude - Aa, onset time – T1 and offset
time – T2. T1 and T2 are determined relatively to the syllable’s position. Concretely T1 is deter-
mined as the beginning of the voiced segments of the syllable minus an anticipation (Eq. (4.7)), and
T2 is the end of the voiced segments of syllable minus an anticipation (Eq. (4.8)). These anticipa-
tions, from now on, T1a and T2a, are the timing parameters to be predicted once the beginning and
end of the voiced sound are known.
where:
144
Chapter 4 - Fundamental Frequency
• Aa – amplitude of AC;
• T1a;
• T2a.
Next sections will discuss the architecture, training and optimization of the number of features
for each parameter presented above. Section 4.6.4 will present tables with the best results in the test
set, for each parameter.
Table 4.13: Linear correlation coefficient between AC parameters calculated along the labelled database.
Aa T1a T2a Ca
Aa 1 0.33 0.43 0.61
T1a 1 0.34 0.29
T2a 1 0.49
Ca 1
Several architectures in what concerns type of network, structure, number of layers, number of
nodes in each layer, and activating functions, were considered and tested for the four ANNs. Feed-
forward networks trained with back-propagation algorithms were selected as the type of network
for the solution of the problem.
The networks’ inputs have the necessary number of nodes to code the features discussed in next
section. The output node codes the parameter, Ca, Aa, T1a or T2a.
145
A Prosody Model to TTS Systems
For the ANN dedicated to predict Ca, from now on, Ca ANN, a perceptron layer with just one
node was tested [Demuth and Beale, 2000] with a hard-limit2 transfer function (0/1 function) in the
output, due to its output being binary, as the Ca data, but with poor results.
The output of Ca ANN was tested with or without a normalisation pre-processing which con-
verts the output into a null average and standard deviation equal to 1. According to results pre-
sented in Table 4.15, it can be concluded that the normalisation is recommended.
Due to the activating function of last layer of Ca ANN being a linear function, a threshold, L,
should be used to compare the output and convert it into a binary value. Values of L between 0.4
and 0.7 were proven to be good candidates to optimize the performance of Ca ANN. But an analy-
sis of thousands of cases has showed the value 0.5 was, most frequently, the best L value. Different
alternatives for L are also presented in Table 4.15.
For the other three parameters the output is 85% of its value divided by the maximum parameter
value and normalized to have null average and standard deviation equal to 1.
4.6.2 Training
Training was done over the training set which consist of 6329 syllables (86%) and using the test
set, with 1026 syllables (14%), also cross-validation in order to avoid over-fitting. Test set was
built picking randomly some paragraphs from every text. The test vector was used to stop training
earlier if further training on the training set will hurt generalization to the test set. The cost function
used for training was the mean squared error between output and target values.
Ca ANN will predict, for each syllable, if there will be an AC associated or not. So, it will be
applied to all syllables. On the other hand the Aa, T1a and T2a ANNs, will predict the parameters
just for the syllables which will have an AC associated. This leads to two alternatives of the train-
ing set: usage of all syllables; or usage of just the syllables that have associated AC, because values
of other syllables are zero in training and test sets and so are irrelevant in predicted ACs.
The first alternative has the advantage of the ANNs being trained to predict a very low value for
Aa, T1a and T2a, in syllables which should not have any AC associated, allowing the model to re-
cover from a incorrect AC insertion by Ca ANN.
The second alternative has the advantage of training ANNs with only the non null patterns.
Since it is not clear which one is preferable, both alternatives were used and the results are re-
ported in the fifth column of Table 4.17, Table 4.18 and Table 4.19. In cases where the second al-
ternative (training just with syllables with associated AC) was used, signed in tables with Y (yes),
the correlation coefficient (r) was determined using just these syllables in the test set and are also
presented in the tables for these cases. The last column of each table presents the r values deter-
mined over the test set using all syllables and considering null the predicted value of parameters
Aa, T1a and T2a for syllables without AC associated as determined by Ca ANN. The values of r in
last column were used to compare the performance of ANNs alternatives.
2 Hard-limit is a function with output zero, if the input argument is less than 0, or 1, if input argument is
greater than or equal 0.
146
Chapter 4 - Fundamental Frequency
Back-propagation training algorithms described in section 3.3.3 were used. The algorithm
trainrp, ‘Resilient back-propagation’, gives results with inferior quality than trainlm – ‘Levenberg-
Marquardt’. This is clearly the best algorithm for the dimension of the network, although the train-
ing process is slower.
For each variation of the ANN, concerning architecture, training algorithm, activating functions,
set of features and its codification, as well as both alternatives in training set as described above,
several hundreds of training sessions were ran and the best result was selected as the performance
for this variation. Only the best performance solutions are presented in Table 4.15, Table 4.17,
Table 4.18 and Table 4.19.
Fig. 4.27 displays the average performance (r) in the same test set of several training sessions for
each parameter, considering different dimensions of the training set. It is visible that for Ca, Aa and
T1a ANNs the performance is stabilised after 90% of the training set, and more patterns in training
set should not improve the performance in test set. But, for T2a ANN the performance increased
0.01 from 90% to 100% of the training set. This leads to the expectation that more training patterns
could improve performance of T2a ANN, but not much more than 0.01.
0,75
0,7
0,65
Ca
0,6 Aa
r
0,55 T1a
T2a
0,5
0,45
0,4
25 50 75 90 100
% of training set
Fig. 4.27 – Evolution of average ANNs performances in the test set, over the dimension of training set.
4.6.3 Features
The sets of features were built taking into account the known and foreseeable dependencies as
well as local contextual information. An optimization followed, in the composition of the sets and
the ways of coding features.
Table 4.14 presents the list of used features and their linear correlation, r, with each output pa-
rameter. The correlation coefficient value was used to select the set of the most correlated features
for the respective ANN. Although some features present a very low correlation with the output pa-
rameter their ensemble use in the whole set of features improves the final performance.
147
A Prosody Model to TTS Systems
Table 4.14: List of features and their correlations, r, with Ca, Aa, T1a, and T2a.
A previous codification of some features, like type of syllable, with one input node for each
category was experimented. But, in order to reduce the number of input nodes without loss in per-
148
Chapter 4 - Fundamental Frequency
formance, a new re-codification of those features were made into only one node. The new codifica-
tion consists now in coding each category by their original correlation value.
Performance variation in the test set with and without a particular feature was the base for final
decision to include or not any of the listed features in the input layers.
Any of the listed features is coded in one node of the input layer. An explanation of features fol-
lows:
1. Means the duration of the syllable and is significantly correlated with the presence of AC
and its amplitude, T1a and T2a;
2. Each segment of the syllable is considered voiced of voiceless according to its identity.
This feature is the length from the beginning of the first voiced segment to the end of the
last voiced segment inside the syllable. Is more strongly correlated with the presence of
AC, because syllables without voiced segments do not have associated AC. It is also
strongly correlated with Aa and even more with T2a, but is negatively correlated with
T1a. What means that the longer the voiced part of syllables is, the later is the onset time
of AC, and the earlier is the offset time;
3. Means the duration of the vowel or diphthong of the syllable, or is zero in the cases of syl-
lables where the vowel was suppressed. Is also strongly correlated with the presence of
AC, its amplitude and T2a. In fact, these first three features are significantly correlated be-
tween them, but do not carry exactly the same information;
4. This feature codes the type of syllables according to vowel (V) - consonant (C) sequences.
In a first phase of the work each type of syllable was coded in one node, but later all types
were coded in just one node according to the correlation with output parameters, which
were identical for all four parameters. Codification is the following from lower to higher
correlation: 1-C; 2-CC; 3-V; 4-VC; 5-VCC; 6-CCVC; 7-CCV; 8-CVC; 9 CV. This feature
is the most correlated with the presence of AC, its amplitude and onset time anticipation,
T1a, and is also strongly correlated with T2a;
5. In EP any word has one tonic syllable which can be orthographically marked or deter-
mined by a simple set of rules described in chapter 2. Theoretically this syllable should be
prominent, although some times speakers do not realise it as stressed or accented syllable.
This feature signalises if the syllable is tonic or not. Tonic syllables have a significant cor-
relation with presence of AC, its amplitude and T2a;
6. Vowels were divided into five groups according to average length and category. Again, all
groups were coded in just one node according to the correlation with output parameters.
Codification is the following from lower to higher correlation: 1-short vowels (u and @);
2-median vowels (i and 6); 3-diphtongs; 4-nasal vowels; 5-long vowels (a, E, e, o and O).
The feature is strongly correlated with the presence of AC, its amplitude and anticipation
of offset instant, T2a, and moderately correlated with anticipation of onset instant, T1a;
7. Distance in sec. from the beginning of syllable to the end of sentence. Is slightly correlated
with presence of AC and its amplitude. The meaning is less and weaker ACs in the end of
sentences;
149
A Prosody Model to TTS Systems
9. Number of ACs from the beginning of phrase. It is slightly negatively correlated with T1a.
The meaning is an earlier onset time for first ACs in the phrase;
10. Distance in sec. from beginning of PC to the beginning of the syllable. It is slightly corre-
lated with Aa, meaning slightly stronger ACs at the end of phrase components;
11. Number of ACs from the beginning of PC. It is slightly correlated with Aa, has the same
meaning as the previous feature;
12. Distance in sec. from beginning of syllable to next PC. It is slightly negatively correlated
with T2a, meaning later offset times for ACs far from next PC;
13. Signalises if the present syllable belongs to the last word of paragraph, coded as yes/no. It
is slightly negatively correlated with the presence of AC and its amplitude, and slightly
correlated with longer anticipation of the offset time. The meaning is less and weaker ACs
in the last word of the paragraph;
14. Signalises if the present syllable is the last one of the paragraph. Is coded as yes/no. Is
slightly negatively correlated with presence of AC, and its amplitude. The meaning is less
and weaker ACs in the last syllable of the paragraph;
15. Signalises if the present syllable belongs to the last word of the sentence, is coded as
yes/no. It is slightly negatively correlated with presence of AC, its amplitude, and is
slightly correlated with a longer anticipation of the offset time. The meaning is less and
weaker ACs in the last word of the sentence;
16. Signalises if present syllable is the last one of the sentence, coded as yes/no. It is slightly
negatively correlated with the presence of AC, and its amplitude, and slightly correlated
with a longer anticipation of the offset time. Meaning less and weaker ACs in last syllable
of the sentence;
17. Position in word – codes the number of syllables to the beginning of word. It is slightly
negatively correlated with the presence of AC, its amplitude and T1a. The meaning is less
and weaker ACs in the last syllables of words;
18. Position in word - codes the number of syllables to the end of word. It is slightly correlated
with Aa and T1a. The meaning is stronger ACs in the first syllables of words;
19. Word length - total number of syllables in the word. It is slightly negatively correlated with
the presence of AC. The meaning is that the longer the word is the less is the number of
ACs;
20. Word length – duration of word in sec. It is slightly correlated with T2a. The meaning is
that the longer the word’s duration is the earlier is the offset time of ACs;
21. Amplitude of previous AC. It is slightly correlated with Aa. The meaning is higher ampli-
tudes for ACs with higher amplitudes of previous AC;
150
Chapter 4 - Fundamental Frequency
23. Distance in sec. to the offset instant of previous AC. The greater the distance to the offset
time of previous AC, the earlier is the onset time of the present AC;
24. Distance in sec. to the previous pause. It is slightly negatively correlated with all parame-
ters;
25. Distance in sec. to the next pause. It is slightly correlated with Aa;
26. This feature codes if the present syllable is the last tonic syllable of an interrogative sen-
tence without interrogative word. It is coded as yes/no;
27. This feature codes if the syllable belongs to an interrogative sentence type without inter-
rogative word. It is coded as yes/no. This and the previous feature intend to code the situa-
tion of last tonic syllable in an interrogative sentence type without interrogative word,
which is known to have a rising and falling F0 contour. Features 26 and 27 did not show a
relevant correlation with AC, maybe because of the rarity of situation of this type of sen-
tences in the database.
Different groups of features were selected as inputs according to their correlation with the output
parameter, and tested. The following tables present only the better performing groups.
Table 4.15 presents three sets of features. The set with 6 features uses just the first 6 features
(the most correlated ones with the Ca parameter). The set of 25 uses the first 25 features. Finally
the set of 27 features uses all the presented features.
In Table 4.17 the set of 25 features is the first 25, the set of 27 are all presented features and the
set of 9 features are just the most correlated ones with the Aa parameter (features numbers:1, 2, 3,
4, 5, 6, 14, 16 and 17).
In Table 4.18 and Table 4.19, the set of 25 features is composed of the first 25 presented fea-
tures.
For Aa, T1a and T2a two performance parameters are presented. Both are linear correlation co-
efficients (r) in the test set between target and predicted vectors. The first one is presented for the
cases where the training was done just with syllables with AC associated (fifth column filled with
Y). Just the syllables with AC predicted by Ca ANN are used. The second column of r uses all syl-
lables. Target vectors have exactly the values resulting from the estimation process in all syllables.
Predicted vectors have the predicted values with the corresponding ANN, but with null elements in
syllables without AC predicted by Ca ANN.
In the following tables, column AF means activating functions. In these columns the L stands for
hyperbolic-logarithmic, T is hyperbolic-tangent and Lin is linear function. In column Training al-
151
A Prosody Model to TTS Systems
gorithm RP mean resilient back propagation algorithm and LM mean Levenberg-Marquardt algo-
rithm.
In Aa, T1a and T2a ANNs’ tables some architectures appears with 3 nodes in the output layer.
These ANNs predict the three output parameters Aa, T1a and T2a, but the presented performance is
just the one for the respective parameter. The output for the other two parameters is not very good
and is discarded.
The first number in the architecture column of the following tables is the number of nodes in the
first hidden layer. Last number is the number of output nodes. The input layer has a number of
nodes equal to the number of features.
To evaluate the performance of the Ca ANN four parameters were used: linear correlation coef-
ficient (r), accuracy (A – given by Eq. (4.9)), recall rate (R – given by Eq. (4.10)) and precision rate
(P – given by Eq. (4.11)).
Where the number of correct decisions is the number of times which the output matches the tar-
get as to having an AC associated or not. The output is 0 or 1 as the output of the ANN is lower or
higher than threshold L.
C
R (%) = × 100% Eq. (4.10)
C+D
C
P (%) = × 100% Eq. (4.11)
C+I
Where C is the number of correctly inserted AC, D is the number of deleted (i.e., none inserted)
AC, and I is the number of inserted errors (i.e., incorrectly inserted ACs).
Table 4.15, presents the best ANNs according to the obtained accuracy and r is also presented. It
must be noted that architectures with better accuracy have better r values. The recall rate and the
precision rate performance parameters for the selected ANN are presented in Table 4.16.
As can be seem in Table 4.15 the accuracy between the presented architectures has very low
variation (between 88,60% to 89,28%). So, which architecture must be selected? A good choice
would be the one with low number of weights, because with a similar performance would be less
computationally expensive. But, a very low difference in this parameter is more significant in final
F0 pattern than similar difference in other parameters, since this parameter is the decision of the
syllable has or not one associated CA. Moreover, there is no additional computation in determining
the features once they must be determined for the other ANNs. In spite of that, the selected archi-
tecture has 27 nodes in the entrance layer and 10 in hidden layer, keeping in mind the lighter archi-
tecture in the case of enhancements in computational time should be needed. The accuracy of al-
152
Chapter 4 - Fundamental Frequency
most 90% achieved in the prediction of existence of AC, is very promising for the final predicted
F0 contour.
Table 4.15: best performances (A and r) in Ca ANN with different architectures, activating functions, training
algorithms, set of features, limit of decision L and output processing.
Training Output
Architecture AF # Features L r A(%)
Alg. processing
27-10-1 L-Lin LM 27 0,5 Y 0,654 89,28
25-10-1 L-Lin LM 25 0,5 Y 0,652 89,18
27-10-1 T-Lin LM 27 0,5 Y 0,650 89,18
6-6-1 T-Lin LM 6 0,61 N 0,644 88,89
27-6-1 T-Lin LM 27 0,61 N 0,639 88,89
25-13-1 L-Lin LM 25 0,5 Y * 88,89
25-7-5-1 T-L-Lin LM 25 0,5 Y * 88,89
6-4-1 T-L RP 6 0,61 N 0,642 88,79
6-6-1 L-Lin LM 6 0,61 N 0,641 88,79
6-10-1 T-Lin LM 6 0,61 N 0,641 88,79
6-10-1 L-Lin LM 6 0,61 N 0,640 88,79
27-10-1 L-Lin LM 27 0,61 N 0,639 88,79
6-3-1 L-Lin RP 6 0,61 N 0,638 88,79
6-3-1 L-Lin RP 6 0.5 Y 0,637 88,79
6-6-4-1 T-L-Lin LM 6 0,61 N 0,642 88,69
6-4-1 L-T RP 6 0,61 N 0,639 88,69
6-6-1 L-Lin LM 6 0,5 Y 0,636 88,69
27-10-1 T-Lin LM 27 0,61 N 0,634 88,69
27-6-1 L-Lin LM 27 0,5 Y 0,630 88,69
25-6-4-1 T-L-Lin LM 25 0,5 Y * 88,69
25-10-1 L-T RP 25 0,5 Y * 88,69
25-13-1 L-T RP 25 0,5 Y * 88,69
25-6-1 L-Lin LM 25 0,5 Y * 88,69
25-4-4-4-1 L-T-L-T RP 25 0,5 Y * 88,69
27-6-1 L-Lin LM 27 0,5 Y 0,644 88,60
6-4-1 L-T RP 6 0,5 Y 0,634 88,60
27-6-1 L-Lin LM 27 0,61 N 0,634 88,60
27-3-1 L-Lin RP 27 0,5 Y 0,633 88,60
27-4-1 L-T RP 27 0,61 N 0,632 88,60
153
A Prosody Model to TTS Systems
Training Output
Architecture AF # Features L r A(%)
Alg. processing
27-6-1 T-Lin LM 27 0,5 Y 0,631 88,60
27-3-1 L-Lin RP 27 0,61 N 0,631 88,60
6-3-1 L-Lin RP 6 0,5 Y 0,629 88,60
25-3-10-1 T-L-Lin LM 25 0,5 Y * 88,60
* Not measured value.
Table 4.16: Performance values for the best Ca ANN.
The strong correlation of the first 6 features with Ca presented in Table 4.14 proved to be really
important because no significant improvements were introduced by the usage of more features.
Anyhow, no deterioration in performance was felt by the introduction of the other features.
Table 4.17 presents the best ANNs to predict Aa, according to the correlation coefficient.
154
Chapter 4 - Fundamental Frequency
0.2 0.25
0.10
0 0.05
0.02
-0.2 0.01
0.003
0.001
-0.4
-0.5 0 T 0.5 1 0 0.2 0.4 0.6
Aa prediction error
Fig. 4.28 – Best Linear fit between target (T) and predicted (A) values for Aa (left) and Probability error
(|Aatarget-Aapredicted|) in test set for predicted Aa (right), red line shows the adjusted normal probability distribu-
tion with µ=0.12 and σ=0.12.
The selected architecture to predict ACs amplitudes has 27 nodes in the input layer and 6 in hid-
den layer. The correlation of 0.602 is quite good compared to previous similar works for other lan-
guages, but is still in a low range. The major errors occur in focus position, where this information
is still lacking. Fig. 4.28 (left) displays the best linear fit between target and predicted values. The
vertical aligned marks in zero value of target and the horizontal aligned marks in zero value of pre-
dicted variable correspond to the wrongly inserted AC and deleted AC, respectively. The right side
of the figure shows that 75% of estimated Aa values have an error less than 0.2 and 95% have an
error less than 0.35.
Table 4.18 presents the best ANNs to predict T1a according to the correlation coefficient.
The selected ANN architecture to predict T1a has 25 nodes in the input layer and 10 in the hid-
den layer. The 25 input nodes receive the first 25 features of Table 4.14. The correlation of 0.743 is
very good compared to previous similar works for other languages. Fig. 4.29 (left) displays the best
linear fit between target and predicted values. The vertical aligned marks in zero value of target and
the horizontal line in zero value in the predicted variable, hidden by other marks, correspond to the
155
A Prosody Model to TTS Systems
wrongly inserted AC and deleted AC, respectively. The graphic in the right side of the figure shows
that 90% of T1a values have an error less than 50 ms.
156
Chapter 4 - Fundamental Frequency
0 0.25
0.10
0.05
-0.1 0.02
0.01
0.003
0.001
-0.2
-0.2 -0.1 0 T 0.1 0.2 0.3 0 0.05 0.1 0.15
T1a prediction error (s)
Fig. 4.29 – Best Linear fit between target (T) and predicted (A) values for T1a (left) and Probability error
(|T1atarget-T1apredicted|) in test set for predicted the T1a values (right), red line shows the adjusted normal prob-
ability distribution with µ=0.022 (s) and σ=0.024 (s).
Table 4.19 present the best ANNs to predict T2a according to the correlation coefficient. The
best performing architecture has 3 outputs but, just the one corresponding to T2a is used.
0.1 0.25
0.10
0 0.05
0.02
-0.1 0.01
0.003
0.001
-0.2
-0.2 0 0.2 0.4 0.6 0 0.05 0.1 0.15
T T2a prediction error (s)
Fig. 4.30 – Best Linear fit between target (T) and predicted (A) values for T2a (left) and Probability error in
test set for predicted T2a (right), red line shows the adjusted normal probability distribution with µ=0.028 (s)
and σ=0.026 (s).
The selected ANN architecture to predict T2a has 25 nodes in the input layer and 7 and 5 in hid-
den layers. The 25 input nodes receive the first 25 features of Table 4.14. The correlation of 0.650
is quite good compared to previous similar works for other languages. Fig. 4.30 (left) displays the
best linear fit between target and predicted values. The vertical aligned marks in zero value of tar-
get and in the horizontal aligned marks in zero value in predicted variable correspond to the
wrongly inserted AC and deleted AC, respectively. The graphic in the right side of the figure shows
that 90% of T2a values have an error less than 60 ms.
157
A Prosody Model to TTS Systems
Application of the model to a sample paragraph is presented in Fig. 4.31. The figure represents
the predicted F0 contour for the utterance corresponding to the text “…and are certainly important
to everyone, particularly to those with responsibilities in on-going reformation.”. The predicted F0
contour was determined with the set of predicted ACs and the estimated contour with the set of es-
timated ACs. The estimated PCs were used in prediction of ACs.
158
Chapter 4 - Fundamental Frequency
Table 4.20: Final performance of prediction the model parameters for ACs.
The major difficulties occur in the words “importantes” (important), “particularmente” (particu-
larly), “responsabilidades” (responsibilities) and “reformas” (changes), where the speaker focussed.
This paralinguistic information is not provided to the model input disabling it to fit well those con-
tours. Any how it is visible that the F0 movement patterns are approximately well fitted.
In the entire paragraph rmse changes from 3.95 Hz to 14.3 Hz between the estimated and the
predicted F0 contour. This difference may be interpreted as the loss in naturalness introduced by
the AC model. A re-synthesised paragraph with a lower rmse than another hasn’t necessarily a bet-
ter naturalness. Moreover, the same observation was made for the linear correlation coefficient in-
dicator.
159
A Prosody Model to TTS Systems
Fig. 4.31 – Result of predicted ACs. In black, the estimated PCs, ACs and the associated F0 contour. In magenta, the predicted ACs, based on estimated PCs, and the
corresponding F0 contour. Vertical lines represent word boundaries.
160
Chapter 4 - Fundamental Frequency
4.7.1 F0 model
Joining together the PC and AC models described above, results in the F0 model to predict the
F0 contour based in Fujisaki’s proposal. The sequence of work is first predicting PCs and then pre-
dicting the ACs, because the AC model depends on the predicted PCs.
Fig. 4.32 depicted a sample application of the F0 model. The predicted phrase component, in
magenta, allows a good fit between predicted and original F0 contours. The addition of accent
components, once again, generates a quite good fitting with the original F0 contour. In this case the
rmse and the correlation coefficient vary from 3.95 Hz to 15.6 Hz and 0.972 to 0.543, respectively.
The difference between Fig. 4.31 and Fig. 4.32 is just the prediction of PCs by the model. So, the
loss in naturalness between these two pictures is due just to the PC model. The rmse goes from 14.3
Hz to 15.6 Hz in this paragraph, meaning a loss in accuracy of just 1.3 Hz with the PC model. It
should be mentioned that this loss isn’t additive in the final quality of the model. Anyhow, the
small difference in rmse of 1.3 Hz due to PC model is certainly a good indicator of the PC model.
Fig. 4.33 displays the speech signal waveform after modification of the segmental durations with
the PSOLA algorithm, plus the respective determined F0 contour, the predicted phrase component
plus Fb and the, also predicted, accent component, together with the respective predicted PCs and
ACs. All data, orthographic marks, words, syllables and phones are presented in synchronism with
the speech signal waveform.
161
A Prosody Model to TTS Systems
Fig. 4.32 – Application of the complete F0 model. In black the estimated PCs, ACs and F0 contour. In magenta the predicted ACs, PCs and F0 contour.
162
Chapter 4 - Fundamental Frequency
Fig. 4.33 – Application of the complete F0 model over the modified duration with the duration’s model. In magenta the predicted ACs, PCs and F0 contour.
163
A Prosody Model to TTS Systems
4.8 Conclusion
The current chapter presented a model to predict the F0 contour from text based on Fujisaki
model theory, in European Portuguese. The F0 model is subdivided in two sub-models, one to pre-
dict the PCs and the other to predict the ACs. The parameters α, β and γ, of the Fujisaki model are
considered constant and equal to 2.0 s-1, 20 s-1 and 0.9, respectively. Also the base line frequency Fb
is considered constant and equal to 75 Hz. These constant values were the ones which allow the
best fit between estimated and original F0 contours.
The database was parameterized with the Fujisaki model with a developed tool to manually in-
sert/correct labelled PCs and ACs. The estimated F0 contour produced with this process results in
an average rmse between estimated and determined F0 contours, in the whole database, of 3.97 Hz.
The re-synthesized speech signal with the estimated contour is difficult to distinguish perceptually
from the original one.
The PCs’ model performs in two steps. In the first step it inserts PCs associated with the begin-
ning of accent groups, based on orthographic marks and weighted candidates. The second step de-
termines the exact position T0, by predicting an anticipation time, T0a, of PC’s time position and
its Amplitude Ap by means of two specific ANNs.
The locations of inserted PCs seem to be consistent with text and with labelled PCs. The best
linear correlation coefficient of the prediction of Ap and T0 are 0.772 and 0.646, respectively.
These values are quite good compared with the ones presented by Mixdorff [2002], 0.73 and 0.53
respectively, in his Integrated German Model (IGM).
The ACs model allows one AC to be assigned to each syllable. For each syllable one first ANN
decides if there will be an associated AC or not. This ANN provides results with an accuracy of
89.3%. For syllables with associated ACs, the amplitude, Aa, onset time, T1, and offset time, T2, of
the AC have to be predicted. T1 and T2 are determined finally by subtracting an anticipation time,
T1a and T2a, to the beginning and end of the voiced part of the syllable, respectively. One ANN for
each parameter was developed giving results with final linear correlation coefficients of 0.602,
0.743 and 0.650 for amplitude, anticipation of T1 and anticipation of T2, respectively. Again these
values are quite good compared with the IGM, which were 0.40, 0.61 and 0.63, respectively.
A value of β=30 /s was experimented and gave a better fitting between predicted and original F0
contours, although, the re-synthesized speech does not sound quite natural in most of the utter-
ances. With β=20 this problem seems to be reduced.
The produced F0 contour with the predicted parameters approximately follows the measured F0.
The major differences are coming from the difficulty in emphasizing the “focus” word due to the
absence of this information in the training phase of the model. The final speech signal, produced by
re-synthesis with the predicted F0 contour is not completely natural yet, but is considered as ac-
ceptable.
It is fair to mention that the model uses just some of the available linguistic information. For in-
stance syntax information has not been used. Moreover, paralinguistic information is not extracted
by the model and several times the speaker produces a higher F0 movement, which can be ex-
plained by this kind of information that is not followed.
164
Chapter 4 - Fundamental Frequency
A perceptual test is necessary to really evaluate the perceived naturalness in each phase of the
entire model. Next chapter describes this test.
165
5 Perceptual Tests
This chapter presents two perceptual tests evaluating duration models and F0 models developed in
the previous chapters. Category-judgment method and the Mean Opinion Score (MOS) scale was
followed to evaluate the perceived distance between the proposed models and the original stimuli.
A comparison between two proposed models to predict segmental durations is also described. The
loss in naturalness along some components of the F0 model is measured in order to evaluate each
component of the model and perceive which parts should be improved. A comparison between the
objective measurements, r and rmse, and the subjective measure, MOS, of perceived naturalness, is
presented.
A Prosody Model to TTS Systems
5.1 Introduction
The objective measured performances presented for each part of the developed models are not
by themselves enough to understand the acceptability of models. Can perceptual tests evidence how
acceptable is an objective performance?
Two perceptual tests were made and are described in this chapter. The first one considers only
the duration models described in chapter 3. The second test considers the F0 model, their sub-
models, and duration plus F0 models, using the best duration model selected by the first perceptual
test. The selection of the model or the alternative model to predict segmental durations in the dura-
tion plus F0 models’ stimuli, were the main reason to perform the subjective evaluation in two per-
ceptual tests instead of just one.
Both tests were done using five paragraphs of the test set, not used in training. Several stimuli
made by copy-synthesis of original paragraphs were prepared to be presented to listeners. Copy-
synthesis stimuli with predicted segmental durations and/or predicted F0 contour were prepared in
time domain using a TD-PSOLA algorithm [Moulines and Charpentier, 1990] and [Moulines and
Laroche, 1995] in the PRAAT software [Boersman and Weenink]. The pauses were kept the same
as in the original stimulus.
The methodology described in [Standard Publication No. 297, IEEE, 1969] for category-
judgment tests were generally followed using the MOS scale.
Almost every listener was a college professor and ages ranged from 24 to 35. Some of them are
involved in speech synthesis.
Perceptual test were presented to groups between one and five listeners at a time. The tests were
performed in an office room with low level of environmental noise. Stimuli were presented in a
computer with the sound volume required by listeners.
Section 5.2 describes the perceptual test of stimuli with modified segmental durations, and sec-
tion 5.3 the perceptual test of stimuli with modified F0 and durations plus F0 modifications, ac-
cording to the selected duration model and F0 model. In the end of each section the correlation be-
tween measured rmse and r with perceived naturalness, MOS, is studied.
168
Chapter 5 - Perceptual Tests
A total of five stimuli per paragraph were presented in random order to listeners in a blind test,
without knowing whether they were listening to the original or to a manipulated version. Listeners
were informed about the type of modifications introduced in original sound and asked to concen-
trate in timing acceptability. They can hear the stimuli as many times as they want and were asked
to classify each stimulus in a scale from 1 to 5 (1- Unsatisfactory, 2- Poor, 3- Fair, 4- Good, 5-
Excellent).
The other stimuli presented were produced with durations predicted by the model and by the al-
ternative model, presented in chapter 3.
Table 5.1 and Table 5.2 characterizes the text of the five paragraphs used in the perceptual test
and distance, measured by correlation coefficient and rmse, between original and alternative model,
model and “No model” stimuli. First paragraph is a short title, about two seconds long, while the
others are paragraphs varying between 10 and 13 seconds. Paragraphs 1 and 5 have interrogatives
while the others are just declaratives sentences.
Stimuli of the alternative model has a variation in correlation coefficient between 0.817 and
0.866, and rmse between 17.7 and 23.7 ms. Stimuli of the model has a correlation coefficient be-
tween 0.790 and 0.882 and rmse between 18.9 and 22.1 ms. Very close stimuli were produced by
“No model” with a correlation coefficient between 0.690 and 0.800 and rmse between 21.2 and
28.0 ms. The values presented for correlation coefficient and rmse evidence that no congruence be-
tween these indicators exist along paragraphs. For instance, the alternative model has the best cor-
relation coefficient for paragraph one and the worst rmse for same paragraph. Which one fits better
the naturalness? Maybe perceptual test can give a hint.
Twenty subjects participated in test, 7 female and 13 male. The listeners were divided into two
groups. The first group was composed of 8 listeners, all experienced in speech related issues; the
second comprised the remaining 12 listeners, who had no experience in the subject. The evaluation
made by the experienced listeners was no different from the others’, therefore, results are displayed
jointly for the whole set of listeners.
169
A Prosody Model to TTS Systems
Table 5.1: Portuguese and respective translation of the 5 paragraphs used in the perceptual test, and respective
number of segments.
Parag. Number
Text Translation
number of phones
Que igualdade perante a lei? João How equal in face of the law? João
1 36
Amaral. Amaral.
As suas opiniões sobre a situação da His opinions regarding the justice
justiça revelam muita reflexão e são system reveal a lot of reflection and
2 certamente importantes para todos, are certainly important to everyone, 164
particularmente para os que têm res- particularly to those with responsi-
ponsabilidades nas reformas a fazer. bilities in on-going reformation.
Evidentemente que quem exerce um It is obvious that someone with
cargo tão sensível há cerca de quinze such high sensitive functions since
anos está sujeito a um desgaste natu- fifteen years ago is exposed to natu-
3 177
ral. Mais ainda, quando a justiça está ral strain. Moreover, when justice is
muito longe de satisfazer as aspirações far from satisfying the ambitions
e interesses dos cidadãos. and interests of citizens.
Há os processos contra gente impor- There are lawsuits against impor-
tante que nunca mais terminam. Há a tant people, which are never-
situação de quem é pobre, e que está ending. There are poor people,
4 objectivamente em situação de inferio- clearly inferior when they have to 209
ridade quando tem de enfrentar na jus- face court against the richer and
tiça os mais ricos e poderosos, que powerful, who can afford luxury
podem pagar advogados de luxo. lawyers.
Mas, que igualdade perante a lei? Que But, how equal facing the law?
igualdade, quando para muitos a justi- How equal, if many still have al-
ça é praticamente inacessível? Como most no access to justice? How can
5 podem esses reclamar o cumprimento they demand law enforcement, if 204
da lei, sem dinheiro para pagar a bons they cannot afford good lawyers
advogados e os elevados custos de um and the high costs of a lawsuit?
processo?
Table 5.2: Correlation coefficient, r, and rmse between original and the other three stimuli in each paragraph.
170
Chapter 5 - Perceptual Tests
4
Average
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Subjects
Fig. 5.1 – Average opinion values of each subject for the 5 stimuli.
5,0
4,5
4,0
3,5
3,0
Average
2,5
2,0
1,5
1,0
0,5
0,0
1 2 3 4 5
Paragraph
Fig. 5.1 shows the average opinion values of each listener for the 5 stimuli. Each bar is the aver-
age of five opinions. Original1 and original2 present the average opinion for the first and second
original stimuli, respectively. They are treated separately to perceive the variation in the evaluation
made by each subject. The original stimuli were globally the favourite, except in some cases, where
the segmental duration predicted by the model or by the alternative model, imposed their prefer-
171
A Prosody Model to TTS Systems
ence to the original ones. The “No model” stimuli were never even near the opinion degree of the
models’ imposed durations.
The alternative model was better classified than the model by 12 subjects and the model was bet-
ter classified than the alternative model by 5 subjects. This denotes an evident preference of the al-
ternative model.
Fig. 5.2 shows the average opinion values of each paragraph for the 5 stimuli. Each bar is the
average of 20 opinions. Again, the opinion values for the duration models are very close to those of
the original stimuli. The alternative model was even preferred, in the first, second and fifth para-
graphs. Also, again, the “No model” stimuli are far from each model.
The alternative model was better classified than the model in 3 paragraphs, and the model was
better classified than the alternative model in the other 2 paragraphs. The preference for the alterna-
tive model is confirmed. No differences exist in scores achieved by both models in paragraphs with
interrogative sentences (paragraphs 1 and 5). Although, “No model” stimuli has higher scores in
those paragraphs.
Table 5.3: Mean Opinion Score (MOS) and standard deviation of the perceptual test.
Alt.
Original 1 Original 2 Model No model
Model
MOS 4.13 4.27 3.93 3.78 2.88
Std 0.92 0.81 0.96 0.91 1.10
4.5
3.5
Values
2.5
1.5
1
Origina1 Original2 Alt model Model No model
Table 5.3 displays the MOS and respective standard deviation. Each MOS value is the average
of 100 opinions. The analysis of variance for all types of stimuli resulted in a significance level of
172
Chapter 5 - Perceptual Tests
100% (p<1and-12 for F=33.5), showing an evident dependency of the results on each type of
stimulus.
Fig. 5.3 illustrates the subjects’ opinion analysis for each type of stimulus over 100 opinions.
Mean Opinion Scores are represented by a black thick line. Blue boxes represent the lower and up-
per quartile. Red lines represent the median score. Minimum and maximum values are presented
with the black thin lines. Red plus signals represent the outliers. Picture evidences the equality in
original1 and original2. The alternative model is close to the original and a little bit better than the
model. Finally, although “No model” presents a quite good score, is still far from the model and
even more far from the alternative model.
Table 5.4 displays the significance level between each pair of stimuli given by analysis of vari-
ance. First line of cells presents the significance level, p, and second line the respective confidence
level given by Eq. (5.1). Orange background cells signalise low confidence level meaning high evi-
dence to accept the hypothesis that these stimuli are the same. In opposition, the other cells present
enough level of confidence to reject the hypothesis that the levels are the same. In conclusion,
MOS between original1 and original2 are not significant, as well MOS between original1 and the
alternative model and between alternative model and the model. All other MOS stimuli pairs are
significant.
5.2.1 Discussion
Original1 and original2 stimuli proved to be very well classified within the levels of Good and
Excellent, and with no significant difference between them.
The test confirmed a slight preference of the alternative model over the model. For some sub-
jects the alternative model was even preferred against the original stimuli. In some paragraphs the
alternative model was also preferred instead of original stimuli. In general the alternative model is
very close to original, with an average (original1 and original2) MOS distance of 0.27. The close-
ness is confirmed by the very low confidence level between alternative model and original1. This
result evidence the improved results achieved by the usage of dedicated ANNs.
173
A Prosody Model to TTS Systems
Model has also a MOS close to original (average distance of 0.42). Still at 0.15 points distant to
the alternative model, but analyses of variance show a very low confidence level between them. For
some subjects the model was preferred instead of original stimuli and alternative model. In two
paragraphs the model was preferred to the alternative model.
“No model” still at a MOS distance of 0.9 and 1.05 to the model and the alternative model re-
spectively. Although “No model” were never preferred than other stimuli for any subject or para-
graph, it still at the Fair level (2.88).
Both proposed models stills at the Good level of acceptability with MOS of 3.93 and 3.78. In
spite of the low confidence level between alternative model and model, the alternative model is se-
lected to further developments concerning F0 modulation, because of its slightly better scores in the
perceptual test and in objective measurements. This selected model, alternative model, is also pre-
ferred by more subjects than the model.
Some discussion follows about the correlation between proximity measurements, given by corre-
lation coefficient and rmse, and perceived naturalness measurements, given by MOS.
Table 5.5 presents the objective distance of the segmental durations between original and modi-
fied stimuli, and the subjective evaluation by means of MOS, for the five paragraphs.
Parag. 1 2 3 4 5
r 0.866 0.817 0.842 0.870 0.865
Alt.
rmse 23.7 19.1 19.5 17.7 18.3
Model
MOS 4.2 4.2 3.9 3.5 4.0
r 0.882 0.814 0.790 0.850 0.844
Model rmse 19.0 19.2 22.1 18.9 19.5
MOS 3.8 3.6 4.0 3.8 3.9
r 0.733 0.69 0.751 0.800 0.766
No
rmse 28.0 24.1 23.8 21.2 23.5
model
MOS 3.3 2.5 2.7 2.8 3.2
Once the scales and meaning of the measurement indicators are different, some scaling was ap-
plied to rmse and MOS in order to be represented in a similar scale as the correlation coefficient.
The modified rmse (mrmse) is determined by Eq. (5.2), and the modified MOS (mMOS) is deter-
mined by Eq. (5.3). The modification aims to represent all measurement indicators in an increasing
scale with a maximum equal to one. Fig. 5.4, Fig. 5.5 and Fig. 5.6 display the measurements r,
mrmse and mMOS, along paragraphs for the alternative model, model and no model.
(30 − rmse)
mrmse = Eq. (5.2)
15
174
Chapter 5 - Perceptual Tests
MOS
mMOS = Eq. (5.3)
5
Alt. Model
1,00
0,80
0,60
0,40
0,20
0,00
1 2 3 4 5
r 0,87 0,82 0,84 0,87 0,87
mrmse 0,42 0,73 0,70 0,82 0,78
mMOS 0,84 0,84 0,78 0,70 0,80
Paragraph
Model
1,00
0,80
0,60
0,40
0,20
0,00
1 2 3 4 5
r 0,88 0,81 0,79 0,85 0,84
mrmse 0,73 0,72 0,53 0,74 0,70
mMOS 0,76 0,72 0,80 0,76 0,78
Paragraph
175
A Prosody Model to TTS Systems
No model
1,00
0,80
0,60
0,40
0,20
0,00
1 2 3 4 5
r 0,73 0,69 0,75 0,80 0,77
mrmse 0,13 0,39 0,41 0,59 0,43
mMOS 0,66 0,50 0,54 0,56 0,64
Paragraph
The correlation between measurement indicators is presented in Table 5.6. No significant corre-
lation seems to exist between subjective and objective measurements. In case of rmse, the correla-
tion is even significantly positive1, seeming that the indication of rmse varies in opposite direction
of MOS. The most significant correlations is found between rmse and r, but, anyhow, at a low level
of -0.38.
In conclusion, no correlation seems to exist along paragraphs for same model between objective
and subjective measurements, and a low correlation exist between the objective measurements.
And, what about correlation between subjective and objective measurements along models?
Table 5.7 presents the mean values along paragraphs of the evaluation measurements r, rmse /
mrmse, MOS / mMOS, for the alternative model, the model and “No model”. Table 5.8 presents a
very strong correlation between each pair of measurements. Therefore, a very strong correlation ex-
ists between objective and subjective measurement indicators when evaluating a model. In addi-
1 Once the naturalness is measured by MOS in an ascending scale and rmse in a descending scale, similar in-
dications by both measurements should be denoted by negative correlation between them.
176
Chapter 5 - Perceptual Tests
tion, the measurement correlation coefficient, r (0.999), seems to be even more correlated with
MOS than rmse (0.994). Also a strong correlation (0.992) exists between the objective measure-
ments correlation coefficient and rmse.
177
A Prosody Model to TTS Systems
The objective is to measure the loss of naturalness introduced by each component of the prosody
model and evaluate the quality concerning naturalness of the F0 model and the complete prosody
model (durations + F0).
The standard category-judgment test was followed. This test proposes the presentation of refer-
ences of scale, concretely, the references of excellent and unsatisfactory. The reference of excel-
lence was the original recorded sound. The unsatisfactory reference for F0 contour is very ambigu-
ous, so it was decided to produce a flat F0 with the average F0 value (103 Hz) to be used as the
unsatisfactory reference. Apart from reference stimuli presented to subjects in the beginning of the
evaluation of each paragraph, two stimuli to be evaluated were produced as a copy of the refer-
ences. The original was named “1- Natural” and the one with constant F0 value was named “0- No
model”.
Besides these two reference stimuli, more seven stimuli were presented to 19 subjects, in a total
of 9 stimuli for each of the 5 paragraphs. The other 7 stimuli correspond to:
4. Predicted ACs based on estimated ACs and PCs – Modified F0 contour imposed by pre-
dicted ACs. ANN features were determined based in estimated PCs and ACs. The AC pa-
rameters in each syllable are predicted by ANN using features of estimated ACs and PCs
instead of previously predicted ACs and PCs. Any bad predicted accent component is due
to the self process of prediction and not because of bad previous predictions. The errors of
previous bad predictions do not propagate;
178
Chapter 5 - Perceptual Tests
5. Predicted ACs with estimated PCs – Modified F0 contour imposed by predicted ACs. In
this case, estimated PCs are used. Features concerning previous ACs are determined in
each syllable. In opposition to previous stimuli, the possible bad AC prediction can influ-
ence the prediction of present AC, by the way of features concerning previous AC. But, no
prediction of PCs is used;
6. F0 Model – Modified F0 contour imposed by predicted PCs and ACs. The F0 contour is
totally predicted by the text. The complete model presented in chapter 4 is applied;
Table 5.9: Portuguese and respective translation of the 5 paragraphs used in the perceptual test.
Parag.
Text Translation
number
179
A Prosody Model to TTS Systems
Table 5.10: Objective measurements of each stimulus by paragraph. For each paragraph the first line repre-
sents the correlation coefficient and second line the rmse.
Dura-
Pred.
Pred. tions + Dura-
Esti- ACs
Parag. Dura- ACs F0 F0 tions + No
mated based on
number tions with est. Model Model F0 model
F0 est. ACs
PCs with Model
and PCs
0.75*Aa
0.869 0.953 0.267 0.306 0.530 0.482 0.503 -
1
13.8 ms 5.2 Hz 19.3 Hz 21.8 Hz 16.1 Hz 17.9 Hz 19.0 Hz 22.4 Hz
0.882 0.964 0.693 0.554 0.528 0.605 0.639 -
2
26.0 ms 5.2 Hz 14.7 Hz 16.2 Hz 16.8 Hz 16.8 Hz 16.9 Hz 23.7 Hz
0.830 0.979 0.734 0.621 0.293 0.377 0.380 -
3
18.1 ms 3.8 Hz 12.9 Hz 15.2 Hz 19.1 Hz 18.4 Hz 19.2 Hz 20.1 Hz
0.801 0.969 0.770 0.756 0.647 0.627 0.594 -
4
21.2 ms 3.0 Hz 11.9 Hz 12.9 Hz 11.3 Hz 11.3 Hz 11.7 Hz 13.3 Hz
0.892 0.971 0.585 0.515 0.433 0.461 0.481 -
5
14.2 ms 3.7 Hz 13.2 Hz 15.2 Hz 17.4 Hz 17.4 Hz 14.6 Hz 15.9 Hz
Listeners were informed about the type of modifications introduced in original sound and asked
to concentrate in intonation acceptability. For each paragraph the references of excellent and unsat-
isfactory were presented, and then all the paragraph stimuli in random order were presented to get
the first impression of them. Then the nine stimuli were presented in the same order to listeners in a
blind test, without knowing whether they were listening to the original or to a manipulated version.
Subjects were asked to classify each stimulus in a scale from 1 to 5 (1- Unsatisfactory, 2- Poor, 3-
Fair, 4- Good, 5- Excellent). The test sheet used in the test is part of the appendix.
Nineteen subjects participated in the perceptual test, 7 female and 12 male. Seven subjects al-
ready had participated in the previous perceptual test.
Table 5.9 presents the text paragraphs used in the perceptual test. These paragraphs, taken from
several pieces of news, belong to the test set of the database, not used in training. Mainly declara-
tive sentences were used. Third paragraphs start with an interrogative, and fifth paragraphs have
one citation.
Table 5.10 presents the measured correlation coefficient and rmse (comparing to original) for
each stimuli in each paragraph. In the case of “No model” stimuli, the correlation coefficients were
not determined due to the constant value of F0. The F0 Model produced a correlation coefficient
varying from 0.29 to 0.65 and an rmse varying from 11 to 19 Hz. The complete model produced a
correlation coefficient varying from 0.38 to 0.64 and an rmse varying from 12 to 19 Hz. The com-
plete model with Aa*0.75 has a correlation from 0.38 to 0.63 and an rmse from 11 to 18 Hz.
Fig. 5.7 shows the average opinion values of each listener for the 9 stimuli. Each bar is the aver-
age of five opinions.
180
Chapter 5 - Perceptual Tests
5.0
4.5
4.0
3.5
3.0
Average
2.5
2.0
1.5
1.0
0.5
0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Subjects
Fig. 5.7 – Average opinion values for each subject in the 9 stimuli.
181
A Prosody Model to TTS Systems
5
4.5
4
3.5
3
Average
2.5
2
1.5
1
0.5
0
1 2 3 4 5
Paragraph
Fig. 5.8 – Average opinion values for each paragraphs in the 9 stimuli.
182
Chapter 5 - Perceptual Tests
The duration model stimuli confirm the closeness to the original stimuli. In this test they were
equally evaluated as original by 5 subjects. The stimuli produced with estimated F0 contour were
better evaluated than original by 3 subjects and for another subject they were at same level. The
three types of stimuli with predicted F0 contour were at a similar level, been preferred distinctly for
different subjects. For all subjects these types of stimuli still at a lower level than original, esti-
mated or duration model stimuli. The complete model without reduction of Aa were preferred than
the complete model with reduction of Aa by 8 subjects, and were not preferred by 6 subjects, being
equally evaluated by 5 subjects. These stimuli stills at a lower level than original, estimated, and
duration model stimuli. The application of F0 model over durations model were preferred than just
the F0 model by 3 subjects, but, modification of duration imposes a general slight decrease in natu-
ralness. “No model” stimuli still at a very lower level than any other stimuli.
Fig. 5.8 shows the average opinion values for the 9 stimuli in each paragraph. Each bar is the
average of 19 opinions. Again, the stimuli produced with the duration model and estimated F0 gets
very close opinion to the original stimuli, being even preferred in third paragraph. The stimuli with
the three levels of predicted F0 (stimuli 4, 5 and 6) stills at a lower level opinion than stimuli pro-
duced with estimated F0. No significant difference seems to exist between the three levels of pre-
dicted F0. The stimuli produced with duration and F0 models with or without Aa reduction stills at
a slightly lower level than the ones produced with predicted F0. The model with Aa reduced is
slightly preferred in 3 paragraphs while the model without reduction is preferred in the other two
paragraphs. Stimuli produced with “No model” stills at a very low level compared with the others.
Table 5.11 displays the MOS and respective standard deviation for the 9 types of stimuli. Each
MOS value is the average of 95 opinions. The analysis of variance for all types of stimuli resulted
in a significance level of 100% (p<10-12 for F=214), showing an evident dependency of the results
on each type of stimulus.
Table 5.11: Mean Opinion Score (MOS) and standard deviation of the perceptual test.
4- Pred.
5- Pre- 7- Du-
ACs
dicted rations 8- Du-
3- Es- based
1- 2- Du- ACs 6- F0 + F0 rations 0- No
timated on est.
Natural rations with es- Model Model + F0 model
F0 ACs
timated with Model
and
PCs 0.75*Aa
PCs
MOS 4.61 4.20 4.38 3.31 3.14 3.09 2.83 2.87 1.24
Std 0.57 0.76 0.64 0.74 0.80 0.73 0.73 0.76 0.46
Fig. 5.9 illustrates the subjects’ opinion analysis for each type of stimulus over 95 opinions.
Mean Opinion Scores are represented by a black thick line. Blue boxes represent the lower and up-
per quartile. Red lines represent the median score. Minimum and maximum values are presented
with the black thin lines. Red plus signals represent outliers.
183
A Prosody Model to TTS Systems
4.5
3.5
Values
2.5
1.5
0 1 2 3 4 5 6 7 8
Fig. 5.9 – Analysis of opinion scores by stimuli. Stimuli from 0 to 8 corresponds to: 0 – No model; 1 – Natu-
ral; 2 – Durations; 3 – Estimated F0; 4 – Predicted ACs based on estimated ACs and PCs; 5 – Predicted ACs
with estimated PCs; 6 – F0 Model; 7 – Duration + F0 model with Aa*0.75; 8 – Durations + F0 model.
Table 5.12: Significance level between pairs of stimuli. Stimuli from 0 to 8 have the same correspondence as
the ones in Fig. 5.9.
0 1 2 3 4 5 6 7 8
0 - 0 0 0 0 0 0 0 0
0.0095
1 - <0.001 0 0 0 0 0
99%
0.0879
2 - 0 0 0 0 0
91%
3 - 0 0 0 0 0
0.1334 0.04
4 - <0.001 <0.001
86% 96%
0.6355 0.0051 0.0157
5 -
36% 99% 98%
0.0145 0.0409
6 -
98% 96%
0.7266
7 -
27%
8 -
184
Chapter 5 - Perceptual Tests
Original stimuli and the ones produced by the duration model and estimated F0 have 75% of its
opinions over 4, and a minimum of 3. Original has a MOS of 4.6, estimated F0 has a MOS at 4.4
and durations model a MOS at 4.2. Stimuli produced with predicted F0 contour with different
blocks of F0 model (stimuli 4, 5 and 6) have almost their opinions between 3 and 4. The F0 model
has even more than half opinions with level 3. Their MOS are 3.3, 3.1 and 3.1 respectively for
model with predicted ACs using estimated ACs and PCs, model with predicted ACs using esti-
mated PCs and F0 model. The stimuli produced with F0 model over duration model have ¾ of its
opinions between 2 and 3. Its MOS are 2.9 and 2.8 respectively for Aa without reduction and with
reduction. “No model” has opinions almost in level 1.
Table 5.12 displays the significance level between each pair of types of stimuli by analysis of
variance. First line of cells presents the significance level, p, and second line the respective confi-
dence level given by Eq. (5.1). Orange background cells signalise low confidence level meaning
high evidence to accept the hypothesis that these stimuli are the same. In opposition, the other cells
present enough level of confidence to reject the hypothesis that the levels are the same. In conclu-
sion, MOS between stimuli 4 (Predicted ACs based on estimated ACs and PCs) and 5 (Predicted
ACs with estimated PCs) are not significant, and MOS between stimuli 5 and 6 (F0 Model) and be-
tween stimuli 7 (Duration + F0 model with Aa*0.75) and 8 (Durations + F0 model) are not signifi-
cant at all. All other MOS stimuli pairs are significant.
5.3.1 Discussion
Natural stimuli were very well classified within the level of Excellent.
This second perceptual test confirmed the Good acceptability of the duration model (alternative
model). The stimuli produced with modified durations according to this model achieved a MOS of
4.2, very close to the natural stimuli (4.6). The distance to the natural stimuli (0.41) were similar as
the one in the first test (0.27).
MOS of estimated F0 (4.4) is at the level of a Good acceptability of naturalness. The distance to
the MOS of natural stimuli were very low (0.23), proven the closeness between original and re-
synthesis with estimated F0 as was point out by preliminary tests. If there is a strong correlation be-
tween rmse and MOS, as will be proved bellow, this result can be extended to the complete data-
base, once the mean rmse of paragraphs in perceptual test (4.18 Hz) is at same level as the rmse in
complete database (3.97 Hz).
This perceptual test proves that the F0 contour of European Portuguese can be modelled by the
Fujisaki’s model with high closeness to the original intonation.
MOS for the stimuli 4 (Predicted ACs based on estimated ACs and PCs) was 3.3. These stimuli
were produced with the same phrase components as the ones in estimated F0; just the accent com-
ponent is predicted. Concretely, these stimuli evaluate the result of the four ANN that predicts the
existence of an AC associated to each syllable or not, and its respective onset time, offset time and
amplitude. No interference of previous bad predictions exists since the features related to previous
ACs are based in estimated ACs. The degradation in perceived naturalness introduced by this pre-
diction is the difference between the MOS of estimated F0 (4.38) and present MOS (3.31). The
degradation of more than one in MOS is significant. This step is the one that introduces more per-
ceived degradation in naturalness, deserving further developments to improve the prediction of
ACs.
185
A Prosody Model to TTS Systems
Stimuli 5 (Predicted ACs with estimated PCs) differ from previous stimuli just in features of
ANN related to previous ACs, that are determined, in this case, based in previously predicted ACs.
MOS is 3.1, denoting a degradation of just 0.2 in perceived naturalness, but with a very low level
of confidence between these two MOSs (86%).
Previous two paragraphs give the answers to the second and third questions. The prediction of a
new AC, being known the previous ACs and PC has a cost of perceived naturalness of 1 in 5, and
prediction of a set of ACs has a cost of 1.3 in 5.
The prediction of PCs can be evaluated by the degradation in perceived naturalness between
stimuli 5 and 6 (F0 model), because the difference between those two stimuli is just the prediction
instead of estimated PCs. Stimuli number 6 also evaluates the complete F0 model. These stimuli
achieved a MOS of 3.1, at the level of Fair naturalness. Concerning the PC model, is evident its
excellent performance, because degradation of perceived naturalness was just 0.05 in 5, and with
no significance at all (confidence level = 36%).
Stimuli 7 (Durations + F0 model with 0.75*Aa) and 8 (Durations + F0 model) consist of the
proposed final version of prosody model. The first with reduction of accent components and the
second just as it is predicted by the model. The degradation in perceived naturalness introduced by
the inclusion of duration model can be evaluated by the difference between MOS values in stimuli
6 and 8. This difference is 0.2 in 5, similar to the difference between natural and durations stimuli
(0.4), The shorter difference between stimuli 6 and 8 can be explained by the fact that some bad
predicted durations can be masked by poor F0 intonations. No significant difference result in MOS
of stimuli produced with reduction of accent components (0.04). Moreover, the very low level of
confidence (27%), given by analysis of variance, shows no evidences that the stimuli are different.
So, the reduction of accent components did not prove to improve the perceived naturalness.
Finally, the complete model (Durations + F0 model) has a MOS of 2.9, at the level of Fair natu-
ralness.
Similarly to the discussion in the previous perceptual test, a discussion follows about the correla-
tion between objective measurements, given by correlation coefficient and rmse, and subjective
evaluation, given by MOS, for modified F0 contours.
Table 5.13 presents the objective distance of the F0 contours between original and modified
stimuli, and the subjective evaluation by means of MOS, for the five paragraphs. Stimuli 2 (Dura-
tions) is not included in this discussion because their modification is in timing domain meanwhile
the other stimuli has their modifications in F0 domain. The measures rmse and correlation coeffi-
cient in stimuli 7 and 8 are measured in comparison to determined F0 after timing modifications.
The correlation between measurements along paragraphs is presented in Table 5.14. No signifi-
cant correlation seems to exist between subjective and objective measurements like in case of dura-
tion models. In this case, the correlation between the two objective measurements, r and rmse, is -
0.79. This value denotes a significant correlation between them. So, generally, as higher is the cor-
relation coefficient, lower is the rmse of the predicted and measured F0 contours along paragraphs.
This correlation was not verified in segmental durations.
186
Chapter 5 - Perceptual Tests
187
A Prosody Model to TTS Systems
Concerning correlation between objective and subjective measurements along models, Table
5.15 presents the mean values along paragraphs of the measurement indicators r, rmse and MOS,
for different types of stimuli with modified F0. Table 5.16 presents the correlation between those
measurements in previous table along stimuli 3 to 8. The correlation between rmse and MOS,
r(rmse,MOS), considering also stimuli 0, is presented in the bottom line of table. A very strong cor-
relation between each pair of parameters exists. Therefore, a very strong correlation exists between
objective and subjective parameters when evaluating a model, as, again between the two objective
parameters, r and rmse. In this case, both objective parameters have the same correlation (0.976)
with subjective parameter. The negative values in correlations involving rmse are due to its scale
that is decreasing along better closeness, in oppositions to MOS and r.
Table 5.16: Correlation between mean values along models of indicator parameters.
In conclusion, as in duration models, perceived naturalness in two paragraphs produced with the
same F0 model cannot be evaluated comparing their own correlation coefficients or rmse. But the
general naturalness of an F0 model, as in duration models, can be evaluated by their rmse or corre-
lation coefficient measured along several paragraphs.
188
Chapter 5 - Perceptual Tests
5.4 Conclusion
In general, the perceptual tests confirmed the objective results. Even, because a high correlation
was found between perceived naturalness and rmse and correlation coefficient of segmental dura-
tions or F0 contours along several paragraphs. Although, perceived naturalness in two paragraphs
produced with the same duration model or F0 model cannot be evaluated comparing their own cor-
relation coefficients or rmse.
Concerning the proposed duration models, the perceptual tests confirmed the improved results
achieved by the usage of dedicated ANNs. In face of the results, the alternative model was selected
for further developments with the level of Good in the MOS scale.
The second perceptual test proved that the F0 contour of European Portuguese can be modelled
by the Fujisaki’s model with high closeness to the original intonation.
The F0 model achieved the level of Fair in the MOS scale, although, a significant reduction in
naturalness as perceived. The test was performed in order to separate the loss in naturalness after
the application of each sub-model. Almost all loss in naturalness was perceived after the AC model
and no additional significant loss as felt after the PC model. These results may indicate that the PC
model is in a rather good quality and that the AC model needs to be improved. But, discussion in
section 6.3 recommends a more detailed error contribution analysis.
The complete proposed model (durations + F0) achieves the level of Fair in the MOS scale.
189
6 Conclusions and Future Work
This chapter closes the thesis making some observations about the time consuming tasks, present-
ing briefly the issues documented in previous chapters and their detailed conclusions. It follows a
discussion about the error contribution of the sob-models of a model, and a resume of the conclu-
sions. The section of future work points out some possibilities to improve the proposed model and
the way to be followed in the near future.
A Prosody Model to TTS Systems
Several constraints, not identified in the beginning appeared during the work and were success-
fully passed upon with some additional time of work. Particularly, some tasks were very time con-
suming. The effort taken into those tasks is not reflected in previous descriptions. Here those time
consuming, but not reported, tasks are mentioned:
• construction of the speech database FEUP-IPB – the task consisted in preparation of the
corpus, finding a skilled professional speaker, recording the signal waveform, conven-
iently editing and storing the corresponding files, manually labelling the speech wave files
into phonetic, word and phrasal levels and finally identify the inevitable errors and mis-
takes and correct them; This was one of the most time consuming task due to the exten-
sion of the speech database;
• estimation of the Fujisaki model parameters – this task consisted in creating a tool that al-
lows an easy and intuitive way of manually estimating the PCs and ACs and the process of
manually estimating those parameters in all used tracks;
• training and refinement of ANNs - a huge quantity of different ANNs were trained hun-
dreds of times. The combinations of different type of ANNs, number of layers, number of
nodes per layer, their respective activating functions, training algorithms, different sets of
features and their codifications were tested in order to accurately select the best architec-
ture. Each best architecture candidate was trained hundreds of times with different random
seeds. Each seed leads to a good final solution, but they all are different. Anyhow, the best
solutions of different sessions has very similar performance;
• programming the extraction of features – hundreds of features were used in the duration
and F0 ANNs models. The process of programming the automatic extraction of those fea-
tures from the labelled files and testing the results was also a very laborious task;
• extraction of reported results – all the reported measured intermediate and final results
were obtained with developed routines for the particular purpose of obtaining those meas-
ures in the whole database;
• development of visualization tools – special tools were developed to allow the visualiza-
tion of the signal waveform, F0, commands, text, syllables, phones and other labels. Sev-
eral figures presented in chapter 4 were produced with those tools;
• publication of papers– several scientific papers were published reporting several parts of
this work. As it is well known, the process of writing a scientific paper, preparing the
presentation and presenting it in a scientific meeting, takes several weeks of work. Al-
though the well-known richness of the knowledge acquired in this process, the total time
taken with several publications are significant in a PhD task schedule.
Other tasks are visible in the main document, as is the document writing, and several other mi-
nor time consuming tasks were also realised. All those tasks together support this PhD thesis.
192
Chapter 6 - Conclusions and Future Work
The duration and the F0 models presented in chapters 3 and 4, respectively, involve several pre-
paratory works and previous processing modules that were presented in chapter 2. Also in this
chapter, one preliminary study of tonic syllable characteristics is reported. A final evaluation of the
model and its components was performed and reported in chapter 5 by the way of perceptual tests.
The results of this study were not directly used in the proposed prosody model, because it has
dependency of more features. Anyhow, the study had an important role in clarifying the future
studies at that time.
The created speech labelled database of EP is an important resource, non-available at the begin-
ning of the work. Section 2.3 reports the speech corpus FEUP-IPB database specially developed
under this work. The speech database consists in several tracks read by a skilled professional
speaker in a total of approximately 100 minutes. The speech files were labelled at the phonetic,
word and phrase levels. Later, the Fujisaki model’s parameters, PCs and ACs, were estimated in
101 paragraphs. Some phonetic statistics were reported. Several phonetic changing phenomena
found in the database were also reported, like dialectal and contextual changes. This database was
used in the whole prosodic study.
Section 2.4 reports a developed module previous to the prosody model. This module’s operation
consists in splitting words into syllables. Two algorithms were proposed, one to split written text
and the other to split the ‘spoken text’, or the transcribed sequence of phonemes. Both algorithms
were based in considering syllables only of the types V, VC, VCC, CV, CVC, CCV and CCVC as
admissible in EP, and a small additional set of rules. The second algorithm considers, also, sylla-
bles of the types C and CC, admitting that original types CV, CVC and CCV suffered vowel reduc-
tion. The error rates measured in a text not seen in the development were 0.06% and 0.89% per di-
vision, respectively. The second algorithm has a comprehensible superior error rate than the first
193
A Prosody Model to TTS Systems
one, because of the additional difficulty of deal with the vowel reduction, very frequent in EP.
Anyhow both solutions reached acceptably low error rates.
Section 2.5 presents several proposals of sets of rules to convert graphemes into phonemes. This
task does not make part of the prosody model but is also very important to improve naturalness in
synthetic speech. The presented rules are not exhaustive, considering that most of the graphemes
have well known and stable rules. Only the graphemes <a>, <e>, <o> and <x> deserved a special
attention. For those graphemes an enlarged set of rules and respective exceptions are documented.
The process of phonetic transcription from text in FEUP-TTS is also described, and considers a ta-
ble of exceptions previous to the rules. The measured error rates for the graphemes <a> and <x>
were 0.34% and 3,4% per phoneme, respectively. The larger error rate in case of grapheme <x> re-
flects the large number of unpredictable situations in the production of that grapheme’s sound. No
error rate measurements were made for the other graphemes, but a large error rate is expected be-
cause of the large possibilities of phonemes into those graphemes can be converted to. The elimina-
tion of those errors consists in including the detected error situations into the table of exceptions.
The problem of homograph words is still unsolved with the table of exception. For those cases
morphologic and contextual information is needed. Post-lexical or co-articulation rules were also
presented to be applied after the grapheme-phoneme conversion rules. Those rules pretend to re-
duce the unnatural distance between the formal lexical transcriptions of text to the usual naturally
produced phonetic sequence, justified by the co-articulation effects.
6.2.2 Timing
Chapter 3 describes the segmental duration model. Two alternative models are proposed based
on the same concepts. The first one uses one ANN to predict the duration of any segment. The sec-
ond alternative uses one dedicated ANN for each segment type using the same set of features pro-
posed in the first alternative. A preliminary model to insert and predict pauses is also proposed.
The chapter starts describing the state of the art in segmental duration models. Then the firstly
proposed model consists in selecting the architecture of the ANN, training algorithm and the set of
features and their codification. The architecture of the ANN was selected by a process of experi-
menting all relevant alternatives and rejecting the ones that produced poor results. For the ones
with best performances, several hundreds of training sessions with random initial weights were per-
formed in order to get the very best performance. The ones with better performance were interac-
tively experimented with the different sets of features. The set of features was selected by a process
of including initially the features with relevant correlation coefficient with the output, and then
measuring the final performance with and without each feature or group of features. The final se-
lected best architecture was a feed forward ANNs with 99 nodes in the input layer, 4 nodes in the
first hidden layer activated with hyperbolic tangent transfer function, 2 nodes in the second hidden
layer activated with hyperbolic logarithmic function and one node in the output layer activated by a
linear transfer function. The ANNs were trained with the Levenberg-Marquardt back-propagation
algorithm. The features of the final set can be grouped in three levels of relevance:
• relevant features: position in relation to tonic syllable, type of the vowel of syllable; posi-
tion to the end of the accent group and to the end of the phrase; distance to next pause; po-
sition of the accent group in the phrase; identity of the previous segment; identity of next
three segments;
194
Chapter 6 - Conclusions and Future Work
• slightly relevant features: type of syllable; type of previous syllable; type of vowel of pre-
vious and next syllable; position to the beginning of the accent group and to the beginning
of the phrase; length of the accent group; position of the accent group in the phrase from
the beginning; suppression or not of the final vowel.
The inclusion of any of the just slightly relevant features into the set of features does not im-
prove the final performance. However, the inclusion of several just slightly relevant features really
improves the final performance.
Several other type of features concerning linguistic and context information were considered but
without enough relevance. A very special attention was taken in the codification process in order to
maximize the performance keeping in mind the reduction of the number of input nodes. This way,
features like position in relation to the tonic syllable, syllable type and type of the vowel of syllable
were codified in just one node each, without loss in performance. The values for those features are
taken from a table that was built considering the correlation of their different possibilities with the
output. However, the identity of segment, for instance, was coded in 44 nodes, because any type of
codification with lower number of nodes reduces the final performance.
The, non-usual, consideration of a large number of features contributed significantly for improv-
ing the final results.
The very final results in a test set comparing the predicted values with the measured ones, as
they were produced by the speaker, were a standard deviation of 19.46 ms and a correlation coeffi-
cient of 0.839. A statistical analysis of the error in the prediction shows that 75% of segments have
an error inferior to 20 ms, 90% an error inferior to 30 ms and 95% an error inferior to 40 ms.
An alternative model was proposed using basically all attributes of the previous one, namely, the
basic architecture of the ANN and set of features, but using one dedicated ANN for each type of
segment. This alternative model has the advantage of each segment been predicted with an ANN
trained only with segments of this type excluding the effects of other type of segments. The disad-
vantage is that the knowledge of other type of segments is not used in the training process of the
dedicated ANN. Is the information of other type of segments useful for the different type of seg-
ment? The final objective results of this alternative model proved that this information is not useful
and should not be used.
The set of features of the alternative model is the same of the previous model, excluding the
identity of segment. Concretely, the final alternative model consists in 44 ANNs with 55 nodes in
the input layer and equal hidden and output layers as the ANN of the previous model.
The final results of this alternative model in the same test set were a standard deviation of 18.2
ms and a correlation coefficient of 0.861. A statistical analysis of the error in the prediction shows
that 75% of segments have an error inferior to 18 ms, 90% an error inferior to 30 ms and 95% an
error inferior to 37 ms.
The comparison of the standard deviation of measured and predicted durations with both mod-
els, as well as some observations confirm the lower dispersion of the predicted values by the
model, as expected from a statistical model. The models have more difficulties in predicting the
very high durations of segments. An analysis of predicted duration by phoneme (segment type)
showed that the maximum predicted values are always lower than the measured and that the mini-
mum predicted values are always higher than the measured. This shows once again the lower ex-
tension of the predict durations by the statistical model.
195
A Prosody Model to TTS Systems
Both proposed models have an objective evaluation at the same level of the state of the art of
models for other languages.
In chapter 5 a perceptual test was presented comparing both models with the natural speech pro-
duced by the speaker and with a ‘model’ called ‘no model’ that imposes one duration to each seg-
ment equal to the average duration of the type of segment. The following conclusions resulted from
the comparison of the MOS over 100 evaluations of each model:
• The alternative model here slightly preferred than the proposed model with an MOS of
3.93 against 3.78. Both models achieved the level of Good acceptability in a MOS scale.
However, the low confidence level between opinions of both models shows no significant
evidence that the models are different.
• Two original stimuli of each sentence were presented to subjects. Their MOS were 4.13
and 4.27, so, at the level of Good acceptability in a MOS scale. The distance of alternative
model stimuli to the average original stimuli were 0.27. Its closeness is confirmed by the
very low confidence level between alternative model and one of the original stimuli. The
first proposed model stimuli are 0.42 far from the original stimuli.
• ‘No model’ stimuli had a MOS of 2.88. It is 0.9 and 1.05 far from the first proposed model
and the alternative model, respectively. Although the ‘no model’ stimuli were never pre-
ferred than other stimuli, for any subject or paragraph, it still at the Fair level in MOS
scale.
The perceptual tests confirmed the good acceptability of both proposed segmental duration mod-
els.
The slightly preference of the alternative model justified the selection for further developments
in final proposed prosody model.
About 70% of pauses are imposed by orthographic punctuation marks; the other 30% occurs be-
tween words and usually are associated to prosodic phrasing. Just the pauses associated with punc-
tuation mark were studied for pause insertion, because of the absence of syntactic information to
determine the semantic group boundaries.
The statistical analysis of the database showed that the sentence marker “.” always impose the
insertion of a pause. The comma, “,”, imposes the insertion of a pause in 65% of cases. Other punc-
tuation marker like: “?”, “!”, “;”, “:” and “(“, seems to impose always one pause, but no statistical
significance exists. Finally, the marker “””, only 20% of times imposes a pause.
An ANN was proposed to predict the duration of pauses. The used features consisted in the type
of sentence marker associated to previous pause, actual pause and next pause and distance to previ-
ous and following pause, in a total of 17 input nodes. The achieved results were 95 ms of rmse and
a correlation coefficient of 0.54, in the test set.
Although the final results in the test set are similar to the ones achieved in other works [Navas,
2003], the model is not considered as reliable, because the rmse is significant in face of the stan-
196
Chapter 6 - Conclusions and Future Work
dard deviation of measured durations. A large database, for pause studies purposes, is needed. This
database does not need to be phonetically labelled, but needs to have a large number of pauses.
The chapter begins with an overview of some ways of coding the F0 contour for prosodic modu-
lation. Then, it follows with the concepts behind the Fujisaki model and the mathematical formula-
tion. The effects of the variation of each parameter of PCs and ACs are analysed. Then, the process
of estimation of the parameters and the developed tool to fulfil the process, were presented. After
that, a model to take care of prediction of PCs was proposed, consisting in the algorithm to control
the insertion in text and the ANNs to predict their magnitudes and final positions. Following, a
model to control the ACs was proposed. This model predicts the existence or not of one AC associ-
ated with syllables, their amplitude, onset time and offset time, using ANNs. Finally, the results
were analysed.
Seven tracks of the FEUP-IPB speech database were separated into 101 paragraphs with variable
lengths. The values of base line frequency, Fb, the natural angular frequency of phrase control
mechanism, α, the natural angular frequency of the accent control mechanism, β, and the relative
ceiling level of accent components, γ, were experimentally verified as having a constant value for
the present speaker at the respective values of 75 Hz, 2.0 /s, 20 /s and 0.9. For each paragraph, the
PCs were inserted making the phrase component cross the lower levels of the F0 contour. Then, the
ACs were estimated under the initial scope of reducing the distance between original and estimated
F0 contours. A strong relation between ACs and syllables was found. So, the inserted ACs were as-
sociated to the syllables. Syllables with voiced sounds usually have one AC associated. Some times
there is no AC associated to the syllable but, two ACs are never associated to one syllable. The fi-
nal rmse between estimated and original F0 was 3.98 Hz and the correlation coefficient was 0.973.
Usually, no perceptible differences exist between original and re-synthesised with estimated F0
contours utterances. Latter, the perceptual test confirmed this proximity, being the MOS of esti-
mated F0 contour at 4.38 and the original ones at 4.61.
It is important to mention that some degree of freedom is allowed by the Fujisaki model between
the PCs and the ACs used to produce a very similar pattern of F0, but this freedom is severely re-
duced using rules or linguistic constraints.
The PCs model performs in two phases. The first one inserts PCs in text associated to the begin-
ning of the accent groups. The second phase predicts the magnitude and the anticipation used to de-
termine the final exact position in speech timing.
The first phase consists in an algorithm to insert PCs in text. The PCs associated to orthographic
marks are 70%, according to experimental measurements. The remaining 30% has no associations.
Although the percentage of associated PCs with orthographic marks is very similar to the one pre-
sented for pauses, there is no full connection between pauses and PCs. The number of pauses is su-
perior to the number of PCs. The eligible positions to insert PCs are just the beginnings of the ac-
cent groups. The algorithm starts inserting one PC in all orthographic marks. Then it removes the
PCs that are very close to the previous one. Then, it inserts PCs in the gaps between PCs longer
than 3s, by means of a weighted score. The score considers the following factors: distance to previ-
ous and next PC, the presence of pause, the length of previous word and the type of previous word.
197
A Prosody Model to TTS Systems
The weight of every factor has been experimentally determined. The positions of the inserted PCs
are very consistent with the positions of the labelled ones.
The second phase consists in the prediction, by the way of ANNs, of the magnitude and of the
anticipation of the PC relatively to the initial eligible position (beginning of accent group). Two
ANNs were used because of the low correlation between the parameters Ap and T0a (anticipation).
The process of selection of the architecture and the set of features was very similar with the one de-
scribed for the duration model. The selected ANN to predict Ap consists in a feed-forward ANN
with 20-2-2-1 nodes in the layers. The first and second hidden layers have the hyperbolic logarith-
mic transfer function, and the output node a linear transfer function. The selected ANN to predict
T0a consists in a feed-forward ANN with 21-4-2-1 nodes in the layers. The first and second hidden
layers have the hyperbolic tangent and hyperbolic logarithmic transfer functions, and the output
node a linear transfer function. The Levenberg-Marquardt back-propagation training algorithm was
used in both ANNs. A set of 20 features was used in Ap’s ANN, and the magnitude of previous PC
were used as additional features in the T0a ANN. The final correlation coefficient values in the test
set were 0.772 and 0.649 for Ap and T0a, respectively. These values are the higher ones published
in similar works, although the perceptual test results should not be disregarded.
The AC model predicts the existence or not of an AC associated to one syllable and in positive
case, predicts the parameter’s amplitude, onset time anticipation and offset time anticipation. The
onset time and the offset time are determined by an anticipation related to the beginning and end of
the voiced part of speech in the syllable. Again, the low correlation between parameters leaded to
the usage of four ANNs. The process of selection of the architectures and the set of features was
similar with the one described in the duration model. A set of 25 or 27 features were used accord-
ing to the parameter. Final performance in test set for each parameter was:
Again, the present results are the higher ones published in similar works. Although, some impor-
tant parameter, like Aa, still show low correlation. Some information is still missing in the model to
improve this parameter. Some observed results showed that most of the ACs produce an accent
component that added with the other components fits closely the original F0 contour. Nevertheless,
in some other cases, lower accent components did not follow the high values of original F0 pattern.
In these cases just a rough approximation is achieved, because of the amplitude or even because of
T1 or T2, or even because of the closeness between ACs. Again, the perceptual test is important for
final judgement about obtained quality.
198
Chapter 6 - Conclusions and Future Work
perceptual test for stimuli with estimated F0 (stimuli 3), predicted ACs with estimated PCs (stimuli
5), predicted ACs and PCs (stimuli 6) and Durations + F0 models (stimuli 8).
In order to evaluate the loss introduced by the ACs model the distance in F0 contours when re-
placing the estimated ACs by the predicted ones was measured. This was the most significant
measured loss in the whole model. In used paragraphs, the rmse increased about 12 Hz, and corre-
lation coefficient decreased about 0.4, by comparison of columns 2 and 3 of Table 6.1. This does
not mean that the accent component deteriorates so much, as will be discussed in section 6.3.
To evaluate the loss introduced by the PCs model, the F0 contour with ACs predicted is taken as
the reference, and is compared with the F0 contour with PCs and ACs predicted. This model as-
sumes that the ACs are dependent of PCs. The new F0 contour is produced predicting again the
ACs because the set of PCs is new and the ACs model use this information in the input features.
The AC model still exactly the same and all features, except the PCs features, also still exactly the
same. The new set of ACs differs of the reference one only in what concerns the change in PCs fea-
tures. No significant loss was measured between new and reference F0 contours. Table 6.1 presents
an insignificant reduction in rmse (-0.2 Hz) and a reduction in r of 0.06, between columns 3 and 4.
Table 6.1: Resume of average (over the 5 paragraphs) evaluation parameters in the 4 stimuli types used for
perceptual tests.
When the F0 model (AC and PC models) is applied over the duration model no significant
changes in rmse and r of predicted F0 contours exists, as can be observed in columns 4 and 5 of
Table 6.1. This is coherent because no change in F0 patterns is introduced by duration model. The
change in those columns of MOS is due to the timing changes and not because F0 pattern changes.
The perceptual test, presented in chapter 5, compares 9 stimuli, in order to measure the audible
degradation in naturalness introduced by each component of the whole prosody model. In general
the MOS of the perceptual test for each type of stimuli confirm the objective measured result,
commented above. The following main observations resulted from the comparison of the MOS
over 95 evaluations of each type of stimuli:
• Stimuli with estimated F0 contour, by the way of the manually labelled PCs and ACs,
was relatively close to the original stimuli, 4.4 and 4.6 in a MOS scale, respectively.
• Stimuli with predicted F0 contour, by the way of estimated PCs and predicted ACs, get
the score of 3.1, denoting a significant degradation in perceived naturalness. The deg-
radation from the estimated F0 stimuli in MOS scale was almost 1.3. This subjective
evaluation confirms the previously discussed objective results. Again, this degradation
can be not only due to the accent component.
• Stimuli with predicted F0 contour by the way of predicted PCs and predicted ACs, or,
in other words, the complete F0 model, get the score of 3.1, denoting no additional deg-
radation introduced by the prediction of the PCs. The very low confidence level (36%)
199
A Prosody Model to TTS Systems
between stimuli with estimated and predicted PCs denotes that no evidence exist to
consider them different types of stimuli.
Generally, the results of the perceptual test confirm the objective measured results.
The information about focus is determinant to improve the correctness of the accent compo-
nents.
The duration model was also considered in this second test with the complete model. Again, one
stimulus just with predicted segmental durations, by the way of the alternative model, was used and
other stimulus with the predicted F0 contour modified over the previous speech signal. This last
stimulus consists in the proposed complete prosody model. The following main observation re-
sulted:
• Stimuli with segmental duration predicted with alternative model achieved an even bet-
ter score (4.2) than in previous test. This small change could be caused by the existence
of other stimuli (F0 modified stimuli) with less naturalness in the same test. The dis-
tance from these stimuli to the original ones was 0.4.
• The complete model achieved a final score of 2.9. This score is considered in the MOS
scale at the Fair level. This score is 0.2 far from the score of the stimuli with predicted
F0, corresponding to the loss resulted by the introduction of the segmental duration
model.
A similar decrease of 1.3 in the MOS occurred with the introduction of the predicted F0. This
decreased can be observed in two comparisons. The first is the comparison of the stimuli with esti-
mated and predicted F0 contour (decrease from 4.4 to 3.1). The second is the comparison of the
stimuli with just segmental duration modified, and the one with the complete model (decrease from
4.2 to 2.9).
Anyhow, no similar decrease in MOS occurred by the introduction of the duration model. By
one way, the comparison of original and predicted durations stimuli (decrease from 4.6 to 4.2), and
by the other way the comparison of stimuli with predicted F0 and the ones with complete prosody
model (decrease from 3.1 to 2.9). The smaller decrease in the second case can be explained by the
lower level of naturalness in F0 contour.
Finally, a comparison between objective and subjective measurements used to evaluate the
model as made in chapter 5. This comparison leaded to the conclusion that “perceived naturalness
in two paragraphs produced with same model can not be evaluated comparing their own correla-
tions coefficient or rmse. But the general naturalness of a model can be evaluated by the rmse or
correlation coefficient measured along several paragraphs”. The rmse and r along several para-
graphs are highly correlated with MOS of the perceptual test.
200
Chapter 6 - Conclusions and Future Work
Fig. 6.1 represents the error in S5, eS5, and in S6, eS6, considering that the PC and AC error
axis are orthogonal. The eS5 has only AC error component (ACe), with the value eAC, and no PC
error component (PCe), once the estimated PCs were used and supposedly, it has no error. The S6
has the same AC component, eAC, but a different absolute error (measured error), eS6. This figure
shows that even a rather small increase in the absolute error between S5 and S6, δe, can correspond
to a significant increase in component of PC error, δePC.
Indeed, the present model considers that the axis between PC and AC are not orthogonal because
it is considered that the ACs are dependent of the PCs. It must be remembered that the set of fea-
tures of AC model contains features related to the PCs. Therefore, Fig. 6.2 presents the same analy-
sis considering now non-orthogonal axis. The represented angle between the axes was selected to
produce a clear figure and was not measured. It can be observed that now, even a lower absolute er-
ror (measured error) in situation S6 can produce a significant PC component error.
A group of experiments and measurements can by studied with the objective of measure the an-
gle between the PC and AC components.
PCe
δe
eS6
δePC
eS5
eAC ACe
Fig. 6.1 – PC and AC error components in stimuli 5 and 6, considering orthogonal axis.
201
A Prosody Model to TTS Systems
PCe
δePC eAC
ACe
eS6
δe
eS5
Fig. 6.2 – PC and AC error components in stimuli 5 and 6, considering non-orthogonal axis.
This analysis demonstrated that: non-significant change in the error of the F0 pattern when the
PC model is applied, does not mean that the PC model does not introduce degradation.
A similar conclusion can be taken from the analysis of the error components of the F0 model
and duration model.
202
Chapter 6 - Conclusions and Future Work
• a prosody model for EP, for TTS purposes, to be implemented in FEUP-TTS system;
• a speech database in EP, FEUP-IPB, labelled at the phoneme, word, phrase and F0 lev-
els;
Two segmental duration models based in ANN were proposed. The most important conclusions
can be summarized by the following items:
• the use of a large number of features contributed to improve the final results;
• both proposed segmental duration models have a Good acceptability in the objective
and subjective measurements.
• the use of one dedicated ANN for each type of segment improves the final performance
of the model, because the knowledge carried out by other types of segments may dam-
age the learning process of the ANN;
• The level of Good was achieved by the duration model in the perceptual tests.
A model to predict the F0 contour based on the Fujisaki model using basically ANNs to predict
the Phrase Commands and the Accent Commands was developed. Initially, the PCs sub-model pro-
ceeds in two phases. The first phase associates PCs to the text, based on a mathematical model ob-
tained with the experimental data. The second phase predicts the PCs magnitudes and exact posi-
tions in the speech signal using ANNs. Then, the ACs model associates ACs with syllables and
predicts their amplitudes, onset times and offset times, using ANNs. The following main conclu-
sion can be pointed out:
• the process and features assures a good correlation coefficients for the predicted pa-
rameters;
• the loss in naturalness measured by the MOS is significant when the AC model is ap-
plied and is not significant when applying the PC model. But this does not mean neces-
sarily that the AC model is the sole responsible.
• The level of Fair was achieved by the F0 model in the perceptual tests.
The complete model achieved a final score of 2.9. This score is considered in the MOS scale at
the Fair level.
203
A Prosody Model to TTS Systems
Finally, high correlation was found between the MOS of a perceptual test and the measures of
rmse and r between predicted and original values of segmental duration and F0 along several para-
graphs. This leads to the conclusion that the rmse and r are very good evaluators for the perceived
naturalness of a model (durations model or F0 model), when measured along several paragraphs.
204
Chapter 6 - Conclusions and Future Work
Once this work was mainly dedicated to the prosody module, the pointed further developments
will focus this module. Some hints for improving the performance of duration and F0 models are
discussed.
For instance, the identification and special treatment of longer segments in the duration’s model
could introduce a small improvement in the final performance. But, using only the same restricted
information, it is not expected that significant improvement can be introduced in the duration’s
model.
A special purpose database concerning pausing studies can allow the development of a better
pausing or phrasing module. The use of a reliable prosodic phrasing, can, perhaps, introduce some
improvements in the final model.
Concerning the F0 module, some improvements may be introduced in the AC model, as consid-
ering some additional restriction in the proximity between ACs, or even by other type of associa-
tions to the ACs, besides syllables. Both suggestions introduce less number of ACs and thus pro-
ducing a lower fitting with the original F0 and a more flat prosody. A flatter pattern of F0, although
less interesting, can attenuate eventual wrong movements or even hide them. But, real improve-
ments only can be achieved breaking the restrictions of the present existent information. The identi-
fication of the prominence words or syllables or the focus is the more relevant information needed.
But, for this, several other kinds of input information must be used, because the prominence infor-
mation can not be taken only from the text morphology. Syntactic knowledge can, probably, give
some additional information useful for the segmental duration model [Ribeiro et al., 2003]. But, the
semantic knowledge is more reliable to produce the focus information.
As was written before, just some part of linguistic information has been used. Further significant
developments surely need other kind of information, for instance, a module to produce non linguis-
tic and/or paralinguistic information.
The consideration of having new kinds of information for prosodic systems, are not new. But it
is limited by the difficulty of getting dynamically non linguistic or paralinguistic information.
Some scientists pointed out as the new generation of synthesizers not text-to-speech but concept-to-
speech. Possibly, this approach intends to avoid the need of extraction of non-linguistic and para-
linguistic information in the speech production introduced by the speaker. Additionally, the con-
cept-to-speech already contains the semantic knowledge. It is well known that speech is used to
convey information or concepts by the way of words. Without words there is no speech. The new
difficulty introduced by the prospective new generation of synthesizers will be the concept-to-text
or concept-to-words processing.
The present prosody module was produced for read speech based on a particular speaker. A
prosody module can be developed by the introduction of several other functionalities. This prosody
205
A Prosody Model to TTS Systems
model can, in the future, incorporate different speech rates, different prosodic styles, emotions and
different text types besides the read type. Perhaps, it can be optimised for the theme of the text de-
pending if it is news, weather forecast, scientific document, mathematical formulae, etc. It can also
be developed given features for facial modelling.
This prosody module can not be considered complete without a model to predict the intensity
pattern.
Unfortunately, no prosody module can produce truly natural patterns yet. It is possible to find
some very special dedicated applications with reasonable natural synthetic speech, but, although
the long evolution made from several years ago in TTS systems, no TTS system exists yet that can
produce natural speech for all applications.
If we look into the future we see the long way to cross in order to reach the followed objective of
obtaining really natural synthetic speech, but if we look backwards we also can see the long way al-
ready crossed. This gives hope for reaching the objective in a future not so far away.
206
Bibliography
A Prosody Model to TTS Systems
Allen, J.; Hunnicut, S. and Klatt, D. H.. (1987). From Text to Speech: The MITalk System. Cambridge Univer-
sity Press, Cambridge.
Andrade, E. and Viana, M.. (1988). Ainda Sobre o Ritmo e o Acento em Português. In actas do 4º Encontro
da Associação Portuguesa de Linguística. Lisbon, 3-5.
Barbosa, F.; Ferrari, L. and Resende, F. G.. (2003). A Methodology to Analyse Homographs for a Brazilian
Portuguese TTS System. In Computational Processing of the Portuguese Language, 6th International
Workshop, PROPOR Proceedings. Faro, pp.57-61.
Barbosa, F.; Pinto, G.; Resende, F. G.; Gonçalves, C. A.; Monserrat, R. and Rosa, M. C.. (2003). Grapheme-
Phone Transcription for a Brazilian Portuguese TTS. In Computational Processing of the Portuguese Lan-
guage, 6th International Workshop, PROPOR Proceedings. Faro, pp.23-30.
Barbosa, P. and Bailly, G.. (1994). Characterisation of rhythmic patterns for text-to-speech synthesis. Speech
Communication, 15: 127-137.
Barbosa, P. and Bailly, G.. (1997). Generation of pauses within the z-score model. In Progress in Specch Syn-
thesis by Van Santen J. P. H., Sproat R. W., Olive J. P. and Hirschber J. Editors. Springer Verlag, New
York, pages 365-381.
Barbosa, P.. (1997). A Model of Segment (and Pause) Duration Generation for Brazilian Portuguese Text-to-
Speech Synthesis. Proceedings of Eurospeech’97, Rodes, pages 2655-2658.
Barros, M. J.. (2002). Estudo Comparativo e Técnicas de Geração de Sinal para Síntese da Fala. Master The-
sis, Faculdade de Engenharia da Universidade do Porto.
Benenati, C. (2000). Separación en Silabas.
http://www.lclark.edu/~benenati/silabacento/silabas.html.
Bergström, M. and Reis, N.. (1997). Prontuário Ortográfico e Guia da Língua Portuguesa. Editorial Notícias.
Boersman, P.. (1993). Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise
ratio of a sampled sound. Proceedings of the Institute of Phonetics Science of the University of Amsterdam
17: 97-110.
Braga, D.; Freitas D.; Teixeira, J. P. and Marques, A.. (2003). On the Use of Prosodic Labelling in Corpus-
Based Linguistic Studies of Spontaneous Speech. In proceedings of Text Speech and Dialogue, Ceske
Budejovice, Czech Republic, pages 388-394.
Braga, D.; Freitas, D. and Ferreira, H.. (2003). Processamento Linguístico Aplicado à Síntese da Fala. In pro-
ceedings of III Congresso Luso-Moçambicano de Engenharia, Maputo/Moçambique. 2º Vol. Pg. 1349-
1360.
Brinckmann, C. and Trouvain, J.. (2003). The Role of Duration Models and Symbolic Representation for Tim-
ing in Synthetic Speech. International Journal of Speech Technology 6, 21-31.
Campbell, W. N. and Isard, S. D.. (1991). Segment durations in a syllable frame. Journal of Phonetics, 19 :37-
47.
208
Bibliography
Campbell, W. N.. (1992). Syllable-based Segmental Duration. In Talking Machines. Teories, Models and De-
signs, by G. Bailly, C. Benoit and T. Sawallis, Elsevier, Oxford, pages 211-224.
Campbell, W. N.. (1993). Predicting Segmental Durations for Accommodation Within a Syllable-Level Tim-
ing Framework. Proceedings of Eurospeech’93, vol. 2, pages 1081-1084.
Campbell, W. N.. (2000). Timing in Speech: A Multi-Level Process. In Prosody: Theory and Experiment. Ed-
ited by Merle Horne, Kluwer Academic Publishers, pages 281-334.
Carvalho, P.; Oliveira, L.; Trancoso, I. and Viana, M.. (1998). Concatenative Speech Synthesis for European
Portuguese. Proc. of the third ESCA/COCOSDA International Workshop on Speech Synthesis. Jenolan
Caves, Australia.
Caseiro, D. and Trancoso, I.. (2002). Grapheme-to-Phone Using Finite State Transducers. Proc. 2002 IEEE
Workshop on Speech Synthesis. Santa Monica, California.
Catarino, D.. (2000). Separação Silábica, http://www.option-line.com/members/dilson/Silabas.htm.
Chu, M. and Feng ,Y.. (2001). Study on Factors Influencing Durations of Syllables in Mandarin. Proceedings
of Eurospeech’01, Scandinavia, pages 927-930.
Córdoba, R.; Vallejo, J. A.; Montero, J. M.; Gutierrez-Arriola, J.; López, M. A. and Pardo, J. M.. (1997).
Automatic Modelling of Duration in a Spanish Text-to-Speech System Using Neural Networks. Proceed-
ings of Eurospeech’99, vol. 4, pages 1619-1622.
Cunha, C. and Cintra, L.. (1997). Nova Gramática do Português Contemporâneo, Edições João Sá
da Costa.
D’Alessandro, C. and Mertens, P.. (1995). Automatic pitch contour stylization using a model of tonal percep-
tion. Computer Speech and Language 9, 257-288.
Demuth, H. and Beale, M.. (2000). Neural Network Toolbox, for use with Matlab – User’s Guide, version 4,
by the Math Works.
Dutoit, T and Leich, H.. (1992). Improving the TD-PSOLA Text-to Speech Synthesizer with a Specially De-
signed MBE Re-Synthesis of the Segments Database. In Vandewalle, J., Boite, R., Moonen, M. and Ooster-
linck, A. (eds), SIGNAL PROCESSING VI: Theories and Applications. Elsevier Science Publishers B. V.
Fackrell, J.; Vereecken, H.; Martens, J.-P. and Van Coile, B.. (1999). Multilingual prosody modelling using
cascades of regression trees and neural networks. Proceedings of Eusospeech’99, Budapest, pp. 1835-1838.
Fackrell, J.; Vereecken, H.; Grover, C.; Martens, J.-P. and Van Coile, B.. (2002). Corpus-based Development
of Prosodic Models Across Six Languages, pages 120-128, in E. Keller, G. Bailly, A. Monaghan, J. Terken,
& M. Huckvale (editors), Improvements in Speech Synthesis, Edited by John Wiley & Sons,West Sussex.
Fant, G.; Liljencrants, J. and Lin, Q.. (1985). A four parameter model of glottal flow. In Speech Transmission
Laboratory – QPSR, 1:1-12.
Ferreira, H.. (2003). Contributo para a leitura automática de textos científicos. Graduation final project
/FEUP. July, 2003. http://www.fe.up.pt/~hfilipe/projecto
Ferreira, M. C.. (1998). Intonation in European Portuguese. In Intonation Systems – A Survey of Twenty Lan-
guages, by Daniel Hirst e Albert Di Cristo, Cambridge University Press, pages. 167-178.
209
A Prosody Model to TTS Systems
Freitas, D.; Moura, A.; Braga, D.; Ferreira, H.; Teixeira, J. P.; Barros, M. J.; Gouveia, P. and Latsch, V..
(2002). A Project of Speech Input and Output in an E-commerce Application. In Advances in Natural Lan-
guage Processing, Proceedings of Third International Conference, PorTAL 2002. Faro, Portugal.
Fromkin, V. and Rodman, R.. (1983). Introdução à Linguagem. Editora Almedina. Coimbra. Portugal.
Frota, S.. (1991). Para a Prosódia da Frase: Quantificador, Advérbio e Marcação Prosódica (Somente alguns
tópicos em foco). Masters Dissertation, Faculdade de Letras da Universidade de Lisboa.
Frota, S.. (2000). Prosody and Focus in European Portuguese, Phonological Phrasing and Intonation. Gar-
land Publishing Inc., New York.
Fujisaki, H. and Hirose, K.. (1984). Analysis of voice fundamental frequency contours for declarative sen-
tences of Japanese. Journal of the Acoustical Society of Japan (E), 5(4):233-241.
Fujisaki, H. and Narusawa, S.. (2002). Automatic Extraction of Model Parameters from Fundamental Fre-
quency Contours of Speech. Proceedings for 2001 2nd Plenary Meeting and Symposium on Prosody and
Speech Processing, pp. 133-138. Sanjo-Kaikan, University of Tokyo.
Fujisaki, H.; Narusawa, S.; Ohno, S. and Freitas, D.. (2003). Analysis and Modeling of F0 Contours of Portu-
guese Utterances Based on the Command-Response Model. Proceedings of Eurospeech’03, Geneva. Pages
2317-2320.
Fujisaki, H.. (1983). Dynamic characteristics of voice fundamental frequency in speech and singing. In
MacNeilage. In P. F., Editor. The Production of Speech, pages 39-55. Springer-Verlag.
Fujisaki, H.. (1988). A note on the physiological and physical basis for the phrase and accent components in
the voice fundamental frequency contour. In Fujimura, O., Editor, Vocal Fold Physiology: Voice Produc-
tion, Mechanisms and Functions, pages 347-355. Raven, New York.
Fujisaki, H.. (1997). Prosody, Models, and Spontaneous Speech. In Sagisaka, Y., Campbell, N. and Higuchi,
N., Computing Prosody, edited by Springer-Verlag New York, Inc. Pages 27-42.
Fujisaki, H., (2002). Modeling in study of Tonal Features of Speech with Application to Multilingual Speech
Synthesis. Proceedings of Joint International Conference of SNLP and Oriental COCOSDA. Thailand.
Goubanova, O. and Taylor, P.. (2000). Using Bayesian Belief Networks for model duration in text-to-speech
systems. Proceedings of ICSLP 2000, Beijing.
Goubanova, O.. (2001). Predicting segmental duration using Bayesian belief network. Proceedings 4th ISCA
Tutorial and Research Work shop on Speech Synthesis, Scotland.
Gouveia, P. D.; Teixeira, J. P. and Freitas, D.. (2000). Divisão Silábica Automática do Texto Escrito e Falado.
Actas do V PROPOR, Processamento Computacional da Língua Portuguesa Escrita e Falada, Atibaia – S.
Paulo. Pages 65-74.
Granqvist, S.. (1996). Enhancements to the Visual Analogue Scale, VAS, for listening tests. Speech, Music
and Hearing, Quarterly Progress and Status Report, Royal Institute of Technology. Pages 61-65.
Guimarães, R. C. and Cabral, J. A. S.. (1997). Estatística. Edição Revista, McGraw Hill de Portugal.
Hagan, M. T. and Menhaj, M.. (1994). Training feedforward networks with the Marquardt algorithm, IEEE
Transactions on Neural Networks, vol. 5, nº 6, pp.989-993.
210
Bibliography
Hirose, K.; Furuyama, Y.; Narusawa, S.; Minematsu, N. and Fujisaki H.. (2003). Use of Linguistic Informa-
tion for Automatic Extraction of F0 Contour Generation Process Model Parameters. Proceedings of Eu-
rospeech 2003, Geneva. Pages 141-144.
Hirschberg, J. and Pierrehumbert, J. B.. (1986). The intonational structuring of discourse. Proceedings of the
24th ACL Meeting. Pages136-144, New York.
Hirst, D. and Di Cristo, A.. (1998). Intonation Systems – A Survey of Twenty Languages. Cambridge Univer-
sity Press.
Hirst, D. and Espesser, R.. (1993). Automatic modelling of fundamental frequency using a quadratic spline
function. Travaux de L’Intitut de Phonétique d’Aix, 15, 71-85.
Hirst, D.; Di Cristo, A. and Espesser, R.. (2000). Levels of Representation and Levels of Analysis for the De-
scription of Intonation Systems. In Merle Horne, Prosody: Theory and Experiment. Edited by Kluwer Aca-
demic Publishers, Dordrecht, pages 51-87.
Hirst, D.. (2002). Automatic Analysis of Prosody for Multi-lingual Speech Corpora. In E. Keller, G. Bailly, A.
Monaghan, J. Terken e M. Huckvale, Improvements in Speech Synthesis, Cost 258: The naturalness of syn-
thetic speech, edited by John Wiley & Sons,West Sussex. Pages 320-327.
Horne, M.. (2000). Prosody: Theory and Experiment. Kluwer Academic Publishers. Dordrecht.
Huang, X.; Acero, A. and Hon, H.. (2001). Spoken Language Processing – A guide to Theory, Algorithm, and
System Development. Prentice Hall, New Jersey.
Huckvale, M.. Speech Filing System Tools for Speech Research http://www.phon.ucl.ac.uk/resource/sfs/
Keller, E. and Zellner, B.. (1997). Les Défis Actuels en Synthèse de la Parole, Etudes de Lettres. Revue de la
Faculté des Lettres de l’Université de Lausanne.
Keller, E.; Bailly, G.; Monaghan, A.; Terken, J. and Huckvale, M.. (2002). Improvements in Speech Synthesis,
Cost 258: The naturalness of synthetic speech. Edited by John Wiley & Sons,West Sussex.
Klatt, D. H.. (1976). Linguistic uses of segmental duration in English: Acoustic and perceptual evidence. Jour-
nal of Acoustic Society of America, 59, 1208-1220.
Kochanski, G. and Shih, C.. (2002). Prosody and Prosodic Models. Tutorial of ICSLP 2002 Denver.
Ladd, D. R. and Cutler, A.. (1983). Models and Measurements in the study of prosody. In, Cutler, A. E Ladd,
D. R., Prosody: Models and Measurements. Springer-Verlag, Berlin.
LiMin, Fu. (1994). Neural Networks in Computer Intelligence. McGraw-Hill International Editions, Computer
Science Series.
Masaki, M.; Kashiola, H. and Campbell, N.. (2002). Modeling the Timing Characteristics of Different Speak-
ing Styles. Proceeding of IEEE 2002 Workshop on Speech Synthesis.
Mateus, M.; Andrade, A.; Viana, M. and Villalva, A.. (1990). Fonética, Fonologia e Morfologia do Portu-
guês. Universidade Aberta, Lisbon.
McClelland, J. L. and Rumelhard, D. E.. (1986). Parallel Distributed Processing – Explorations in the Micro-
structure of Cognition. Volume 2 – Psychological and Biological Models. The Massachusetts Institute of
Technology Press.
211
A Prosody Model to TTS Systems
Mixdorff, H. and Jokisch, O.. (2001). Building An Integrated Prosodic Model of German. Proceedings of Eu-
rospeech’01, Aalborg. Pages 947-950.
Mixdorff, H.. (1998). Intonation Patterns of German – Model-based Quantitative Analysis and Synthesis of F0
Contours. Doktor-Ingenieurs Dissertation, Technische Universität Dresden.
Mixdorff, H.. (2000). A Novel Approach to the Fully Automatic Extraction of Fujisaki Model Parameters.
Proceedings of ICASSP 2000, vol. 3, pages 1285 – 1288, Istanbul.
Mixdorff, H.. (2002). An Integrated Approach to Modeling German Prosody. Doktor-Ingenieur habilitatus
Dissertation, Technische Universität Dresden.
Möbius, B.; Pätzold, M. and Hess, W.. (1993). Analysis and synthesis of German F0 contours by means of Fu-
jisaki’s model. Speech Communication 13, 53-61.
Moulines, E. and Charpentier, F.. (1990). Pitch-Syncronous Waveform Processing Techniques for Text-to-
Speech Synthesis Using Diphones. Speech Communication 9, 453-467.
Moulines, E. and Laroche, J.. (1995). Non-Parametric techniques for pitch-scale and time-scale modification
of speech. Speech Communication 16, 175-205.
Narusawa, S.; Minematsu, N.; Hirose, K. and Fujisaki, H.. (2001). Automatic Extraction of Parameters from
Fundamental Frequency Contours of Speech. Proceedings of ICSP 2001, Daejon Korea.
Narusawa, S.; Minematsu, N.; Hirose, K. and Fujisaki, H.. (2002a). A Method for Automatic Extraction of
Model Parameters from Fundamental Frequency Contours of Speech. Proceedings of ICASSP 2002, vol. 1
pp.509-512, Orlando, USA.
Narusawa, S.; Minematsu, N.; Hirose, K. and Fujisaki, H.. (2002b). Automatic Extraction of Model Parame-
ters from Fundamental Frequency Contours of English Utterances. Proceedings of ICSLP 2002, vol. 3
pp1725-1728, Denver USA.
Navas, E.; Hernáez, I. and Sánchez, J.. (2002a). Basque Intonation Modelling for Text To Speech Conversion.
Proceedings of ICSLP’02, Denver, USA.
Navas, E.; Hernáez, I. and Sánchez, J.. (2002b). Subjective Evaluation of Synthetic Intonation. IEEE 2002
Workshop on Speech Synthesis. Santa Monica, USA.
Navas, E.; Hernáez, I.; Armenta, A.; Etxebarria, B. and Salaberria, J.. (2000). Modelling Basque Intonation
Using Fujisaki’s Model and CARTs. In state of the art in speech synthesis digest, 3/1 – 3/6.
Navas, E.. (2003). Modelado Prosódico del Euskera Batúa para Conversión de Texto a Habla. PhD thesis,
Universidad del País Vasco, Escuela Superior de Ingenieros de Bilbao.
Olaszy, G.. (1991). The inherent time structure of speech sounds. In Mária Gósy, Temporal Factors in Speech,
a collection of papers, edited by Research Institute for Linguistics, Hungarian Academy of Sciences.
Olaszy, G.; Németh, G. and Olaszy, P.. (2001). Automatic Prosody Generation – a Model for Hungarian. Pro-
ceedings of Eurospeech’01, Aalborg. Pages 525-528.
Oliveira, L.; Viana, M. and Trancoso, I.. (1991). DIXI – Portuguese Text-to-Speech System. Proc. of Euros-
peech’91. Genoa, Italy.
212
Bibliography
Oliveira, L.; Viana, M. and Trancoso, I.. (1993). DIXI: Sistema de Síntese da Fala a Partir do Texto para o
Português. Proc. EPLP’93 – 1º Encontro de Processamento da Língua Portuguesa Escrita e Falada. Lis-
boa.
Oliveira, L.. (1996). Síntese de Fala a Partir de Texto. Phd thesis, Universidade Técnica de Lisboa.
Oliveira, M.. (2002). Pausing Strategies as Means of Information Processing in Spontaneous Narratives. Pro-
ceedings of Speech Prosody 2002, Aix-En-Provence. Pages 539-542.
Pierrehumbert, J. B.. (1980). The Phonology and Phonetics of English Intonation. PhD thesis, Massachusetts
Institute of Technology.
Rabiner, L. and Schafer, R.. (1978). Digital Processing of Speech Signals. Prentice-Hall.
Ribeiro, R.; Oliveira, L. and Trancoso, I.. (2003). Using Morphossyntactic Information in TTS Systems: Com-
paring Strategies for European Portuguese. In Proc. PROPOR 2003. Faro, Portugal. Pages 143-150.
Riedmiller, M. and Braun, H.. (1993). A direct adaptive method for faster backpropagation learning: The
RPROP algorithm. Proceedings of the IEEE International Conference on Neural Networks.
Rossi, P.; Palmieri, F.; Cutugno, F.. (2002). A Method for Automatic Extraction of Fujisaki-Model Parame-
ters. Proceedings of Speech Prosody 2002, Aix-En-Provence. Pages 615-618.
Rumelhard, D. E. and McClelland, J. L.. (1986). Parallel Distributed Processing – Explorations in the Micro-
structure of Cognition. Volume 1 – Foundations, The Massachusetts Institute of Technology Press.
Salgado, X. F. and Banga, E. R.. (1999). Segmental Duration Modelling in a Text-to-Speech System for the
Galician Language. Proceedings of Eurospeech’99, Budapeste. Pages 1635-1638.
Silverman, K. and Pierrehumbert, J.. (1990). The Timing of Prenuclear High Accents in English. Papers in
Laboratory Phonology I , J. Kingston and M. Beckman, (eds), Cambridge University Press, Cambridge
UK. 72-106.
Souza, M. N.; Caprini, E. J.; Machado, C. G.; Ludolf, M. V.; Calôba, L. P.; Seixas, J. M.; Resende, F. G.; Net-
to, S. L.; Freitas, D.; Teixeira, J. P.; Espain, C.; Pêra, V. and Moreira, F.. (1999). Developing a Voiced In-
formation Retrieval System for the Portuguese Language Capable to Handle Both Brazilian and Portuguese
Spoken Versions .Proceedings of the Eurospeech’99, Budapest.
Taylor, P.. (1994). The rise / fall / connection model of intonation. Speech Communication 15, 169-186.
Taylor, P.. (2000). Analysis and Synthesis of Intonation using the Tilt Model. Journal of the Acoustical Soci-
ety of America. vol 1073, pp. 1697-1714.
Teixeira, J. P. and Freitas, D.. (2002). Acoustic Characterisation of the Tonic Syllable In Portuguese, pages
120-128, in E. Keller, G. Bailly, A. Monaghan, J. Terken, & M. Huckvale (Editors), Improvements in
Speech Synthesis, Edited by John Wiley & Sons,West Sussex.
213
A Prosody Model to TTS Systems
Teixeira, J. P. and Freitas, D.. (2003a). Evaluation of a Segmental Durations Model for TTS. In Computa-
tional Processing of the Portuguese Language, 6th International Workshop, PROPOR Proceedings. Faro,
pp.40-48.
Teixeira, J. P. and Freitas, D.. (2003b). Segmental Durations Predicted With a Neural Network. Proceedings
of Eurospeech’03, Geneva. Pages 169-172.
Teixeira, J. P.; Freitas, D. and Fujisaki, H.. (2003). Prediction of Fujisaki Model’s Phrase Commands. Pro-
ceedings of Eurospeech’03, Geneva. Pages 397-400.
Teixeira, J. P.; Freitas, D. and Fujisaki, H.. (2004). Prediction of Accent Commands for the Fujisaki Intonation
Model. Proceedings of Speech Prosody 2004, Nara - Japan. Pages 451-455.
Teixeira, J. P.; Freitas, D.; Braga, D.; Barros, M. J. and Latsch, V.. (2001). Phonetic Events from the Labeling
the European Portuguese Database for Speech Synthesis, FEUP/IPB-DB. Proceedings of Eurospeech’01,
Aalborg. Pages 1707-1710.
Teixeira, J. P.; Freitas, D.; Gouveia, P.; Olaszy, G. and Németh G.. (1998). MULTIVOX – Conversor Texto
Fala Para Português. In III Encontro Para o Processamento Computacional da Língua Portuguesa Escrita
e Falada - PROPOR, Porto Alegre – Brasil.
Teixeira, J. P.; Rosa, E.; Freitas, D. and Pinto, M. da G.. (1999). Acoustical Characterization of the Accented
Syllable in Portuguese, A Contribution to the Naturalness of Speech Synthesis, Proceedings of the Eu-
rospeech’99, Budapest. Volume 4, Page 1651-1654.
Teixeira, J. P.. (1995). Modelização Paramétrica de Sinais Para Aplicação em Sistemas de Conversão Texto-
Fala. Masters dissertation, Faculdade de Engenharia da Universidade do Porto.
Trancoso, I.; Viana, M.; Silva, M.; Marques, G. and Oliveira, L.. (1994). Rule-Based versus Neural Network
Based Approaches to Letter-to-Phone Conversion for Portuguese Common and Proper Names. In Proc. In-
ternational Conference on Spoken Language Processing. Yokohama, Japan.
Van Santen, J. P. H.. (1992). Contextual Effects on Vowel Duration. Speech Communication. 11(6):513-546.
Van Santen, J. P. H.. (1994). Assignment of segmental duration in text-to-speech synthesis. Computer Speech
and Language, 8, 95-128.
Van Santen, J. P. H.. (1997). Segmental Duration and Speech Timing. In Sagisaka, Y., Campbell, N. e Higu-
chi, N., Computing Prosody, edited by Springer Verlag, New York.
Vereecken H.; Martens J.-P.; Grover C.; Fackrell J. and Van Coile B.. (1998). Automatic Prosodic Labeling of
6 Languages. Proceedings of ICSLP’98, Sidney, Australia, Vol. 4 pp. 1399-1402.
Viana, M. C.; Oliveira, L. and Mata, A. I.. (2001). Prosodic Phrasing: Machine and Human Evaluation. TTS
workshop 2001, Edinburgh.
Viana, M. C.; Oliveira, L. and Mata, A. I., (2003). Prosodic Phrasing: Machine and Human Evaluation. Inter-
national Journal of Speech Technology 6, 83-94.
Vorstermans, A.; Martens, J.P. and Bert, V. C.. (1996). Automatic segmentation and labeling of multi-lingual
speech data, Speech Communication, 271-293.
214
Bibliography
Zellner, B., (1994). Pauses and the Temporal Structure of Speech. In Eric Keller, Fundamentals of Synthesis
and Speech Recognition, Basic Concepts, State-of-the-Art and Future Challenges, by John Wiley & Sons,
Chichester.
Zellner, B., (1998). Caractérisation et prédiction du débit de parole en français – Une étude de cas. Thèse pré-
sentée pour obtenir le grade de Docteur en Lettres, Université de Lausanne.
Zellner, B.. (2001). Les enjeux de la simulation scientifique L’exemple du rythme de la parole. Actes des
Journées Prosodie 10-11 Octobre 2001.
Zvonik, E. and Cummins, F.. (2002). Pause Duration and Variability in Read Texts. Proceedings of ICSLP’02,
Denver, USA.
Matlab® – The Language of Technical Computing, Using Matlab, version 6, 2000. Math Works.
Standard Publication No. 297, IEEE, (1969). IEEE Recommended Pratice for Speech Quality Measurements.
IEEE Transations on Audio and Electroacoustics. Vol. AU-17, no.3. 1969.
215