0% found this document useful (0 votes)
10 views5 pages

IJCA NasirAhmad

The document discusses the development of a Pashto text-to-speech (TTS) synthesis system utilizing data-driven techniques like Classification and Regression Tree (CART) and Non Uniform Units (NUUs). It outlines the modular structure of the system, which includes a Natural Language Processing (NLP) module and a Digital Signal Processing (DSP) module, detailing the processes involved in converting Pashto text into intelligible speech. The paper also reviews various speech synthesis methods, emphasizing the advantages and challenges of concatenative synthesis for producing natural-sounding speech.

Uploaded by

Terry Ob
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views5 pages

IJCA NasirAhmad

The document discusses the development of a Pashto text-to-speech (TTS) synthesis system utilizing data-driven techniques like Classification and Regression Tree (CART) and Non Uniform Units (NUUs). It outlines the modular structure of the system, which includes a Natural Language Processing (NLP) module and a Digital Signal Processing (DSP) module, detailing the processes involved in converting Pashto text into intelligible speech. The paper also reviews various speech synthesis methods, emphasizing the advantages and challenges of concatenative synthesis for producing natural-sounding speech.

Uploaded by

Terry Ob
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/259362966

The Development of Pashto Speech Synthesis System

Article · June 2013

CITATIONS READS

3 3,174

4 authors:

Fatima Tuz Zuhra Nasir Ahmad


University of Peshawar University of Engineering and Technology, Peshawar
10 PUBLICATIONS 29 CITATIONS 105 PUBLICATIONS 1,648 CITATIONS

SEE PROFILE SEE PROFILE

Muhammad Akbar Ali Khan Sahibzada Abdur Rehman Abid


University of Engineering and Technology, Peshawar University of Engineering and Technology, Peshawar
10 PUBLICATIONS 122 CITATIONS 10 PUBLICATIONS 156 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Nasir Ahmad on 19 December 2013.

The user has requested enhancement of the downloaded file.


International Journal of Computer Applications (0975 – 8887)
Volume 71– No.24, June 2013

The Development of Pashto Speech Synthesis System

Muhammad Sahibzada Abdur Fatima Tuz Zuhra Nasir Ahmad


Akbar Ali Khan Rehman Abid Department of Department of
Department of Department of Computer Science, Computer Systems
Computer Systems Computer Systems University of Engineering,
Engineering, Engineering, Peshawar, Pakistan. University of
University of University of Engineering &
Engineering & Engineering & Technology,
Technology, Technology, Peshawar, Pakistan
Peshawar, Pakistan. Peshawar, Pakistan.

ABSTRACT number of applications can potentially take advantage of the


This paper presents a novel Pashto text-to-speech (TTS) Pashto speech synthesis system. Rest of the paper is organized
synthesis system based on data driven techniques such as as follows. Section 2 describes different methods of the
Classification and Regression Tree (CART), Bigrams, and speech synthesis. Section 3 explains the proposed Pashto
Non Uniform Units (NUUs). A modular concatenative TTS speech synthesis system while section 4 concludes the
system has been developed for the Pashto language. Speech findings of this work.
synthesis is carried out through a series of steps with the 2. SPEECH SYNTHESIS METHODS
intention to provide a gradually more absolute transcription of A number of methods for the speech synthesis have been
the text, from which the final speech signal is then generated. proposed in literature. All of these method falls largely into
The steps can be divided into two modules; a Natural one of the following three categories: articulatory synthesis,
Language Processing (NLP) module and a Digital Signal formant synthesis or concatenative synthesis. Each of these
Processing (DSP) module. These steps incrementally enhance methods has their own advantages and disadvantages.
the information derived from the input and put it on a
generally accessible internal data structure. The goal is to 2.1 Articulatory Synthesis
obtain enough information on the internal data structure so as Articulatory synthesizers are physical models based on the
to be capable to obtain an intelligible and natural speech. detailed description of the physiology of speech production
and the physics of sound generation in the human vocal
General Terms apparatus [5]. To make the computers to speak by articulatory
Speech synthesis, Pashto speech synthesis, concatenative synthesis, the human vocal apparatus is modeled by
speech synthesis combining electrical, mechanical and electronic components
and a robotic talking head is made that produces sound just
Keywords similar to a person [6]. It is the most difficult approach as the
Pashto speech synthesis, Classification and Regression Tree, physiology of human speech production is not yet fully
Non Uniform Units, Pashto TTS explored. Recent progress in speech production imaging,
articulatory control modeling, and tongue biomechanics
1. INTRODUCTION modeling has led to significant improvements in the way
Speech synthesis is the process which takes a sequence of articulatory synthesis is performed [7]. Articulatory
words as an input and converts them to an acoustic signal. It is synthesizers are computationally costly and difficult to debug.
the opposite process of speech recognition where speech is That is why they are far from practical applications.
converted into corresponding text. The systems for
automatically generating speech parameters from a linguistic
2.2 Formant Synthesis
Formant synthesis is a descriptive acoustic-phonetic approach
representation (such as a phoneme string) were not available
to the speech synthesis [3]. In formant synthesis parameters
until the 1960s [1], and the systems for converting ordinary
such as fundamental frequency and noise levels are varied
text into speech were first developed in the 1970s, with
over time to create a waveform of artificial speech. Formant
MITalk being the then most popular such system [2]. In the
synthesis is based on the source filter model of speech and is
early days of synthesis, the research efforts were devoted
the most broadly used synthesis method and has two basic
mainly to simulating human speech production mechanisms,
structures, cascaded and parallel. Synthesis of dissimilar
using basic articulatory models based on electro-acoustic
voices and voice characteristics, and the modeling of emotive
theories. Though this modeling is still one of the ultimate
speech have kept research on formant synthesis active [8]. At
goals of synthesis research, advances in computer science
least three formants are required to produce an intelligible
have widened the field of Text to Speech processing to
speech; however up to five formants have been used for
include not only human speech production but also to model
producing a higher quality speech. Each formant is usually
the text processing [3]. In [4], a TTS system for the Maltese
modeled with a two pole resonator which enables both, the
language has been proposed, transforming arbitrary textual
formant frequency and its bandwidth to be specified [9]. Rule-
input into the spoken output.
based formant synthesis is based on a set of rules determining
In proposed Pashto speech synthesis the input is the Pashto
the parameters necessary to synthesize a desired utterance [2].
text while the output is its corresponding speech signal. A

49
International Journal of Computer Applications (0975 – 8887)
Volume 71– No.24, June 2013

Infinite number of sounds provided by the formant synthesis


makes it more flexible than other synthesis methods.
2.3 Concatenative Synthesis Text
Concatenative synthesis is the generation of natural sounding
synthesized speech waveforms by selecting and concatenating
NLP Module
speech units from a large database [10]. It is the simplest way
Preprocessor
of producing natural and intelligible synthetic speech.
Locating the correct unit is the most important factor in the
concatenative synthesis. Shorter units need less memory, but Morphological Analysis I
the collection and labeling of the speech samples becomes
complex and difficult. On the other hand longer units need Contextual Analysis N
more memory; however more naturalness, less concatenation
points and a fine control of the co-articulation can be T
Prosody Parser
achieved. The units used can be words, syllables,
demisyllables, phonemes, diphones, or triphones [11]. Word E
is perhaps the most natural unit for written text and a suitable Phonetizer
unit for limited vocabulary synthesis system. Concatenation of R
words is relatively easy to perform and the co-articulation
effects within a word are captured in the stored units. Prosody Generator
N
However, words uttered in isolation are greatly different from
their utterance in continues sentences thus making the A
synthesized continuous speech sounding unnatural [2].
Phonemes are the most commonly used units in speech DSP Module L
synthesis as they are the standard linguistic presentation of Concatenative Synthesis
speech. Moreover, the inventory of fundamental units is
usually between 40 and 50, which is clearly the minimum as
compared with the other units [2].
D
3. PASHTO SPEECH SYNTHESIS Fig 1: Pashto Text to Speech System
SYSTEM A
The transduction procedure for the Pashto speech synthesis is Table 1: Pashto words and corresponding parts of speech
achieved through a sequence of steps, which gives a detailed Word POS Word T POS
transcription of the text, from which the corresponding speech tlwyzywn Noun Iw Noun
is finally derived. These steps can be divided into two bIa Adverb zh A Pronoun
modules, the NLP module and the DSP module, as shown in hm Adverb kE Postposition
Figure 1. slamwnh Noun wRandE Adverb
kwm Verb tasw Pronoun
3.1 Natural Language Processing ghg Intransitive verb ahtram S Noun
d Pronoun aw Conjunction
The NLP module processes analyze the text to derive a more
, Punctuation gwr@i Verb
suitable phonetic transcription that can be finally used by the twns Noun ph T Preposition
DSP module. The subtasks of the NLP module are discusses amnItI Adjective wkR@i Verb
in more details in the following. . Punctuation hklh R noun

3.1.1 Pre-Processing 3.1.3 Contextual Analysis U


The preprocessor block transforms the text into processable For contextual analysis of Pashto a bi-gram model has been
input in the form of a words list. The function of the used. In bi-gram model the probability of a tag C depending on
preprocessor is to divide the incoming sentences into tokens the pervious tag is considered. The bigram model is sketched
and determine punctuation ambiguity such as a full stop by using a set of states that represent theT part of speech
categories based on the grammar. Every transition from state
indicating the end of a sentence. U
y to state x is associated with a transition probability P (cx|cy),
3.1.2 Morphological Analysis which is the probability for a word of category c y to be
Morphological analyzer uses lexical information to obtain a tracked by a word of category cx. Transition Rprobability is the
probability of the part of speech word to follow the current
morphological parse for each word and thus recognizes its E
part of speech word while emission probability is the
possible parts of speech category. probability of a word occurrence in the same category part of
The parts of speech categories of the Pashto words can be speech. A state dependent probability P (wx|cy) is calculated
for each state and every word in the vocabulary, which shows
expressed in the form of a morphological dictionary which the probability that category cy appears as word wx. The
gives a list of all words linked with their part of speech Speech probabilities are shown in table 2.
transition and emission
categories, as shown below in Table 1. The emission and transition probabilities are determined by
the words and tag combination appearances in a corpus. The
emission probability P(wx|cy) is estimated by the number of
times wx appears as cx, divided by the total number of words
with part of speech category cx.

50
International Journal of Computer Applications (0975 – 8887)
Volume 71– No.24, June 2013

# (w x , c y ) Punctuation , 0.2000 Conjunction 0.2500


P (wx|cy) ≈ (1) . 0.8000 Noun 0.2500
# (c y ) Pronoun 0.5000
Transitive chpawl 1.0000 Noun 1.000
In the same way, the bigram transition probability between Verb
categories cy and cx is estimated by the number of times cx
appears after cy, divided by the total number of words with Verb awr@i 0.0833 Conjunction 0.2500
part-of-speech category cy, bh 0.0833 Preposition 0.0833
. . Pronoun 0.1667
# (c x , c y ) . . Punctuation 0.3333
P (cx|cy) ≈ (2) wRandE 0.1667 Verb 0.1667
# (c y )
Once emission and transition probabilities are estimated, 3.1.4 Prosodic Parser
getting the most excellent sequence of tags for a given In Pashto speech synthesis, prosodic phrases are identified
sentence reduces to selecting the best sequence of part of with a rather trivial chinks ’n chunks algorithm [12]. In the
speech tags for the sentence, i.e., the one with highest proposed system it is considered that a prosodic phrase break
probability given the sequence of words and the bigram is automatically set when a word belonging to the chunks
model. group is followed by a word classified as a chink. Chinks are
Table 2: Emission and Transition Probabilities composed of conjunction preposition, pronoun, postposition;
and chunks are composed of adjective, adverb, intransitive
Part of Words Emission POS Transition
verb, noun, transitive verb, verb, and punctuation. The classes
speech Probability probability of chinks and chunks considered for the synthesis of Pashto
(POS) speech are given in table 3.
Adjective amnItI 0.2000 Noun 0.6000
chWr 0.2000 Transitive 0.2000 Table 2: Pashto Chinks and Chunks
mhm 0.2000 Verb 0.2000 Pashto Chinks Pashto Chunks
mtid 0.2000 Verb zh, xpl, aw, tasw, d, , srh, Iw, dzl, ghg, bIa, Im sId,
pwrh 0.2000 ph, kE, chE, dE, IE, lh, wRandE, sIlman, slamwnh,
Adverb bIa 0.2500 Adverb 0.2500 hghh, twlw, mwng, etc. awnR@y, kwm, afghanstan,
hm 0.2500 Pronoun 0.5000 mhm, amrIka, ashna, twns,
nh 0.5000 Verb 0.2500 tlwyzywn, xbrwnh, gwr@i,
Conjunction aw 0.6667 Adjective 0.1667 awr@i, srprst, jmhwr, etc.
chE 0.3333 Postposition 0.1667
Pronoun 0.5000
3.1.5 Phonetizer
Verb 0.1677
In the proposed synthesis system corpus based phonetizer has
Intansitive ghg 0.5000 Noun 0.5000 been developed and is implemented as a decision tree trained
Verb Im 0.5000 Punctuation 0.5000 on the real data. In the construction of automatic
Noun afghanstan 0.0294 Adjective 0.0588 phonetization, the characteristic utilized in the decision tree
amrika 0.0294 Adverb 0.0882 are only the letter being currently phonetized, the part of
ghwrdzng 0.0294 Conjunction 0.0294 speech of the current word and the letters on the left and right
of the current letter. In the Pashto training corpus phonetic
hklh 0.0294 Intransitive 0.0588
transcription are given to each word and thus each letter of the
tlwyzywn 0.0588 verb 0.4412 word obtains its phonetic symbol. A phonetic character is
twns 0.0588 Noun 0.0588 given to each phoneme by choosing the phonetic symbol used
. . Postposition 0.0294 in the corpus. The CART tree is implemented in MATLAB
. . Preposition 0.0588 which repeats itself, accounting for the details so that building
dzwakwnw 0.0294 Pronoun 0.1765 a tree from its top is the same as building a tree from any of
Verb its interior nodes. This phonetization was tested on the entire
Pashto test corpus to get the part of speech details for each
Postposition kE 0.3333 Adjective 0.3333 word from the corpus and no error was found.
srh 0.6667 Preposition 0.3333
Pronoun 0.3333 3.1.6 Prosody Generator
Preposition ph 0.6667 Conjunction 0.3333 Prosody is achieved as a result of unit selection from a large
speech corpus. Phonetic features such as current and
th 0.3333 Noun 0.3333
neighbouring phonemes, as well as linguistic features such as
Pronoun 0.3333 stress, position of the phoneme within its word, position of the
Pronoun d 0.5333 Adjective 0.0667 word within its prosodic phrase, position of the prosodic
dE 0.0667 Noun 0.7333 phrase within the sentence and part-of-speech tag of the
xpl 0.0667 Pronoun 0.1333 current word are used to find a sequence of speech segments
. . Verb 0.0667 or units taken from the speech corpus, whose features most
. . closely match the features of the speech unit to be
synthesized.
zmwng 0.0667

51
International Journal of Computer Applications (0975 – 8887)
Volume 71– No.24, June 2013

3.2 Digital Signal Processing 5. REFERENCES


DSP module operates on the phonetic transcription obtained [1] R. Sproat, and J. Olive, ―Text-to-Speech Synthesis‖ in V.
from the previous module and creates the speech waveform K. Madisetti and D. B. Williams (eds.), Digital Signal
that can be reproduced audibly. In this work, the Processing Handbook, Ch. 46, CRC Press, 1998.
concatenative synthesis approach has been adopted. Twenty
sentences of Pashto are stored in text corpus and the same are [2] J. Allen, M.S. Hunnicutt, and D. Klatt, From Text to
recorded and stored as .wav files. The HMM-based text-to- Speech, Cambridge University Press, Cambridge, 1987.
speech alignment system [13] is used to create the [3] J. Allen, M.S. Hunnicutt, and D. Klatt, From Text to
segmentation files. The content of the segmentation files is Speech: the MITalk System, Cambridge University Press,
such that each line refers to a start point, an end point, and a Cambridge, 1987.
phoneme name. Alignment, on the other hand, is trained by
the degree of correspondence between the assumed phonemic [4] P. J. Farrugia ―Text-To-Speech Technologies for Mobile
transcription and the actual list of phonetic units produced. In Telephony Services‖, MSc thesis, Dept of Computer
some cases a difference between the assumed phonemic Science and AI, University of Malta 2005.
transcription and the actual list of phonetic units occurs due to [5] S. Parthasarathy, and C. H. Coker, ―Automatic estimation
the co-articulation which cannot be taken into account in the of articulatory parameters‖, Computer Speech and
phonemic transcriptions. The segmentation files are check and Language, vol. 6, no.1, pp. 37-75, 1992.
corrected where needed using the Wavesurfer tool. A speech
unit database is generated from the segmented speech, [6] B. Baxter, and W.J. Strong, ―WINDBAG—a vocal-tract
containing information about the current phoneme, previous analog speech synthesizer‖, Journal of the Acoustical
phoneme, next phoneme, the index of the part of speech of the Society of America, vol. 45, no. 1, pp. 309, 1969.
current word, the index of the current prosodic phrase within
[7] P. Birkholz, D. Jackel, and B.J. Kröger, ―Construction and
the current sentence, the number of prosodic phrases on the
control of a three-dimensional vocal tract model‖,
right until the end of the sentence, the index of the current
ICASSP 2006, Toulouse, France, pp. 873-876. 2006.
word within the current prosodic phrase, the number of words
on the right until the end of the current prosodic phrase, the [8] R. Carlson, B. Granström, and I. Karlsson ―Experiments
index of the sentence containing the phoneme and the start with voice modelling in speech synthesis‖, Speech
and end point for the current phoneme in the related wav file. communication, vol. 10, pp. 481-490. 1991.
A few entries in the database are shown in Figure 2.
'_#In1113' [1] [ 0] [ 108] [9] R. Donovan, ―Trainable Speech Synthesis‖, PhD. Thesis.
Cambridge University Engineering Department,
'I#Wn1113' [1] [ 108] [ 394]
England, 1996.
'WIDn1113' [1] [ 394] [ 699]
'DWZn1122' [1] [ 699] [ 928] [10] A. J. Hunt and A. W. Black, ―Unit Selection in a
'ZDLn1122' [1] [ 928] [1324] Concatenative Speech Synthesis System Using a Large
Speech Database‖, ATR Interpreting
'LZXn1122' [1] [1324] [1993] Telecommunications Research Labs. 2-2 Hikaridai,
Fig 2: Database entries Seika-cho, Soraku-gun, Kyoto 619-02, Japan, 1996.

NUU synthesis formats targets by appending them with the [11] S. Lemmetty, ―Review of Speech Synthesis
linguistic context features. It also checks for the accessible Technology‖, MSc Thesis, Helsinki University of
diphones in the speech unit database similar to the target Technology Department of Electrical and
diphones, and selects a maximum of 10 units per diphone to Communications Engineering, March 30, 1999.
accelerate the search process. Viterbi algorithm finds the best [12] M.J. Liberman and K.W. Church, ―Text Analysis and
order of units by minimizing the selection cost. At the end the Word Pronunciation in Text-to-Speech Synthesis,‖ in S.
selected diphones from the speech corpus are concatenated to Furui and M.M. Sondhi, (eds.), Advances in Speech
produce the final synthetic speech. Signal Processing, pp. 791–831. Dekker, New York,
4. CONCLUSION 1992.
Pashto speech synthesis system utilizing the Bigram model, [13] F. Malfrere, O. Deroo, T. Dutoit, and C. Ris, ―Phonetic
CART and NUUs techniques has been presented. The Alignement: Speech-Synthesis-based versus Viterbi-
emission and transition probability for each word in the based‖, Speech Communication, vol. 40, no. 4, pp. 503–
Pashto words dictionary are calculated through the bigram 517, 2003.
model. CART is efficient due to its lower computational
requirements and greater flexibility. NUUs checked for the
available diphones in the speech unit database matching to the
target diphones while viterbi algorithm finds the finest order
of units. The Pashto speech synthesis system produces natural
speech for the sentences in the speech corpus, while for other
sentences it produced nearly natural audio results, with minor
discontinuities. In the future work the problem of acronyms,
abbreviations, and out of vocabulary words will be
considered.

52
IJCATM : www.ijcaonline.org

View publication stats

You might also like