TTS SRM Speech
TTS SRM Speech
1
Introduction
• Technology - converts written text into spoken language.
• Process - transforms text input, such as written words and
sentences, into audible speech.
2
TTS-18th Century
• First instance of a machine that
could produce synthesized
speech - late 18th century.
4
TTS block diagram
5
Typical TTS system
6
TTS - working
• Front-end - Two major tasks.
• Text normalization / Pre-processing / Tokenization - Converts raw
text containing symbols like numbers and abbreviations into the
equivalent of written-out words.
• Text-to-phoneme or grapheme-to-phoneme conversion -
Assigns phonetic transcriptions to each word, and divides and marks
the text into prosodic units, like phrases, clauses, and sentences.
• Phonetic transcriptions and prosody information together make up the
symbolic linguistic representation that is output by the front-end.
7
TTS - Working
• Back - end
• Synthesizer - Converts the symbolic linguistic representation into
sound.
• In certain systems, this part includes the computation of the target
prosody (pitch contour, phoneme durations), which is then imposed on
the output speech.
8
TTS – Key aspects
• Input Text
• Can be in the form of documents, articles, messages, or any other
written content.
• Text Analysis
• First analyzes the input text to understand its linguistic structure,
including sentence boundaries, punctuation, and pronunciation.
• This process involves breaking down the text into smaller units,
such as words, phrases, and sentences.
9
TTS – Key aspects
• Linguistic Processing
• Linguistic and phonetic rules are used to process the text.
• This involves determining the appropriate pronunciation of words,
handling punctuation and numbers, and applying intonation
patterns
11
TTS – Key aspects
• Prosody
• Prosody refers to the rhythm, intonation, and stress patterns in
spoken language.
• TTS systems use prosody to make the synthesized speech
sound natural and expressive.
• This includes variations in pitch, timing, and emphasis to mimic
the way humans speak.
12
TTS – Key aspects
• Speech Synthesis
• Once the linguistic and phonetic processing is complete, the
TTS system generates the audio signal that represents the
spoken text.
• This can be done using different methods, including
concatenative synthesis (reusing pre-recorded speech
fragments) or parametric synthesis (generating speech from
mathematical models).
13
TTS – Key aspects
• Output
• The synthesized speech is the final output of the TTS system.
• It can be played through speakers, headphones, or integrated
into various applications and devices.
14
TTS – Types
• Concatenative synthesis
• Based on the concatenation (stringing together) of segments of
recorded speech; Produces the most natural-sounding
synthesized speech
• Formant synthesis
• Does not use human speech samples at runtime; Instead, the
synthesized speech output is created using additive
synthesis and an acoustic model
15
TTS model – Basic components
16
Linguistic Features
• Phonetics - the study of speech sounds in their physical aspects
• Phonology - the study of speech sounds in their cognitive aspects
• Morphology - the study of the formation of words
• Syntax - the study of the formation of sentences
• Semantics - the study of meaning
• Pragmatics - the study of language use
17
Acoustic Features
Frequency:
Relates to the individual pulsations produced by vocal cord vibrations for a
unit of time. The rate of vibration depends on the length, thickness, and
tension of the vocal cords, and thus is different for child, adult male and
female speech.
A speech sound contains two types of frequencies: fundamental frequency
(F0) which relates to vocal cord function and reflects the rate of vocal cord
vibration during phonation (pitch) and formant frequency which relates to vocal
tract configuration.
18
Acoustic Features
Time: Time as a property of speech sounds reflects the duration of a given
sound.
Amplitude: The amplitude is marked by darkness of the bands: the greater
the intensity of the sound energy presents in a given time and frequency, the
darker will be the mark at the corresponding point on the screen.
Formant: A formant is a concentration of acoustic energy around a particular
frequency in the speech wave. There are several formants, each at a different
frequency, roughly one in each 1000Hz band. That is formants occur at
roughly 1000Hz intervals.
19
TTS models – Data flow
18 October 2023 21
TTS – Applications
1. Accessibility Services
• For visually impaired individuals - allowing them to access digital content,
including websites, books, and documents.
• Screen readers use TTS to read aloud the content of computer screens
to users with visual impairments.
• TTS enables the conversion of written text into audio format, making
books and documents accessible to people who prefer to listen rather
than read.
4. Language Learning
22
TTS – Applications
5. GPS and Navigation
• Navigation systems use TTS to provide turn-by-turn directions and
• location information to drivers and pedestrians.
6. Assistive Technology
• TTS aids individuals with learning disabilities, such as
dyslexia, by reading text aloud to help them understand and learn
content.
7. Multilingual Support
• TTS can be used to translate and pronounce text in various
languages, making it valuable for travellers and
international business communication.
23
TTS – Applications
8. Audio Descriptions for Video Content
• TTS can be used to add audio descriptions for visually impaired
• individuals in movies, TV shows, and online video content.
9. Read-Aloud Software
• TTS is employed in software that reads documents, emails,
and web content aloud, assisting people with reading
difficulties or those who prefer auditory input.
25
TTS – Applications
14. Communication Devices
• TTS is integrated into augmentative and alternative communication
(AAC) devices for individuals with speech disabilities.
15. Real-Time Language Translation
• TTS can assist in real-time translation apps by converting translated text
into speech for communication between speakers of different languages.
26
TTS – Applications
17. Broadcasting and Podcasting
• TTS technology can generate synthetic voices for news, weather, and
other segments in broadcasting and podcasting.
27
TTS – Limitations
• Artificial or Robotic-Sounding Speech
• Inaccurate Pronunciation
• Lack of Emotion or Expression
• Limited Language Support
• Technical Limitations
• Unnatural Pausing or Pacing
• Background Noise or Statics
• Text to Speech API
28
TTS – Emerging trends
• Advances in Neural Text-to-Speech
• Voice Cloning
• Overdubbing
• Emotional TTS
• Multilingual TTS
• Singing TTS
29
TTS – Notable innovations
• Neural TTS (NTTS): The adoption of neural networks, particularly deep learning
techniques, has revolutionized TTS synthesis. NTTS models, like WaveNet and
Tacotron, have made voices sound significantly more natural and human-like.
• End-to-End Models: End-to-end TTS models combine text analysis and voice
synthesis into a single network. This simplifies the TTS process and often results
in more coherent and expressive speech.
• Expressive TTS: Greater control over the expressiveness of generated speech.
Users can modify parameters like pitch, speed, emotion, and accents, making TTS
voices highly customizable
30
TTS – Notable innovations
• Multilingual TTS: Handle multiple languages and dialects more effectively. This
includes models that can switch between languages in the same sentence and
offer higher-quality output.
• Zero-shot Learning: Can now generate speech in languages they were not
explicitly trained on. This is achieved by training on multiple languages and
leveraging multilingual embeddings.
• Voice Cloning and Customization: Can clone a specific voice or allow users to
customize a voice based on a few recorded samples. This has opened up
possibilities for personalizing TTS experiences
32
TTS – Notable innovations
• Adaptive TTS: Capability to adapt to a user's voice, speech patterns, or
pronunciation, resulting in more personalized and natural-sounding output
• Reduced Data Requirements: Achieve impressive results with smaller datasets,
reducing the amount of training data required.
• Eco-Friendly TTS: More environmentally friendly by optimizing their energy
consumption during training and usage.
33
TTS – Demonstration
• https://ttsmaker.com/
• https://www.ibm.com/demos/live/tts-demo/self-service/home
• https://speechify.com/voiceover/?landing_url=https%3A%2F%2F2.zoppoz.workers.dev%3A443%2Fhttps%2Fspeechify.com%2
Fblog%2Fexamples-of-text-to-speech%2F
• https://paperswithcode.com/task/text-to-speech-synthesis
34
TTS – Market Research
• The global text-to-speech (tts) market was valued at $2.8 billion in 2021
• It is projected to reach $12.5 billion by 2031, growing at a CAGR of 16.3%
from 2022 to 2031.
• The TTS Market is segmented into Industry Vertical, Offering, Deployment
Model, Type, Language and Enterprise Size.
35
TTS – 11 Best Text to Speech Tools in
2023 1. Murf
2. Descript
3.Speechify
4. Listnr
5. Synthesia
6. Speechelo
7. Notevibes
8. Fliki
9. FreeTTS
10. Synthesys
11. Lovo
36
TTS – Major companies & Products
• Amazon.com, Inc. – Amazon Polly
• Microsoft Corporation – Microsoft Azure
• Google LLC - Google Cloud Text-to-Speech API
• IBM Corporation - IBM Watson
• Nuance Communications, Inc. - SIRI
37
38