0% found this document useful (0 votes)
30 views38 pages

TTS SRM Speech

Uploaded by

pratik665123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views38 pages

TTS SRM Speech

Uploaded by

pratik665123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Innovations in Text to Speech Synthesis

Department of Electronics and Communication Engineering


SRM IST

1
Introduction
• Technology - converts written text into spoken language.
• Process - transforms text input, such as written words and
sentences, into audible speech.

• TTS synthesis - to provide accessibility, enhance user experiences,


and improve communication.

2
TTS-18th Century
• First instance of a machine that
could produce synthesized
speech - late 18th century.

• French inventor Joseph Faber


created the “Euphonia” - used
bellows reeds and a keyboard
to produce a range of sounds,
including synthesized speech.

• Euphonia - imitate the human


voice to a certain extent
3
TTS architecture

4
TTS block diagram

5
Typical TTS system

6
TTS - working
• Front-end - Two major tasks.
• Text normalization / Pre-processing / Tokenization - Converts raw
text containing symbols like numbers and abbreviations into the
equivalent of written-out words.
• Text-to-phoneme or grapheme-to-phoneme conversion -
Assigns phonetic transcriptions to each word, and divides and marks
the text into prosodic units, like phrases, clauses, and sentences.
• Phonetic transcriptions and prosody information together make up the
symbolic linguistic representation that is output by the front-end.

7
TTS - Working
• Back - end
• Synthesizer - Converts the symbolic linguistic representation into
sound.
• In certain systems, this part includes the computation of the target
prosody (pitch contour, phoneme durations), which is then imposed on
the output speech.

8
TTS – Key aspects
• Input Text
• Can be in the form of documents, articles, messages, or any other
written content.

• Text Analysis
• First analyzes the input text to understand its linguistic structure,
including sentence boundaries, punctuation, and pronunciation.
• This process involves breaking down the text into smaller units,
such as words, phrases, and sentences.
9
TTS – Key aspects
• Linguistic Processing
• Linguistic and phonetic rules are used to process the text.
• This involves determining the appropriate pronunciation of words,
handling punctuation and numbers, and applying intonation
patterns

KPR Institute of Engineering and


18 October 2023 10
Technology, Coimbatore, Tamil Nadu, India
TTS – Key aspects
• Voice Selection
• TTS systems typically offer a selection of voices, each with its
own unique characteristics and accents.
• Users can choose the voice that best suits their preferences and
the context in which the speech will be used.

11
TTS – Key aspects
• Prosody
• Prosody refers to the rhythm, intonation, and stress patterns in
spoken language.
• TTS systems use prosody to make the synthesized speech
sound natural and expressive.
• This includes variations in pitch, timing, and emphasis to mimic
the way humans speak.

12
TTS – Key aspects
• Speech Synthesis
• Once the linguistic and phonetic processing is complete, the
TTS system generates the audio signal that represents the
spoken text.
• This can be done using different methods, including
concatenative synthesis (reusing pre-recorded speech
fragments) or parametric synthesis (generating speech from
mathematical models).

13
TTS – Key aspects
• Output
• The synthesized speech is the final output of the TTS system.
• It can be played through speakers, headphones, or integrated
into various applications and devices.

14
TTS – Types
• Concatenative synthesis
• Based on the concatenation (stringing together) of segments of
recorded speech; Produces the most natural-sounding
synthesized speech

• Formant synthesis
• Does not use human speech samples at runtime; Instead, the
synthesized speech output is created using additive
synthesis and an acoustic model
15
TTS model – Basic components

16
Linguistic Features
• Phonetics - the study of speech sounds in their physical aspects
• Phonology - the study of speech sounds in their cognitive aspects
• Morphology - the study of the formation of words
• Syntax - the study of the formation of sentences
• Semantics - the study of meaning
• Pragmatics - the study of language use

17
Acoustic Features
Frequency:
Relates to the individual pulsations produced by vocal cord vibrations for a
unit of time. The rate of vibration depends on the length, thickness, and
tension of the vocal cords, and thus is different for child, adult male and
female speech.
A speech sound contains two types of frequencies: fundamental frequency
(F0) which relates to vocal cord function and reflects the rate of vocal cord
vibration during phonation (pitch) and formant frequency which relates to vocal
tract configuration.
18
Acoustic Features
Time: Time as a property of speech sounds reflects the duration of a given
sound.
Amplitude: The amplitude is marked by darkness of the bands: the greater
the intensity of the sound energy presents in a given time and frequency, the
darker will be the mark at the corresponding point on the screen.
Formant: A formant is a concentration of acoustic energy around a particular
frequency in the speech wave. There are several formants, each at a different
frequency, roughly one in each 1000Hz band. That is formants occur at
roughly 1000Hz intervals.
19
TTS models – Data flow

18 October 2023 21
TTS – Applications
1. Accessibility Services
• For visually impaired individuals - allowing them to access digital content,
including websites, books, and documents.
• Screen readers use TTS to read aloud the content of computer screens
to users with visual impairments.

2. Voice Assistants and Virtual Agents


• Voice-activated virtual assistants like Siri, Alexa, and Google Assistant
use TTS to provide spoken responses and communicate with users.
• Virtual customer service agents and chatbots often employ TTS for
human-like interactions.
21
TTS – Applications
3. Audiobooks and E-Books

• TTS enables the conversion of written text into audio format, making
books and documents accessible to people who prefer to listen rather
than read.

4. Language Learning

• TTS is used in language learning applications to provide pronunciation


and fluency exercises for learners

22
TTS – Applications
5. GPS and Navigation
• Navigation systems use TTS to provide turn-by-turn directions and
• location information to drivers and pedestrians.
6. Assistive Technology
• TTS aids individuals with learning disabilities, such as
dyslexia, by reading text aloud to help them understand and learn
content.

7. Multilingual Support
• TTS can be used to translate and pronounce text in various
languages, making it valuable for travellers and
international business communication.
23
TTS – Applications
8. Audio Descriptions for Video Content
• TTS can be used to add audio descriptions for visually impaired
• individuals in movies, TV shows, and online video content.
9. Read-Aloud Software
• TTS is employed in software that reads documents, emails,
and web content aloud, assisting people with reading
difficulties or those who prefer auditory input.

10. Assisting the Elderly


• TTS can help older adults with vision or cognitive impairments by
reading reminders, notifications, and messages.
24
TTS – Applications
11. Accessibility in Web and Mobile Apps
• Many websites and mobile applications integrate TTS features to ensure
accessibility for all users.

12. Entertainment and Gaming


• TTS is used in video games for character voices and narration,
enhancing the gaming experience.

13. Voiceovers for Videos and Presentations


• TTS can be used to create voiceovers for videos and presentations when
human narration is not available or cost-effective.

25
TTS – Applications
14. Communication Devices
• TTS is integrated into augmentative and alternative communication
(AAC) devices for individuals with speech disabilities.
15. Real-Time Language Translation
• TTS can assist in real-time translation apps by converting translated text
into speech for communication between speakers of different languages.

16. Interactive Educational Content


• TTS is used in educational software and e-learning platforms to provide
interactive and engaging content

26
TTS – Applications
17. Broadcasting and Podcasting
• TTS technology can generate synthetic voices for news, weather, and
other segments in broadcasting and podcasting.

18. Voice User Interfaces (VUI)


• VUIs for smart devices, home automation, and vehicles use TTS to
provide users with voice-guided interactions.

27
TTS – Limitations
• Artificial or Robotic-Sounding Speech
• Inaccurate Pronunciation
• Lack of Emotion or Expression
• Limited Language Support
• Technical Limitations
• Unnatural Pausing or Pacing
• Background Noise or Statics
• Text to Speech API
28
TTS – Emerging trends
• Advances in Neural Text-to-Speech
• Voice Cloning
• Overdubbing
• Emotional TTS
• Multilingual TTS
• Singing TTS

29
TTS – Notable innovations
• Neural TTS (NTTS): The adoption of neural networks, particularly deep learning
techniques, has revolutionized TTS synthesis. NTTS models, like WaveNet and
Tacotron, have made voices sound significantly more natural and human-like.
• End-to-End Models: End-to-end TTS models combine text analysis and voice
synthesis into a single network. This simplifies the TTS process and often results
in more coherent and expressive speech.
• Expressive TTS: Greater control over the expressiveness of generated speech.
Users can modify parameters like pitch, speed, emotion, and accents, making TTS
voices highly customizable

30
TTS – Notable innovations
• Multilingual TTS: Handle multiple languages and dialects more effectively. This
includes models that can switch between languages in the same sentence and
offer higher-quality output.
• Zero-shot Learning: Can now generate speech in languages they were not
explicitly trained on. This is achieved by training on multiple languages and
leveraging multilingual embeddings.
• Voice Cloning and Customization: Can clone a specific voice or allow users to
customize a voice based on a few recorded samples. This has opened up
possibilities for personalizing TTS experiences

KPR Institute of Engineering and


18 October 2023 31
Technology, Coimbatore, Tamil Nadu, India
TTS – Notable innovations
• Emotional TTS: Convey different emotions in their speech, making them more
suitable for applications like virtual assistants, customer service bots, and
storytelling
• Real-Time TTS: Reduces the latency between text input and speech output. This
is critical for applications like live captioning and voice assistants.

• Low-Resource Languages: TTS models for languages with limited resources,


which can be especially beneficial for preserving and promoting linguistic diversity.
• Open-Source TTS Projects: Availability of open-source TTS projects and datasets
has democratized TTS development and research

32
TTS – Notable innovations
• Adaptive TTS: Capability to adapt to a user's voice, speech patterns, or
pronunciation, resulting in more personalized and natural-sounding output
• Reduced Data Requirements: Achieve impressive results with smaller datasets,
reducing the amount of training data required.
• Eco-Friendly TTS: More environmentally friendly by optimizing their energy
consumption during training and usage.

33
TTS – Demonstration
• https://ttsmaker.com/
• https://www.ibm.com/demos/live/tts-demo/self-service/home
• https://speechify.com/voiceover/?landing_url=https%3A%2F%2F2.zoppoz.workers.dev%3A443%2Fhttps%2Fspeechify.com%2
Fblog%2Fexamples-of-text-to-speech%2F

• https://paperswithcode.com/task/text-to-speech-synthesis

34
TTS – Market Research
• The global text-to-speech (tts) market was valued at $2.8 billion in 2021
• It is projected to reach $12.5 billion by 2031, growing at a CAGR of 16.3%
from 2022 to 2031.
• The TTS Market is segmented into Industry Vertical, Offering, Deployment
Model, Type, Language and Enterprise Size.

• Major players - Acapela Group, Amazon.Com, CereProc, Google, Inc., IBM


Corporation, iFlytek, iSpeech, LumenVox LLC, Microsoft Corporation,
NextUp Technologies, Nuance Communications, Readspeaker, Sestek,

35
TTS – 11 Best Text to Speech Tools in
2023 1. Murf
2. Descript
3.Speechify
4. Listnr
5. Synthesia
6. Speechelo
7. Notevibes
8. Fliki
9. FreeTTS
10. Synthesys
11. Lovo
36
TTS – Major companies & Products
• Amazon.com, Inc. – Amazon Polly
• Microsoft Corporation – Microsoft Azure
• Google LLC - Google Cloud Text-to-Speech API
• IBM Corporation - IBM Watson
• Nuance Communications, Inc. - SIRI

37
38

You might also like