0% found this document useful (0 votes)
118 views6 pages

Ijarcet Vol 4 Issue 7 3067 3072 PDF

Uploaded by

bindu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views6 pages

Ijarcet Vol 4 Issue 7 3067 3072 PDF

Uploaded by

bindu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Journal of Advanced Research in Computer Engineering & Technology (IJARCET)

Volume 4 Issue 7, July 2015

A REVIEW ON SPEECH TO TEXT CONVERSION


METHODS
Miss.Prachi Khilari1 Prof. Bhope V. P.2
1 2
Department of E&TC Engineering. Department of E&TC Engineering.
G.H.R.C.O.E.M, Ahmednagar. G.H.R.C.O.E.M, Ahmednagar.
Savitribai Phule University of Pune. Savitribai Phule University of Pune.

ABSTRACT: alphabet which makes the composition of human speech.


Most of the Information in digital world is available to a
Speech is the first important primary need, and the most few who can read or understand a scrupulous language.
convenient means of communication between people. The Language technologies can provide solutions in the form of
communication among human computer interaction is ordinary interfaces so the digital content can reach to the
called human computer interface. This paper gives an masses and facilitate the exchange of information across
overview of major technological perspective and different people speaking different languages[4]. These
appreciation of the fundamental progress of speech to text technologies play a vital role in multi-lingual societies such
conversion and also gives overview technique developed in as India which has about 1652 dialects/native languages.
each stage of classification of speech to text conversion. A Speech to Text conversion take input from microphone in
comparative study of different technique is done as per the form of speech & then it is converted into text form
stages. This paper concludes with the decision on future which is display on desktop. Speech processing is the study
direction for developing technique in human computer of speech signals, and the various methods which are used
interface system in different mother tongue and it also to process them. In this process various applications such as
discusses the various techniques used in each step of a speech coding, speech synthesis, speech recognition and
speech recognition process and attempts to analyze an speaker recognition technologies; speech processing is
approach for designing an efficient system for speech employed. Among the above, speech recognition is the
recognition. However, with modern processes, algorithms, most important one. The main purpose of speech
and methods we can process speech signals easily and recognition is to convert the acoustic signal obtained from a
recognize the text. In this system, we are going to develop microphone or a telephone to generate a set of words [13,
an on-line speech-to-text engine. However, the transfer of 23]. In order to extract and determine the linguistic
speech into written language in real time requires special information conveyed by a speech wave we have to employ
techniques as it must be very fast and almost 100% correct computers or electronic circuits. This process is performed
to be understandable. The objective of this review paper is for several applications such as security device, household
to recapitulate and match up to different speech recognition appliances, cellular phones ATM machines and computers.
systems as well as approaches for the speech to text Survey of these paper deals with different methods of
conversion and identify research topics and applications speech to text conversion which is useful for different
which are at the forefront of this exciting and challenging languages such as Phonem to Graphem method,conversion
field. for Bengali language, HMM based speech synthesis
methods etc[17].
Keyword : Speech To Text conversion, Automatic Speech
Recognition, Speech Synthesis. 1. Type of Speech:
I. INTRODUCTION : Speech recognition system can be separated in different
classes by describing what type of ullerances they can
In modern civilized societies for communication between recognize [1].
human speeches is one of the common methods. Different
ideas formed in the mind of the speaker are communicated 1.1 Isolated Word:
by speech in the form of words, phrases, and sentences by Isolated word recognizes attain usually require each
applying some proper grammatical rules.The speech is utterance to have quiet on both side of sample
primary mode of communication among human being and windows. It accepts single words or single utterances
also the most natural and efficient form of exchanging at a time .This is having “Listen and Non Listen state”.
information among human in speech. By classifying the Isolated utterance might be better name of this class.
speech with voiced, unvoiced and silence (VAS/S) an
elementary acoustic segmentation of speech which is 1.2 Connected Word:
essential for speech can be considered. In succession to Connected word system are similar to isolated words
individual sounds called phonemes this technique can but allow to divide or separate sound to be “run
almost be identical to the sounds of each letter of the together minimum pause between them.

ISSN: 2278 – 1323 All Rights Reserved © 2015 IJARCET 3067


International Journal of Advanced Research in Computer Engineering & Technology (IJARCET)
Volume 4 Issue 7, July 2015

1.3 Continuous speech: II. LITERATURE REVIEW:


Continuous speech recognizers allows user to talk
almost naturally, while the computer determine the 1.Yee-Ling Lu, Man-Wai and Wan-Chi Siu explains about
content. Recognizer with continues speech capabilities text-to-phoneme conversion by using recurrent neural
are some of the most difficult to create because they networks trained with the real time recurrent learning
utilize unique sound and special method to determine (RTRL) algorithm [3].
utterance boundaries.
2.Penagarikano, M.; Bordel, G explains a technique to
1.4 Spontaneous speech:
perform the speech to text conversion as well as an
At a basic level, it can be thought of as speech that is
investigational test carried out over a task oriented Spanish
natural sounding and not rehearsed. An ASR System
corpus are reported & analytical results also.
with spontaneous speech ability should be able to
handle a different words and variety of natural speech 3.Sultana, S.; Akhand, M. A H; Das, P.K.; Hafizur
feature such as words being run together. Rahman, M.M. explore Speech-to-Text (STT) conversion
using SAPI for Bangla language. Although achieved
2. Types of Speaker Model: performance is promising for STT related studies, they
All speakers have their special voices, due to their unique identified several elements to recover the performance and
physical body and personality. Speech recognition system might give better accuracy and assure that the theme of this
is broadly classified into main categories based on speaker study will also be helpful for other languages for Speech-
models, namely, speaker dependent and speaker to-Text conversion and similar tasks [3].
independent [1]. 4.Moulines, E., in his paper "Text-to-speech algorithms
2.1 Speaker independent models: based on FFT synthesis," present FFT synthesis algorithms
for a French text-to-speech system based on diaphone
Speaker independent systems are designed for variety of concatenation. FFT synthesis techniques are capable of
speakers. It recognizes the speech patterns of a large group producing high quality prosodic adjustments of natural
of people. This system is most difficult to develop, most speech. Several different approaches are formulated to
expensive and offers less accuracy than speaker dependent reduce the distortions due to diaphone concatenation.
systems. However, they are more flexible.
5.Decadt, Jacques, Daelemans, Walter and Wambacq
2.2 Speaker dependent models: describes a method to develop the readability of the textual
output in a large vocabulary continuous speech recognition
Speaker dependent systems are designed for a specific
system when out-of-vocabulary words occur. The basic
speaker. This systems are usually easier to develop, cheaper
idea is to replace uncertain words in the transcriptions with
and more accurate, but not as flexible as speaker adaptive
a phoneme recognition result that is post-processed using a
or speaker independent systems. They are generally more
phoneme-to-grapheme converter. This technique uses
accurate for the particular speaker, but much less accurate
machine learning concepts.
for others speakers.

3. Types of Vocabulary: III. SPEECH TO TEXT SYSTEM:


The size of vocabulary of a speech recognition system Speech is an exceptionally attractive modality for human
affects the complexity, processing necessities, performance computer interaction: it is “hands free”; it requires only
and the precision of the system. Some applications only modest hardware for acquisition (a high-quality
require a few words (e.g. numbers only), others require microphone or microphones); and it arrives at a very
very large dictionaries (e.g. direction machines). In ASR modest bit rate. Recognizing human speech, especially
systems the types of vocabularies can be classified as continuous (connected) speech, without burdensome
follows. training (speaker-independent), for a vocabulary of
sufficient complexity (60,000 words) is very hard.
a. Small vocabulary - ten of words However, with modern processes, flow diagram,
algorithms, and methods we can process speech signals
b. Medium vocabulary - hundreds of words
easily and recognize the text which is talking by the talker.
c. Large vocabulary – thousands of words In this system, we are going to develop an on-line speech-
to-text engine [4]. The system acquires speech at run time
d. Very-large vocabulary – tens of thousands of words through a microphone and processes the sampled speech to
e. Out-of-Vocabulary – Mapping a word from the identify the uttered text. The recognized text can be stored
vocabulary into the unknown word. in a file. It can supplement other larger systems, giving
users a different choice for data entry.
Apart from the above characteristics, the environment
variability, channel variability, speaker style, sex, age,
speed of speech also make the ASR system more complex.
But the efficient ASR systems must cope with the
variability in the signal.

ISSN: 2278 – 1323 All Rights Reserved © 2015 IJARCET 3068


International Journal of Advanced Research in Computer Engineering & Technology (IJARCET)
Volume 4 Issue 7, July 2015

physically debilitated clients. Voice SMS is an application


grew in this work that permits a client to record and
believer talked messages into SMS instant message. Client
can send messages to the entered telephone number.
Speech recognition is done via the Internet, connecting to
Google's server. The application is adapted to input
messages in English. Speech recognition for Voice uses a
technique based on hidden Markov models (HMM -
Hidden Markov Model). It is currently the most successful
Figure 1 . Basic Block diagram of speech to text System. and most flexible approach to speech recognition.
A discourse to-content framework can likewise enhance
framework availability by giving information passage
alternatives to visually impaired, hard of hearing, or
1. Speech production & Speech perception :

Figure 2 . Speech Chain


The process starts with the message information which can that is consistent with the sounds of the desired spoken
be thought of as having a number of different message and with the desired degree of emphasis. The end
representations during the process of speech production. result of the neuro-muscular controls step is a set of
For example the message could be represented initially as articulator motions (continuous control) that cause the
English text. In order to “speak” the message, the talker vocal tract articulators to move in a prescribed manner in
implicitly converts the text into a symbolic representation order to create the desired sounds. Finally the last step in
of the sequence of sounds corresponding to the spoken the Speech Production process is the “vocal tract system”
version of the text. This step, called the language code that physically creates the necessary sound sources and the
generator which converts text symbols to phonetic symbol appropriate vocal tract shapes over time so as to create an
(along with stress and durational information) that describe acoustic waveform, such as the one shown in fig. that
the basic sounds of a spoken version of the message and the encodes the information in the desired message into the
manner (i.e., the speed and emphasis) in which the sounds speech signal.
are intended to be produced. The third step in the speech
production process is the conversion to “neuro-muscular
controls,” i.e., the set of control signals that direct the
neuro-muscular system to move the speech articulators,
namely the tongue, lips, teeth, jaw and velum, in a manner

ISSN: 2278 – 1323 All Rights Reserved © 2015 IJARCET 3069


International Journal of Advanced Research in Computer Engineering & Technology (IJARCET)
Volume 4 Issue 7, July 2015

IV. AUTOMATIC SPEECH characterize and recognize information about the speaker
identity. The speaker recognition system may be viewed as
RECOGNITION (ASR): working in a four stages-
1. Basic Principle: a. Analysis
ASR systems operate in two phases. First, a training phase, b. Feature extraction
during which the system learns the reference patterns
representing the different speech sounds (e.g. phrases, c. Modeling
words, phones) that constitute the vocabulary of the d. Testing
application. Each reference is learned from spoken
examples and stored either in the form of templates a. Speech analysis:
obtained by some averaging method or models that
In Speech analysis technique Speech data contains different
characterize the statistical properties of pattern [6]. Second, types of information that shows a speaker identity. This
a recognizing phase, during which an unknown input
includes speaker specific information due to vocal tract,
pattern, is identified by considering the set of references. excitation source and behavior feature. The physical
structure and dimension of vocal tract as well as excitation
source are unique for each speaker. This uniqueness is
embedded in the speech signal during speech production
and can be used for speaker used for speaker recognition.
b. Feature Extraction Technique:
Feature Extraction is the most important part of speech
recognition since it plays an important role to separate one
speech from other. Because every speech has different
individual characteristics embedded in utterances[6]. These
characteristics can be extracted from a wide range of
feature extraction techniques proposed and successfully
exploited for speech recognition task. But extracted feature
should meet some criteria while dealing with the speech
signal such as:
a. Easy to measure extracted speech features.
b. It should not be susceptible to mimicry.
c. It should show little fluctuation from one speaking
environment to another.

Figure 3 . Basic Principle of ASR d. It should be stable over time.

2. Speech Recognition Techniques: e. It should occur frequently and naturally in speech.

The goal of speech recognition is for a machine to be able V. SYSTEM DESIGN &
to "hear,” understand," and "act upon" spoken information. IMPLEMENTATION :
The earliest speech recognition systems were first
attempted in the early 1950s at Bell Laboratories. The goal
of automatic speaker recognition is to analyze, extract

Figure 4 . System Architecture

ISSN: 2278 – 1323 All Rights Reserved © 2015 IJARCET 3070


International Journal of Advanced Research in Computer Engineering & Technology (IJARCET)
Volume 4 Issue 7, July 2015

The system equipped a speech-to-text system using isolated Synthesized speech can also be used in many educational
word recognition with a vocabulary of ten words (digits 0 institutions in field of study as well as sports. If the teacher
to 9) and statistical modeling (HMM) for machine speech can be tired at a point of time but a computer with speech
recognition. In the training phase, the uttered digits are synthesizer can teach whole day with same performance,
recorded using 16-bit pulse code modulation (PCM) with a efficiency and accuracy.
sampling rate of 8 KHz and saved as a wave file using
sound recorder software. We use the MATLAB software’s 4.Telecommunication and Multimedia
wavered command to convert the .wav files to speech STT systems make it possible to access vocal information
samples. Generally, a speech signal consists of noise- over the telephone. Queries to such information retrieval
speech-noise in a noisy environment. The recognition of systems could be put through the user's voice (with the help
actual speech in the given samples is important. We divided of a speech recognizer), or through the telephone keyboard.
the speech signal into frames of 450 samples each with an Synthesized speech may also be used to speak out short text
overlap of 300 samples, i.e., two-thirds of a frame length. messages in mobile phones.
The speech is alienated from the pauses using voice activity
detection (VAD) techniques, which are discussed in detail 5.Man-Machine Communication
later in the paper. The system performs speech analysis and
synthesis using the linear predictive coding (LPC) method. Speech synthesis can be used in several kinds of human
From the LPC coefficients we get the weighted cepstral machine interactions and interfaces. For example, in
coefficients and cepstral time derivatives, which form the warning, alarm systems, clocks and washing machines
characteristic vector for a frame. Then, the system performs synthesized speech may be used to give more exact
vector quantization using a vector codebook. The resulting information of the current situation [5]. Speech signals are
vectors form the observation sequence. For each word in far better than that of warning lights or buzzers as it enables
the vocabulary, the system builds an HMM model and to react to the signal more fast if the person is unable to get
trains the model during the training phase. The training light due some obstacles.
steps, from VAD to HMM model building, are performed
using PC-based C programs. We load the resulting HMM
6.Voice Enabled E-mail
models onto an FPGA for the recognition phase. In the Voice-enabled e-mail uses voice recognition and speech
recognition phase, the speech is acquired vigorously from synthesis technologies to enable users to access their email
the microphone through a codec and is stored in the from any telephone. The subscriber dials a phone number
FPGA’s memory. These speech samples are preprocessed, to access a voice portal, then, to collect their email
and the probability of getting the observation sequence for messages, they press a couple of keys and, perhaps, say a
each model is calculated. The uttered word is recognized phrase like "Get my e-mail." Speech synthesis software
based on maximum likelihood estimation. converts e-mail text to a voice message, which is played
back over the phone. Voice-enabled e-mail is especially
VI. APPLICATIONS OF SPEECH TO useful for mobile workers, because it makes it possible for
TEXT SYSTEM: them to access their messages easily from virtually
anywhere (as long as they can get to a phone),without
The application field of STT is expanding fast whilst the having to invest in expensive equipment such as laptop
quality of STT systems is also increasing steadily. Speech computers or personal digital assistants.
synthesis systems are also becoming more affordable for
common customers, which makes these systems more VII. CONCLUSION:
suitable for everyday use & becomes a cot effective. Some
uses of STT are described below[5]. In this paper, we discussed the topics relevant to the
development of STT systems .The speech to text
1.Aid to Vocally Handicapped conversion may seem effective and efficient to its users if it
produces natural speech and by making several
A hand-held, battery-powered synthetic speech aid can be
modifications to it. This system is useful for deaf and dumb
used by vocally handicapped person to express their words.
people to Interact with the other peoples from society.
The device will have especially designed keyboard, which Speech to Text synthesis is a critical research and
accepts the input, and converts into the required speech application area in the field of multimedia interfaces. In this
within blink of eyes. paper gathers important references to literature related to
2.Source of Learning for Visually Impaired the endogenous variations of the speech signal and their
importance in automatic speech recognition. A database has
Most important fact for listening is an important skill for been created from the various domain words and syllables.
people who are blind. Blind individuals rely on their ability The desired speech is produced by the Concatenative
to hear or listen to gain information quickly and efficiently. speech synthesis approach. Speech synthesis is
Students use their sense of hearing to gain information advantageous for people who are visually handicapped.
from books on tape or CD, but also to assess what is This paper made a clear and simple overview of working
happening around them. of speech to text system (STT) in step by step process. The
system gives the input data from mice in the form of voice,
3.Games and Education then preprocessed that data & converted into text format
displayed on PC. The user types the input string and the

ISSN: 2278 – 1323 All Rights Reserved © 2015 IJARCET 3071


International Journal of Advanced Research in Computer Engineering & Technology (IJARCET)
Volume 4 Issue 7, July 2015

system reads it from the database or data store where the Journal of Solid-State Circuits, vol SC-22, NO 1, February
words, phones, diaphones, triphone are stored. In this 1987, pp 3-14.
paper, we presented the development of existing STT
system by adding spellchecker module to it for different [10] Aggarwal, R. K. and Dave, M., “Acoustic modelling
language. There are many speech to text systems (STT) problem for automatic speech recognition system:
available in the market and also much improvisation is advances and refinements (Part II)”, International Journal
of Speech Technology (2011) 14:309–320.
going on in the research area to make the speech more
effective, and the natural with stress and the emotions. [11] Ostendorf, M., Digalakis, V., & Kimball, O. A.
(1996). “From HMM’s to segment models: a unified view
VIII. ACKNOWLEDGMENT: of stochastic modeling for speech recognition”. IEEE
The authors would like to thanks the Director/Principal Dr. Transactions on Speech and Audio Processing, 4(5), 360–
Vinod Chowdhary,Prof.Aade K.U. and Prof.Bhope V.P. 378.
Savitribai Phule, Pune University, for their useful [12] Yasuhisa Fujii, Y., Yamamoto, K., Nakagawa, S.,
discussions and suggestions during the preparation of this “AUTOMATIC SPEECH RECOGNITION USING
technical paper. HIDDEN CONDITIONAL NEURAL FIELDS”, ICASSP
2011: P-5036-5039.
IX. REFERENCES:
[13] Mohamed, A. R., Dahl, G. E., and Hinton, G.,
[1] Sanjib Das, “Speech Recognition Technique: A “Acoustic Modelling using Deep Belief Networks”,
Review”,International Journal of Engineering Research and submitted to IEEE TRANS. On audio, speech, and
Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 3, language processing, 2010.
May-Jun 2012.
[14] Sorensen, J., and Allauzen, C., “Unary data structures
[2] Ms. Sneha K. Upadhyay,Mr. Vijay N. for Language Models”, INTERSPEECH 2011.
Chavda,”Intelligent system based on speech recognition
with capability of self learning” ,International Journal For [15] Kain, A., Hosom, J. P., Ferguson, S. H., Bush, B.,
Technological Research In Engineering ISSN (Online): “Creating a speech corpus with semi-spontaneous, parallel
2347 - 4718 Volume 1, Issue 9, May-2014. conversational and clear speech”, Tech Report: CSLU-11-
003, August 2011.
[3] Deepa V.Jose, Alfateh Mustafa, Sharan R,”A Novel
Model for Speech to Text Conversion” International
Refereed Journal of Engineering and Science (IRJES)ISSN
(Online) 2319-183X, Volume 3, Issue 1 (January 2014).
[4] B. Raghavendhar Reddy,E. Mahender,”Speech to Text
Conversion using Android Platform”, International Journal
of Engineering Research and Applications (IJERA) ISSN:
2248-9622,Vol. 3, Issue 1, January -February 2013.
[5] Kaveri Kamble, Ramesh Kagalkar,” A Review:
Translation of Text to Speech Conversion for Hindi
Language”, International Journal of Science and Research
(IJSR) ISSN (Online): 2319-7064.Volume 3 Issue 11,
November 2014.
[6] Santosh K.Gaikwad,Bharti W.Gawali,Pravin Yannawar,
“A Review on Speech Recognition
Technique”,International Journal of Computer Applications
(0975 – 8887)Volume 10– No.3, November 2010.
[7] Penagarikano, M.; Bordel, G., “Speech-to-text
translation by a non-word lexical unit based system,"Signal
Processing and Its Applications, 1999. ISSPA '99.
Proceedings of the Fifth International Symposium on ,
vol.1, no., pp.111,114 vol.1, 1999.
[8]. Olabe, J. C.; Santos, A.; Martinez, R.; Munoz, E.;
Martinez, M.; Quilis, A.; Bernstein, J., “Real time text-to-
speech conversion system for spanish," Acoustics, Speech,
and Signal Processing, IEEE International Conference on
ICASSP '84. , vol.9, no., pp.85,87, Mar 1984.
[9]Kavaler, R. et al., “A Dynamic Time Warp Integrated
Circuit for a 1000-Word Recognition System”, IEEE

ISSN: 2278 – 1323 All Rights Reserved © 2015 IJARCET 3072

You might also like