VUIs and Mobile Applications (chapter7)
Introduction
As their name indicates, voice user interfaces (VUIs) are interfaces that allow users to
interact with computing systems through use of voice. Although our voice can be used in different
ways, VUIs typically refer to communication through the use of language. This narrows down the
problem at hand as communication through VUIs is a subset of communicating through aural user
interfaces. It is possible to communicate with computers through sounds other than pronounced
words and sentences. Different sounds can be used to communicate information through their
frequency, amplitude, duration, and other properties that make them unique. However, language
is how we naturally communicate, be it through voice or text; therefore, when referring to VUIs,
we refer to communicating with machines using pronounced language
A voice-user interface (VUI) is a system that allows spoken human interaction with computers.
VUI typically is using speech recognition in order to understand spoken requests from the user and
is also able to answer these requests through text or voice outputs
VUI, has recently dramatically increased in popularity.
VUI is making use of speech recognition technology in order to enable users to interact
with devices by using just their voices.
VUI allows efficient interactions which are more ‘human’ than any other form of the
user interface such as mouse or keyboard since “Speech is the primary mean of human
communication”
A VUI will generally be much faster than a traditional UI.
A voice command device (VCD) is a device – such as a computer – which is controlled
with a VUI.
Voice user interfaces are now appearing more and more everywhere, especially on newer
smartphones. They are also integrated into recent automobiles, domotics (home
automation), computers, home appliances (washing machines etc…) and TV remote
controls.
1
There are three basic technologies at the heart of systems that interact with users through the
use of voice:
1- Voice recognition,
2- Voice transcription, and
3- Speech synthesis (a subset of which is referred to as text-to-speech)
Qualities of Speech
The qualities of speech are those things that differentiate speech from other types of aural
input. To build good VUIs, a general understanding of the physical qualities of speech not only
give us a better high-level insight into the operation of voice recognition engines but also leads us
toward building better VUIs. So, let us take a survey of what these qualities of speech are:
Amplitude
Speech, as in any other type of aural input, has a loudness level. The loudness of a sound
is based on the amplitude of the sound wave that makes the sound. The amplitude of speech is
important in that input devices are designed to receive sounds within certain amplitude thresholds.
Frequencies and Pitch
Another quality of speech is the set of frequencies that make up the sounds that in turn
make speech. Along with the amplitude of our voice, the combination of these frequencies is what
makes up the different pitches and sounds that we use to make speech.
Meaning and Context
Sounds make up words, words make phrases and sentences, and a combination of spoken
sounds, words, phrases, and sentences make up spoken language. However, the meaning of
different combinations of sounds, words, phrases, and sentences in spoken language depends
largely on the context in which the spoken language is used.
2
Language
Language is a way of combining audible sounds, signs, and gestures to allow people to
ommunicate. When dealing with VUIs, we are concerned with spoken language, which is the
subset of the aural things that can represent a language. Interpreting this spoken language is the
final goal of both voice recognition and voice transcription systems.
Voice transcription
Transcription can be performed by a machine or written by a human. Transcription
converts recorded speech into written format. There are automated voice transcriptions today, but
they are speaker dependent. This means that the voice transcription system has to learn the speech
patterns of the user through a training process. This training process can be long, taking several
hours, before enabling the system to recognize the user’s voice reasonably well
Voice recognition
Voice recognition allows the recognition of a word, a phrase, a sentence, or multiple sentences
pronounced by voice against a finite set possible matches. In other words, the computing system
tries to match something said by the user to a given set of possibilities that it already understands.
For example, telephone companies have used this technology for years in their directory systems
asking the user to “Please say one to reach Bob and two to reach Phil.” If the user says “three,” the
system cannot find a match and tells the user “That’s not a valid option; please say one to reach
Bob and two to reach Phil.” Some major distinctions between voice transcription and voice
recognition, as defined in this text, are the following:
1- Voice recognition systems rely on predefined interactions with the user. These interactions
can be composed of predefined words, phrases, or sentences.
2- Voice recognition systems attempt to map these predefined words, phrases, sentences, or
composites to something that is understandable by the computing system.
4. Performance of voice recognition systems is inversely related to the size of the grammar
vocabulary for a given transaction not necessarily true of voice transcription systems as their
performance is much less dependent on the specifics of a particular user interaction.
3
5. Voice recognition systems are typically speaker independent. Whereas most voice
transcription systems are tuned to a given user’s voice and spoke patterns, voice
transcription systems are designed to be independent of the user’s voice.
Functionality offered by today’s voice recognition systems could be in one of several of the
following forms
1-Directed Dialogue: As the term suggests, this is the case when every response by the user is
preceded by a list of acceptable responses as well as an explanation of them. An example of
such a dialogue is shown in following Figure:
As you see in this example, the dialogue is directed by the system. The analog of this type
of dialog in the GUI world would be an interface entirely comprised of list boxes that are bound
to predetermined selections.
3- Natural Language: The ultimate goal of voice recognition is to be able to have the
computing system understand the commands that you give it even if things are paraphrased
or are put into the wrong order.
4
This is an example of a natural language interaction, taking place between a user and the
computing system. As you can see in this example, although the system can only
understand sentences and words that concern turning on and off lights or adjusting the
temperature in the house, the user is able to put those words and sentences in any order
desired.
4- Mixed-Mode Dialogues: Mixed-mode dialogs are often referred to as mixed initiative or
natural language as well (though this second is a bit of a misnomer as mixed-mode
dialogues are not as flexible as natural language dialogues). Mixed-mode dialogues allow
natural language interactions while directing the user so as to contain the possible responses
to a given question to a minimum.
This example shows an example of a mixed-mode dialog between our fictitious system and
Bob. Note that although the system drives the conversation taking place with the user and
tries to limit the expected responses from the user, it understands the user’s natural
language response
5
Obviously, directed dialog
1- Limits the interaction to a few specific questions and answers, sometimes with a list of
possible responses.
2- This type of system works in situations with few potential customer responses.
3- This type is also easy to create, but that’s about the end of the benefits related of
customer experience,
Natural language addresses these common concerns by
1- Letting the caller speak freely, as if speaking to a live person.
2- Natural language processing uses AI to interpret whatever the customer says.
3- Callers may respond with full sentences, and the system will pick out the most important
information and generate a helpful response.
4- This system is perfect for situations with many possible responses.
5- It’s more complex to develop but Natural language is the best option for most VUI systems
today.
The following is natural language VUI example:
6
Java Speech APIs
As in the case of other APIs, the Java platform offers a canonical API, JSAPI, or Java
Speech APIs, is this canonical API. Although this API is no different than any other API,
considering it has two benefits.
First, because it has been agreed on by more than one commercial entity, there is less bias in it
toward any particular platform implementation.
Second, it gives us a good high-level view of what any API may implement in providing access to
the underlying technologies for a VUI. There are three main packages in JSAPI:
1. javax.speech: This package provides the infrastructure to connect to the voice channels for input
and output and to manage dictionary vocabularies dynamically. It also provides the interfaces
that are later used by the other two packages in JSAPI.
2. javax.speech.synthesis: As its name may suggest, this package provides an API suitable for
providing an interface to speech-synthesis systems. This package provides the utilities to adjust
the different values for the quality of speech provided by the speech-synthesis engine for tighter
control of the synthesis. It also provides JSML hooks into the system so that the synthesis can
be done based on JSML.
3. javax.speech.recognition: This package provides the interfaces for managing grammars, rules,
recognition results, and the settings of the recognition engine. As may be suspected, it takes
advantage of JSGF.
7
VXML
VXML is the W3C's standard XML format (Voice Extensible Markup Language) for
specifying interactive voice dialogues between a human and a computer. VXML used for
- Telephone -based speech applications.
- Voice browsing of the web
VXML allows voice applications to be developed and deployed in similar way to HTML
for visual applications where as HTML documents are interpreted by a visual web browser, VXML
documents are interpreted by a voice browser. VXML has two main components
- tags .... Control what the program does.
- grammars .... Controls what speech is recognized, what the user can say.
Advantages of VoiceXML
1. To provide a simple mechanism to build VUIs.
2. Separates user interaction code (in VoiceXML) from service logic (e.g. scripts)
3. To provide a language that is portable across multiple voice recognition platforms.
4. To enable one VXML document to hold multiple voice interactions. This lessens the number of
interactions between the application running on the voice platform and applications, databases, or
other interfaces providing the business logic necessary to generate the VXML document.
5. To offer a small but sufficient set of features for basic telephony call control and text-to-
speech interactions. Although the functionality through VXML for these functions are sufficient
for smaller efforts, it is recommended that more comprehensive efforts, particularly for mobile
applications.
8
Using VXML for Mobile Applications
There is a great deal of effort today to implement embedded voice recognition systems
onto mobile devices. VXML can be produced from existing GUIs using transcoding mechanisms
that convert another user interface markup language to VXML, using XSLs or other technologies.
In theory, the primary advantage of using VXML in this manner is to reduce the necessary
development, create consistency among different interfaces, and to reduce the cost of changes and
maintenance during the lifetime of an application. In practice, things do not quite work that way
because of a number of factors. Some of these are as follows:
1. There is a large performance cost in using VXML on the server side. This problem can typically
be easily solved by increasing CPU and memory. Because CPU and memory are now mostly
inexpensive commodities, the problem is not significant. Nevertheless, it is important to be aware
of the performance loss when moving from a legacy VUI to a VXML-based VUI.
2. Because VXML has been designed as a browser that utilizes ECMAScript to reduce traffic with
the other applications, it is stateful. And because it is stateful, the task of transforming a generic
interface to VXML can be a fairly complicated one.
(ECMAScript is a standard script language, developed with the cooperation of Netscape and
Microsoft and mainly derived from Netscape's JavaScript, the widely-used scripting
language that is used in Web pages to affect how they look or behave for the user).
3. Generating VXML with typical server-side technologies such as JSP or ASP has the drawback
of increasing network traffic although it has the advantage of allowing us to apply the same
techniques used for generating HTML-based Web pages to generate VXML pages.
4. Automated conversion mechanisms that consume HTML or XHTML to produce VXML are
typically flawed. Whereas the same technologies do a fair job with text and GUIs, VUIs are much
more sensitive to errors. One wrong or poorly designed interaction with the user will turn the user
off from using the system.