0% found this document useful (0 votes)
23 views4 pages

DL Review

Uploaded by

ashwinkumarrcm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views4 pages

DL Review

Uploaded by

ashwinkumarrcm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

DEPARMENT OF ARTIFICIAL INTELLIGENCE AND

MACHINE LEARNING

MINI PROJECT

LIP READING USING DEEP LEARNING MODEL

AND OPEN CV

BY
ASHWIN KUMAR C (221501014)
HARIHARAN A (221501035)
ABSTRACT

Lipreading is the task of decoding text from the movement of a speaker's mouth. Traditional approaches
separated the problem into two stages: designing or learn- ing visual features, and prediction. More recent deep
lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). However,
existing work on models trained end-to-end perform only word classification, rather than sentence-level
sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton
& Basala, 1982), in- dicating the importance of features capturing temporal context in an ambiguous
communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length
sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the
connectionist tempo- ral classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is
the first end-to-end sentence-level lipreading model that simultaneously learns spatiotemporal visual features
and a sequence model. On the GRID corpus, LipNet achieves 95.2% accuracy in sentence-level, overlapped
speaker split task, outperforming experienced human lipreaders and the previous 86.4% word-level state-of-the-
art accuracy (Gergen et al., 2016).

KEY POINTS:

 Automated Lipreading

 Classification with deep learning

 Sequence prediction in speech recognition


INTRODUCTION

Lipreading, the process of interpreting spoken language by visually observing the movements of a speaker's
mouth, is an essential aspect of human communication and speech comprehension. This skill is particularly
significant in situations where auditory cues are limited or absent, such as in noisy environments or for
individuals with hearing impairments. The importance of lipreading is underscored by the well-documented
McGurk effect (McGurk & MacDonald, 1976), a perceptual phenomenon where conflicting auditory and visual
speech signals result in the perception of a completely different phoneme. This effect highlights the complex
interplay between visual and auditory information in speech perception and underscores the challenges
involved in accurately interpreting spoken language through lipreading alone.

Lipreading is inherently a difficult task for humans, especially in the absence of contextual information. The
subtle and often ambiguous nature of lip movements, compounded by the fact that many visual speech cues
(such as those made by the lips, tongue, and teeth) are latent or partially obscured, makes accurate lipreading
a formidable challenge. Pioneering studies by Fisher (1968) and Woodward & Barber (1960) revealed that
certain visual phonemes, or "visemes," are frequently confused with one another, leading to a high rate of
errors in human lipreading. Fisher (1968), for example, identified five categories of visual phonemes from a list
of 23 initial consonant phonemes that are commonly mistaken for each other when viewed without auditory
input. These errors are often asymmetrical, with similar patterns observed for final consonant phonemes as
well.

As a result, human lipreading performance is generally poor, even among individuals with extensive experience
in lipreading. Research by Easton & Basala (1982) demonstrated that hearing-impaired individuals achieve an
accuracy of only 17±12% when attempting to lipread a limited set of 30 monosyllabic words, and 21±11% for a
set of 30 compound words. These findings underscore the inherent limitations of human lipreading,
particularly when contextual information is scarce.

Given these challenges, there is a compelling need to automate the lipreading process, leveraging the power of
modern technology to overcome the limitations of human perception. The potential applications of automated
lipreading are vast and varied, encompassing areas such as improved hearing aids, silent dictation in public
spaces, enhanced security measures, robust speech recognition in noisy environments, biometric identification,
and the processing of silent films.

However, automating lipreading is a complex task, primarily due to the need to extract and interpret
spatiotemporal features from video sequences. Unlike traditional approaches, which often separate feature
extraction and prediction into distinct stages, recent advances in deep learning have enabled the development
of models that can learn these features end-to-end. Despite this progress, most
existing models have been limited to word-level classification, lacking the ability to perform sentence-level
sequence prediction—a crucial requirement for practical lipreading applications.

In this context, we introduce LipNet, a novel end-to-end model designed to address the challenges of sentence-
level lipreading. To the best of our knowledge, LipNet is the first model capable of simultaneously learning
spatiotemporal visual features and a sequence model to make sentence-level predictions. Drawing inspiration
from advancements in automatic speech recognition (ASR), LipNet operates at the character level and employs
spatiotemporal convolutional neural networks (STCNNs), recurrent neural networks (RNNs), and the
connectionist temporal classification (CTC) loss function (Graves et al., 2006) to achieve this goal.

Our empirical evaluation on the GRID corpus (Cooke et al., 2006), one of the few publicly available datasets for
sentence-level lipreading, demonstrates the effectiveness of LipNet. The model achieves a remarkable 95.2%
sentence-level word accuracy on an overlapped speakers split—a benchmark task widely used in the lipreading
research community. This performance not only surpasses the previous state-of-the-art accuracy of 86.4%
reported by Gergen et al. (2016) for word-level classification but also generalizes well to unseen speakers,
achieving an accuracy of 88.6%.

Furthermore, we compare LipNet's performance with that of hearing-impaired individuals tasked with
lipreading the same sentences from the GRID corpus. On average, these individuals achieve an accuracy of
52.3%, while LipNet attains 1.69 times higher accuracy, underscoring the model's potential to outperform
human lipreaders in practical applications.

To further understand LipNet's decision-making process, we apply saliency visualization techniques (Zeiler &
Fergus, 2014; Simonyan et al., 2013) to interpret the model's learned behavior. These visualizations reveal that
LipNet focuses on phonologically important regions in the video, confirming that the model is effectively
capturing the relevant visual cues for accurate lipreading. Additionally, by analyzing intra-viseme and inter-
viseme confusion matrices at the phoneme level, we observe that the majority of LipNet's errors occur within
viseme categories, suggesting that while the model is highly effective, some ambiguities remain when
contextual information is insufficient for disambiguation.

In summary, LipNet represents a significant advancement in the field of automated lipreading, offering a
powerful tool for sentence-level speech interpretation from visual input. Its high accuracy, ability to generalize
across speakers, and interpretability make it a promising solution for a wide range of practical applications in
communication, security, and beyond.

You might also like