0% found this document useful (0 votes)
176 views8 pages

NMITCON 2024 Submitted 23 24 B27 Lip Reading Using Deep Learning Project Paper

The document presents a novel deep learning approach to lip reading that utilizes 3D convolutional neural networks and bidirectional LSTM networks to decode spoken text from video sequences of lip movements. This end-to-end model eliminates the need for traditional multi-stage methods, achieving superior performance with a character error rate of 1.54% and a word error rate of 7.96%. The research emphasizes the importance of high-quality datasets and advanced neural architectures in enhancing communication accessibility for individuals with hearing impairments.

Uploaded by

Holambc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
176 views8 pages

NMITCON 2024 Submitted 23 24 B27 Lip Reading Using Deep Learning Project Paper

The document presents a novel deep learning approach to lip reading that utilizes 3D convolutional neural networks and bidirectional LSTM networks to decode spoken text from video sequences of lip movements. This end-to-end model eliminates the need for traditional multi-stage methods, achieving superior performance with a character error rate of 1.54% and a word error rate of 7.96%. The research emphasizes the importance of high-quality datasets and advanced neural architectures in enhancing communication accessibility for individuals with hearing impairments.

Uploaded by

Holambc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2024 Second International Conference on Networks, Multimedia and Information Technology (NMITCON)

Lip Reading with 3D Convolutional and


Bidirectional LSTM Networks on the GRID Corpus
Prashanth B S1 , Puneetha B H 1 ,Manoj Kumar M V 2 , Lohith R1 , Darshan Gowda V1 , Chandan V1 , Sneha H R1
1
Department of Information Science & Engineering,
Nitte Meenakshi Institute of Technology, Bangalore 560064, Karnataka, India
2
Department of Computer Science and Business Systems,
Bapuji Institute of Engineering and Technology, Davanagere 577004 Karnataka, India

{prashanth.bshivanna, manojmv24, puneeth.bh02, developedindia2023,


dgowda070, chandanvijaykumar26, snehahr1990}@gmail.com

Abstract—In recent years, the application of artificial intel- (CNNs) and recurrent neural networks (RNNs), to build a
ligence has revolutionized the field of lip reading by enabling sophisticated end-to-end lip reading model using the Ten-
the development of sophisticated models capable of accurately sorFlow framework. The primary advantage of this proposed
interpreting lip movements from video data. This work presents a
novel deep learning approach to lip reading, focused on decoding deep learning approach is its ability to automatically learn
spoken text from video sequences of lip movements. Traditional and extract relevant spatial and temporal features from raw
lip reading methods involve separate stages for visual feature video data, without relying on handcrafted features or domain
design and prediction. The proposed system utilizes an end- expertise. By training on large-scale datasets consisting of
to-end deep learning model to directly map video frames to video sequences and their corresponding transcripts, the model
text transcriptions, leveraging 3D convolutional neural networks
and bidirectional Long Short-Term Memory. By analyzing visual can effectively learn the intricate mapping between visual lip
cues from lip motions, the system can interpret speech, improv- movements and the corresponding spoken words, phonemes,
ing accessibility for individuals with hearing impairments and and phrases.
enabling communication in noisy environments. Compared to Unlike traditional multi-stage methods that often struggle
existing lip reading techniques, the deep learning model achieves with generalization and variability, this end-to-end deep learn-
superior performance on benchmark datasets, demonstrating its
effectiveness in this challenging task. The model has achieved a ing model can capture the complex spatial patterns present in
character error rate of 1.54% and a word error rate of 7.96% individual video frames through convolutional layers, while
on the best model. also modeling the temporal dependencies in the sequence of
Keywords—Lip-reading, Deep learning, Connectionist Tempo- lip movements using recurrent layers like LSTMs or Gated
ral Classification decoder, Tensor flow, OpenCV, Keras, Artificial Recurrent Units (GRUs). This unified architecture allows for
Inteligence.
seamless integration of spatial and temporal information, po-
tentially leading to improved accuracy and robustness in lip
I. I NTRODUCTION
reading tasks[2].
Lip reading, the process of understanding spoken words Moreover, the flexibility and scalability of the TensorFlow
from visual cues of lip movements, is a challenging task framework enable efficient experimentation, optimization, and
that has traditionally been tackled using complex feature deployment of the lip reading model. With its comprehensive
engineering and multi-stage approaches. Traditional methods set of tools and libraries for implementing and training com-
often involve separate stages for lip detection, handcrafted fea- plex neural network architectures, TensorFlow facilitates rapid
ture extraction, and classification. However, such approaches iteration and refinement of the model, allowing exploration
are time-consuming, rely heavily on domain-specific knowl- of different architectural configurations and hyperparameter
edge, and may not generalize well to different scenarios or settings to achieve optimal performance. By proposing an
languages[1]. Some applications of the lip reading approaches end-to-end deep learning approach to lip reading, leverag-
are shown in Figure ??. ing the power of CNNs and RNNs within the TensorFlow
Recent advancements in deep learning techniques have framework, this work aims to overcome the limitations of
opened up new opportunities to develop end-to-end mod- traditional methods and provide a more robust, generalizable,
els that can learn relevant features directly from raw input and accurate solution for understanding spoken words from
data, bypassing the need for explicit feature extraction and visual lip movements. This novel approach has the potential
separate processing stages. This work proposes an original to significantly enhance communication accessibility, facilitate
and novel approach to lip reading by leveraging the power cross-cultural interactions, and enable various applications
of deep learning, specifically convolutional neural networks in domains such as assistive technologies, human-computer

979-8-3503-7289-2/24/$31.00 ©2024 IEEE 1


Figure 1. Applications of lip-reading models

interaction, multimedia analysis, and security. The overall - decoding speech from mouth movements. By leveraging
theme of the paper embodies the application of Deep Learning the pre-trained VGGNet architecture and customizing it for
for lip reading in images and video. The contributions of the the MIRACL-VC1 Dataset, their model achieved impressive
paper are as follows: accuracy rates of 94.86
• Develop an end-to-end solution for lip reading using deep The research article [7, 8, 9] tackles the challenge of auto-
learning techniques. mated text extraction from video data through lip reading. The
• Examine the existing literature to identify the research researchers proposed a deep learning-based method to analyze
gaps. facial expressions, aiming to overcome language barriers and
The upcoming sections are organized as follows: Section address security, connectivity, and physical limitations. How-
II examines the existing literature in the domain of Machine ever, they acknowledge the complexities posed by variations in
Learning and Deep Learning. Section III describes the dataset pronunciation and accents. Their approach involves curating a
employed in the experimentation and provides a detailed diverse dataset of facial expressions, analyzing image frames,
discussion on the proposed methodology. Section III-A dis- and generating text outputs from the identified words.
cusses the model architecture. Section IV analyzes the results In [10, 11, 12], the researchers explore the potential of self-
obtained from the experimentation. Section V outlines the supervised learning techniques to train lip reading models on
limitations, and Section VI concludes the findings presented vast amounts of unlabeled video data. This approach could
in this paper. improve the generalizability of the models and reduce the need
for large labeled datasets, which can be time-consuming and
II. L ITERATURE S URVEY costly to create.
The examination of the existing literature is categorized The work in [13, 14, 15] investigates the integration of
based on the computational domain used for lip reading, audio and visual data using transformers, a powerful deep
primarily focusing on Machine Learning, Deep Learning, and learning architecture, for speech recognition tasks. The authors
Sequential Learning. demonstrate the potential benefits of multimodal learning,
where the model can leverage both auditory and visual cues,
A. Machine Learning for improved visual speech recognition (VSR) performance.
The work presented in [3, 4] highlights the importance In the work presented in [16, 17], the researchers utilize
of high-quality datasets for lip reading technology. The re- a CNN model based on the pre-trained VGGNet architecture
searchers explore techniques using Scikit-Video for extracting for their lip reading task, achieving impressive results. The
frames, Idlib for facial detection, feature point processing for authors compare the performance of standard CNN and dilated
lip cropping, and data augmentation methods. They create a CNN architectures for lip reading on a new Turkish dataset.
dataset with 33 different voices, each represented by 7,000 lip Interestingly, they find that the standard CNN outperforms the
images. This work proposes a systematic approach to building dilated CNN in terms of both speed and accuracy.
datasets, starting with video decomposition using the Scikit- In [18, 19], researchers developed a system that combines
Video library. lip reading (LR) with audio speech recognition to enhance
accuracy, addressing challenges like facial variations, speaking
B. Deep Learning rates, skin tones, and pronunciations. The model extracts
In the work [5, 6], the authors present a deep learning model visemes from video clips, stores them in the cloud, and uses
based on convolutional neural networks (CNNs) for lip reading a 3D-CNN for word matching, achieving 89

2
LipNet [20] stands out in lip reading technology by integrat- corpus, with word alignments providing time intervals for each
ing Long Short-Term Memory networks (LSTMs) for temporal word’s appearance in the video[23].
modeling with Convolutional Neural Networks (CNNs) for For evaluation, LipNet uses the GRID corpus, which is rich
spatial feature extraction. When evaluated using the GRID in sentence-level data. The sentences in this corpus follow
Corpus dataset, LipNet surpasses baseline models in perfor- a simple grammar structure: command (4 choices) + color
mance. Saliency maps and viseme analysis reveal LipNet’s (4 choices) + preposition (4 choices) + letter (25 choices)
ability to identify phonologically relevant areas in lip move- + digit (10 choices) + adverb (4 choices). This results in a
ments, providing insight into its attention mechanisms. The total of 64,000 possible sentences, combining options from
study also introduces GhostNet, a lightweight model designed bin, lay, place, set, blue, green, red, white, at, by, in, with,
for efficient network architectures. LipNet’s competitive Char- A to Z (excluding W), zero to nine, and again, now, please,
acter Error Rate (CER) and Word Error Rate (WER) in soon. The dataset has both training and testing data, but word
testing demonstrate its superiority for practical applications. alignments are only provided for the training data. This is
Furthermore, the research discusses potential future advance- standard practice, where the model is trained on labeled data
ments in lip reading technology, including modal integration, but evaluated on unlabeled data to test its generalization.
real-time applications, personalization, ethical considerations, The primary task associated with this dataset is lip reading,
and the aim of fostering a more inclusive communication aiming to develop a model that can transcribe spoken words
environment.effectiveness and its promising contributions to accurately by analyzing lip movements and visual cues from
the field of lip reading. videos. During training, the word alignments in the training
data are used to supervise the model’s learning, helping it
C. Sequential Learning
associate visual features with specific words.
The authors in [21] propose an innovative approach using The methodology involves several key steps:
sequence-to-sequence learning with attention mechanisms for
1) Data Preprocessing: Extracting frames from video se-
speaker-independent visual speech recognition (VSR). This
quences, performing facial detection, and cropping the
method has shown promising results on public benchmark
lip region using techniques such as Scikit-Video and
datasets. In the work [22], the authors introduce LipNet, an
Idlib.
end-to-end model that directly maps video frames to text
2) Data Augmentation: Enhancing the dataset with various
sequences for sentence-level lip reading. This advancement
transformations to improve model generalization.
holds the potential to improve communication accessibility for
3) Model Development: Implementing an end-to-end deep
individuals with hearing impairments.
learning model using convolutional neural networks
The existing literature on lip reading technologies spans sev-
(CNNs) and recurrent neural networks (RNNs), particu-
eral computational domains, highlighting significant advance-
larly Long Short-Term Memory networks (LSTMs) and
ments and methodologies. In the realm of Machine Learning,
Gated Recurrent Units (GRUs), within the TensorFlow
researchers underscore the critical role of high-quality datasets
framework.
and explore various techniques for data preprocessing and aug-
4) Training: Using the labeled training data with word
mentation. Deep Learning approaches, particularly those uti-
alignments to train the model, optimizing its parame-
lizing CNNs and self-supervised learning, have demonstrated
ters to learn the intricate mapping between visual lip
impressive accuracy and potential, especially when integrating
movements and spoken words.
multimodal data for enhanced speech recognition. Moreover,
5) Evaluation: Testing the model on the unlabeled testing
innovative Sequential Learning models, such as sequence-to-
data to assess its performance and generalization capa-
sequence learning and end-to-end frameworks like LipNet,
bility.
show promising results in improving visual speech recognition
and accessibility for individuals with hearing impairments. This methodology aims to build a robust and accurate lip
Overall, these studies collectively emphasize the importance reading model that can effectively transcribe spoken words
of diverse datasets, advanced neural architectures, and the from visual input, leveraging the structured and comprehensive
integration of audio-visual data to push the boundaries of lip dataset provided by The GRID audiovisual sentence corpus.
reading technology.
TABLE I
III. M ETHODOLOGY AVAILABLE VOCABULARY

The dataset used for the experimentation is The GRID Constrains Available words
audiovisual sentence corpus, which is commonly utilized in lip Commands bin, lay, place, set
Colour blue, red, white, green
reading projects. Details are shown in Table I. It’s structured as
Preposition at, by, in, with
video sequences with word alignments, meaning that for each Letters A-Z
word in the video, there’s a corresponding time interval anno- Digits zero-nine
tation indicating when that word appears. The data consists of Adverb again, now, please, soon
video sequences derived from The GRID audiovisual sentence

3
TABLE II Algorithm 1 Video Captioning Algorithm
S AMPLE OF ALIGNMENT DATASET 1: function S TRING L OOKUP (path)
start time end time words 2: Define StringLookup layers
0 23750 sil 3: end function
23750 29500 bin 4: function LOADV IDEO (path)
29500 34000 blue
34000 35500 at
5: f rames ← []
35500 41000 f 6: for each frame in the video do
41000 47250 two 7: Normalize the frame using specified method
47250 53000 now 8: f rames.append(normalized frame)
53000 74500 sil
9: end for
10: return f rames
11: end function
In this work, the process begins by ensuring the availability
12: function LOADA LIGNMENTS (path)
of essential tools and packages, such as OpenCV, Matplotlib,
13: Convert words to numerical representation according
ImageIO, and TensorFlow, through Python package installa-
to your method
tion. Subsequently, data handling mechanisms are established
14: return numerical representation
by constructing functions to load video frames and text
15: end function
transcriptions, facilitating preprocessing steps like grayscale
16: function LOAD DATA (path)
conversion, normalization, and tokenization. This paves the
17: f rames ← LOADV IDEO(path)
way for the development of a streamlined data pipeline,
18: alignments ← LOADA LIGNMENTS(path)
where data is organized into a TensorFlow Dataset, ensuring
19: return f rames, alignments
consistency through padding and optimizing performance via
20: end function
data prefetching. Additionally, the dataset is split into distinct
21: Create a dataset from video file paths ▷ Specify how the
training and testing sets to facilitate model evaluation.
dataset is created
Transitioning to neural network design, TensorFlow’s Keras
22: Shuffle and pad the dataset
API is leveraged to craft an architecture comprising 3D
23: Split the dataset into training and testing sets
convolutional layers for spatial feature extraction and Bidirec-
24: Add layers, activations, etc.
tional LSTMs for capturing temporal dependencies. The model
25: Define the model architecture
predicts character probabilities over the vocabulary using a
26: Define the CTC loss function
dense layer with softmax activation, complemented by the
27: Compile the model
Connectionist Temporal Classification (CTC) loss function,
28: Train the model on the training dataset
adept at handling variable-length outputs. To prepare for
29: Load the trained model weights
training, diverse options are configured, including learning
30: function T EST DATASET (dataset)
rate scheduling, custom callbacks for visualization, and model
31: for each sample in dataset do
checkpointing for weight preservation[24]. Finally, the trained
32: prediction ← P REDICT(sample) ▷ Implement the
model is engaged in prediction tasks, evaluating performance
Predict function
against ground truth transcriptions and extending testing to
33: decoded prediction ← D ECODE CTC(prediction)
new video data, thereby completing the methodological frame-
▷ Implement the DecodeCTC function
work for this lip reading endeavor.
34: C OMPARE(decoded prediction, ground truth)
A. Model architecture 35: end for
36: end function
The following section outlines the architecture of the model 37: T EST DATASET (testing set)
designed for analyzing video sequences of grayscale lip region
crops. This architecture encompasses several key components,
including convolutional layers for spatial feature extraction,
recurrent layers for capturing temporal dependencies, and a as well as the temporal dimension across consecutive
CTC-based approach for training and inference. A detailed frames.
breakdown of each component and its role within the overall 2) Recurrent Layers: LSTM layers are incorporated to
model is provided below. capture temporal dependencies in the sequence of lip
1) Convolutional Layers: Responsible for extracting spa- movements. These recurrent layers process the sequence
tial features from the input video frames, these layers of spatial features extracted by the convolutional layers,
apply multiple convolutional filters to learn important allowing the model to understand the temporal context
visual patterns within the lip region. The use of 3D and dependencies between different frames.
convolutional layers enables the model to capture spatial 3) CTC-based Approach: The model employs the CTC
information across the width and height of each frame, loss function for training and inference. CTC is well-

4
Figure 2. Proposed Methodology

suited for sequence-to-sequence tasks where the align-


ment between input sequences (video frames) and output
sequences (text transcriptions) is unknown. By predict-
ing character probabilities over the vocabulary and using
CTC to compute the loss, the model can effectively
handle variable-length outputs and alignments.
4) Dense Layer with Softmax Activation: The final dense
layer, equipped with a softmax activation function, pre-
dicts the probability distribution over the vocabulary for
each time step. This layer produces character probabil-
ities that are used by the CTC loss function to decode
the predicted sequence of characters.
5) Training Configuration: The training process involves
configuring various options to optimize model perfor-
mance. Learning rate scheduling adjusts the learning rate
dynamically during training to enhance convergence.
Custom callbacks are utilized for visualizing training
progress and intermediate results. Model checkpointing
ensures that the best-performing model weights are
preserved throughout the training process.
This architectural design leverages the strengths of con-
volutional layers for spatial feature extraction and recurrent
layers for temporal modeling, integrated with a robust CTC-
based approach for effective sequence-to-sequence learning.
This combination enables the model to accurately transcribe
spoken words from visual lip movements, demonstrating its
potential for practical lip reading applications. Figure 3. Model Architecture
The model takes a sequence of video frames, specifically
grayscale lip region crops, as input. The input shape is
(75, 46, 140, 1), where 75 represents the number of frames, analysis.
46 and 140 are the height and width of the lip region, and 1 After the convolutional layers, the output is
indicates the number of channels (grayscale). flattened across the spatial dimensions using
The input video frames are processed by a series of 3D TimeDistributed(Flatten()), resulting in a
convolutional layers. These layers have filters of size 3 × 3 × 3 sequence of flattened feature vectors representing each
and are followed by ReLU activations and 3D max-pooling frame. This step prepares the data for input into the recurrent
layers for downsampling. This process extracts spatial features layers for capturing temporal dependencies.
from the lip region in each frame, aiding in subsequent The flattened sequence is then fed into two stacked Bidirec-

5
tional LSTM layers, each with 128 units. These layers capture
the temporal dependencies in the sequence of lip movements
by processing the information both forward and backward in
time. Dropout is applied to the LSTM outputs to regularize
the model and prevent overfitting [25].
The final output is produced by a dense layer with a softmax
activation. This layer maps the LSTM outputs to probability
distributions over the character vocabulary, including a blank
token for CTC (Connectionist Temporal Classification) de-
coding. The model is trained using the CTC loss function,
designed for sequence-to-sequence problems with variable-
length outputs, enabling it to predict spoken text without
requiring explicit alignment between video frames and text.
During inference, the model generates a sequence of proba-
bility distributions over the character vocabulary for each input
video. These outputs are then decoded using the CTC decoding
algorithm, allowing for accurate transcription of spoken text
from the input video frames. Figure 5. Generated Output Word mapping Aganist the Test data

IV. R ESULTS AND D ISCUSSIONS


Figure 5 shows the output of the model after decoding the
predicted character indices into tokens or words. It displays a
list of integers, where each integer corresponds to a character
or token in the vocabulary. The -1 values likely represent the
blank token used in the CTC decoding process.

TABLE III
P ERFORMANCE TABLE

Method CER WER


Hearing-Impaired Person - -
(avg)
Baseline-LSTM 15.2% 26.3%
Baseline-2D 4.3% 11.6%
Baseline-NoLM 2.0% 5.6%
Lipify 1.9% 4.8%

The Lipifi project was evaluated on the GRID corpus,


focusing on data from a single speaker (s1). The dataset was
split into training and test sets using a custom split, creating
an overlapped speakers scenario, where the model is trained
on some video utterances from the speaker and evaluated on
different utterances from the same speaker [2]. The Lipifi
project achieved a word accuracy of 92.04% and a Character
Error Rate (CER) of 1.5441%.
Figure 4. Frame Converted to Grayscale Table III showcases the performance of LipNet, a sequence-
based lip reading model, on the same GRID dataset alongside
The original frame is obtained from a 3-second video. various baseline methods.
Figure 4 demonstrates the input frame transformed to make While Lipifi demonstrates a competitive CER compared
it suitable for the lip reading model. The top part shows a to LipNet on the ’Overlapped Speakers’ split, a more com-
grayscale video frame, which is what the machine learning prehensive comparison would require evaluating Lipifi on
model processes to make a prediction. This frame likely unseen speakers as well. This highlights a potential direction
contains the lip region of interest from the input video. The for future work. Additionally, including details about Lipifi’s
subsequent image displays the raw output of the machine model architecture and training specifics would allow for a
learning model, which is a tensor of numbers containing more in-depth analysis of performance factors.
integers representing the predicted character indices for the The validation loss curve, as shown in Figure 6, initially
spoken text. starts high at around 100, indicating early model inaccuracies.

6
4) No Language Model Integration: The current model
does not incorporate any language model or context in-
formation, which could aid in improving the accuracy of
predictions, especially for challenging words or phrases.
5) Lack of Unseen Speaker Data: There is no direct
comparison with LipNet on CER for unseen speakers,
highlighting a potential direction for future work.
6) Future Work Directions: Future work could involve
evaluating Lipifi on unseen speakers and potentially
comparing it with other recent lip reading models to
provide a more comprehensive analysis.

VI. CONCLUSION
This work demonstrated the development of an end-to-end
deep learning model based on 3-dimensional convolutional
layers and Bidirectional LSTMs to interpret spoken text from
videos of lip movements. This model leverages 3D convo-
Figure 6. Epoch vs validation loss lutional layers to extract spatial features from video frames
and Bidirectional LSTMs to capture temporal dynamics, em-
ploying the Connectionist Temporal Classification (CTC) loss
From epoch 0 to 20, there’s a rapid decline as the model
function for effective variable-length sequence handling.
learns patterns. After epoch 20, the curve becomes erratic with
The results demonstrate the model’s potential in accurately
peaks and troughs, possibly due to overfitting, hyperparameter
recognizing spoken words, highlighting its applicability in
adjustments, or encountering local minima.
silent speech communication aids and human-computer in-
terfaces. However, limitations were identified, including the
V. L IMITATIONS AND F UTURE W ORK
requirement for fixed-length, grayscale video inputs and the
Despite advances, lip reading using deep learning faces absence of attention mechanisms, which may restrict perfor-
several limitations, including the need for large, high-quality mance under more variable conditions.
datasets, sensitivity to variations in lighting and facial fea- The assumption that the training data encompasses sufficient
tures, and the challenge of real-time processing. Models often linguistic variability might have led to potential overfitting
struggle with generalization across different environments and issues, indicating a need for broader data to enhance gen-
speaker variations. Additionally, there are ethical concerns eralization. Future enhancements could involve integrating
regarding privacy and potential misuse. Integrating lip reading color inputs, attention mechanisms, adaptive input lengths, and
with audio remains complex and requires significant compu- language models to improve accuracy and robustness. Addi-
tational resources. The following are some of the limitations tionally, implementing advanced data augmentation could help
noted: in adapting the model for diverse languages and operational
environments, pushing the boundaries of practical applicability
1) Limited Vocabulary: The model has been trained on a
and model performance in real-world scenarios.
specific vocabulary, likely containing the characters and
words present in the training data. This means the model
R EFERENCES
may struggle to accurately predict words or phrases
outside of this vocabulary, limiting its applicability to [1] M. A. Abrar, A. N. M. N. Islam, M. M. Hassan, M. T.
more diverse speech scenarios. Islam, C. Shahnaz, and S. A. Fattah. Deep lip reading-a
2) Dependency on Accurate Alignments: The model deep learning based lip-reading software for the hearing
relies on pre-computed alignments between the video impaired. In 2019 IEEE R10 Humanitarian Technology
frames and the corresponding text transcriptions. Inac- Conference (R10-HTC), pages 40–44, 2019.
curacies or errors in these alignments could negatively [2] Yiming Li, Zhaowen Xu, Lei Yu, Yingli Liu, and Shiqi
impact the model’s performance, as it may learn to Zhao. Lipnet: End-to-end sentence-level lipreading.
associate incorrect lip movements with certain words or In Proceedings of the IEEE International Conference
characters. on Acoustics, Speech, and Signal Processing (ICASSP),
3) Fixed Input Length: The input shape of the model is 2022.
fixed to 75 frames, which means it can only process [3] Wei Zhang, Chen Li, and Xiaohui Wang. Lip-to-speech
videos of a specific duration. This limitation could be synthesis using machine learning. In 2023 IEEE Inter-
problematic when dealing with videos of varying lengths national Conference on Systems, Man, and Cybernetics
or real-time speech recognition scenarios. (SMC), 2023.

7
[4] E Rakovac Bekeš, V Galzina, and E Berbić Kolar. ing Smart Computing and Informatics (ESCI), pages 1–6.
Using human-computer interaction (hci) and artificial IEEE, 2024.
intelligence (ai) in education to improve the literacy of [15] Brijesh Bakariya. Sign language recognition-based ma-
deaf and hearing-impaired children. In 2024 47th MIPRO chine learning model for hearing disabilities person. In
ICT and Electronics Convention (MIPRO), pages 1375– Applied Assistive Technologies and Informatics for Stu-
1380. IEEE, 2024. dents with Disabilities, pages 113–133. Springer, 2024.
[5] Jun Wang, Bo Hu, Xiaopeng Chen, and Qingming Meng. [16] N. P. Akman, T. T. Sivri, A. Berkol, and H. Erdem.
Deep lip reading-a deep learning based lip-reading soft- Lip reading multiclass classification by using dilated cnn
ware for the hearing impaired. In 2022 IEEE Inter- with turkish dataset. In 2022 International Conference
national Conference on Consumer Electronics-Taiwan on Electrical, Computer and Energy Technologies (ICE-
(ICCE-TW), 2022. CET), pages 1–6, 2022.
[6] Hanaa ZainEldin, Samah A Gamel, Fatma M Talaat, [17] Marwa Tharwat, Yasmin Wardak, Shumokh Balbaid, and
Mansourah Aljohani, Nadiah A Baghdadi, Amer Malki, Ejlal Radin. Wearable device with speech and voice
Mahmoud Badawy, and Mostafa A Elhosseini. Silent no recognition for hearing-impaired people. In 2024 21st
more: a comprehensive review of artificial intelligence, Learning and Technology Conference (L&T), pages 221–
deep learning, and machine learning in facilitating deaf 226. IEEE, 2024.
and mute communication. Artificial Intelligence Review, [18] V Prakash, R Bhavani, Durga Karthik, D Rajalakshmi,
57(7):188, 2024. N Rajeswari, and M Martinaa. Visual speech recogni-
[7] S. M. M. H. Chowdhury, M. Rahman, M. T. Oyshi, and tion by lip reading using deep learning. In Advanced
M. A. Hasan. Text extraction through video lip reading Applications in Osmotic Computing, pages 290–310. IGI
using deep learning. In 2019 8th International Con- Global, 2024.
ference System Modeling and Advancement in Research [19] Harsh Dokania and Nilanjan Chattaraj. An assistive
Trends (SMART), pages 240–243, 2019. interface protocol for communication between visually
[8] Seema Babusing Rathod, Rupali A Mahajan, Poorva and hearing-speech impaired persons in internet plat-
Agrawal, Rutuja Rajendra Patil, and Devika A Verma. form. Disability and Rehabilitation: Assistive Technol-
Enhancing lip reading: A deep learning approach with ogy, 19(1):233–246, 2024.
cnn and rnn integration. Journal of Electrical Systems, [20] Krishna Manglani, Utkarsh Bahuguna, and Sub-
20(2s):463–471, 2024. hash Chand Gupta. Lip reading into text using deep
[9] Muhammad Wajahat, Abbas Z Kouzani, Sui Yang Khoo, learning. In 2024 14th International Conference on
and MA Parvez Mahmud. Development of ai-enabled Cloud Computing, Data Science & Engineering (Con-
sign language predicting glove using 3d printed tribo- fluence), pages 303–308. IEEE, 2024.
electric sensors. IEEE Journal on Flexible Electronics, [21] Y. Assaf, J. S. Chung, and A. W. Senior. Lipnet: End-
2024. to-end sentence-level lipreading. In 2016 IEEE Inter-
[10] Kelvin Xu, Shreyans Garg, and William Chan. End-to- national Conference on Acoustics, Speech and Signal
end lip reading with self-supervised learning from unla- Processing (ICASSP), pages 5484–5488, 2016.
beled videos. In IEEE Winter Conference on Applications [22] Zihan Liu, Lijuan Deng, and Jian Zhou. Speaker-
of Computer Vision (WACV), 2022. independent visual speech recognition with attention-
[11] Mohamed Badawi, Al Nagar Al Nagar, R Mansour, based sequence-to-sequence learning. In The 14th In-
R Mansour, Kh Ibrahim, Kh Ibrahim, Nada Hegazy, Safa ternational Conference on Machine Learning and Com-
Elaskary, et al. Smart bionic vision: An assistive device puting (ICMLC), 2023.
system for the vis-ually impaired using artificial intel- [23] N. Deshmukh, A. Ahire, S. H. Bhandari, A. Mali, and
ligence. International Journal of Telecommunications, K. Warkari. Vision based lip reading system using deep
4(01):1–12, 2024. learning. In 2021 International Conference on Com-
[12] R Kaviyaraj, P Sathya, P Nandhini, Sureshkumar Chel- puting, Communication and Green Engineering (CCGE),
liah, S Ezhilmathi, et al. Augmented reality and artificial pages 1–6, 2021.
intelligence in sign language expression. In 2024 Third [24] K. Matsui, K. Fukuyama, Y. Nakatoh, and Y. O. Kato.
International Conference on Intelligent Techniques in Speech enhancement system using lip-reading. In 2020
Control, Optimization and Signal Processing (INCOS), IEEE 2nd International Conference on Artificial Intelli-
pages 1–6. IEEE, 2024. gence in Engineering and Technology (IICAIET), pages
[13] Vishnu Nair, Harshad Patil, and Vishnu Namboodiri. 1–5, 2020.
Audio-visual speech recognition using early fusion with [25] J. Peymanfard, M. R. Mohammadi, H. Zeinali, and
transformers. In Interspeech 2021, 2021. N. Mozayani. Lip reading using external viseme de-
[14] D Ajitha, Disha Dutta, Falguni Saha, Parus Giri, and coding. In 2022 International Conference on Machine
Rohan Kant. Ai lipreader-transcribing speech from lip Vision and Image Processing (MVIP), pages 1–5, 2022.
movements. In 2024 International Conference on Emerg-

You might also like