0% found this document useful (0 votes)
29 views7 pages

Deformation Flow Based Two-Stream Network For Lip Reading

Uploaded by

tasinsafwathc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views7 pages

Deformation Flow Based Two-Stream Network For Lip Reading

Uploaded by

tasinsafwathc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Deformation Flow Based Two-Stream Network for Lip Reading

Jingyun Xiao1,2 , Shuang Yang1 , Yuanhang Zhang1,2 , Shiguang Shan1,2 , Xilin Chen1,2
1 Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of
Computing Technology, CAS, Beijing 100190, China
2 University of Chinese Academy of Sciences, Beijing 100049, China

Abstract— Lip reading is the task of recognizing the speech performance has been achieved by these methods, there are
content by analyzing movements in the lip region when people still several issues that demand more consideration. First,
are speaking. Observing on the continuity in adjacent frames most existing lip reading methods extract frame-wise features
in the speaking process, and the consistency of the motion
patterns among different speakers when they pronounce the and then model the temporal relationships with RNNs, with
same phoneme, we model the lip movements in the speaking less consideration of the innate spatiotemporal correlation of
process as a sequence of apparent deformations in the lip region. adjacent frames. Second, one main difference between lip
Specifically, we introduce a Deformation Flow Network (DFN) reading and other video analysis tasks is that the input video
arXiv:2003.05709v2 [cs.CV] 13 Mar 2020

to learn the deformation flow between adjacent frames, which is focused on the face, and usually a crop of the lip region.
directly captures the motion information within the lip region.
The learned deformation flow is then combined with the original It sets higher demands on the discriminative power of subtle
grayscale frames with a two-stream network to perform lip facial information in the videos.
reading. Different from previous two-stream networks, we make
the two streams learn from each other in the learning process by In this paper, we propose a Deformation Flow Network
introducing a bidirectional knowledge distillation loss to train (DFN) to generate the deformation flow of the face in a
the two branches jointly. Owing to the complementary cues video. It is trained in a completely self-supervised manner,
provided by different branches, the two-stream network shows with no need for labeled data. The deformation flow is a
a substantial improvement over using either single branch.
A thorough experimental evaluation on two large-scale lip sequence of deformation fields. A deformation field is a
reading benchmarks is presented with detailed analysis. The mapping of the correlated pixels from the source frame
results accord with our motivation, and show that our method to the target frame, which directly represents the motion
achieves state-of-the-art or comparable performance on these information from the source frame to the target frame.
two challenging datasets. By computing the deformation field between each pair of
adjacent frames, we can capture and represent the motion of
I. INTRODUCTION
the face in the video.
Visual speech recognition, also known as lip reading, is
the task of decoding speech content based on the visual cues For effective lip reading, we use both the computed
of a speaker’s lip motion. Lip reading is a developing topic deformation flow and the raw videos as the input to a two-
that has received growing attention in recent years. It has stream network. The two branches predict the probabilities
broad application prospects in hearing aids, special education of each word class independently. To make the two branches
for hearing impaired people, complementing acoustic speech exchange information during training, we adopt knowledge
recognition in noisy environments, new human-machine in- distillation, and utilize a bidirectional knowledge distillation
teraction methods, among many other potential applications. loss to help the two branches learn from each other’s predic-
The field of video understanding has progressed signif- tions during training. At test time, we fuse predictions from
icantly in recent years. However, lip reading, a special both branches to make the final prediction. We observe that
task of video understanding, remains a challenging task. a simple average of the predictions produces more accurate
Different from coarse-grained video analysis tasks, such as predictions, compared with results of using either single
action detection and action recognition, lip reading is a branch. It suggests that the two sources of input, the raw
fine-grained video analysis task, and requires subtle spatial video and the deformation flows, provide complementary
information in the lip region as well as continuous and cues for the lip reading task.
discriminative temporal information of lip motion. While
Our contributions are threefold: (a) we propose a De-
humans outperform machines in action recognition, machines
formation Flow Network (DFN) to generate deformation
have already exceeded humans in lip reading. This is partly
flows that can capture the motion information of the faces,
because the visual details and lip motions are too subtle
which is trained in a self-supervised manner; (b) we use the
for humans to capture and analyze, while machines have an
deformation flows and the raw videos as the inputs to a two-
innate advantage in this respect.
stream network, which provide complementary cues for lip
Recent lip reading methods are based on deep learning and
reading, and utilize a bidirectional knowledge distillation loss
often conducted in end-to-end fashion. Although promising
to train the two branches jointly; (c) we conduct extensive
This work was done by Jingyun Xiao during his internship at the Institute experiments on LRW [4] and LRW-1000 [17], demonstrating
of Computing Technology, Chinese Academy of Sciences. the effectiveness of our methods.
Front-end Back-end

Grayscale Inputs
(1×88×88×T)
Visual Features Pg
3D-Conv ResNet-18 Bi-GRU FC
(512×1×1×T)
DFN

×
Bidirectional
Encoder
Knowledge Distillation
Loss

Decoder P
Front-end Back-end

2D-Conv ResNet-18 Visual Features Pd


Bi-GRU FC
Deformation Flow Inputs (512×1×1×T)
(2×88×88×T)

Fig. 1. The overview of Deformation Flow Based Two-stream Network. Given an input video, we first feed it to the Deformation Flow Network to
generate the deformation flow. Then the raw video and the deformation flow are fed into the two branches separately. Each branch predicts the probability
of each word class independently. At test time, we fuse the results of each branch to improve classification performance. During training, we propose a
bidirectional knowledge distillation loss to enable the two branches exchange learned knowledge.

II. R ELATED W ORKS reading. They also employ optical flow and two-stream
In this section, we briefly review previous works on deep networks. However, the optical flow is hard to obtain and it
learning methods for lip reading, as well as self-supervised costs considerable time and storage. Moreover, most existing
methods for facial deformation modeling. optical flow methods are unable to capture the fine-grained
motion information of the lip region.
A. Deep Learning Methods for Lip Reading
With the rapid development of deep learning in recent B. Self-supervised Facial Deformation
years, some works have begun to apply deep learning meth- Recently, there have been a series of works using the de-
ods to lip reading and obtained considerable improvements formation field and warping methods for face manipulation,
over traditional methods using hand-engineered features. facial attributes learning and other face-related tasks.
Noda [7] was the first to employ a convolutional neural The Deforming Autoencoder (DAE) [9] presents an un-
network (CNN) to extract the features for lip reading for the supervised method to disentangle shape (in the form of
first time. Wand et al. [12] used Long Short-Term Memory a deformation field) and appearance (texture information
(LSTM) to replace the traditional classifier for lip reading, disregarding the pose variations) of a face. The learned fea-
and achieved considerable improvement. In 2016, Chung tures are demonstrated to be effective for face manipulation,
et al. [4] proposed an end-to-end lip reading model and landmarks localization and emotion estimation.
compared several strategies of processing the frames for X2Face [16] is a network that can generate face images
word classification, which has founded a solid base for with a target pose and expression. In the evaluation stage,
the subsequent progress for lip reading. Since then, more given a source face and a driving face, the network is
recent lip reading approaches have followed an end-to-end able to generate a new face that preserves the identity,
paradigm. Concurrently, Assael et al. [1] proposed LipNet, appearance, hairstyle and other attributes of the source face,
which is the first end-to-end sentence-level lip reading model. while possessing the pose and expression of the driving face.
In 2017, Stafylakis et.al. [10] proposed a new word-level In the training stage, it uses a pixel-wise L1 loss between
lip reading model that attains 83.0% classification accuracy the generated frame and the driving frame to supervise the
on the LRW dataset, which is a significant improvement over training process. In this way, the training process of the
prior art. It uses a combination of a single 3D convolution network does not need any annotations.
layer, ResNet [5], and bidirectional LSTM networks [6]. FAb-Net [15] has a similar architecture to X2Face. How-
The proposed architecture shows a strong spatiotemporal ever, it aims to learn the facial attributes in a self-supervised
modeling power, successfully copes with many in-the-wild manner. The learned facial attributes are demonstrated to
variations that LRW presents. Inspired by the success of achieve results comparable to and even surpassing the fea-
deep spatiotemporal convolutional networks and two-stream tures learned by supervised methods in several tasks.
architectures in action recognition, Weng et al. [14] intro- Inspired by these works, we propose the Deformation Flow
duced deep spatiotemporal convolutional networks to lip Network (DFN) in our work to model the lip motion in the
Derformation
Source Frame

Target Frame
Source Frame Encoder vs Decoder

Deformation Field Output Frame


Output Frame

Target Frame vt
Concatenation
Deformation Flow

Fig. 2. The architecture of DFN. It consists of an encoder and a decoder.


Given a source frame and a target frame, the encoder encodes them into two Optical Flow
feature vectors, vs and vt . The decoder takes the concatenation of vs and
vt as input, and generates a deformation field. The source frame is warped
by the deformation field, and generates an output frame. A pixel-wise L1
loss between the output frame and target frame can supervise the network Fig. 3. Examples of the source frames, target frames, output frames,
effectively. The DFN is trained in a completely self-supervised manner. deformation flow generated by DFN, and optical flow generated by PWC-
Net [11]. The color variations indicate that the deformation flow captures
speaking process for lip reading, which is also trained in a more details of the face than the optical flow.
self-supervised manner.
Source Frame
III. M ETHODS
In this section, we introduce our Deformation Flow Net-
work (DFN) for generating the deformation flow, Deforma- Target Frame
tion Flow Based Two-stream Network (DFTN) for word-
level lip reading, and the bidirectional knowledge distillation
loss for training the two-stream network jointly. Output Frame
An overview of the pipeline is shown in Fig. 1. Given
an input video (i.e., cropped grayscale image sequence of
the lip region), we first feed it to the Deformation Flow
Difference Image
Network to generate a series of deformation fields, one for
each pair of adjacent frames. This resulting deformation field
sequence is the deformation flow of the original video. Next,
the grayscale video and the deformation flow are fed into the Fig. 4. Examples of the difference images of the output frames and target
frames.
two branches separately for recognition. The two branches
are optimized with individual classification losses, and a
1
bidirectional knowledge distillation loss, which helps the two L1 = ∑ |o(x, y) − t(x, y)| (2)
branches learn from each other. At test time, we fuse the n (x,y)
results of each branch to make the final prediction for the Given the above optimization target, the DFN can be
input video. trained in a completely self-supervised manner, with no need
A. Deformation Flow Network for any extra manual annotations. Examples of the source
The architecture of the Deformation Flow Network (DFN) frames, target frames and output frames are shown in Fig. 3.
is shown in Fig. 2. The input to the DFN is a pair of frames It is worth noting that since the deformation field is estimated
(i.e., a source frame and a target frame). The output of the at the pixel level, it can capture very subtle variations of faces
DFN is a deformation field, which is a 2-channel map with and directly represent the motion information, which means
the same size as the input frames. The DFN consists of it also has great potentials in other face-related tasks beyond
an encoder and a decoder. The encoder encodes the source lip reading.
frame s and target frame t into a source vector vs , and a B. Deformation Flow Based Two-stream Network
target vector vt . The decoder takes the concatenation of vs
In this subsection, we introduce the two branches (i.e.,
and vt as input, and generates a deformation field d, which
the grayscale branch and the deformation flow branch) of
predicts the relative offsets (δ x, δ y) for each pixel location
the Deformation Flow Based Two-stream Network, as well
(x, y) in the target frame relative to the source frame. An
as the fusion strategy of the two branches in detail.
output frame o is generated by sampling from the source
Firstly, we introduce the baseline model in this paper.
frame s with the offsets (δ x, δ y) of the deformation field d:
The grayscale branch adopts the widely used architecture
proposed by [10], which is a combination of CNN and RNN,
o(x, y) = s(x + δ x, y + δ y) (1)
except that we use Gated Recurrent Units (GRU) [2] instead
The output frame o = D(s,t), is expected to be identical to of LSTMs. Specifically, it consists of a front-end (i.e., a
the target frame t, which can be supervised by a pixel-wise single layer of 3D CNN followed by ResNet-18 [5] and
L1 loss between the output frame and target frame: a back-end (i.e., a 2-layer bidirectional RNN with GRUs).
Bidirectional Knowledge exp(z(i) /T )
Classification Loss
Distillation Loss
Classification Loss q(i) = , (3)
∑ j exp(z( j) /T )
Softmax Layer Softmax Layer
where T is a parameter known as temperature. T is usually
set to 1 for classification tasks, and the equation becomes the
softmax function. In knowledge distillation, a large T makes
FC Layer FC Layer
the probability distribution q “softer”, which is easier for a
GRU GRU student network to learn than a one hot vector corresponding
to the ground truth. In our work, we set T to 20. The
ResNet-18 ResNet-18
knowledge distillation loss is defined as:
3D-Conv Layer 2D-Conv Layer
N
(i) (i)
LKD (qt , qs ) = − ∑ qt logqs , (4)
i=1
... ...
where qt and qs denotes the soft probability distributions of
the teacher network and student network, respectively, and
N denotes the number of classes.
Since we expect the two branches to learn from each other,
Fig. 5. The architecture of the two-stream network for lip reading.
The learning propcess is guided by both the classification loss and the
we adopt a bidirectional knowledge distillation loss:
bidirectional knowledge distillation loss.
LBiKD (qg , qd ) = LKD (qg , qd ) + LKD (qd , qg ) (5)

The front-end extracts the visual features for each frame, and Therefore the final objective function of the two-stream
outputs a sequence of feature vectors. The back-end decodes network is:
the feature sequences, and predicts the probability of each
L = LCE (zg , y) + LCE (zd , y) + λ LBiKD (qg , qd ), (6)
word class. The deformation flow branch mostly mirrors the
structure of the grayscale branch. The only difference is that where LCE represents the standard cross-entropy loss for
the first layer of this branch is a 2D convolution layer, while classification tasks, y is the one hot vector indicating the
it is a 3D convolution layer in the grayscale branch. The word class label of the video, and λ is a hyper-parameter
detailed architecture is shown in Fig. 5. indicating the weight of LBiKD .
Massive amount of works on two-stream networks have
IV. E XPERIMENTS
explored methods to fuse the two branches. In this work, we
experimented with different fusion strategies, and the results A. Datasets
with different fusion strategies are presented in IV-C. Among The proposed methods are evaluated on two large-scale
all the strategies, we find that fusing the output probabilities public lip reading datasets, LRW [4] and LRW-1000 [17].
of the two branches gives the best performance. Here we give a brief overview of the two datasets.
However, the problem with fusing the predicted proba- LRW. LRW [4] is a large and challenging word-level lip
bilities from individual branches is that the two branches reading dataset. Each sample of LRW is a video snippet of
are optimized separately, and lack interaction in the training 29 frames captured from BBC programs. The label is the
stage. We wish to design a method that can help the two corresponding word class of the video snippet. The dataset
branches exchange the knowledge they learned during the has 500 word classes and each class has around 1000 training
training process. Therefore, we propose the bidirectional samples, 50 validation samples and 50 testing samples. The
knowledge distillation loss. total duration of LRW is approximately 173 hours. The main
challenges of LRW are: (a) the variability of appearance
C. Bidirectional Knowledge Distillation Loss and pose of the speakers, (b) similar word classes such
as ”benefit” and ”benefits”, ”allow” and ”allowed”, which
In this subsection, we introduce the bidirectional knowl- demands strong discriminative power of models, and (c) the
edge distillation loss as an additional supervision for training target words do not exist independently in the videos; rather,
the two branches jointly. they are presented with surrounding context, which requires
Fusion strategies for the two-stream architecture have the model to focus on the correct keyframes.
been widely explored in the field of action recognition. LRW-1000. LRW-1000 [17] is the first public large-scale
Here, we adopt the method of knowledge distillation. The Mandarin lip reading dataset. It is a naturally-distributed
two branches are able to make word-level classification as large-scale benchmark for lip reading in the wild which
two independent models respectively. The outputs of the contains 1, 000 word classes with more than 700, 000 sam-
fully-connected (FC) layers of the grayscale branch and the ples from more than 2, 000 individual speakers. Each class
deformation branch are denoted as zg and zd respectively. corresponds to the syllables of a Mandarin word composed of
We then obtain the predicted probability distribution over all one or several Chinese characters. It is a challenging dataset,
classes, qg and qd as: marked by the following properties: (a) it contains significant
TABLE I
image quality variations such as lighting conditions and
E VALUATION OF DIFFERENT INPUTS ON LRW.
scale, as well as speakers’ attribute variations in pose, speech
rate, age, make-up and so on, (b) the frequency of each word Input Accuracy (%)
class is imbalanced, which is consistent with the natural case Grayscale 81.91
that some words occur more frequently than others in the Deformaion Flow 77.24
Deformaion Flow (optimized by classification loss) 79.43
everyday life, and (c) the samples of the same word are Optical Flow 67.81
not limited to a constant length range to allow for modeling
of different speech rates. These factors make LRW-1000 a
challenging lip reading benchmark with a large lexicon. can be seen, the output frame matches the target frame quite
B. Implementation Details well. Visualizations of the deformation field shows clear
discrimination of the lip region, which carries the motion
Data preprocessing. For both LRW and LRW-1000, we
information we wish to capture, from neighboring regions.
resize the cropped images of lip region to 96 × 96 as input.
This indicates that DFN can generate precise deformation
For LRW, we randomly crop the input to 88 × 88 during
fields, which meets our expectation of directly capturing
training and apply random horizontal flipping. For LRW-
motion in the speakers’ faces, especially in the lip region.
1000, we take a central 88 × 88 crop, and do not apply
random flipping. Secondly, we also studied the reconstruction quality of the
Network architecture. For DFN, we employ a ResNet-18 output frames qualitatively and quantitatively. As shown in
[5] as the encoder, and 7 cascaded pairs of deconvolutional Fig. 4, DFN is able to reconstruct faces of varying poses by
layers and bilinear upsampling layers as the decoder. The warping the source frames. We randomly chose 2000 pairs of
encoder yields a 256-dimensional vector for each frame. The target frames and output frames to evaluate the peak signal-
decoder takes the concatenation of a source vector vs and a to-noise ratio (PSNR) and structural similarity (SSIM) index.
target vector vt as input, which is 512-dimensional, and then The average PSNR is 26.86 and the SSIM index is 0.82,
generates a 2-channel deformation field with the same size which also proves the effectiveness of our method.
as the input frames. The two channels of the deformation Inspired by the observations in [8], we further exper-
field denote the offsets along the x and y axis at each pixel iment with replacing the L1 loss with classification loss
location. to supervise the DFN. This should help the DFN learn to
For the lip reading model, as mentioned earlier, we employ generate task-specific deformation flows which better suits
ResNet-18 as the front-end and GRU as the back-end. More the lip reading task. Specifically, we freeze the decoder and
specifically, for the grayscale branch, the front-end is a single unfreeze the encoder of DFN when training the deformation
3D convolution layer followed by a powerful ResNet-18 flow branch with classification loss, after pretraining in the
network which yields a 512-dimensional vector for each self-supervised manner. As shown in Fig. 6, the action of
frame. For the deformation branch, we use a single 2D mouth opening or closing is slightly amplified in the output
convolution layer on top of the ResNet-18 network. As for frames compared with the motion in the target frames. The
the back-end, we use a 2-layer bidirectional Gated Recurrent classification accuracy is also improved, as shown in Table
Unit (Bi-GRU) RNN with 1024 hidden units to process I.
the sequence of the 512-dimensional vectors, each vector Finally, we compared DFN with a state-of-the-art optical
extracted from a frame. flow method, PWC-Net [11], on the task of lip reading
Training strategies. We use the three-stage training qualitatively and quantitatively in the following aspects.
strategy proposed in [10]. We use the Adam optimizer (1) We utilize the pretrained model in [11] to generate
with default hyperparameters. For LRW, the learning rate the optical flow of the adjacent frames in the video, and use
is initialized to 0.0001 and reduced by half every time the optical flow for lip reading. The generated optical flow
when the validation loss stagnates, until the model reaches and deformation flow are shown in Fig. 3. It shows that the
convergence. For LRW-1000, the learning rate is initialized deformation flow reflects more fine-grained details.
to 0.001. In all of our experiments, when the validation loss
stagnates for the first time, we reduce the learning rate of the (2) We use the deformation flow generated by DFN and
back-end to 10% of the learning rate of the front-end. This optical flow generated by PWC-Net as inputs to evaluate their
policy works well in alleviating the overfitting problem. As lip reading performance on LRW respectively. The results of
for the weight of bidirectional knowledge distillation loss, are presented in Table I. It indicates that our task-specific
we initialize it to be 100, and reduce it by half every time deformation flow is more effective for the lip reading task.
when the validation loss stagnates. (3) We also compared the network complexity (i.e., float-
ing point operations (FLOPs) and the number of params)
C. Evaluation of DFN of DFN and PWC-Net, which is shown in Table II. The
We performed a thorough evaluation of DFN over several result shows that the computational complexity of DFN is
aspects on LRW [4]. much lower than PWC-Net, which is one of our motivations
Firstly, the source frames, target frames, output frames, to propose DFN. The greatly reduced complexity makes it
and generated deformation fields are shown in Fig. 3. As possible to use DFN in real-time applications.
TABLE II
C OMPUTATION EXPENSE OF DIFFERENT NETWORKS .
Network GFLOPS # Params
DFN 14.5 7.95M
PWC-Net 635 9.37M
Lip Reading Model 18.4 40.5M

(a)

Fig. 6. The output frames and deformation fields generated by DFN, (b)
where the encoder is optimized with the classification loss instead of the
L1 loss. The output frames have slight differences from the target frames.
According to the views in [8], optical flow learned for action recognition in Fig. 7. Examples of the inputs of the grayscale branch and the deformation
a task-specific manner differs from traditional optical flow and improves the flow branch.
performance of action recognition. This is also the case with the deformation recognition accuracy than additive fusion, i.e. taking the
flow.
average of the probabilities of the two branches.
D. Evaluation of DFTN Evaluation of the bidirectional knowledge distillation
In this subsection, we present the ablation studies of DFTN loss. To make the two branches exchange the learned knowl-
on LRW and LRW-1000. edge and further improve the performance of DFTN, we
Evaluation of each single branch. We pretrained the trained the two-stream network with the bidirectional knowl-
two branches (i.e., the grayscale branch and the deformation edge distillation loss as an additional supervision. The results
flow branch) of the two-stream networks independently. The are presented in Table III. It is shown that the bidirectional
inputs of the two branches are shown in Fig. 7. The grayscale knowledge distillation not only improves the accuracy of the
branch alone is also the baseline model in this paper. The joint prediction, but also improves the prediction accuracy
results in terms of recognition accuracy on LRW and LRW- of each branch when they work independently.
1000 are shown in Table III. Evaluation of different fusion strategies and distil-
Evaluation of the two-stream network. We fused the lation strategies. To further validate the effectiveness of
probabilities predicted by the two branches to make the final the bidirectional knowledge distillation loss, we conducted
classification of the testing samples. The results are shown in experiments to compare the performance of different fusion
Table III. Empirically, we found using multiplicative fusion, strategies and distillation strategies. We experimented with
i.e. taking the product of the probabilities results in higher two fusion methods that fuse the intermediate features of
the two branches rather than the probabilities:
TABLE III 1) Average the outputs of FC layers of the two branches,
E VALUATION OF DFTN ON LRW AND LRW-1000. feed the vector to a softmax layer to get the probability
distribution, and then compute the cross-entropy loss;
Method LRW (%) LRW-1000 (%) 2) Adopt the fusion method in [14], i.e. sum the outputs of
Grayscale branch (baseline) 81.91 38.56 the last layers of ResNet of the two branches, feed the
Deformaion flow branch 79.43 36.44
Two-stream 83.03 41.46 resulting vector to the back-end to get the probability
Grayscale branch (with LBiKD ) 82.93 38.76 distribution, and then compute the cross-entropy loss.
Deformation flow branch (with LBiKD ) 80.85 37.47
Two-stream(with LBiKD ) 84.13 41.93
Besides the above fusion strategies, we also experimented
with two unidirectional knowledge distillation strategies to
TABLE IV compare with the bidirectional knowledge distilling strategy:
E VALUATION OF DIFFERENT STRATEGIES ON LRW.
1) Distill knowledge from the grayscale branch to the
Method Accuracy (%) deformation flow branch.
Grayscale branch 81.91 2) Distill knowledge from the deformation flow branch to
Deformation flow branch 79.43 the grayscale branch.
Avg (FC) 82.13
Add (Res4) 82.52 The results are presented in Table IV. It indicates that
Mul (probabilities) 83.03 the fusion of the output probabilities performs better than
Mul (probabilities) (with LKD(d−>g) ) 82.14
Mul (probabilities) (with LKD(g−>d) ) 82.92
the fusion of the intermediate features of the two branches
Mul (probabilities) (with LBiKD ) 84.13 (mid-fusion). Also, the bidirectional knowledge distillation
TABLE V
Natural Science Foundation of China (No. 61702486,
C OMPARISON WITH OTHER METHODS ON LRW.
61876171).
Method Accuracy (%)
Chung16 [4] 61.10
R EFERENCES
Chung17 [3] 76.20 [1] Y. M. Assael, B. Shillingford, S. Whiteson, and N. De Fre-
Stafylakis17 [10] 83.00 itas. Lipnet: End-to-end sentence-level lipreading. arXiv preprint
Stafylakis17 [10] (reproduced) 77.80 arXiv:1611.01599, 2016.
Weng19 [14] 84.07 [2] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of
DFTN 84.13 gated recurrent neural networks on sequence modeling. arXiv preprint
arXiv:1412.3555, 2014.
TABLE VI [3] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. Lip reading
C OMPARISON WITH OTHER METHODS ON LRW-1000. sentences in the wild. In 2017 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 3444–3453. IEEE, 2017.
Method Accuracy (%) [4] J. S. Chung and A. Zisserman. Lip reading in the wild. In Asian
Conference on Computer Vision, pages 87–103. Springer, 2016.
Yang19 [17] 38.19 [5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image
Wang19 [13] 36.91 recognition. In Proceedings of the IEEE conference on computer vision
DFTN 41.93 and pattern recognition, pages 770–778, 2016.
[6] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural
computation, 9(8):1735–1780, 1997.
outperforms unidirectional knowledge distillation with an [7] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata.
Lipreading using convolutional neural network. In Fifteenth Annual
obvious improvement. Conference of the International Speech Communication Association,
2014.
E. Comparison with State-of-the-Art [8] L. Sevilla-Lara, Y. Liao, F. Güney, V. Jampani, A. Geiger, and M. J.
Black. On the integration of optical flow and action recognition. In
Comparison with other methods on LRW. We compared German Conference on Pattern Recognition, pages 281–297. Springer,
our method with other word-level lip reading methods [4], 2018.
[9] Z. Shu, M. Sahasrabudhe, R. Alp Guler, D. Samaras, N. Paragios, and
[10], [14] on LRW. The results are presented in Table V. I. Kokkinos. Deforming autoencoders: Unsupervised disentangling of
The model in [14] employs deep 3D CNNs and optical shape and appearance. In Proceedings of the European Conference on
flow based two-stream networks, which achieved the existing Computer Vision (ECCV), pages 650–665, 2018.
[10] T. Stafylakis and G. Tzimiropoulos. Combining residual networks
state-of-the-art performance. Our method outperforms it, and with lstms for lipreading. arXiv preprint arXiv:1703.04105, 2017.
establishes the new state-of-the-art performance. [11] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. Pwc-net: Cnns for optical
flow using pyramid, warping, and cost volume. In Proceedings of the
Comparison with other methods on LRW-1000. We IEEE Conference on Computer Vision and Pattern Recognition, pages
compared our method with other word-level lip reading 8934–8943, 2018.
[12] M. Wand, J. Koutnı́k, and J. Schmidhuber. Lipreading with long short-
methods [17], [13] on LRW-1000. The results are presented term memory. In 2016 IEEE International Conference on Acoustics,
in Table VI. Our method shows a considerable improvement Speech and Signal Processing (ICASSP), pages 6115–6119. IEEE,
over all previous methods on LRW-1000 and achieves state- 2016.
[13] C. Wang. Multi-grained spatio-temporal modeling for lip-reading.
of-the-art performance. arXiv preprint arXiv:1908.11618, 2019.
[14] X. Weng and K. Kitani. Learning spatio-temporal features with two-
V. C ONCLUSION stream deep 3d cnns for lipreading. arXiv preprint arXiv:1905.02540,
2019.
In this paper, we propose a Deformation Flow Network [15] O. Wiles, A. Koepke, and A. Zisserman. Self-supervised learn-
(DFN) to generate the deformation flow, a way to model ing of a facial attribute embedding from video. arXiv preprint
arXiv:1808.06882, 2018.
the lip movements in the speaking process as a sequence [16] O. Wiles, A. Sophia Koepke, and A. Zisserman. X2face: A network
of deformations over the lip region. Notably, the network for controlling face generation using images, audio, and pose codes. In
is lightweight and trained in a self-supervised manner. To Proceedings of the European Conference on Computer Vision (ECCV),
pages 670–686, 2018.
take advantages of the complementary cues provided by [17] S. Yang, Y. Zhang, D. Feng, M. Yang, C. Wang, J. Xiao, K. Long,
the deformation flow and the raw videos, we propose a S. Shan, and X. Chen. Lrw-1000: A naturally-distributed large-scale
benchmark for lip reading in the wild. In 2019 14th IEEE International
Deformation Flow Based Two-stream Network (DFTN) for Conference on Automatic Face & Gesture Recognition (FG 2019),
word-level lip reading. Different from previous methods pages 1–8. IEEE, 2019.
that fuse the features of the two branches, we employ the
bidirectional knowledge distillation loss to help the two
branches interact with each other, and exchange knowledge
during training. Finally, we compare our method with other
word-level lip reading methods, and show that our method
achieves state-of-the-art performance. Our work makes a first
attempt to introduce facial deformation to generate a new
modality. It provides potential applications and possibilities
for not only lip reading, but also other face analysis tasks.

VI. ACKNOWLEDGMENTS
This work is partially supported by National Key R&D
Program of China (No. 2017YFA0700804) and National

You might also like