2nd International Conference on “Advancement in Electronics & Communication
Engineering (AECE 2022) July 14-15, 2022
Video Transcription and Summarization using NLP
KhushiPorwal*, Harshit Srivastava**, Ritik Gupta***, ShiveshPratap
Mall****, Nidhi Gupta*****
*
(Department of Computer Science and Engineering, Student, Raj Kumar Goel Institute Of
Technology,Ghaziabad-201003
[email protected])
**
(Department of Computer Science and Engineering, Student, Raj Kumar Goel Institute Of
Technology,Ghaziabad-201003
[email protected])
***
(Department of Computer Science and Engineering, Student, Raj Kumar Goel Institute Of
Technology, Ghaziabad-201003
[email protected])
****
(Department of Computer Science and Engineering, Student, Raj Kumar Goel Institute Of
Technology, Ghaziabad-201003
[email protected])
*****
(Department of Computer Science and Engineering, Assistant Professor, Raj Kumar Goel Institute Of
Technology, Ghaziabad-201003
[email protected])
I. Introduction because of busy work, time conflicts, or other time
constraints. To solve this problem, we need a set of
The field of artificial intelligence and natural
language processing are growing exponentially, as tools that can not only convert this
are their applications. This helped fabricate many audio-video content into text but also qualitatively
tools for the purpose of expediting many complex summarize it without changing its meaning. This
calculations and data extraction and exploration. In helps save a lot of time and the extracted text can
this fast-paced world where time is invaluable, be used in different ways.[4]
people don’t have time to watch long videos on the
internet. So to refrain from investing so much time, Text or text summaries themselves will be
they instead watch them twice as fast to get the very useful products that will probably be used
impression of what is explained in the video. every day in the near future, as you can find similar
applications in many areas where only audio or
The internet is flooded with lots of audio video content summaries are needed.
and video content, and the list keeps growing every
second. In addition to this content, there are daily
online meetings and events that have become a II. Related Work
daily routine since the pandemic. It’s clear that
people can miss one of these meetings, lectures
2.1 Amazon Transcribe
Raj Kumar Goel Institute of Technology, Ghaziabad Page 608
Electronic copy available at: https://ssrn.com/abstract=4157647
2nd International Conference on “Advancement in Electronics & Communication
Engineering (AECE 2022) July 14-15, 2022
This python library is meant to be used exclusively
Amazon transcribe is an automatic speech for .wav audio files. It is used to play or edit audio
recognition service that uses machine learning files and similar operations on them. To install the
pydub command used is: pip install
models to convert speech into text. With Amazon
pydubAudioSegment is a wrapper class for
Transcribe, you can record voice input, create
pydub.AudioSegment. It is being used here to read
easy-to-read transcripts, improve accuracy with
the “.mp3” audio and convert it into a
language adaptation, and filter content to ensure
corresponding “.wav” format audio file.
customer privacy. Practical use cases include
transcription and analysis of customer agent calls,
3.1.3 Transcription
and creation of video captions.
Transcription of a video is extracting the texts from
2.2 Google Transcribe a video file or precisely can be said to be a textual
form of audio in that particular video. The process
Live Transcribe is a real-time captions smartphone of generating transcription starts from an audio file
application developed by Google. It takes speech with .wav format which enables us to perform
and turns it into real–time captions with simply the operations on them. Pydub helps us with a function
usage of the phone’s mic. It permits two-way split_on_silence() which splits the audio into small
chunks depending on the silences found in the file.
keyboard
verbal exchange through a type-back
A folder to create the chunks is created so that they
for users who cannot or do not need to can be fetched easily when required to be converted
speak. to text. A loop statement is used over the chunk of
audio obtained to get in touch with every part of the
original audio. The loop would have the heart of
III. Methodology
the whole process or the most critical thing. The
chunks obtained are converted and exported into
3.1 Transcript Generation their respective .wav file. An instance of speech
recognition, Recognition is created to recognize the
Transcription of a video is generated in series of speech from an audio source. In the loop after
steps involving conversion of video into audio then every chunk is obtained it is sent to the record
into text and this is made possible using the function that records the file to the AudioData
following Python modules and libraries. instance. The next step being the most important is
the final text from that AudioData which is input to
the recognize_google() function which results in
3.1.1 Moviepy the extracted text. This process is repeated and
finally, the text obtained is given as the transcript of
The famous python module to edit videos. By the video provided.
making use of this library videos can be cut,
concatenated, composted, and diluted with custom 3.2 Text Summarization
effects. Titles can also be inserted using it. It is
compatible with all video formats including Text summarization is one of the applications of
GIFs.ffmpeg is the framework of multimedia that is
Natural Language Processing (NLP) which can
required by moviepy to function. ffmpeg consists
have a huge impact on our ever growing virtual
of ffplay and ffprobe. ffplay is a simple media
life. The Covid-19 pandemic has made virtual
player whereas ffprobe is a command line to
display. Moviepy is required here to convert the meetings and webinars an integral part of
.mp4 format of video into the corresponding .mp3 everyone’s life. Text summarization can make it
audio format. To install the moviepy command easy to get the summary of these long hour
used is: pip install moviepy. meetings and webinars. For sequences like
webinars and instructional programs where the
3.1.2 Pydub caption is not available, speech recognition may be
performed on the audio to obtain the transcript.
Once the text corresponding to the sequence is
Raj Kumar Goel Institute of Technology, Ghaziabad Page 609
Electronic copy available at: https://ssrn.com/abstract=4157647
2nd International Conference on “Advancement in Electronics & Communication
Engineering (AECE 2022) July 14-15, 2022
available, one can perform text summarization The system takes the video from either YouTube or
techniques to obtain a summary. local system and the system divides each video into
several frame-based audio chunks and then the
The techniques used in text summarization can audio chunks are further divided into tokens where
be divided into two groups: each token is then extracted to text using
machine-learning Hugging Face Model.
1. Statistical analysis based on
information-retrieval techniques: In this
approach, a subset of existing words,
phrases, or sentences in the original text is Table 1 – Metric table of results
selected to form the summary. The
sentences are ranked based on various
features. The final summary includes a S.No. Time Total Summar
few top ranked sentences.
Duratio size y Time
n of
2. Natural Language Processing (NLP)
video
analysis based on information extraction
techniques: In this technique we make use
of artificial intelligence, performing a
1 0min 1.28Mb 30 sec
detailed semantic analysis of the source
text to build a source representation
54sec
designed for a particular application.[2]
Then a summary representation is formed
2 2min 3.81Mb 1Min
using this source representation and the
output summary text is synthesized. The 13sec
generated summaries contain new phrases
and sentences that may not appear in the
source text.
3 2min 5.34Mb 1.25
17sec Min
3.2.1 Text Rank Algorithm
4 7min 14.2Mb 1.34
Text Rank is a text summarization technique which
is used in Natural Language Processing to generate
17sec Min
Document summaries. It uses an extractive
approach and is an unsupervised graph-based text
5 30min 43.23M 1.5 Min
summarization technique.
21sec b
Table 1 shows the result of the proposed algorithm
used to obtain the video summarization. The
algorithm gives less than 5 seconds of error video
as output for the input given.
Fig 1: Working Flow of Video Summarization
System
Table 2: Metric table of processing time and
Memory usage
IV. Resultand Analysis
Raj Kumar Goel Institute of Technology, Ghaziabad Page 610
Electronic copy available at: https://ssrn.com/abstract=4157647
2nd International Conference on “Advancement in Electronics & Communication
Engineering (AECE 2022) July 14-15, 2022
text summarization, applied to transcripts derived
using automatic speech recognition. We also use
S.No. Total Summary Memor
temporal analysis of pauses between words to
Time of Time y Usage
detect sentence boundaries. We have shown that the
Video Requeste dominant word pair selection algorithm works well
d by user in identifying main topics in video speech
transcripts. The problem of deriving good
evaluation schemes for automatically generated
1 0min 1min 22Mb video summaries is still a complex and open
54sec problem.
References
2 2min 3Min 72Mb
13sec [1] Video Summarization using NLP Sanjana R1,
Sai Gagana V2, Vedavathi K R3, Kiran K N4
International Research Journal of Engineering and
3 2min 3min 72Mb Technology (IRJET) e-ISSN: 2395-0056 Volume: 08
17sec Issue: 08 | Aug 2021
[2] Kaiyang Zhou, Yu Qiao, Tao Xiang :Deep
4 7min 5min 102Mb
Reinforcement Learning for Unsupervised Video
17sec Summarization In: Diversity-Representativeness
Reward 2018, Association for the Advancement of
Artificial Intelligence (www.aaai.org).
5 30min 7min 177Mb
21sec [3] MrigankRochan, Linwei Ye, and Yang Wang:
Video Summarization Using Fully Convolutional
Table 2 shows the metric table for results with Sequence Networks In: International Conference
memory usage and processing time. Here the on Learning Representations (2018).
memory usage and processing time results depend
on the total time of the input video and the [4] MayuOtani, Yuta Nakashima ,EsaRahtu ,
summary time requested by the user. JanneHeikkil¨a , and NaokazuYokoya : Video
Summarization using Deep Semantic Features In:
Proc. Advances in Neural Information Processing
V. Conclusion Systems (NIPS) 2016.
The number of video recordings is available on the [5] Wang F. and Ngo C.W. Rushes video
Internet. It has become very difficult to spend time summarization by object and event understanding.
watching videos. The increase in video content on In TRECVID Workshop on Rushes Summarization
the internet requires an efficient way of in ACM Multimedia Conference September 2007.
representing the video. Summarizing transcripts of
videos allows us to quickly lookout for the [6] You J., Liu G., Sun L., and Li H. A multiple
important content in the video and helps us to save visual models based perceptive analysis framework
time. Our video transcription model is ideal for for multilevel video summarization. IEEE Trans.
indexing or subtitling video and/or multi-speaker Circuits Syst. Video Tech., 17(3), 2007.
content and uses machine learning technology.
[7] Ngo C.W., Ma Y.F., and Zhang H.J. Video
We propose an algorithm to automatically summarization and scene detection by graph
summarize video programs. We use concepts from
Raj Kumar Goel Institute of Technology, Ghaziabad Page 611
Electronic copy available at: https://ssrn.com/abstract=4157647
2nd International Conference on “Advancement in Electronics & Communication
Engineering (AECE 2022) July 14-15, 2022
modeling. IEEE Trans. Circuits Syst. Video Tech., Image Processing, vol. 3653, January 1999, San
15(2):296–305, 2005. Jose, CA, pp. 1531–1541.
[17] S. M. Iacob, R. L. Lagendijk, and M. E. Iacob,
[8] Xu C., Shao X., Maddage N.C., and
“Video abstraction based on asymmetric similarity
Kankanhalli M.S. Automatic music video values,” Proceedings of SPIE Conference on
summarization based on audiovisual-text analysis Multimedia Storage and Archiving Systems IV, vol.
and alignment.In Proc. 31st Annual Int. ACM 3846, September 1999, Boston, MA, pp. 181–191.
SIGIR Conf. on Research and Development in
Information Retrieval, 2005. [18] L. Agnihotri, K. V. Devera, T. McGee, and N.
Dimitrova, “Summarization of video programs
based on closed captions,” Proceedings of SPIE
[9] Duan L.Y., Xu M., Chua T.S., Tian Q., and Xu
Conference on Storage and Retrieval for Media
C. A MidLevel Representation Framework for Databases 2001, vol. 4315, January 2001, San
Semantic Sports Video Analysis.In Proc. 11th ACM Jose, CA, pp. 599–607.
Int. Conf. on Multimedia, 2003.
[19] M. A. Smith and T. Kanade, “Video skimming
[10] Ferman A.M. and Tekalp A.M. Two-stage and characterization through the combination of
hierarchical video summary extraction to match image and language understanding techniques,”
Proceedings of the IEEE Conference on Computer
low-level user browsing preferences. IEEE Trans.
Vision and Pattern Recognition, June 1997, San
Multimedia, 5(2):244–256, 2003. Juan, PR.
[11] P.Sushma, Dr.S.Nagaprasad, Dr. V. Ajantha [20] M. Christel, A. G. Hauptman, A. S. Warmack,
Devi. Youtube: Big Data Analytics using Hadoop and S. A. Crosby, “Adjustable filmstrips and skims
and Map Reduce in International Journal of as abstractions for a digital video library,”
Proceedings of the IEEE Conference on Advances
Engineering Research in Computer Science and
in Digital Libraries, May 19-21-1999, Baltimore,
Engineering (IJERCSE), vol 5, Issue 4, April 2018. MD.
[12] J. Oh and K. A. Hua, “An efficient technique
for summarizing videos using visual contents,”
Proceedings of the IEEE International Conference
on Multimedia and Expo (ICME 2000), July
30-August 2 2000, New York, NY.
[13] S. Pfeiffer, R. Lienhart, S. Fischer, and W.
Effelsberg, “Abstracting digital movies
automatically,” Journal of Visual Communication
and Image Processing, vol. 7, no. 4, pp. 345–353,
December 1996.
[14] R. Lienhart, “Dynamic video summarization of
home video,” Proceedings of SPIE Conference on
Storage and Retrieval for Media Databases 2000,
vol. 3972, January 2000, San Jose, CA, pp.
378–389.
[15] L. He, E. Sanocki, A. Gupta, and J. Grudin,
“Auto-summarization of audio-video
presentations,” Proceedings of the 7th ACM
International Multimedia Conference, 30 October -
5 November 1999, Orlando, FL, pp. 489–498.
[16] K. Ratakonda, I. M. Sezan, and R. J. Crinon,
“Hierarchical video summarization,” Proceedings
of SPIE Conference Visual Communications and
Raj Kumar Goel Institute of Technology, Ghaziabad Page 612
Electronic copy available at: https://ssrn.com/abstract=4157647