Final Defense Report-4
Final Defense Report-4
INSTITUTE OF ENGINEERING
PURWANCHAL CAMPUS
BY
Angal Dahal(PUR076BCT010)
Kajal Singh(PUR076BCT037)
Kushal Acharya(PUR076042)
Maheshwar Prasad Bhatt(PUR076BCT045)
March, 2024
“NEPALI DOCUMENT SUMMARIZER USING mT5”
By
Angal Dahal(PUR076BCT010)
Kajal Singh(PUR076BCT037)
Kushal Acharya(PUR076BCT042)
Project Supervisor
March, 2024
COPYRIGHT©
The author has agreed that the Library, Department of Electronics and Computer En-
gineering, Purwanchal Campus, Institute of Engineering may make this report freely
available for inspection. Moreover, the author has agreed that permission for extensive
copying of this project report for scholarly purpose may be granted by the supervisor(s)
who supervised the thesis work recorded herein or, in their absence, by the Head of
the Department wherein the thesis report was done. It is understood that the recogni-
tion will be given to the author of this report and to the Department of Electronics and
Computer Engineering, Purwanchal Campus, Institute of Engineering in any use of the
material of this thesis report. Copying or publication or the other use of this report for
financial gain without approval of the Department of Electronics and Computer Engi-
neering, Purwanchal Campus, Institute of Engineering and author’s written permission
is prohibited.
Request for permission to copy or to make any other use of the material in this report in
whole or in part should be addressed to:
Head
Department of Electronics and Computer Engineering
Purwanchal Campus, Institute of Engineering
Dharan , Sunsari
Nepal
iii
DECLARATION
We declare that the work hereby submitted for Bachelor of Engineering in Computer
Engineering at Institute of Engineering, Purwanchal Campus entitled “ NEPALI DOC-
UMENT SUMMARIZER USING NLP” is our work and has not been previously
submitted by me at any university for any academic award.
Angal Dahal(PUR076/BCT/010)
Kajal Singh(PUR076/BCT/037)
Kushal Acharya(PUR076/BCT/042)
Maheshwar Prasad Bhatt(PUR076/BCT/045)
March, 2024
iv
RECOMMENDATION
The undersigned certify that they have read and recommended to the Department of
Electronics and Computer Engineering for acceptance, a project entitled “Nepali Doc-
ument Summarizer using mT5”, submitted by Angal Dahal, Kajal Signh, Kushal
Acharya, Maheshwar Prasad Bhatt in partial fulfillment of the requirement for the
award of the degree of “Bachelor of Engineering in Computer Engineering”.
..........................................................................
Assoc. Prof. Binaylal Shrestha
Supervisor
Department of Electronics and Computer Engineering
Purwanchal Campus, Institute of Engineering, Tribhuvan University
..........................................................................
Assoc. Prof. Surendra Shrestha, (PhD)
External Examiner
Department of Electronics and Computer Engineering
Pulchowk Campus, Institute of Engineering, Tribhuvan University
..........................................................................
Asst. Prof. Pravin Sangroula
Head of Department
Department of Electronics and Computer Engineering
Purwanchal Campus, Institute of Engineering, Tribhuvan University
v
DEPARTMENTAL ACCEPTANCE GOES HERE
vi
ACKNOWLEDGEMENT
We would like to extend our heartfelt gratitude to the respective HOD of the DE-
PARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING Mr. Pravin
Sangroula and all the teachers of this department for granting us the opportunity to
do the major project on “Nepali Document Summarizer”. Their unwavering support,
guidance, and encouragement have been invaluable in shaping this endeavor. First and
foremost, we would like to express our sincere appreciation to our project cluster head,
Mr. Binaylal Shrestha his expertise, valuable insights, and continuous support will play
a crucial role in the development and execution of this project which will undoubt-
edly lead us toward successful outcomes. Moreover, we are thankful to the participants
who have willingly agreed to provide the necessary data for training our “Nepali Doc-
ument Summarizer”. Their involvement and support will be integral to the success of
our project. In conclusion, we are deeply grateful to all the individuals (friends, expert
seniors) who have helped us in taking on this project. Their collective efforts and con-
tributions will drive the successful completion of this endeavor. Thank you all for your
continued support guidance, and encouragement.
Angal Dahal(PUR076BCT010)
Kajal Singh(PUR076BCT037)
Kushal Acharya(PUR076BCT042)
vii
ABSTRACT
This project focuses on developing and evaluating a Nepali Document Summarizer us-
ing advanced Natural Language Processing techniques. The goal is to automatically
generate concise summaries from diverse Nepali documents, articles etc. , addressing
the common challenge of information overload. Through sourcing data from various
Nepali documents, articles and implementing an iterative development approach, in-
cluding the addition of a document upload feature, the Summarizer demonstrates com-
mendable performance in distilling complex information into succinct summaries. The
use of the ROUGE metric validates its effectiveness, with consistently high scores. In
conclusion, the project successfully addresses the challenges of summarizing Nepali
language documents, offering a valuable tool for efficient information extraction and
contributing to more accessible and time-efficient information consumption.
viii
TABLE OF CONTENTS
COPYRIGHT iii
DECLARATION iv
RECOMMENDATION v
DEPARTMENTAL ACCEPTANCE vi
ACKNOWLEDGEMENT vii
ABSTRACT viii
LIST OF FIGURES xi
1 INTRODUCTION 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 LITERATURE REVIEW 3
3 METHODOLOGY 5
3.6.1 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
ix
3.6.2 T5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.6.3 mT5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
REFERENCES 18
x
LIST OF FIGURES
xi
CHAPTER 1
INTRODUCTION
1.1 Background
Text summarization has evolved significantly, driven by advancements in NLP and ma-
chine learning techniques. Early methods relied on handcrafted rules but struggled with
1
nuanced meaning. Later, transformers, notably introduced in Vaswani et al.’s ”Atten-
tion is All You Need,” revolutionized NLP by improving parallelization and capturing
long-range dependencies. Models like BERT, GPT-2, GPT-3, and T5 further enhanced
text summarization. Various attempts have been made to summarize Nepali texts with
notable results by various scholars achieving good results with help of pre-trained trans-
former models.
The problem statement for task of ‘Document Summarizer’ is to develop a model and
a web app to deliver the results of the model to the end user. The task involves huge
corpus of Nepali text and their summarized forms. The goal will be to achieve a good
summarization model with resulting ROUGE score comparable to current benchmark.
1.3 Objectives
• To summarize the given text comprehending the meaning of text with ROUGE-L
score near to 0.443.
1.4 Scope
2
CHAPTER 2
LITERATURE REVIEW
In 1958, a research involved summarizing articles and papers used statistical informa-
tion derived from word frequency and distribution, which were the fed to machine to
compute a relative measure of significance, first for individual words and then for sen-
tences. Sentences scoring highest in significance were extracted and printed out to
become the “auto-abstract.” [5]. This was early phase of summarization work done in
field of NLP, that used manual heuristics and mathematical model to calculate the sig-
nificance of a word in a sentence and how it captures the overall meaning of that text,
generating extractive summary.
There have been two main approaches mainly for the summary namely extractive and
abstractive. In any case use of Neural Network would require minimum of preprocess-
ing and relieves for manual feature processing. Some researches argue that extractive
summary are more interpretable and expected to give better results than abstractive sum-
maries [8]. While the other researches suggest that abstractive summarizer performs
better overall, the results suggest that the margin by which abstraction outperforms ex-
traction is greater when controversiality is high, providing a context in which the need
for generation based methods is especially great [1], suggesting that effectiveness of
extractive and abstractive summary depends upon context provided.
LSTM cells are also useful for text summarization, use of GloVe algorithm to make
embedding to capture features for Nepali word corpus. That is fed to LSTM cell and
ROUGE -1 and ROUGE -2 is calculated for test matrices with human generated sum-
mary. LSTM cells capture features very well within a corpus. Extractive method is used
to capture sentences with most weight[4].
3
The transformer architecture, introduced by Vaswani et al. in 2017, marked a significant
milestone in NLP. Transformers rely on self-attention mechanisms, which allow the
model to weigh the importance of different words in a sequence when encoding or
decoding, capturing long-range dependencies more effectively compared to traditional
RNNs and CNNs. [7] This architecture led to the development of models like BERT,
GPT, and T5, each trained using a large text corpora, and fine tuned for specific NLP
task.
Leveraging neural networks like BERT, pre-trained models are fine-tuned for sum-
marization tasks. While news domain data dominates training, the study investigates
the adaptability of these models to academic texts. Robustly pre-trained models show
promise in generalization, yet human evaluation underscores the necessity for improved
assessment methods and metrics, revealing potential discrepancies between metric-
based and human-perceived quality in text summarization. [3]
4
CHAPTER 3
METHODOLOGY
The system consists of a web client or browser which user interacts with to access
the features of the model, WSGI application handles the routing of HTTP request to
proper python function which have the logic to preprocess the input, in our case that
would involve parsing the text from document. The parsed text is then fed to the model
and output summary is then returned to the user via the same WSGI application where
browser presents it to the user.
The use case diagram of the system is given in figure below various stages have to be
performing to achieve automatic text summarization for the Nepali documents.
5
Figure 3.2: Use case diagram of the model
For model fine tuning and testing we used publicly available dataset from Hugging
Face. Dataset involved around 15,580 training data and around 1,732 testing data in
format of Dataset dict that is defined by Hugging Face itself. Dataset involved article
with in average 1000 tokens and it’s summary with in average 270 tokens.
For the core task of Nepali Text Summarization we selected mT 5base as the base model
to fine-tune. mT5 is a multi-lingual variant of T5 model that generalizes every NLP task
as text-to-text transformation providing adaptability and a unified approach.
Text preprocessing involved removing HTML tags and English language advertise-
ments. Prefix were added in the text specific to the summarization task. Since the
model needs to understand the syntactic and semantic context of the sentence we kept
the punctuation and stop words as it is.
6
3.6 Model Architecture
3.6.1 Transformer
The Transformer architecture follows an encoder-decoder structure but does not rely on
recurrence and convolutions in order to generate an output.
In short the task of the encoder is to map an input sequence to a sequence of continuous
representations, which is then fed into a decoder. While decoder receives the output of
the encoder together with the decoder output at the previous time step to generate an
output sequence. At each step the model is auto-regressive, consuming the previously
7
generated symbols as additional input when generating the next. [7]
Encoder
The encoder consists of a stack of N = 6 identical layers, each of the layer is composed
of two sublayers:
The six layers of the Transformer encoder apply the same linear transformations to all
the words in the input sequence, but each layer employs different weight (W1 ,W2 ) and
bias (b1 , b2 parameters to do so. Each sublayer is also succeeded by a normalization
layer, layernorm(.) which normalizes the sum computed between the sublayer input, x
and the output generated by the sublayer itself, sublayer(x):
The positional encoding vectors are of the same dimension as the input embeddings and
are generated using sine and cosine functions of different frequencies. Then, they are
simply summed to the input embeddings in order to inject the positional information.
8
Decoder
The decoder shares several similarities with the encoder. The decoder also consists of a
stack of N = 6 identical layers that are each composed of three sublayers:
1. The first sublayer receives the previous output of the decoder stack, augments it
with positional information, and implements multi-head self-attention over it. While
the encoder is designed to attend to all words in the input sequence regardless of their
position in the sequence, the decoder is modified to attend only to the preceding words.
Hence, the prediction for a word at position i can only depend on the known outputs for
the words that come before it in the sequence. In the multi-head attention mechanism
(which implements multiple, single attention functions in parallel), this is achieved by
introducing a mask over the values produced by the scaled multiplication of matrices
Q and K . This masking is implemented by suppressing the matrix values that would
otherwise correspond to illegal connections
e11 e12 ... e1n e11 −∞ ... −∞
T
e21 e22 ... e2n e21 e22 ... e2n
mask(QK ) = mask
=
(3.3)
... ... ... ... ... ... ... ...
em1 em2 ... emn em1 em2 ... emn
3. The third layer implements a fully connected feed-forward network, similar to the
one implemented in the second sublayer of the encoder.
Furthermore, the three sublayers on the decoder side also have residual connections
around them and are succeeded by a normalization layer. Positional encodings are
also added to the input embeddings of the decoder in the same manner as previously
explained for the encoder.
9
3.6.2 T5
3.6.3 mT5
Objective of mT5 model was to closely follow T5 model’s recipe as much as possible,
specifically ”T5.1.1” recipe. One of the most important distinction is to rather use a
“line length filter” that requires pages to contain at least three lines of text with 200
or more characters.The sampling strategy for pre-training multilingual models involves
balancing the representation of languages. Boosting lower-resource languages by sam-
pling according to a probability function helps prevent overfitting or underfitting. The
hyperparameter α controls the degree of boosting, with values like 0.3 striking a bal-
ance between high and low-resource language performance. [9].
Due to very low percentage (0.69%) Nepali dataset used for training, the model suffered
highly from “accidental translation”. Due to this, we’ve followed a language-specific
tokenization approach which means creating specific tokenizer for specific language
10
or a subset of language rather than being general multilingual. According to this ap-
proach, we’ve redefined the tokenizer to include English and Nepali tokens. Since, we
were to work exclusively with those texts this completely avoids the cross-lingual errors
improving the model performance. This also reduced the model size significantly and
improved the performance.
Fine-tuning a pre-trained multilingual model with a Nepali dataset using Hugging Face’s
prebuilt trainer involved first loading the pre-trained model from the Hugging Face
model hub and preparing the Nepali dataset for fine-tuning by tokenizing the text data,
batch tokenization was utilized for this task. Next, the trainer was configured with spe-
cific hyperparameters such as batch size, weight decay and learning rate. The prebuilt
trainer then facilitated the fine-tuning process, iterating through the dataset to adjust
the model’s parameters and minimize the loss function. After fine-tuning, the model’s
performance was evaluated on a separate validation dataset to assess its adaptation to
Nepali language. Additional fine-tuning iterations or adjustments to hyperparameters
were conducted based on evaluation results.
For summary generation in mT5, encoding entails tokenizing the input text into sub-
word units using our tokenizer and passing it through multiple layers of self-attention
and feed-forward neural networks in the encoder to capture contextual information.
During decoding, the model generates the summary token by token, with the input
to the decoder including encoded representations from the encoder and a task-specific
prefix indicating the summary task. The decoder, consisting of multiple layers of self-
attention and feed-forward networks, attends to the encoded input representations and
previously generated tokens to predict the next token in the summary sequence, with
cross-attention mechanisms enabling the decoder to incorporate relevant information
from the input text encoded by the encoder. This iterative process continues until an
end-of-sequence token is generated or a maximum summary length is reached, result-
ing in accurate and concise summaries of the input text.
11
CHAPTER 4
The graph shows curve plotted of loss against of number of steps. Blue curve represents
training loss and yellow curve represents validation loss.
The graph shows that model starts with higher training and validation loss, the loss
gradually starts decreasing indicating the model’s ability to learn from the dataset and
optimize it’s weights and biases. Training loss and validation loss follows the same
pattern as it take a big dive during initial steps and settles down to base value as it
moves forward.
The report below shows ROUGE-1, ROUGE-2 and ROUGE-L for our model.
12
Metric Recall Precision F-1 Score
ROUGE-1 0.635170 0.372573 0.464093
ROUGE-2 0.442052 0.257207 0.320759
ROUGE-L 0.566153 0.332508 0.414010
ROUGE scores of our model were comparable to scores achieved by other similar re-
searches [4]. These ROUGE scores indicates the congruence of unigram, bigram, and
longest common sub sequence matches between the generated summaries and the ref-
erence summaries. These scores provide quantitative measures of our summarization
model’s ability to accurately capture content overlap at different levels of granularity.
Below is summary output for news article text sourced from OnlineKhabar for testing
our models ability to grasp unseen data.
13
Figure 4.2: Output for first article
The summarization model effectively captured the essence of the news article, high-
lighting key details about the details about the hostel’s construction and brief detail.
Below is summary output for news article text sourced from Setopati.
14
Figure 4.3: Output for second article
The resulting summary was able to summarize the essence of the news.
15
Figure 4.4: Output for Nepali text document.
16
CHAPTER 5
5.1 Conclusions
In conclusion, the Document Summarizer successfully met the objectives set forth in
the project, particularly in the context of handling Nepali language documents. The
model demonstrated proficiency in summarizing diverse Nepali documents, offering a
valuable tool for efficiently extracting key insights from extensive textual data.
The positive results obtained from the evaluation indicate the potential for broader ap-
plications of the Document Summarizer in handling Nepali language documents, con-
tributing to more accessible and time-efficient information consumption.
5.2 Limitations
Recognizing and addressing these limitations is crucial for the successful development,
implementation, and eventual adoption of the summarizing model. Ongoing refinement
and adaptation will be necessary to overcome these challenges and enhance the system’s
effectiveness in real-world settings.
17
REFERENCES
[3] E. Hermansson and C. Boddien. Using pre-trained language models for extractive
text summarisation of academic papers. 2020.
[4] R. S. Khanal, S. Adhikari, and S. Thapa. Extractive method for nepali text summa-
rization using text ranking and lstm. Proceedings of 10th IOE Graduate Confer-
ence, 10, October 2021.
[5] H. P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research
and Development, 2(2):159–165, 1958.
18