Final Report
Final Report
A PROJECT REPORT
Submitted by
RISHIK P 111720102101
PAVAN P 111720102102
LOKESH P 111720102111
March 2024
R.M.K. ENGINEERING COLLEGE
(An Autonomous Institution)
R.S.M. Nagar, Kavaraipettai-601 206
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
I
ACKNOWLEDGEMENT
We earnestly portray our sincere gratitude and regard to our beloved Chairman Shri. R. S.
Munirathinam, our Vice Chairman, Shri. R. M. Kishore and our Director, Shri. R.
Jyothi Naidu, for the interest and affection shown towards us throughout the course.
We convey our sincere thanks to our Principal, Dr. K. A. Mohamed Junaid, for being the
We reveal our sincere thanks to our Professor and Head of the Department, Computer
Science and Engineering, Dr. T. Sethukarasi, for her commendable support and
We would like to express our sincere gratitude for our Project Guide Ms. P. Baby Shamini,
Assistant Professor for her valuable suggestions towards the successful completion for this
We take this opportunity to extend our thanks to all faculty members of Department of
Computer Science and Engineering, parents and friends for all that they meant to us during
II
ABSTRACT
As a result of the growth of the internet, Information was digitalized as files and documents.
Information retrieval is the process of obtaining the required documents from vast databases.
As the number of documents multiplies every five years, It is crucial to retrieve information
effectively. Therefore, automatic indexing of documents is necessary for efficient information
retrieval. In this research, an unsupervised model using genism was introduced to solve the
problem. However, in the related work, the challenges of Information retrieval such as
classification of similar documents, extraction of keywords, scientific documents clustering
and evaluating the similarity are addressed using Natural Language Methods. The research
provides an unsupervised model using genism to rank the documents based on the user query.
The proposed model preprocesses the documents, generates the similarity score according to
the user's query and ranks the documents based on the similarity score. The proposed
methodology makes it easier to quickly retrieve the relevant documents based on the user's
needs. To evaluate the performance of the proposed method, we collected the documents
from different domains on the internet. The proposed method obtains comparatively more
relevant documents than the traditional methods
III
TABLE OF CONTENTS
2 SYSTEM ANALYSIS 10
2.1 Existing System 10
3 SYSTEM DESIGN 14
3.1 System Architecture 14
3.2 UML Diagrams 18
3.2.1 Use Case Diagram 18
3.2.2 Class Diagram 19
3.2.3 Data Flow Diagram 20
3.2.4 Activity Diagram 21
IV
4 SYSTEM IMPLEMENTATION 22
4.1 Modules 22
4.2 Module description 23
4.2.1 Data Set Collection 23
4.2.2 Data Pre-Processing 23
4.2.3 Building and applying gensim 26
4.2.4 Ranking the doument 27
4.2.5 Computing the evaluation parameters 27
4.3 Algorithms 28
4.3.1 Natural language processing 28
4.3.2 Gensim 29
4.3.3 Term Frequency-Inverse Document Frequency 30
4.4 Testing 32
V
LIST OF FIGURES
4.3.1 NLP 28
4.3.2 Gensim 29
4.3.3 TF-IDF 30
01 Original Text 45
02 Tokenization 46
03 Removal of punctuation 47
04 Word folding 47
06 Word Stemming 48
VI
TABLE.NO TABLE NAME PAGE NUMBER
4.1 Modules 19
VII
LIST OF ABBREVATIONS
03 AI Artificial intelligence
06 OS Operating system
VIII
CHAPTER 1
INTRODUCTION
Over many years, it has been a tremendous task for retrieval of information for any
organization due to the vast volume of data growing continuously. It is an uphill battle for the
user to retrieve the required information from an immense amount of data. Information or
data is stored in the form of documents in any organization. The documents available in
digital form are doubled every five years. So, the identification of similar documents is
beneficial. Similarity among the data plays a crucial role in information retrieval. Segregation
of documents helps the user in many ways. Manual segregation of documents takes a lot of
time and human resources. Thus, this project tends to show the technological innovation to
simplify the task by automatically segregating data. To carry out this thought, various
documents are used as input. The users can take a query of the documents as input and it will
automatically rank the documents in the query based on similarity. This chapter briefly
describes the introduction part of this work by first discussing the overview of the project and
problem statement followed by the objective, existing systems, the significance, and finally
the limitations. Document similarity is a method of getting similar documents from large
document datasets into one query. Ranking the documents in the query based on similarity
helps the user to get access to the required information efficiently. Document similarity is
mainly used in the process of Information retrieval. Information retrieval is a major task in
any organization. Information retrieval helps users to formulate solutions for present
problems by analyzing past results. Information or results are mainly stored in the form of
documents or portable document formats.
1
1.1 Problem Statement
Because of the increasing amount of data, search engines encounter different difficulties
in fetching better relevant to users search queries. Traditional document ranking methods
are mostly based on the similarity computations between documents and queries.In many
cases users may want to retrieve documents that are not only similar but also general or broad
regarding a certain topic. So, to rank the documents efficiently and accurately there is a need
for document ranking using the semantic measure.
[2]. Qian Liu et al. (2021) [2] proposed to use of association rules for measuring word
similarity at a global level and fuzzy similarity to measure the top-k words in in IEEE
Access, vol. 9, pp. 126801-126821,2021.
This proposed to use of association rules for measuring word similarity at a global level and
2
fuzzy similarity to measure the top-k words. For top-k words, the authors proposed a
similarity measure to word embedding where both local and global information is considered.
The global information is measured with association rules and local information is measured
by word embedding also the authors compared this proposed method to eight state-of-the-art
baselines. The data sets used by the authors are TRECdisk 4&5, WT10G, and RCV1. The
authors found a fuzzy logic system which overcomes the problems associated with combining
the two types of measures by inferring the similarity between words and then returning the
top-k selected words. The advantage of fuzzy logic is that it provides a flexible and
convenient way to transform expert knowledge expressed in natural language into fuzzy
rules. This paper contains three components: local, global and fuzzy systems but there is
nocomponent-wise validation.
This used the standard K-means algorithm. The K-means algorithm calculates the distance
measure. Experimented by using the datasets similar news articles and analysis of customer
feedback, text mining, duplicate content detection, and finding similar documents. In the
dataset, the number of assigned different categories is the same with a set of several clusters.
They can use two evaluation measure-purity and entropy that give the quality of a clustering
result. By using Euclidean distance measures they are used to find the document clustering
more effectively. Pearson and Jaccard methods are more suitable for finding rational clusters
with high clarity values that are represented by documents from a single group that controls
every cluster. The advantage of the experiment is the frequency is calculated according to the
typed dataset but the accuracy is very low.
[4]. Shuaizhang et al. (2019) [4] proposed a model for extended citation model for scientific
document clustering" in IEEE Access, vol. 9, pp. 150865-150877, 2021, doi:
10.1109/ACCESS.2021.3125729.
This proposed a model for extended citation model for scientific document clustering and a
citation network and textual similarity network to enhance the performance of scientific
document clustering. They are conducted by using the PMC database and PubMed database
are two popular databases in the biomedical field and provide a large number of openaccess
3
and full-text scientific documents with 10,996 scientific documents. They used Java and R
programming for the implementation of the experiment. The data should be preprocessed
before the experiment should be conducted they constructed a textual similarity network
according to the integrated citation network, co-citation network, bibliographic coupling
network, and textual similarity network. They proved the practicability of our proposed
extended citation model by comparing it with the traditional bibliographic coupling model
and the textual similarity model for scientific document clustering by using the R
programming language. They used a random walk algorithm for the popular community
detection algorithm whose input is the similarity network and output is the clustering results.
The advantage is the clustering of scientific documents in an efficient way by considering the
frequency of the document. The limitation is that it used a limited dataset.
This model the authors used Maximum Entropy Principle based Document Ranking with
Term Selection Analysis (MEPDR-TSA) for CLIR. The user query in the Tamil language is
translated into the English language. Then, the MEPDR technique is employed for the
ranking of the documents and TSA is used for choosing a set of retrieved documents from
each query. Finally, the retrieved English documents are again converted back into the Tamil
language proficiently by using google Translate and the results are tested against precision,
recall, and F-score. The authors found a Maximum Entropy Principle based Document
Ranking with Term Selection Analysis and converted the Tamil query into English also
converting the retrieved English document into Tamil. The advantage of the proposed method
is better than the existing methods like Okapi BM25, IB-MLIR, MULM, KNN, and n-gram.
But, this method should be improved with advanced document ranking methods.
This proposed a model of multi-criteria indexing and retrieval works for web pages and
documents. It uses different retrieval methods to get an accurate document and it handles the
4
page ranking algorithm issues, this model utilized the top seven criteria for indexing and
retrieving results. First Phase: Users enter the required queries. The MCIR goes through
crawling online or offline. The first step is finding pages or documents existing on the web if
it is working online or offline through stored files on a machine. Once the system finds a page
URL or document path, it visits and finds out according to the user search query. Second
phase: At this stage, the weight model starts generating weight for each criterion according to
user preferences. After it calls rank statistics to return the final page weight to users. This
model proved that ranking through multiple criteria has different results than one or two
criteria as compared to previous algorithms. It retrieves the document and web pages based
on the top seven criteria: Pages or documents votes, Keywords in (Domain, Page content,
Url), Pages publish date, Pages Modified date, Number of links, Pages load time, and Bad
links.
[7]. Hikmat Ullah Khan, Shumaila Nasir, Kishwar Nasim, Danial Shabbir, Ahsan
Mahmood, Twitter trends: A ranking algorithm analysis on real time data, Expert
Systems with Applications, Volume 164, 2021.
[8]. Dimitris Pappas and Ion Androutsopoulos, A Neural Model for Joint Document and
Snippet Ranking in Question Answering for Large Document Collections, Department
of Informatics, Athens University of Economics and Business, Greece, Institute for
Language and Speech Processing, Research Center ‘Athena’, Greece, 2021.
5
This used POSIT-DRMM or PDRMM, it is a differentiable extension of DRMM. Proposed
architecture to jointly rank documents and snippets concerning a question; there are two
particularly important stages in QA for large document collections. They instantiated the
proposed architecture using a recent neural relevance model (PDRMM) and a BERT-based
ranker. Using biomedical data (from BIOASQ), they showed that the two resulting joint
models (PDRMM-based and BERT-based) vastly outperform the corresponding pipelines in
snippet retrieval, the main goal in QA for document collections, using fewer parameters, and
also remaining competitive in document retrieval. They provided a modified version of the
Natural Questions dataset, suitable for document and snippet retrieval. The documents are
retrieved by using fewer parameters like snippets of the document for the questions. The
advantage of this method is that document retrieval results are better than DRMM and several
other neural rankers. But, the stage of the dataset should be increased to a multi-granular task
as BIOASQ already includes this multigranular task, but exact answers are provided only for
factoid questions and they are freely written by humans, as in MS-MARCO, with similar
limitations. Hence, appropriately modified versions of the BIOASQ datasets are needed.
[9]. Tianrun Cai, Zeling He, Chuan Hong, Yichi Zhang, Yuk-Lam Ho, Jacqueline
Honerlaw, Alon Geva, Vidul Ayakulangara Panickan, Amanda King, David R Gagnon,
Michael Gaziano, Kelly Cho, Katherine Liao, Tianxi Cai, Scalable relevance ranking
algorithm via semantic similarity assessment improves efficiency of medical chart
review, Journal of Biomedical Informatics, Volume 132, 2022.
The authors used the pGUESS algorithm as prior guided semantic similarity to measure the
informativeness of a clinical note to a given phenotype. The algorithm scores the relevance of
a note as the cosine similarity between SEVnote and SEVref. The pGUESS algorithm is fully
knowledge-based except for assigning notes into the three categories via clustering. The
results on note ranking for CAD performed at both VHA and PHS suggests high
transportability of the pGUESS algorithm across two different healthcare systems. So, the
pGUESS algorithm does not require local EHR data but only knowledge sources and
embedding vectors. The algorithm reduced the burden of chart review and it improved the
efficiency and accuracy of human annotation. Determining patient disease status via chart
review is a critical yet labour-intensive task for EHR for training or validating robust
prediction algorithms.
The advantage of this algorithm is that overall ranking quality, as measured by the rank
6
correlation, was the highest for pGUESS compared to all other methods. But, the publicly
available TMGuassian algorithm does not allow out-of-sample prediction, so only tested the
portability of LDAgibbs and LEAvem.
This used the combination of the traditional statistical method and deep learning model as
well as a novel model based on multi-model nonlinear fusion proposed in this paper. The ant
financial data set comes from the Chinese text similarity contest held by Alipay. Its data
mainly come from Alipay’s customer service data. The data contains 100000 pieces of data.
Semantic textual similarity (STS) datasets from 2012 to 2015 in the experiment are also used.
The model uses the Jaccard coefficient based on part of speech, Term Frequency-Inverse
Document Frequency (TF-IDF) and word2vecCNN algorithm to measure the similarity of
sentences respectively. The model combines the traditional sentence similarity calculation
method based on statistics and completes the coarse-grained extraction of the sentence. The
results of the Jaccard algorithm, TF-IDF algorithm and word2vec-CNN are input into the
shallow fully connected neural network to train the model and give an ideal classification
result. IN which the Jaccard algorithm takes grammatical information, TF-IDF calculates
sentence similarity from the word frequency and Inverse document and by using the
word2vec-CNN algorithm, the sentence feature matrix is weighted by a multi-feature
attention mechanism by which it increases the performance. The experimental results of the
proposed method show that the proposed sentence similarity calculation method based on
multi-feature fusion can balance the calculation results of multiple models. Experimental
results show that the matching of sentence similarity calculation method based on multi-
model nonlinear fusion is 84%, and the F1 value of the model is 75%. In this experiment, the
similarity of the sentence is measured with the respective meaning of the words in the
sentence. The limitations are that the word vector given by the word2vec model is static and
cannot describe the dynamic change of semantics and accuracy can be improved.
7
1.3 System Requirements
The feasibility of the project is analysed in this phase and business proposal is put forth with
a very general plan for the project and some cost estimates.During system analysis the feasibility
study of the proposed system is to be carried out. This is to ensure that the proposed system is
not a burden to the company.For feasibility analysis, some understanding of the major requirements
for the system is essential.
ECONOMICAL FEASIBILITY :
The economic feasibility of document ranking depends on various factors, primarily the specific
application and the value it brings to users or businesses. Document ranking,often associate with
information retrieval and search engines, can have economic benefits in different contexts.
Here are some considerations:
User satisfaction and Engagement
Productivity and Efficiency
8
Adaptability and Scalabily
TECHNICAL FEASIBILITY:
The technical feasibility of document ranking involves assessing whether the implementation
of a document ranking system is technically viable given the available technology, resources
and infrastructure.
Here are key considerations for evaluating the technical feasibility of document ranking:
Algorithm complexity
Data availability and Quality
Feature Engineering
Testing and Evaluation
OPERATIONAL FEASIBILITY:
The operational feasibility for document ranking refers to whether the system can be effectively
integrated into existing operations and processes. It assesses the practicality of implementing
the document ranking system within the operational context of an organization.
Here are key considerations for evaluating the operational feasibility of document ranking:
User acceptance
Data input and output
Resource availability
Legal and regulatory compilance
9
CHAPTER 2
SYSTEM ANALYSIS
2.1 Existing System
The existing system of Natural language processing is confined to the stream of computer
science and artificial intelligence. Natural language processing mainly deals with the
interaction between computers and human language which mainly focuses on how computers
process and analyze a large amount of data. The NLP techniques help computers to
understand contexts in the document.Due to the increase in data in digital formats such as
documents, the classification of data becomes a difficult task. So, different methods are used
to classify the data based on the user's requirements. The traditional method such as manual
retrieval of data takes a huge time and effect.Therefore, automatic retrieval of data takes
place. Initially, the raw data should be converted into a real number format.Secondly, the
preprocessing of the data takes place by performing the following step tokenization, and
removal of stop words which could not contain any meaningto it. Finally, according to the
user’s query, the similarity metrics should be measured.The above step takes place with help
of NLP Techniques. The Existing method mainly uses the hybrid model to obtain the results.
The hybrid model combines two or more techniques.The accuracy of a single technique is
less than hybrid techniques. The hybrid technique can be used in the Analysis of large and
complex documents, the Entertainment industry, Resume ranking and many more. But, the
usage of the hybrid model makes the process complex. The hybrid method is a combination
of various NLP techniques such as TF-IDF, Jaya and grey wolf optimizer, the Longest
common sequence and may be more compared with other machine learning methods such as
CNN and CNN. The hybrid method also increases the time complexity of the work. Major
work cannot increase the similarity metrics and dictionary.
10
Limted understanding of context:
Document ranking systems may struggle to understand the context of the content they
are evaluating.
They often rely on keyword matching and statistical patterns, which might not capture
the nuanced meaning of documents accurately.
Many existing document ranking methods are primarily designed for text-based
content and may not effectively handle multimedia content such as images, videos, or
audio.
This can limit the usefulness of search engines for users seeking diverse types of
information.
Scalability challenges:
As the volume of documents and user queries continues to grow, existing document
ranking methods may struggle to maintain scalable performance without sacrificing
relevance or quality.
This can result in slower response times or degraded search experiences for users.
Limted personalization:
This lack of personalization can lead to suboptimal search experiences for users who
have diverse needs and preferences.
11
2.2 Proposed Sysyem:
In this model,ranking the documents based on similarity of the users query is done on
Natural language processing.The ranking of the documents helps in the efficient retrieval of
information in any firm. This model helps in identifying the most relevant and important
information based on the user requirement.
The proposed work is implemented in Python 3.8 with libraries Gensim, Spacy, NumPy,
nltk, corpora, models, similarities and other mandatory libraries. There are many
applications of Natural language processing such as Information retrieval, classification of
documents, spell checker, estimation of similarity, keyword extraction, Language
Translation and many information retrieval problems. Most of the information retrieval
problems are solved using NLP techniques. In the natural language processing techniques,
data processing plays an important role to solve the problem. The proposed model is
implemented using an unsupervised learning algorithm and well-known python library
called Gensim which ranked the documents based on similarity of the user’s query with
better accuracy. The dataset in which we have simple text will be pre-processed with
different metrics such as word tokenization, removal of punctuation. Stop word removal
and word stemming. Word stemming plays a crucial role in finding the accurate similarity.
The preprocessed documents are splitted into source and queries documents and the
proposed model is applied to the preproceed documents, compared the source with other
target documents, obtained the score of the documents and rank the documents based on
similarity of the user query. The performance evaluation of the proposed method is better
than the traditional methods.
Time-efficient:
The automatic generation of solutions to any natural language problem reduces the
time complexity of the users. The automatic generation of solutions to any natural
language problem reduces the time complexity of the users.
12
Market Research and Analysis:
Automatic summarization of research papers, Extraction of keywords, and clustering
of similar documents in query help the researchers.
Streamlined processes:
Avoiding the traditional methods for communicating such as help centres and
customer care were updated to chatbots to improve customer satisfaction.
E-learning:
NLP machine learning technology can examine the language used in a classroom to
define the mental states of both teachers and students.
13
CHAPTER 3
SYSTEM DESIGN
Preprocessing:
Preprocessing consists of term folding, term tokenization, removal of punctuation,
stop term elimination term stemming. Preprocessing is the first thing we have to do
for a document for further easy process.It raises reliability and accuracy.Preprocessing
data can increase the correctness and quality of a dataset, making it more reliable by
removing missing or inconsistent data values brought on by human or computer
mistake. It ensures consistency in data.
Term Folding:
Preprocessing is applied to the data once it has been received. Term folding is a
preprocessing technique that lowercases words that are currently in uppercase. It
14
appears as though two similar terms are the lowercase.
Removal of Punctuation:
Term Tokenization:
Term tokenization then separates the raw text into tokens, which are words and
sentences. By studying the words, these symbols help the reader in determining the
context and analyzing the text's meaning.
Elimination of stop term occurs in the next preprocessing stage. In any language, stop
terms are a group of frequently used terms. Stop words in English include "the," "is,"
and "and," for instance. Stop words are used to remove unnecessary words so that
computers can concentrate on the crucial ones. It is one of the most commonly used
preprocessing steps across different NLP applications.
Term Stemming:
Word stemming, the final preprocessing stage, removes the final few characters from
a word, frequently resulting in inaccurate spelling and meaning. Lemmatization
considers the context and converts the word to its meaningful base form, which is
called Lemma. For instance, stemming the word 'Eating' would return 'Eat'.
After preprocessing is finished, the input source document and set of documents in
datasets are updated as a preprocessed data.
15
BUILT GENSIM MODEL:
Gensim is open source python natural language processing library used for
unsupervised modeling. The features of the gensim is its scalability, robust, platform
agnostic and Efficient multicore implementations. The uses of gensim are fastText,
Word2vec, LSA (Latent Semantic Analysis), LDA (Latent Dirichlet Allocation) and
tf-idf (term frequency-inverse document frequency). fastText, uses a neural network
for word embedding, is a library for learning of word embedding and text
classification. Word2vec, used to produce word embedding, is a group of shallow and
two-layer neural network models. LSA is a technique in NLP (Natural Language
Processing) that allows us to analyse relationships between a set of documents and
their containing terms. LDA is a technique in NLP that allows sets of observations to
be explained by unobserved groups. These unobserved groups explain, why some
parts of the data are similar. tf-idf, a numeric statistic in information retrieval, reflects
how important a word is to a document in a corpus. It is often used by search engines
to score and rank a document’s relevance given a user query. The facilities provided
by Gensim for building topic models and word embedding is unparalleled. It also
provides more convenient facilities for text processing. It handle large text files even
without loading the whole file in memory. Gensim doesn’t require costly annotations
or hand tagging of documents because it uses unsupervised models. The core concepts
of the gensim are document, corpus, vector and model. Document is an object of the
text sequence type which is known as ‘str’ in Python 3. A corpus may be defined as
the large and structured set of machine-readable texts produced in a natural
communicative setting. In Gensim, a collection of document object is called corpus.
The role of the corpus are Serves as input for training a model and Serves as topic
extractor. A vector is mathematical representation of a document. Model refers to an
algorithm used for transforming vectors from one representation to another. For
working on text documents, Gensim also requires the words, i.e. tokens to be
converted to their unique ids. For achieving this, it gives us the facility of Dictionary
object, which maps each word to their unique integer id. It does this by converting
input text to the list of words and then pass it to the corpora.Dictionary() object. In
Gensim, the dictionary object is used to create a bag of words (BoW) corpus which
further used as the input to topic modelling and other models as well. Term
Frequency-Inverse Document Frequency model which is also a bag-of-words model.
16
It is different from the regular corpus because it down weights the tokens i.e. words
appearing frequently across documents. During initialisation, this tf-idf model
algorithm expects a training corpus having integer values (such as Bag-of-Words
model). Then after that at the time of transformation, it takes a vector representation
and returns another vector representation. The output vector will have the same
dimensionality but the value of the rare features (at the time of training) will be
increased. It basically converts integer-valued vectors into real-valued vectors.
After performing the gensim model then we start comparing the input data with data
set details to measure the similarity then the score of the target documents means
documents in the dataset is measured with respect to the source document the target
documents are sorted according to the score obtained means the documents are ranked
based on the scores.
17
3.2 UML Diagrams
18
3.2.2 Class Diagram
In software engineering, a class diagram in the Unified Modelling Language (UML) is a type
of static structure diagram that describes the structure of a system by showing the system's
classes, attributes, operations (or methods), and the relationships among the classes. It
explains which class contains information. In this class diagram there are document with
document type, and also a folder and document version in it.
19
3.2.3 Data Flow Diagram
The Data Flow Diagram (DFD) shows information flow in the system the user uploads
and views the data and the system evaluates and provides the result.
20
3.2.4 Activity Diagram
Activity diagrams are graphical representations of workflows of stepwise activities and
actions with support for choice, iteration and concurrency. In the Unified Modelling
Language, activity diagrams can be used to describe the business and operational step-
by-step workflows of components in a system. An activity diagram shows the overall
flow of control.
21
CHAPTER 4
SYSTEM IMPLEMENTATION
4.1 Modules:
22
4.2 Module Description:
4.2.1 Data Set Collection:
23
for machines to analyzed. In the samples of the dataset, we broken the word into smaller
words.
Removal of Punctuation:
Word folding:
24
process of removing diacritical marks or accents from the text. This step is important
because different languages may use different accents, and some languages have
multiple accents for the same letter.
Stop Word Removal is followed by word folding.In natural language processing, stop
word removal is a common technique used for text preprocessing. Stop words are words
that are commonly used in a language but do not contribute to the meaning of a
sentence. Examples of stop words in English include "a", "an", "the", "is", "are", "of",
and "in". These words are usually removed from the text during the preprocessing stage
as they don't provide any value for the analysis of the text.The main purpose of stop
word removal is to reduce the size of the dataset and improve the accuracy of
downstream analysis of text in the model. For the process of stop word removal , then
the preprocessed text from the dataset will be preprocessed with stop word Removal .
Then the preprocessed text is passed to the next stage.In conclusion, stop word removal
is a common technique used in natural language processing for preprocessing text data.
While it can be useful in reducing the size of the dataset and improving the accuracy of
model.
Word Stemming:
Finally, the crucial step in the preprocessing is stemming. Stemming is followed by Stop
word Removal.Stemming is a common technique used in Natural Language Processing
(NLP) for text pre-processing. It is the process of reducing a word to its base or root
form, called a stem or lemma. This is done by removing the suffixes and prefixes from
the word, which results in the stem being derived. Stemming is useful because it helps
to reduce the dimensionality of the text data, making it easier to analyze and process. It
also helps to normalize the text, allowing similar words to be treated as the same, which
can improve the accuracy of model which helps in information retrieval.
For example, consider the word "running". The Porter Stemming Algorithm would
apply the following rules: Remove the suffix "ing" to get "runn", Apply the rule for "nn"
to get "run", Apply the rule for "un" to get "run". In the samples of the dataset, the
preprocessed data is passed to the stage of stemming which will written the root of the
word in the sample. In conclusion, stemming is a valuable technique in NLP that helps
25
to normalize text and reduce dimensionality. However, it is important to use judiciously
and to combine them with other techniques to achieve the best results in text processing.
In conclusion, text data processing is a crucial step in natural language processing that
involves the conversion of raw textual data into a structured format that can be used for
analysis and modelling. This process involves several techniques such as tokenization,
stop word removal, stemming/lemmatization, and part-of-speech tagging, which can be
used individually or in combination depending on the specific requirements of the
analysis.
Initially, the builded contains importing the libraries of genism and Natural processing
language libraries. The corpora, models, and similarities modules are imported from the
gensim library. These modules will be used to create a dictionary of words from the
corpus, train a TF-IDF model, and calculate document similarities. Several sample
documents are read from text files and stored in a list called documents. The first
document is stored in doc1, the second document is stored in doc2, and so on. The
documents are collected from different domains from wikipedia. The text of each
document is split into words and stored in another list called text_corpus. This creates a
list of lists, where each inner list contains the words of a single document. A Dictionary
object is created using the text_corpus list. This creates a mapping between words and
unique integer IDs. A corpus object is created by converting each document in
text_corpus to a bag-of-words representation using the doc2bow method of the
Dictionary object. This creates a list of sparse vectors, where each vector represents the
frequency of each word in a single document. A TfidfModel object is trained on the
corpus object, which creates a TF-IDF representation of each document in the corpus.
This assigns a weight to each word in each document based on how important it is to the
document relative to the other documents in the corpus. A MatrixSimilarity object is
created from the TF-IDF corpus. This object allows us to calculate the similarity
between any two documents in the corpus. A query document is defined as the first
document in documents. The text of the query document is converted to a bag-of-words
representation using the same Dictionary object that was used to create the corpus. The
similarity between the query document and each document in the corpus is calculated
using the MatrixSimilarity object. This produces a list of similarity scores, where each
26
score represents the similarity between the query document and a single document in the
corpus. The similarity scores are sorted in descending order, and the document ID and
similarity score for each document in the corpus are printed to the console.
The accuracy of the proposed gensim model is compared with other similar method. The
accuracy of the gensim model is higher than other traditional models.
27
4.3 Algorithms:
4.3.1 Natural Language Processing:
Natural language processing(NLP) is confined to the stream of computer science and
artificial intelligence. Natural language processing mainly deals with the interaction
between computers and human language which mainly focuses on how computers
process and analyze a large amount of data. The NLP techniques help computers to
understand contexts in the document. Natural Language Processing (NLP) plays a
significant role in document ranking, especially in information retrieval systems such as
search engines. Document ranking refers to the process of determining the relevance of
documents to a user's query and presenting them in a ranked order.
28
4.3.2 Gensim:
Gensim is a popular open source natural language processing library used for
unsupervised topic modelling and that specializes in creating and manipulating vector
space models of natural language data. Vector space models represent text documents as
high-dimensional vectors, which can be analysed using various mathematical operations
to discover patterns, similarities, and relationships between them. Gensim provides a
suite of tools for building, training, and using vector space models, with a focus on
scalability, performance,and ease of use. One of the main features of Gensim is its
support for multiple text corpus formats, includingplain text, CSV, and preprocessed
corpus formats such as MMCorpus and LDA-C. Gensim provides a flexibleand efficient
way to preprocess text data, which involves tokenizing, stemming, stop-word removal,
and other tasks that are necessary to convert raw text into a form that can be used to
build vector space models. Preprocessing is typically done using Gensim's built-in
functions or custom pipelines, which can be configured to meet the specific needs of the
user. Gensim supports several popular vector space models, including bag-ofwords, TF-
IDF, LSI (Latent Semantic Indexing), LDA (Latent Dirichlet Allocation), and
word2vec. These models differ in their underlying assumptions and mathematical
techniques, but they all share the goal of representing text documents as vectors in a
high-dimensional space. For example, the bag-of-words model represents each
document as a vector of term frequencies, where each term corresponds to a dimension
in the vector space.
29
4.3.3 Term Frequency-Inverse Document Frequency:
30
4.3.4 Bag of words:
The doc2bow method is a function provided by the Gensim library for converting a
document (list of words) into a bag-of-words format. Bag-of-words (BOW) is a
commonly used representation of text in natural language processing. In BOW, a
document is represented as a sparse vector of word frequencies, where each dimension
corresponds to a unique word in the vocabulary. The doc2bow method takes a list of
tokens as input and returns a list of tuples. Each tuple represents a word in the document
and its frequency count. The first element of the tuple is the word's index in the
vocabulary, and the second element is the word's frequency count in the
document.Overall, the doc2bow method is an important tool for text processing in
natural language processing. It provides a simple and efficient way to convert
documents to a bag-of-words format, which can be used in various downstream tasks
such as topic modeling, clustering, and classification.
31
4.4 Testing:
Software testing techniques are methods used to design and execute tests to evaluate
software applications. It involves rigorous unit testing to validate the functionality of
individual modules, comprehensive integration testing to ensure seamless interaction
between components, and manual testing to assess overall system performance,
usability, and accessibility.
32
4.4.1.2 Integration Testing:
Integration testing plays a crucial role in ensuring the seamless interaction between various
components of the system in our project.
Determine the key integration points in your document ranking system. These might include
the interaction between document tokenization, term weighting, similarity calculation, and
the ranking algorithm. Verify the flow of data between different components. Ensure that
data is passed correctly from one module to another and that the transformations are applied
as intended .
Use mock objects or stubs to simulate external dependencies, such as databases or external
APIs. This allows you to control the input and focus on the interactions between the internal
components. Test how different components interact with each other. For example, ensure
that the term weights calculated during term weighting are correctly used in the similarity
calculation, and that the results are then appropriately considered in the ranking algorithm.
If your document ranking system interacts with external systems (e.g., a search engine
platform, database, or caching system), perform tests to ensure a smooth integration. Test
scenarios like data retrieval, updates, and error handling
System testing plays a crucial role evaluating the entire document ranking system as a whole
to ensure that it meets the specified requirements and functions correctly in a real-world
environment. Identify and define various test scenarios that represent typical and edge use
cases. These scenarios should cover a range of queries, document types, and user interactions.
Perform end-to-end testing to simulate the entire user journey, from submitting a query to
receiving and displaying the ranked document results. Ensure that the system behaves as
expected at every step.
33
CHAPTER 5
Document ranking based on Similarity includes usage of NLP techniques and also applying
machine learning algorithms. The model includes application of NLP techniques for
preprocessing which includes word tokenization, removal of punctuation,stop word removal
and word stemming. The dataset of different field documents such as categories which
includes biography, news and hand written chapters.The each category contains one source
document and target documents. The input of the proposed gensim model is preprocessed
documents and output of the proposed system is ranked based on the similarity with respect
to the source document. The proposed method calculated the similarity score before
stemming and after stemming. The measurement of similarity score is more accurate when
we use stemming and without stemming similarity scores are not accurate. The accuracy
measure of the proposed method is higher than the other traditional methods. The output of
sample of the proposed method is shown below
SAMPLE
Initially, the dataset is collected from the sources and the documents contain
information about the biography of Dhoni.
The dataset with plain text is preprocessed with following steps such as word
tokenization, removal of punctuation, word folding, stop word removal and word
stemming.
The preprocessed documents are processed into the proposed method and ranked
documents are obtained.
34
5.1 Similarity Scores before and after stemming
35
CHAPTER 6
CONCLUSION
This project “Document Ranking Based on Similarity using Natural Language Processing
Technique” is used for ranking the documents based on similarity score with respect kk’to the
source document which plays a crucial role in information retrieval. In the era of the digital
world, digital information has been increasing widely and it has been doubled every five
years. The manual accessing of data is a different and time-consuming process. The
traditional method for accessing the documents has been not accurate. The preprocessing of
text plays an important role in the NLP techniques models. Most of the existing methods
were not focused on preprocessing of text. The proposed Gensim model processes the text
with five different methods such as tokenization, Removal of punctuation, Word Folding,
Stop word Removal and Word Stemming. Word Stemming plays an important role in
measuring the similarity score. The existing models only focused on finding the similarity of
documents and clustering them in a cluster but the proposed method ranked the documents
based on the similarity score with respect to the user query. The model proposed model is
more accurate than other traditional models with accurary 1 after the stemming and before
stemming with accuray 0.86. The model proposed is more accurate than other traditional
models. The proposed model helps in information retrieval applications such as web engine,
search engine , Entertainment and News industry and many more. our work can be further
improved by considering the homonyms ambiguity in the documents. The homonyms of the
words can also improve the similarity measure of documents.
36
REFERENCES
[1]. Benzi Xu et al. (2021) [1] used pseudo-longest-common-subsequence (pseudo-LCS) and
the Jaccard similarity coefficient is proposed based on this analysis and principal component
analysis (PCA) Volume 132, 2022.
[2] Qian Liu et al. (2021) proposed to use of association rules for measuring word similarity
at a global level and fuzzy similarity to measure the top-k words in in IEEE Access, vol. 9,
pp. 126801-126821, 2021.
[4] Shuaizhang et al. (2019) proposed a model for extended citation model for scientific
document clustering" in IEEE Access, vol. 9, pp. 150865-150877, 2021, doi:
10.1109/ACCESS.2021.3125729.
[6]. Mohamed Attia, Manal A. Abdel-Fattah, Ayman E. Khedr, A proposed multi criteria
indexing and ranking model for documents and web pages on large scale data, Journal of
King Saud University – Computer and Information Sciences, 2021.
[7]. Hikmat Ullah Khan, Shumaila Nasir, Kishwar Nasim, Danial Shabbir, Ahsan Mahmood,
Twitter trends: A ranking algorithm analysis on real time data, Expert Systems with
Applications, Volume 164, 2021.
[8]. Dimitris Pappas and Ion Androutsopoulos, A Neural Model for Joint Document and
Snippet Ranking in Question Answering for Large Document Collections, Department of
Informatics, Athens University of Economics and Business, Greece, Institute for Language
and Speech Processing Research Center ‘Athena’, Greece, 2021.
37
[9]. Tianrun Cai, Zeling He, Chuan Hong, Yichi Zhang, Yuk-Lam Ho, Jacqueline Honerlaw,
Alon Geva, Vidul Ayakulangara Panickan, Amanda King, David R Gagnon, Michael
Gaziano, Kelly Cho, Katherine Liao, Tianxi Cai, Scalable relevance ranking algorithm via
semantic similarity assessment improves efficiency of medical chart review, Journal of
Biomedical Informatics, Volume 132, 2022.
[11]. Bo Xu, Hongfei Lin, Yuan Lin, Kan Xu, Two-stage supervised ranking for emotion
cause extraction, Knowledge-Based Systems, Volume 228, 2021.
[13]. M. AbuSafiya, "Measuring Documents Similarity using Finite State Automata," 2020
2nd International Conference on Mathematics and Information Technology (ICMIT), 2020,
pp. 208-211.
[16]. F. Ye, X. Zhao, W. Luo, D. Li and W. Min, "Query-Adaptive Remote Sensing Image
Retrieval Based on Image Rank Similarity and Image-to-Query Class Similarity," in IEEE
Access, vol. 8, pp. 116824-116839, 2020.
[17]. Jesus Serrano-Guerrero, Francisco P. Romero, Jose A. Olivas, A relevance and quality-
based ranking algorithm applied to evidence-based medicine, Computer Methods and
Programs in Biomedicine, Volume 191, 2020.
38
[18]. Yun Li, Yongyao Jiang, Chaowei Yang, Manzhu Yu, Lara Kamal, Edward M.
Armstrong, Thomas Huang, David Moroni, Lewis J. McGibbney, Improving search ranking
of geospatial data based on deep learning using user behavior data, Computers &
Geosciences, Volume 142, 2020.
[22]. R. Dong, Z. -g. Wei, C. Liu and J. Kan, "A Novel Loop Closure Detection Method
Using Line Features," in IEEE Access, vol. 7, pp. 111245-111256, 2019.
[23]. J. Kim, "A Document Ranking Method with Query-Related Web Context," in IEEE
Access, vol. 7, pp. 150168-150174, 2019.
[24]. C. Xia, T. He, W. Li, Z. Qin and Z. Zou, "Similarity Analysis of Law Documents Based
on Word2vec," 2019 IEEE 19th International Conference on Software Quality, Reliability
and Security Companion (QRS-C), 2019, pp. 354-357.
[25]. Y. Ma, P. Zhang and J. Ma, "An Ontology Driven Knowledge Block Summarization
Approach for Chinese Judgment Document Classification," in IEEE Access, vol. 6, pp.
71327-71338, 2018.
[27]. M. Liu, B. Lang, Z. Gu and A. Zeeshan, "Measuring similarity of academic articles with
semantic profile and joint word embedding," in Tsinghua Science and Technology, vol. 22,
no. 6, pp. 619-632, December 2017.
[28]. Olga Vechtomova, Murat Karamuftuoglu, Lexical cohesion and term proximity in
document ranking, Information Processing & Management, Volume 44, Issue 4, 2008.
39
[29]. Czesław Daniłowicz, Jarosław Baliński, Document ranking based upon Markov chains,
Information Processing & Management, Volume 37, Issue 4, 2001.
[30]. H. Shen, L. Xue, H. Wang, L. Zhang and J. Zhang, "B+-Tree Based MultiKeyword
Ranked Similarity Search Scheme Over Encrypted Cloud Data," in IEEE Access, vol. 9, pp.
150865-150877, 2021, doi: 10.1109/ACCESS.2021.3125729.
40
APPENDIX I - SOURCE CODE
1.PreProcessing
import spacy
from nltk.stem.porter import PorterStemmer
from gensim.utils import simple_preprocess
41
lemmatized_text = " ".join(lemmatized_tokens)
# Print the resulting text
print(lemmatized_text)
doc2 = nlp(lemmatized_text)
42
2. Gensim
from gensim import corpora, models, similarities
# Define some sample documents
with open("/content/pf-source.txt","r") as f:
doc1 =f.read()
with open("/content/pf-2.txt","r") as f:
doc2 =f.read()
#doc2 = "This document is the second document"
with open("/content/pf-3.txt","r") as f:
doc3=f.read()
with open("/content/pf-4.txt","r") as f:
doc4=f.read()
with open("/content/pf-5.txt","r") as f:
doc5=f.read()
with open("/content/ps-6.txt","r") as f:
doc6=f.read()
# Create a corpus of documents
documents = [doc1, doc2, doc3, doc4,doc5,doc6]
text_corpus = [doc.split() for doc in documents]
dictionary = corpora.Dictionary(text_corpus)
corpus = [dictionary.doc2bow(text) for text in text_corpus]
43
query_vec = dictionary.doc2bow(query.lower().split())
# Calculate the similarities between the query vector and each document in the corpus
similarities = similarity_index[tfidf_model[query_vec]]
3. Accuracy
relevant_docs = [doc1,doc2,doc3,doc4,doc5,doc6]
# the relevant documents are assumed to be doc1, doc2, and doc7
num_relevant_docs = len(relevant_docs)
num_correct = 0
for doc_id, sim_score in result_docs:
if documents[doc_id] in relevant_docs:
num_correct += 1
accuracy = num_correct / num_relevant_docs
print(f"Accuracy: {accuracy:.2f}")
44
APPENDIX II - SCREENSHOTS
45
Figure 02 Tokenization
46
Figure 03 Removal of punctuation
47
Figure 05 Stop word removal
48
Figure 07 Document ranking based on the similarity score before stemming
49
50