0% found this document useful (0 votes)
15 views59 pages

Final Report

Brain Tumor detection report

Uploaded by

Karthik Raju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views59 pages

Final Report

Brain Tumor detection report

Uploaded by

Karthik Raju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

20CS713 PROJECT PHASE II

DOCUMENT RANKING BASED ON SIMILARITY USING


NATURAL LANGUAGE PROCESSING TECHNIQUE

A PROJECT REPORT

Submitted by

RISHIK P 111720102101
PAVAN P 111720102102
LOKESH P 111720102111

in partial fulfillment for the award of the degree


of
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING

R.M.K. ENGINEERING COLLEGE


(An Autonomous Institution)
R.S.M. Nagar, Kavaraipettai-601 206

March 2024
R.M.K. ENGINEERING COLLEGE
(An Autonomous Institution)
R.S.M. Nagar, Kavaraipettai-601 206

BONAFIDE CERTIFICATE

Certified that this project report “DOCUMENT RANKING BASED ON SIMILARITY


USING NATURAL LANGUAGE PROCESSING TECHNIQUE” is the bonafide work
of RISHIK P (111720102110), PAVAN P (111720102112), LOKESH P (111720102101)
who carried out the 20CS713 Project Phase II work under my supervision.

SIGNATURE SIGNATURE

Dr. T. Sethukarasi, M.E., M.S. Ph.D., Ms.P.Baby shamini, M.E., (Ph.D)


Professor and Head Assistant Professor
Department of Computer Science and Department of Computer Science and
Engineering Engineering
R.M.K. Engineering College R.M.K. Engineering College
R.S.M. Nagar, Kavaraipettai, R.S.M. Nagar, Kavaraipettai,
Tiruvallur District– 601206. Tiruvallur District–601206.

Submitted for the Project Viva–Voce held on .................................... at R.M.K. Engineering


College, Kavaraipettai, Tiruvallur District– 601206.

INTERNAL EXAMINER EXTERNAL EXAMINER

I
ACKNOWLEDGEMENT

We earnestly portray our sincere gratitude and regard to our beloved Chairman Shri. R. S.

Munirathinam, our Vice Chairman, Shri. R. M. Kishore and our Director, Shri. R.

Jyothi Naidu, for the interest and affection shown towards us throughout the course.

We convey our sincere thanks to our Principal, Dr. K. A. Mohamed Junaid, for being the

source of inspiration in this college.

We reveal our sincere thanks to our Professor and Head of the Department, Computer

Science and Engineering, Dr. T. Sethukarasi, for her commendable support and

encouragement for the completion of our project.

We would like to express our sincere gratitude for our Project Guide Ms. P. Baby Shamini,

Assistant Professor for her valuable suggestions towards the successful completion for this

project in a global manner.

We take this opportunity to extend our thanks to all faculty members of Department of

Computer Science and Engineering, parents and friends for all that they meant to us during

the crucial times of the completion of our project.

II
ABSTRACT

As a result of the growth of the internet, Information was digitalized as files and documents.
Information retrieval is the process of obtaining the required documents from vast databases.
As the number of documents multiplies every five years, It is crucial to retrieve information
effectively. Therefore, automatic indexing of documents is necessary for efficient information
retrieval. In this research, an unsupervised model using genism was introduced to solve the
problem. However, in the related work, the challenges of Information retrieval such as
classification of similar documents, extraction of keywords, scientific documents clustering
and evaluating the similarity are addressed using Natural Language Methods. The research
provides an unsupervised model using genism to rank the documents based on the user query.
The proposed model preprocesses the documents, generates the similarity score according to
the user's query and ranks the documents based on the similarity score. The proposed
methodology makes it easier to quickly retrieve the relevant documents based on the user's
needs. To evaluate the performance of the proposed method, we collected the documents
from different domains on the internet. The proposed method obtains comparatively more
relevant documents than the traditional methods

Keywords: Information retrieval, Document Ranking, Similarity and Natural language


processing.

III
TABLE OF CONTENTS

CHAPTER TITLE PAGE NO


ABSTRACT III
LIST OF FIGURES VI
LIST OF TABLES VII
LIST OF ABBREVATIONS VIII
1 INTRODUCTION 01
1.1 Problem Statement 02
1.2 Literature Survey 02
1.3 System Requirement 08
1.3.1 Hardware Requirements 08
1.3.2 Software Requirements 08
1.3.3 Feasibility Study 08

2 SYSTEM ANALYSIS 10
2.1 Existing System 10

2.1.1 Disadvantages of Existing System 10

2.2 Proposed System 12

2.2.1 Advantages of Proposed System 12

3 SYSTEM DESIGN 14
3.1 System Architecture 14
3.2 UML Diagrams 18
3.2.1 Use Case Diagram 18
3.2.2 Class Diagram 19
3.2.3 Data Flow Diagram 20
3.2.4 Activity Diagram 21

IV
4 SYSTEM IMPLEMENTATION 22
4.1 Modules 22
4.2 Module description 23
4.2.1 Data Set Collection 23
4.2.2 Data Pre-Processing 23
4.2.3 Building and applying gensim 26
4.2.4 Ranking the doument 27
4.2.5 Computing the evaluation parameters 27
4.3 Algorithms 28
4.3.1 Natural language processing 28
4.3.2 Gensim 29
4.3.3 Term Frequency-Inverse Document Frequency 30

4.3.4 Bag of words 31

4.4 Testing 32

4.4.1 Testing Methods 32

4.4.1.1 Unit Testing 32

4.4.1.2 Integration Testing 33

4.4.1.3 System Testing 33

5 RESULTS & DISCUSSION 34


6 CONCLUSION 36
REFERENCES 37
APPENDIX I - SOURCE CODE 41
APPENDIX II - SCREENSHOTS 45

V
LIST OF FIGURES

FIGURE NO FIGURE NAME PAGE NO

3.1 System Architecture 14

3.2.1 Use Case Diagram 18

3.2.2 Class Diagram 19

3.2.3 Data flow diagram 20

3.2.4 Activity Diagram 21

4.3.1 NLP 28

4.3.2 Gensim 29

4.3.3 TF-IDF 30

4.3.4 Bag of words 31

5.1 Similarity scores before and after stemming 35

5.2 Difference between similarity scores 35

01 Original Text 45

02 Tokenization 46

03 Removal of punctuation 47
04 Word folding 47

05 Stop Word Removal of text 48

06 Word Stemming 48

07 Document ranking based on the 49


similarity score before stemming
08 Document ranking based on the 49
similarity score after stemming

VI
TABLE.NO TABLE NAME PAGE NUMBER

4.1 Modules 19

5.1 Similarity scores before and after 35


stemming

VII
LIST OF ABBREVATIONS

S.NO ABBREVATION EXPANSION


01 NLP Natural language processing

Term Frequency Inverse Document


02 TF-IDF Frequency

03 AI Artificial intelligence

04 NLTK Natural Language Tool Kit

05 SVM Support Vector Machine

06 OS Operating system

07 PCA Principle Component Analysis

08 STS Semantic textual similarity

09 RAM Random Access Memory

10 LSA Latent Semantic Analysis

11 MLIR Multi Lingual Information


Retrivel

12 BSMRS Basic Similarity-based Multi-


keyword Ranked Search

13 ESMRS Enchanced Similarity-based Multi-keyword


Ranked Search

14 MLIR Multi Lingual Information


Retrivel

VIII
CHAPTER 1
INTRODUCTION

Over many years, it has been a tremendous task for retrieval of information for any
organization due to the vast volume of data growing continuously. It is an uphill battle for the
user to retrieve the required information from an immense amount of data. Information or
data is stored in the form of documents in any organization. The documents available in
digital form are doubled every five years. So, the identification of similar documents is
beneficial. Similarity among the data plays a crucial role in information retrieval. Segregation
of documents helps the user in many ways. Manual segregation of documents takes a lot of
time and human resources. Thus, this project tends to show the technological innovation to
simplify the task by automatically segregating data. To carry out this thought, various
documents are used as input. The users can take a query of the documents as input and it will
automatically rank the documents in the query based on similarity. This chapter briefly
describes the introduction part of this work by first discussing the overview of the project and
problem statement followed by the objective, existing systems, the significance, and finally
the limitations. Document similarity is a method of getting similar documents from large
document datasets into one query. Ranking the documents in the query based on similarity
helps the user to get access to the required information efficiently. Document similarity is
mainly used in the process of Information retrieval. Information retrieval is a major task in
any organization. Information retrieval helps users to formulate solutions for present
problems by analyzing past results. Information or results are mainly stored in the form of
documents or portable document formats.

1
1.1 Problem Statement

Because of the increasing amount of data, search engines encounter different difficulties
in fetching better relevant to users search queries. Traditional document ranking methods
are mostly based on the similarity computations between documents and queries.In many
cases users may want to retrieve documents that are not only similar but also general or broad
regarding a certain topic. So, to rank the documents efficiently and accurately there is a need
for document ranking using the semantic measure.

1.2 Literature Survey

[1]. Benzi Xu et al. (2021) [1] used pseudo-longest-common-subsequence (pseudo-LCS)


and the Jaccard similarity coefficient is proposed based on this analysis and principal
component analysis (PCA) Volume 132, 2022.
This used pseudo-longest-common-subsequence (pseudo-LCS) and the Jaccard similarity
coefficient is proposed based on this analysis and principal component analysis (PCA) along
with it K-medoids is also used to improve the soft constraints. To effectively measure the
similarity of the operation sequences, a deep analysis was performed to determine the
information requirements and characteristics of the operation sequence similarity problem.
The modified pseudo-LCS is proposed to record the first two pieces of information, and a
corresponding backtracking algorithm is also presented. The Jaccard similarity coefficient is
used here to measure the last information. These two similarity coefficients are combined
based on PCA to generate a novel comprehensive similarity coefficient. The numerical
illustration result shows that it can distinguish all the different cases with rational similarity
values. The typical process route discovery is a practical problem; two conflicting soft
constraints are introduced and solved by the K-medoids method. The proposed method
effectively measures the similarity of the operation sequence. The algorithms provide the
similarity coefficient which helps the user to calculate the ranks of the sequence. The
proposed methodologies also have soft constraints while clustering the documents. The
advantage of this experiment is the quality of products.

[2]. Qian Liu et al. (2021) [2] proposed to use of association rules for measuring word
similarity at a global level and fuzzy similarity to measure the top-k words in in IEEE
Access, vol. 9, pp. 126801-126821,2021.
This proposed to use of association rules for measuring word similarity at a global level and

2
fuzzy similarity to measure the top-k words. For top-k words, the authors proposed a
similarity measure to word embedding where both local and global information is considered.
The global information is measured with association rules and local information is measured
by word embedding also the authors compared this proposed method to eight state-of-the-art
baselines. The data sets used by the authors are TRECdisk 4&5, WT10G, and RCV1. The
authors found a fuzzy logic system which overcomes the problems associated with combining
the two types of measures by inferring the similarity between words and then returning the
top-k selected words. The advantage of fuzzy logic is that it provides a flexible and
convenient way to transform expert knowledge expressed in natural language into fuzzy
rules. This paper contains three components: local, global and fuzzy systems but there is
nocomponent-wise validation.

[3]. N. Kumar, S. K. Yadav and D. S. Yadav, "Similarity Measure Approaches Applied


in Text Document Clustering for Information Retrieval," 2020 Sixth International
Conference on Parallel, Distributed and Grid Computing (PDGC), 2020, pp. 88-921.

This used the standard K-means algorithm. The K-means algorithm calculates the distance
measure. Experimented by using the datasets similar news articles and analysis of customer
feedback, text mining, duplicate content detection, and finding similar documents. In the
dataset, the number of assigned different categories is the same with a set of several clusters.
They can use two evaluation measure-purity and entropy that give the quality of a clustering
result. By using Euclidean distance measures they are used to find the document clustering
more effectively. Pearson and Jaccard methods are more suitable for finding rational clusters
with high clarity values that are represented by documents from a single group that controls
every cluster. The advantage of the experiment is the frequency is calculated according to the
typed dataset but the accuracy is very low.

[4]. Shuaizhang et al. (2019) [4] proposed a model for extended citation model for scientific
document clustering" in IEEE Access, vol. 9, pp. 150865-150877, 2021, doi:
10.1109/ACCESS.2021.3125729.

This proposed a model for extended citation model for scientific document clustering and a
citation network and textual similarity network to enhance the performance of scientific
document clustering. They are conducted by using the PMC database and PubMed database
are two popular databases in the biomedical field and provide a large number of openaccess

3
and full-text scientific documents with 10,996 scientific documents. They used Java and R
programming for the implementation of the experiment. The data should be preprocessed
before the experiment should be conducted they constructed a textual similarity network
according to the integrated citation network, co-citation network, bibliographic coupling
network, and textual similarity network. They proved the practicability of our proposed
extended citation model by comparing it with the traditional bibliographic coupling model
and the textual similarity model for scientific document clustering by using the R
programming language. They used a random walk algorithm for the popular community
detection algorithm whose input is the similarity network and output is the clustering results.
The advantage is the clustering of scientific documents in an efficient way by considering the
frequency of the document. The limitation is that it used a limited dataset.

[5]. M. P. Mahalakshmi and N. S. Fatima, "Maximum Entropy Principle based


Document Ranking with Term Selection Analysis for Cross-Lingual Information
Retrieval," 2021 Third International Conference on Intelligent Communication
Technologies and Virtual Mobile Networks (ICICV), 2021, pp. 1015-1019.

This model the authors used Maximum Entropy Principle based Document Ranking with
Term Selection Analysis (MEPDR-TSA) for CLIR. The user query in the Tamil language is
translated into the English language. Then, the MEPDR technique is employed for the
ranking of the documents and TSA is used for choosing a set of retrieved documents from
each query. Finally, the retrieved English documents are again converted back into the Tamil
language proficiently by using google Translate and the results are tested against precision,
recall, and F-score. The authors found a Maximum Entropy Principle based Document
Ranking with Term Selection Analysis and converted the Tamil query into English also
converting the retrieved English document into Tamil. The advantage of the proposed method
is better than the existing methods like Okapi BM25, IB-MLIR, MULM, KNN, and n-gram.
But, this method should be improved with advanced document ranking methods.

[6]. Mohamed Attia, Manal A. Abdel-Fattah, Ayman E. Khedr, A proposed multi


criteria indexing and ranking model for documents and web pages on large scale data,
Journal of King Saud University - Computer and Information Sciences, 2021.

This proposed a model of multi-criteria indexing and retrieval works for web pages and
documents. It uses different retrieval methods to get an accurate document and it handles the

4
page ranking algorithm issues, this model utilized the top seven criteria for indexing and
retrieving results. First Phase: Users enter the required queries. The MCIR goes through
crawling online or offline. The first step is finding pages or documents existing on the web if
it is working online or offline through stored files on a machine. Once the system finds a page
URL or document path, it visits and finds out according to the user search query. Second
phase: At this stage, the weight model starts generating weight for each criterion according to
user preferences. After it calls rank statistics to return the final page weight to users. This
model proved that ranking through multiple criteria has different results than one or two
criteria as compared to previous algorithms. It retrieves the document and web pages based
on the top seven criteria: Pages or documents votes, Keywords in (Domain, Page content,
Url), Pages publish date, Pages Modified date, Number of links, Pages load time, and Bad
links.

[7]. Hikmat Ullah Khan, Shumaila Nasir, Kishwar Nasim, Danial Shabbir, Ahsan
Mahmood, Twitter trends: A ranking algorithm analysis on real time data, Expert
Systems with Applications, Volume 164, 2021.

This explored Term Frequency-Inverse Document Frequency (Tf-IDF), Combined


Component Approach (CCA) and Biterm Topic Model (BTM) approaches for finding the
topics and terms within given topics. Data Set - Data is collected using the Twitter
application programming interface API that extracts data from Twitter sources. First, the data
is extracted from the Twitter API and the results are stored in the. xlsx file. In the next step
data cleaning involves deleting unused data, and duplicates, doing a spell check, and other
modifications that make the data easier to understand. Stemming translates past tense verbs
into present tenses, tokenization develops tokens in connection to the given roles in a
sentence, and transforms normalization. Data integration is a procedure that gathers
information from several sources and unifies it. The data reduction procedure assists with
large-scale dataset analysis by condensing datasets with a high volume while retaining all
relevant information. The dataset has been used in the last phase to apply various models,
including TF-IDF, CCA, and BTM, to extract the subjects from the tweet collection.

[8]. Dimitris Pappas and Ion Androutsopoulos, A Neural Model for Joint Document and
Snippet Ranking in Question Answering for Large Document Collections, Department
of Informatics, Athens University of Economics and Business, Greece, Institute for
Language and Speech Processing, Research Center ‘Athena’, Greece, 2021.

5
This used POSIT-DRMM or PDRMM, it is a differentiable extension of DRMM. Proposed
architecture to jointly rank documents and snippets concerning a question; there are two
particularly important stages in QA for large document collections. They instantiated the
proposed architecture using a recent neural relevance model (PDRMM) and a BERT-based
ranker. Using biomedical data (from BIOASQ), they showed that the two resulting joint
models (PDRMM-based and BERT-based) vastly outperform the corresponding pipelines in
snippet retrieval, the main goal in QA for document collections, using fewer parameters, and
also remaining competitive in document retrieval. They provided a modified version of the
Natural Questions dataset, suitable for document and snippet retrieval. The documents are
retrieved by using fewer parameters like snippets of the document for the questions. The
advantage of this method is that document retrieval results are better than DRMM and several
other neural rankers. But, the stage of the dataset should be increased to a multi-granular task
as BIOASQ already includes this multigranular task, but exact answers are provided only for
factoid questions and they are freely written by humans, as in MS-MARCO, with similar
limitations. Hence, appropriately modified versions of the BIOASQ datasets are needed.

[9]. Tianrun Cai, Zeling He, Chuan Hong, Yichi Zhang, Yuk-Lam Ho, Jacqueline
Honerlaw, Alon Geva, Vidul Ayakulangara Panickan, Amanda King, David R Gagnon,
Michael Gaziano, Kelly Cho, Katherine Liao, Tianxi Cai, Scalable relevance ranking
algorithm via semantic similarity assessment improves efficiency of medical chart
review, Journal of Biomedical Informatics, Volume 132, 2022.

The authors used the pGUESS algorithm as prior guided semantic similarity to measure the
informativeness of a clinical note to a given phenotype. The algorithm scores the relevance of
a note as the cosine similarity between SEVnote and SEVref. The pGUESS algorithm is fully
knowledge-based except for assigning notes into the three categories via clustering. The
results on note ranking for CAD performed at both VHA and PHS suggests high
transportability of the pGUESS algorithm across two different healthcare systems. So, the
pGUESS algorithm does not require local EHR data but only knowledge sources and
embedding vectors. The algorithm reduced the burden of chart review and it improved the
efficiency and accuracy of human annotation. Determining patient disease status via chart
review is a critical yet labour-intensive task for EHR for training or validating robust
prediction algorithms.

The advantage of this algorithm is that overall ranking quality, as measured by the rank

6
correlation, was the highest for pGUESS compared to all other methods. But, the publicly
available TMGuassian algorithm does not allow out-of-sample prediction, so only tested the
portability of LDAgibbs and LEAvem.

[10] P. Zhang, X. Huang, Y. Wang, C. Jiang, S. He and H. Wang, "Semantic Similarity


Computing Model Based on Multi Model Fine-Grained Nonlinear Fusion," in IEEE
Access, vol. 9, pp. 8433-8443, 2021.

This used the combination of the traditional statistical method and deep learning model as
well as a novel model based on multi-model nonlinear fusion proposed in this paper. The ant
financial data set comes from the Chinese text similarity contest held by Alipay. Its data
mainly come from Alipay’s customer service data. The data contains 100000 pieces of data.
Semantic textual similarity (STS) datasets from 2012 to 2015 in the experiment are also used.
The model uses the Jaccard coefficient based on part of speech, Term Frequency-Inverse
Document Frequency (TF-IDF) and word2vecCNN algorithm to measure the similarity of
sentences respectively. The model combines the traditional sentence similarity calculation
method based on statistics and completes the coarse-grained extraction of the sentence. The
results of the Jaccard algorithm, TF-IDF algorithm and word2vec-CNN are input into the
shallow fully connected neural network to train the model and give an ideal classification
result. IN which the Jaccard algorithm takes grammatical information, TF-IDF calculates
sentence similarity from the word frequency and Inverse document and by using the
word2vec-CNN algorithm, the sentence feature matrix is weighted by a multi-feature
attention mechanism by which it increases the performance. The experimental results of the
proposed method show that the proposed sentence similarity calculation method based on
multi-feature fusion can balance the calculation results of multiple models. Experimental
results show that the matching of sentence similarity calculation method based on multi-
model nonlinear fusion is 84%, and the F1 value of the model is 75%. In this experiment, the
similarity of the sentence is measured with the respective meaning of the words in the
sentence. The limitations are that the word vector given by the word2vec model is static and
cannot describe the dynamic change of semantics and accuracy can be improved.

7
1.3 System Requirements

1.3.1 Hardware Requirements

Operating system : Windows 8+


RAM : 4 GB Minimum
Hard disc or SSD : More than 500 GB
Processor : Intel 3rd generation or high or Ryzen with 8 GB Ram

1.3.2 Software Requirements


Front End : HTML, CSS, BOOTSRAP
Framework : Flask
Monitor : SVGA
Server side Script : Python
Scripts : JavaScript , J Query

1.3.3 Feasibility Study

The feasibility of the project is analysed in this phase and business proposal is put forth with
a very general plan for the project and some cost estimates.During system analysis the feasibility
study of the proposed system is to be carried out. This is to ensure that the proposed system is
not a burden to the company.For feasibility analysis, some understanding of the major requirements
for the system is essential.

Three key considerations involved in the feasibility analysis are


 ECONOMICAL FEASIBILITY
 TECHNICAL FEASIBILITY
 OPERATIONAL FEASIBILITY

ECONOMICAL FEASIBILITY :
The economic feasibility of document ranking depends on various factors, primarily the specific
application and the value it brings to users or businesses. Document ranking,often associate with
information retrieval and search engines, can have economic benefits in different contexts.
Here are some considerations:
 User satisfaction and Engagement
 Productivity and Efficiency
8
 Adaptability and Scalabily
TECHNICAL FEASIBILITY:
The technical feasibility of document ranking involves assessing whether the implementation
of a document ranking system is technically viable given the available technology, resources
and infrastructure.
Here are key considerations for evaluating the technical feasibility of document ranking:
 Algorithm complexity
 Data availability and Quality
 Feature Engineering
 Testing and Evaluation

OPERATIONAL FEASIBILITY:
The operational feasibility for document ranking refers to whether the system can be effectively
integrated into existing operations and processes. It assesses the practicality of implementing
the document ranking system within the operational context of an organization.
Here are key considerations for evaluating the operational feasibility of document ranking:
 User acceptance
 Data input and output
 Resource availability
 Legal and regulatory compilance

9
CHAPTER 2
SYSTEM ANALYSIS
2.1 Existing System

The existing system of Natural language processing is confined to the stream of computer
science and artificial intelligence. Natural language processing mainly deals with the
interaction between computers and human language which mainly focuses on how computers
process and analyze a large amount of data. The NLP techniques help computers to
understand contexts in the document.Due to the increase in data in digital formats such as
documents, the classification of data becomes a difficult task. So, different methods are used
to classify the data based on the user's requirements. The traditional method such as manual
retrieval of data takes a huge time and effect.Therefore, automatic retrieval of data takes
place. Initially, the raw data should be converted into a real number format.Secondly, the
preprocessing of the data takes place by performing the following step tokenization, and
removal of stop words which could not contain any meaningto it. Finally, according to the
user’s query, the similarity metrics should be measured.The above step takes place with help
of NLP Techniques. The Existing method mainly uses the hybrid model to obtain the results.
The hybrid model combines two or more techniques.The accuracy of a single technique is
less than hybrid techniques. The hybrid technique can be used in the Analysis of large and
complex documents, the Entertainment industry, Resume ranking and many more. But, the
usage of the hybrid model makes the process complex. The hybrid method is a combination
of various NLP techniques such as TF-IDF, Jaya and grey wolf optimizer, the Longest
common sequence and may be more compared with other machine learning methods such as
CNN and CNN. The hybrid method also increases the time complexity of the work. Major
work cannot increase the similarity metrics and dictionary.

2.1.1 Disadvantages of Existing System

Bias and fairness issue:

 Document ranking algorithms may inadvertently perpetuate or amplify biases present


in the training data.
 If the training data is biased, the system might favor certain groups or perspectives,
leading to unfair rankings.

10
Limted understanding of context:
 Document ranking systems may struggle to understand the context of the content they
are evaluating.
 They often rely on keyword matching and statistical patterns, which might not capture
the nuanced meaning of documents accurately.

Difficulty in handling multimedia texts:

 Many existing document ranking methods are primarily designed for text-based
content and may not effectively handle multimedia content such as images, videos, or
audio.

 This can limit the usefulness of search engines for users seeking diverse types of
information.

Scalability challenges:

 As the volume of documents and user queries continues to grow, existing document
ranking methods may struggle to maintain scalable performance without sacrificing
relevance or quality.

 This can result in slower response times or degraded search experiences for users.

Limted personalization:

 Existing document ranking methods often provide one-size-fits-all rankings based on


the query and document content, without considering individual user preferences,
search history, or context.

 This lack of personalization can lead to suboptimal search experiences for users who
have diverse needs and preferences.

11
2.2 Proposed Sysyem:

In this model,ranking the documents based on similarity of the users query is done on
Natural language processing.The ranking of the documents helps in the efficient retrieval of
information in any firm. This model helps in identifying the most relevant and important
information based on the user requirement.
The proposed work is implemented in Python 3.8 with libraries Gensim, Spacy, NumPy,
nltk, corpora, models, similarities and other mandatory libraries. There are many
applications of Natural language processing such as Information retrieval, classification of
documents, spell checker, estimation of similarity, keyword extraction, Language
Translation and many information retrieval problems. Most of the information retrieval
problems are solved using NLP techniques. In the natural language processing techniques,
data processing plays an important role to solve the problem. The proposed model is
implemented using an unsupervised learning algorithm and well-known python library
called Gensim which ranked the documents based on similarity of the user’s query with
better accuracy. The dataset in which we have simple text will be pre-processed with
different metrics such as word tokenization, removal of punctuation. Stop word removal
and word stemming. Word stemming plays a crucial role in finding the accurate similarity.
The preprocessed documents are splitted into source and queries documents and the
proposed model is applied to the preproceed documents, compared the source with other
target documents, obtained the score of the documents and rank the documents based on
similarity of the user query. The performance evaluation of the proposed method is better
than the traditional methods.

2.2.1 Advantages of Proposed System


 Better data handling:
In the present digital world, a lot of unstructured data such as documents, pdf and
emails are handled with the help of NLP techniques. NLP techniques help in
information retrieval, efficient access to data, and ordering of data.

 Time-efficient:
The automatic generation of solutions to any natural language problem reduces the
time complexity of the users. The automatic generation of solutions to any natural
language problem reduces the time complexity of the users.

12
 Market Research and Analysis:
Automatic summarization of research papers, Extraction of keywords, and clustering
of similar documents in query help the researchers.

 Streamlined processes:
Avoiding the traditional methods for communicating such as help centres and
customer care were updated to chatbots to improve customer satisfaction.

 Improve customer satisfaction:


NLP techniques such as sentimental analysis in feedback, customer satisfaction
surveys, and reviews analysis help in the efficient analysis of customer problems and
give relevant results, and also improve customer satisfaction.

 E-learning:
NLP machine learning technology can examine the language used in a classroom to
define the mental states of both teachers and students.

 Provide high information:


With the help of NLP techniques, the user gets a piece of high-quality relevant
information.

13
CHAPTER 3

SYSTEM DESIGN

3.1 System Architecture

Figure 3.1 System Architecture

 Preprocessing:
Preprocessing consists of term folding, term tokenization, removal of punctuation,
stop term elimination term stemming. Preprocessing is the first thing we have to do
for a document for further easy process.It raises reliability and accuracy.Preprocessing
data can increase the correctness and quality of a dataset, making it more reliable by
removing missing or inconsistent data values brought on by human or computer
mistake. It ensures consistency in data.

 Term Folding:

Preprocessing is applied to the data once it has been received. Term folding is a
preprocessing technique that lowercases words that are currently in uppercase. It

14
appears as though two similar terms are the lowercase.

 Removal of Punctuation:

Grammar is defined as the rules for forming well-structured sentences. while


describing the syntactic structure of well -formed programs, Grammar plays a very
essential and important role. In simple words. Grammar denotes syntactical rules that
are used for conversation in natural languages. The data's punctuation and spaces are
also eliminated. For simpler processing, we should eliminate these punctuation marks.

 Term Tokenization:

Term tokenization then separates the raw text into tokens, which are words and
sentences. By studying the words, these symbols help the reader in determining the
context and analyzing the text's meaning.

 Stop Term Elimination:

Elimination of stop term occurs in the next preprocessing stage. In any language, stop
terms are a group of frequently used terms. Stop words in English include "the," "is,"
and "and," for instance. Stop words are used to remove unnecessary words so that
computers can concentrate on the crucial ones. It is one of the most commonly used
preprocessing steps across different NLP applications.

 Term Stemming:

Word stemming, the final preprocessing stage, removes the final few characters from
a word, frequently resulting in inaccurate spelling and meaning. Lemmatization
considers the context and converts the word to its meaningful base form, which is
called Lemma. For instance, stemming the word 'Eating' would return 'Eat'.

 UPDATED SOURCE & TARGET DOCUMENTS:

After preprocessing is finished, the input source document and set of documents in
datasets are updated as a preprocessed data.

15
 BUILT GENSIM MODEL:

Gensim is open source python natural language processing library used for
unsupervised modeling. The features of the gensim is its scalability, robust, platform
agnostic and Efficient multicore implementations. The uses of gensim are fastText,
Word2vec, LSA (Latent Semantic Analysis), LDA (Latent Dirichlet Allocation) and
tf-idf (term frequency-inverse document frequency). fastText, uses a neural network
for word embedding, is a library for learning of word embedding and text
classification. Word2vec, used to produce word embedding, is a group of shallow and
two-layer neural network models. LSA is a technique in NLP (Natural Language
Processing) that allows us to analyse relationships between a set of documents and
their containing terms. LDA is a technique in NLP that allows sets of observations to
be explained by unobserved groups. These unobserved groups explain, why some
parts of the data are similar. tf-idf, a numeric statistic in information retrieval, reflects
how important a word is to a document in a corpus. It is often used by search engines
to score and rank a document’s relevance given a user query. The facilities provided
by Gensim for building topic models and word embedding is unparalleled. It also
provides more convenient facilities for text processing. It handle large text files even
without loading the whole file in memory. Gensim doesn’t require costly annotations
or hand tagging of documents because it uses unsupervised models. The core concepts
of the gensim are document, corpus, vector and model. Document is an object of the
text sequence type which is known as ‘str’ in Python 3. A corpus may be defined as
the large and structured set of machine-readable texts produced in a natural
communicative setting. In Gensim, a collection of document object is called corpus.
The role of the corpus are Serves as input for training a model and Serves as topic
extractor. A vector is mathematical representation of a document. Model refers to an
algorithm used for transforming vectors from one representation to another. For
working on text documents, Gensim also requires the words, i.e. tokens to be
converted to their unique ids. For achieving this, it gives us the facility of Dictionary
object, which maps each word to their unique integer id. It does this by converting
input text to the list of words and then pass it to the corpora.Dictionary() object. In
Gensim, the dictionary object is used to create a bag of words (BoW) corpus which
further used as the input to topic modelling and other models as well. Term
Frequency-Inverse Document Frequency model which is also a bag-of-words model.

16
It is different from the regular corpus because it down weights the tokens i.e. words
appearing frequently across documents. During initialisation, this tf-idf model
algorithm expects a training corpus having integer values (such as Bag-of-Words
model). Then after that at the time of transformation, it takes a vector representation
and returns another vector representation. The output vector will have the same
dimensionality but the value of the rare features (at the time of training) will be
increased. It basically converts integer-valued vectors into real-valued vectors.

 COMPARING AND RANKING THE DOCUMENTS:

After performing the gensim model then we start comparing the input data with data
set details to measure the similarity then the score of the target documents means
documents in the dataset is measured with respect to the source document the target
documents are sorted according to the score obtained means the documents are ranked
based on the scores.

17
3.2 UML Diagrams

3.2.1 Use Case Diagram

The use-case diagram presents the functionality provided by a system in


terms of actors, their goals and any dependencies between those use cases.
The actors involved in the project are the actors and the system. The user
uploads the dataset for pre-processing and the system evaluates and predicts
the result.

Figure 3.2.1 Use Case Diagram

18
3.2.2 Class Diagram
In software engineering, a class diagram in the Unified Modelling Language (UML) is a type
of static structure diagram that describes the structure of a system by showing the system's
classes, attributes, operations (or methods), and the relationships among the classes. It
explains which class contains information. In this class diagram there are document with
document type, and also a folder and document version in it.

Figure 3.2.2 Class Diagram

19
3.2.3 Data Flow Diagram

The Data Flow Diagram (DFD) shows information flow in the system the user uploads
and views the data and the system evaluates and provides the result.

Figure 3.2.3 Data Flow Diagram

20
3.2.4 Activity Diagram
Activity diagrams are graphical representations of workflows of stepwise activities and
actions with support for choice, iteration and concurrency. In the Unified Modelling
Language, activity diagrams can be used to describe the business and operational step-
by-step workflows of components in a system. An activity diagram shows the overall
flow of control.

Figure 3.2.4 Activity Diagram

21
CHAPTER 4

SYSTEM IMPLEMENTATION

4.1 Modules:

Module Function of Module

Data Set Collection - upload_csv


- retrieve_file
- prompt_user
Data Pre-processing - handle_null_values
- select_feature
- encode_categorical
Building and applying gensim - topic_id
- topic_words
- ida_model
Ranking the documents - scoring_documents
- ranking_algorithms
- learning_to_Rank models
- result_presentation
Computing the evaluation parameters - Precision
- Accuracy
- F1score

22
4.2 Module Description:
4.2.1 Data Set Collection:

Data collection is the process of gathering and measuring information on variables of


interest, in an established systematic fashion that enables one to answer stated research
questions, test hypotheses, and evaluate outcomes.The collected data is pre-processed
before giving it to the classification algorithm which involves the data cleaning and the
removal of the missing data which is needed for moving forward with the
procedure.Dataset is in the form of simple and plain text which is collected from the
wikipedia and content written unstructured documents. The dataset is collect in the form
of sample where each sample contain one query document and other corpus of
documents.The samples of the dataset are collects from different domains such as
biography , crime story, History of a flower.

4.2.2 Data Pre-Processing:


Data processing of text refers to the process of cleaning, preprocessing, and
transforming raw text data to make it suitable for analysis and modeling. In the data
processing of text, it consists of five crucial steps such as.
 Word tokenization,
 Removal of punctuation
 word folding
 Stop Word Removal
 Word Stemming
Word tokenization:

Tokenization is a fundamental step in Natural Language Processing (NLP) that involves


breaking down a text into smaller units, known as tokens. A token is a sequence of
characters that represents a single element of meaning in the text. These tokens could be
individual words, phrases, or even individual characters. Tokenization is a crucial step
in NLP as it is the foundation for most downstream tasks such as sentiment analysis,
text classification, and language modeling. The primary goal of tokenization is to split
text into smaller units that can be more easily analyzed and processed by a computer.
Without tokenization, it would be challenging to process large volumes of text data
effectively. Tokenization helps to reduce the complexity of text data, making it easier

23
for machines to analyzed. In the samples of the dataset, we broken the word into smaller
words.

Removal of Punctuation:

Removal of punctuation is followed by tokenization of plain text. Punctuation removal


is an essential step in preprocessing of text, especially in information retrieval.
Punctuation refers to characters that are used to enhance the readability and convey
meaning in written language, such as commas, periods, question marks, and exclamation
marks. However, when dealing with text data, these symbols are often irrelevant or even
detrimental to the analysis process. Removing punctuation from text data is a simple but
powerful technique that can improve the performance of NLP models and increase the
accuracy of results. The primary reason for removing punctuation is to simplify the text
and make it easier for the model to process. Punctuation can disrupt the continuity of the
text and introduce noise into the analysis process. When a model is trained on a dataset
that includes punctuation, it may not perform well when it encounters text without
punctuation. Removing punctuation ensures that the model can process text consistently,
regardless of the presence or absence of punctuation. Removal of punctuation can also
enhance the accuracy of text analysis tasks. In the samples of the dataset, the plain text
is after the completion of the tokenization, then tokenized text is preprocessed with
punctuation removal. Finally, the punctuation free text is passed to the next stage. In
conclusion, removing punctuation from text data is a simple but effective technique to
simplify the text and make it easier to process. This technique is used to improve the
performance and accuracy of model.

Word folding:

Word folding is followed by Removal of punctuation. Word Folding is a text


normalization technique that aims to standardizes words by converting them to a
common format. It involves collapsing different variations of a word into a single
representation so that they can be easily compared or searched. The process of Word
Folding typically involves Two steps: case folding and accent folding. Case folding is
the process of converting all the characters in a word to either uppercase or lowercase.
This is necessary because uppercase and lowercase letters can represent the same letter,
and Word Folding aims to standardize the representation of words.Accent folding is the

24
process of removing diacritical marks or accents from the text. This step is important
because different languages may use different accents, and some languages have
multiple accents for the same letter.

Stop word removal:

Stop Word Removal is followed by word folding.In natural language processing, stop
word removal is a common technique used for text preprocessing. Stop words are words
that are commonly used in a language but do not contribute to the meaning of a
sentence. Examples of stop words in English include "a", "an", "the", "is", "are", "of",
and "in". These words are usually removed from the text during the preprocessing stage
as they don't provide any value for the analysis of the text.The main purpose of stop
word removal is to reduce the size of the dataset and improve the accuracy of
downstream analysis of text in the model. For the process of stop word removal , then
the preprocessed text from the dataset will be preprocessed with stop word Removal .
Then the preprocessed text is passed to the next stage.In conclusion, stop word removal
is a common technique used in natural language processing for preprocessing text data.
While it can be useful in reducing the size of the dataset and improving the accuracy of
model.

Word Stemming:

Finally, the crucial step in the preprocessing is stemming. Stemming is followed by Stop
word Removal.Stemming is a common technique used in Natural Language Processing
(NLP) for text pre-processing. It is the process of reducing a word to its base or root
form, called a stem or lemma. This is done by removing the suffixes and prefixes from
the word, which results in the stem being derived. Stemming is useful because it helps
to reduce the dimensionality of the text data, making it easier to analyze and process. It
also helps to normalize the text, allowing similar words to be treated as the same, which
can improve the accuracy of model which helps in information retrieval.

For example, consider the word "running". The Porter Stemming Algorithm would
apply the following rules: Remove the suffix "ing" to get "runn", Apply the rule for "nn"
to get "run", Apply the rule for "un" to get "run". In the samples of the dataset, the
preprocessed data is passed to the stage of stemming which will written the root of the
word in the sample. In conclusion, stemming is a valuable technique in NLP that helps

25
to normalize text and reduce dimensionality. However, it is important to use judiciously
and to combine them with other techniques to achieve the best results in text processing.
In conclusion, text data processing is a crucial step in natural language processing that
involves the conversion of raw textual data into a structured format that can be used for
analysis and modelling. This process involves several techniques such as tokenization,
stop word removal, stemming/lemmatization, and part-of-speech tagging, which can be
used individually or in combination depending on the specific requirements of the
analysis.

4.2.3 Building and applying gensim:

Initially, the builded contains importing the libraries of genism and Natural processing
language libraries. The corpora, models, and similarities modules are imported from the
gensim library. These modules will be used to create a dictionary of words from the
corpus, train a TF-IDF model, and calculate document similarities. Several sample
documents are read from text files and stored in a list called documents. The first
document is stored in doc1, the second document is stored in doc2, and so on. The
documents are collected from different domains from wikipedia. The text of each
document is split into words and stored in another list called text_corpus. This creates a
list of lists, where each inner list contains the words of a single document. A Dictionary
object is created using the text_corpus list. This creates a mapping between words and
unique integer IDs. A corpus object is created by converting each document in
text_corpus to a bag-of-words representation using the doc2bow method of the
Dictionary object. This creates a list of sparse vectors, where each vector represents the
frequency of each word in a single document. A TfidfModel object is trained on the
corpus object, which creates a TF-IDF representation of each document in the corpus.
This assigns a weight to each word in each document based on how important it is to the
document relative to the other documents in the corpus. A MatrixSimilarity object is
created from the TF-IDF corpus. This object allows us to calculate the similarity
between any two documents in the corpus. A query document is defined as the first
document in documents. The text of the query document is converted to a bag-of-words
representation using the same Dictionary object that was used to create the corpus. The
similarity between the query document and each document in the corpus is calculated
using the MatrixSimilarity object. This produces a list of similarity scores, where each

26
score represents the similarity between the query document and a single document in the
corpus. The similarity scores are sorted in descending order, and the document ID and
similarity score for each document in the corpus are printed to the console.

4.2.4 Ranking the document:

Ranking documents based on user requirements is an essential task in information


retrieval. The dataset consists of plaintext which is preprocessed, the preprocessed text
unstructured document is splitted and vectorization using TF-IDF model. The distance
between query document and corpus of documents. The similarity score is calculated
with respect to requirement is calculated.The corpus documents are ranked based on
similarity score.

4.2.5 Computing the evaluation parameters:

The accuracy of the proposed gensim model is compared with other similar method. The
accuracy of the gensim model is higher than other traditional models.

27
4.3 Algorithms:
4.3.1 Natural Language Processing:
Natural language processing(NLP) is confined to the stream of computer science and
artificial intelligence. Natural language processing mainly deals with the interaction
between computers and human language which mainly focuses on how computers
process and analyze a large amount of data. The NLP techniques help computers to
understand contexts in the document. Natural Language Processing (NLP) plays a
significant role in document ranking, especially in information retrieval systems such as
search engines. Document ranking refers to the process of determining the relevance of
documents to a user's query and presenting them in a ranked order.

Figure 4.3.1 NLP

28
4.3.2 Gensim:
Gensim is a popular open source natural language processing library used for
unsupervised topic modelling and that specializes in creating and manipulating vector
space models of natural language data. Vector space models represent text documents as
high-dimensional vectors, which can be analysed using various mathematical operations
to discover patterns, similarities, and relationships between them. Gensim provides a
suite of tools for building, training, and using vector space models, with a focus on
scalability, performance,and ease of use. One of the main features of Gensim is its
support for multiple text corpus formats, includingplain text, CSV, and preprocessed
corpus formats such as MMCorpus and LDA-C. Gensim provides a flexibleand efficient
way to preprocess text data, which involves tokenizing, stemming, stop-word removal,
and other tasks that are necessary to convert raw text into a form that can be used to
build vector space models. Preprocessing is typically done using Gensim's built-in
functions or custom pipelines, which can be configured to meet the specific needs of the
user. Gensim supports several popular vector space models, including bag-ofwords, TF-
IDF, LSI (Latent Semantic Indexing), LDA (Latent Dirichlet Allocation), and
word2vec. These models differ in their underlying assumptions and mathematical
techniques, but they all share the goal of representing text documents as vectors in a
high-dimensional space. For example, the bag-of-words model represents each
document as a vector of term frequencies, where each term corresponds to a dimension
in the vector space.

Figure 4.3.2 Gensim

29
4.3.3 Term Frequency-Inverse Document Frequency:

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that


reflect how important a word is to a document in a collection or corpus of documents.
It is widely used in natural language processing and information retrieval for tasks
such as text classification, document clustering, and search engine ranking. The TF-
IDF statistic is calculated based on the frequency of each word in a document and the
inverse frequency of the word across all documents in the corpus. This calculation
results in a weight for each word that reflects its relative importance in the document.
The term frequency (TF) of a word in a document is simply the number of times the
word appears in the document. It is usually normalized by dividing it by the total
number of words in the document to account for differences in document length. The
TF value is high for words that appear frequently in a document and low for words
that appear infrequently. The inverse document frequency (IDF) of a word is a
measure of how common or rare the word is across all documents in the corpus. It is
calculated by taking the logarithm of the total number of documents in the corpus
divided by the number of documents in the corpus that contain the word. The IDF
value is high for words that appear in few documents and low for words that appear in
many documents.

Figure 4.3.3 TF-IDF

30
4.3.4 Bag of words:
The doc2bow method is a function provided by the Gensim library for converting a
document (list of words) into a bag-of-words format. Bag-of-words (BOW) is a
commonly used representation of text in natural language processing. In BOW, a
document is represented as a sparse vector of word frequencies, where each dimension
corresponds to a unique word in the vocabulary. The doc2bow method takes a list of
tokens as input and returns a list of tuples. Each tuple represents a word in the document
and its frequency count. The first element of the tuple is the word's index in the
vocabulary, and the second element is the word's frequency count in the
document.Overall, the doc2bow method is an important tool for text processing in
natural language processing. It provides a simple and efficient way to convert
documents to a bag-of-words format, which can be used in various downstream tasks
such as topic modeling, clustering, and classification.

Figure 4.3.4 Bag of words

31
4.4 Testing:

Software testing techniques are methods used to design and execute tests to evaluate
software applications. It involves rigorous unit testing to validate the functionality of
individual modules, comprehensive integration testing to ensure seamless interaction
between components, and manual testing to assess overall system performance,
usability, and accessibility.

4.4.1 Testing Methods:

4.4.1.1 Unit Testing:

Unit testing typically involves testing individual components or functions of you


ranking system to ensure they behave as expected. Break down your document ranking
system into smaller units or components. This might include functions or modules
responsible for tokenization, term weighting, similarity calculation, and ranking
algorithm implementation. Create test cases for each unit to cover a range of
scenarios,including edge cases and typical use cases. For document ranking, these might
include different types of queries, various document structures, and scenarios where
certain components might fail.Ensure that each unit test is isolated from external
dependencies. Mock or stub external services or modules to focus solely on the unit
being tested. This allows you to pinpoint the source of any failures. Verify that the
document tokenization unit correctly processes documents into tokens. Test it with
various document types and check if the output is as expected. Check that the term
weighting unit assigns appropriate weights to terms based on their importance. Test with
different term frequencies and document lengths. Validate the similarity calculation unit
to ensure it accurately computes the similarity between a query and a document. Use
predefined cases with known results to verify correctness.

32
4.4.1.2 Integration Testing:

Integration testing plays a crucial role in ensuring the seamless interaction between various
components of the system in our project.
Determine the key integration points in your document ranking system. These might include
the interaction between document tokenization, term weighting, similarity calculation, and
the ranking algorithm. Verify the flow of data between different components. Ensure that
data is passed correctly from one module to another and that the transformations are applied
as intended .
Use mock objects or stubs to simulate external dependencies, such as databases or external
APIs. This allows you to control the input and focus on the interactions between the internal
components. Test how different components interact with each other. For example, ensure
that the term weights calculated during term weighting are correctly used in the similarity
calculation, and that the results are then appropriately considered in the ranking algorithm.
If your document ranking system interacts with external systems (e.g., a search engine
platform, database, or caching system), perform tests to ensure a smooth integration. Test
scenarios like data retrieval, updates, and error handling

4.4.1.3 System Testing:

System testing plays a crucial role evaluating the entire document ranking system as a whole
to ensure that it meets the specified requirements and functions correctly in a real-world
environment. Identify and define various test scenarios that represent typical and edge use
cases. These scenarios should cover a range of queries, document types, and user interactions.
Perform end-to-end testing to simulate the entire user journey, from submitting a query to
receiving and displaying the ranked document results. Ensure that the system behaves as
expected at every step.

33
CHAPTER 5

RESULTS & DISCUSSION

Document ranking based on Similarity includes usage of NLP techniques and also applying
machine learning algorithms. The model includes application of NLP techniques for
preprocessing which includes word tokenization, removal of punctuation,stop word removal
and word stemming. The dataset of different field documents such as categories which
includes biography, news and hand written chapters.The each category contains one source
document and target documents. The input of the proposed gensim model is preprocessed
documents and output of the proposed system is ranked based on the similarity with respect
to the source document. The proposed method calculated the similarity score before
stemming and after stemming. The measurement of similarity score is more accurate when
we use stemming and without stemming similarity scores are not accurate. The accuracy
measure of the proposed method is higher than the other traditional methods. The output of
sample of the proposed method is shown below

SAMPLE

 Initially, the dataset is collected from the sources and the documents contain
information about the biography of Dhoni.
 The dataset with plain text is preprocessed with following steps such as word
tokenization, removal of punctuation, word folding, stop word removal and word
stemming.
 The preprocessed documents are processed into the proposed method and ranked
documents are obtained.

34
5.1 Similarity Scores before and after stemming

5.2 Difference between similarity Scores

35
CHAPTER 6
CONCLUSION

This project “Document Ranking Based on Similarity using Natural Language Processing
Technique” is used for ranking the documents based on similarity score with respect kk’to the
source document which plays a crucial role in information retrieval. In the era of the digital
world, digital information has been increasing widely and it has been doubled every five
years. The manual accessing of data is a different and time-consuming process. The
traditional method for accessing the documents has been not accurate. The preprocessing of
text plays an important role in the NLP techniques models. Most of the existing methods
were not focused on preprocessing of text. The proposed Gensim model processes the text
with five different methods such as tokenization, Removal of punctuation, Word Folding,
Stop word Removal and Word Stemming. Word Stemming plays an important role in
measuring the similarity score. The existing models only focused on finding the similarity of
documents and clustering them in a cluster but the proposed method ranked the documents
based on the similarity score with respect to the user query. The model proposed model is
more accurate than other traditional models with accurary 1 after the stemming and before
stemming with accuray 0.86. The model proposed is more accurate than other traditional
models. The proposed model helps in information retrieval applications such as web engine,
search engine , Entertainment and News industry and many more. our work can be further
improved by considering the homonyms ambiguity in the documents. The homonyms of the
words can also improve the similarity measure of documents.

36
REFERENCES
[1]. Benzi Xu et al. (2021) [1] used pseudo-longest-common-subsequence (pseudo-LCS) and
the Jaccard similarity coefficient is proposed based on this analysis and principal component
analysis (PCA) Volume 132, 2022.

[2] Qian Liu et al. (2021) proposed to use of association rules for measuring word similarity
at a global level and fuzzy similarity to measure the top-k words in in IEEE Access, vol. 9,
pp. 126801-126821, 2021.

[3] N. Kumar, S. K. Yadav and D. S. Yadav, "Similarity Measure Approaches Applied in


Text Document Clustering for Information Retrieval," 2020 Sixth International Conference
on Parallel, Distributed and Grid Computing (PDGC), 2020, pp. 88-921.

[4] Shuaizhang et al. (2019) proposed a model for extended citation model for scientific
document clustering" in IEEE Access, vol. 9, pp. 150865-150877, 2021, doi:
10.1109/ACCESS.2021.3125729.

[5]. M.P.Mahalakshmi and N. S. Fatima, "Maximum Entropy Principle based Document


Ranking with Term Selection Analysis for Cross-Lingual Information Retrieval," 2021 Third
International Conference on Intelligent Communication Technologies and Virtual Mobile
Networks (ICICV), 2021, pp. 1015-1019.

[6]. Mohamed Attia, Manal A. Abdel-Fattah, Ayman E. Khedr, A proposed multi criteria
indexing and ranking model for documents and web pages on large scale data, Journal of
King Saud University – Computer and Information Sciences, 2021.

[7]. Hikmat Ullah Khan, Shumaila Nasir, Kishwar Nasim, Danial Shabbir, Ahsan Mahmood,
Twitter trends: A ranking algorithm analysis on real time data, Expert Systems with
Applications, Volume 164, 2021.

[8]. Dimitris Pappas and Ion Androutsopoulos, A Neural Model for Joint Document and
Snippet Ranking in Question Answering for Large Document Collections, Department of
Informatics, Athens University of Economics and Business, Greece, Institute for Language
and Speech Processing Research Center ‘Athena’, Greece, 2021.

37
[9]. Tianrun Cai, Zeling He, Chuan Hong, Yichi Zhang, Yuk-Lam Ho, Jacqueline Honerlaw,
Alon Geva, Vidul Ayakulangara Panickan, Amanda King, David R Gagnon, Michael
Gaziano, Kelly Cho, Katherine Liao, Tianxi Cai, Scalable relevance ranking algorithm via
semantic similarity assessment improves efficiency of medical chart review, Journal of
Biomedical Informatics, Volume 132, 2022.

[10]. P. Zhang, X. Huang, Y. Wang, C. Jiang, S. He and H. Wang, "Semantic Similarity


Computing Model Based on Multi Model Fine-Grained Nonlinear Fusion," in IEEE Access,
vol. 9, pp. 8433-8443, 2021.

[11]. Bo Xu, Hongfei Lin, Yuan Lin, Kan Xu, Two-stage supervised ranking for emotion
cause extraction, Knowledge-Based Systems, Volume 228, 2021.

[12]. M. F. Bashir, H. Arshad, A. R. Javed, N. Kryvinska and S. S. Band, "Subjective


Answers Evaluation Using Machine Learning and Natural Language Processing," in IEEE
Access, vol. 9, pp. 158972-158983, 2021.

[13]. M. AbuSafiya, "Measuring Documents Similarity using Finite State Automata," 2020
2nd International Conference on Mathematics and Information Technology (ICMIT), 2020,
pp. 208-211.

[14]. V. Kuppili, M. Biswas, D. R. Edla, K. J. R. Prasad and J. S. Suri, "A Mechanics-Based


Similarity Measure for Text Classification in Machine Learning Paradigm," in IEEE
Transactions on Emerging Topics in Computational Intelligence, vol. 4, no. 2, pp. 180-200,
April 2020.

[15]. M. J. Kim, J. S. Kang and K. Chung, "Word-Embedding-Based Traffic Document


Classification Model for Detecting Emerging Risks Using Sentiment Similarity Weight," in
IEEE Access, vol. 8, pp. 183983-183994, 2020.

[16]. F. Ye, X. Zhao, W. Luo, D. Li and W. Min, "Query-Adaptive Remote Sensing Image
Retrieval Based on Image Rank Similarity and Image-to-Query Class Similarity," in IEEE
Access, vol. 8, pp. 116824-116839, 2020.

[17]. Jesus Serrano-Guerrero, Francisco P. Romero, Jose A. Olivas, A relevance and quality-
based ranking algorithm applied to evidence-based medicine, Computer Methods and
Programs in Biomedicine, Volume 191, 2020.

38
[18]. Yun Li, Yongyao Jiang, Chaowei Yang, Manzhu Yu, Lara Kamal, Edward M.
Armstrong, Thomas Huang, David Moroni, Lewis J. McGibbney, Improving search ranking
of geospatial data based on deep learning using user behavior data, Computers &
Geosciences, Volume 142, 2020.

[19]. R. Pelánek, "Measuring Similarity of Educational Items: An Overview," in IEEE


Transactions on Learning Technologies, vol. 13, no. 2, pp. 354-366, 1 April-June 2020.

[20]. G. Venkanna and D. K. F. Bharati, "Optimal Text Document Clustering Enabled by


Weighed Similarity Oriented Jaya With Grey Wolf Optimization Algorithm," in The
Computer Journal, vol. 64, no. 1, pp. 960-972, Nov. 2019.

[21]. S. Zhang, Y. Xu and W. Zhang, "Clustering Scientific Document Based on an Extended


Citation Model," in IEEE Access, vol. 7, pp. 57037-57046, 2019.

[22]. R. Dong, Z. -g. Wei, C. Liu and J. Kan, "A Novel Loop Closure Detection Method
Using Line Features," in IEEE Access, vol. 7, pp. 111245-111256, 2019.

[23]. J. Kim, "A Document Ranking Method with Query-Related Web Context," in IEEE
Access, vol. 7, pp. 150168-150174, 2019.

[24]. C. Xia, T. He, W. Li, Z. Qin and Z. Zou, "Similarity Analysis of Law Documents Based
on Word2vec," 2019 IEEE 19th International Conference on Software Quality, Reliability
and Security Companion (QRS-C), 2019, pp. 354-357.

[25]. Y. Ma, P. Zhang and J. Ma, "An Ontology Driven Knowledge Block Summarization
Approach for Chinese Judgment Document Classification," in IEEE Access, vol. 6, pp.
71327-71338, 2018.

[26]. Q. Mahmood, M. A. Qadir and M. T. Afzal, "Application of COReS to Compute


Research Papers Similarity," in IEEE Access, vol. 5, pp. 26124- 26134, 2017.

[27]. M. Liu, B. Lang, Z. Gu and A. Zeeshan, "Measuring similarity of academic articles with
semantic profile and joint word embedding," in Tsinghua Science and Technology, vol. 22,
no. 6, pp. 619-632, December 2017.

[28]. Olga Vechtomova, Murat Karamuftuoglu, Lexical cohesion and term proximity in
document ranking, Information Processing & Management, Volume 44, Issue 4, 2008.

39
[29]. Czesław Daniłowicz, Jarosław Baliński, Document ranking based upon Markov chains,
Information Processing & Management, Volume 37, Issue 4, 2001.

[30]. H. Shen, L. Xue, H. Wang, L. Zhang and J. Zhang, "B+-Tree Based MultiKeyword
Ranked Similarity Search Scheme Over Encrypted Cloud Data," in IEEE Access, vol. 9, pp.
150865-150877, 2021, doi: 10.1109/ACCESS.2021.3125729.

40
APPENDIX I - SOURCE CODE

1.PreProcessing
import spacy
from nltk.stem.porter import PorterStemmer
from gensim.utils import simple_preprocess

# Load the English NLP model


nlp = spacy.load("en_core_web_sm")

# Define a sample text to tokenize


#text = "This is an example sentence. It contains multiple words."
with open("","r") as f:
text =f.read()
print(text)

# Tokenize the text using spaCy


doc = nlp(text)

# Print each token and its part of speech


for token in doc:
print(f"Token: {token.text} POS: {token.pos_}")
# Remove punctuation tokens from the document
doc_without_punct = [token.text for token in doc if not token.is_punct]

# Join the remaining tokens back into a single string


text_without_punct = " ".join(doc_without_punct)

# Print the resulting text


print(text_without_punct)
doc1 = nlp(text_without_punct)
lemmatized_tokens = [token.lemma_ for token in doc1]

# Join the lemmatized tokens back into a single string

41
lemmatized_text = " ".join(lemmatized_tokens)
# Print the resulting text
print(lemmatized_text)
doc2 = nlp(lemmatized_text)

# Remove the stop words from the tokenized text


tokens_without_stopwords = [token.text for token in doc2 if not token.is_stop]

# Join the non-stopword tokens back into a single string


text_without_stopwords = " ".join(tokens_without_stopwords)

# Print the resulting text


print(text_without_stopwords)

from nltk.stem.porter import PorterStemmer


from gensim.utils import simple_preprocess

# Define a sample text to stem


#with open("/content/text.txt","r") as f:
#text =f.read()

# Tokenize the text using Gensim's simple_preprocess function


tokens = simple_preprocess(text_without_stopwords)

# Define a stemmer object


stemmer = PorterStemmer()

# Stem each token in the text


stemmed_tokens = [stemmer.stem(token) for token in tokens]

# Join the stemmed tokens back into a single string


stemmed_text = " ".join(stemmed_tokens)

# Print the resulting text


print(stemmed_text)

42
2. Gensim
from gensim import corpora, models, similarities
# Define some sample documents
with open("/content/pf-source.txt","r") as f:
doc1 =f.read()
with open("/content/pf-2.txt","r") as f:
doc2 =f.read()
#doc2 = "This document is the second document"
with open("/content/pf-3.txt","r") as f:
doc3=f.read()
with open("/content/pf-4.txt","r") as f:
doc4=f.read()
with open("/content/pf-5.txt","r") as f:
doc5=f.read()
with open("/content/ps-6.txt","r") as f:
doc6=f.read()
# Create a corpus of documents
documents = [doc1, doc2, doc3, doc4,doc5,doc6]
text_corpus = [doc.split() for doc in documents]
dictionary = corpora.Dictionary(text_corpus)
corpus = [dictionary.doc2bow(text) for text in text_corpus]

# Train a TF-IDF model on the corpus


tfidf_model = models.TfidfModel(corpus)
tfidf_corpus = tfidf_model[corpus]

# Create a similarity index for the TF-IDF corpus


similarity_index = similarities.MatrixSimilarity(tfidf_corpus)

# Define a sample query


with open("/content/pf-source.txt","r") as f:
query = f.read()

# Convert the query to a bag-of-words vector using the corpus dictionary

43
query_vec = dictionary.doc2bow(query.lower().split())

# Calculate the similarities between the query vector and each document in the corpus
similarities = similarity_index[tfidf_model[query_vec]]

# Sort the documents by similarity


result_docs = sorted(enumerate(similarities), key=lambda item: -item[1])

# Print the ranked results


for doc_id, sim_score in result_docs:
print(f"Document {doc_id}: (Similarity score: {sim_score:.3f})

3. Accuracy
relevant_docs = [doc1,doc2,doc3,doc4,doc5,doc6]
# the relevant documents are assumed to be doc1, doc2, and doc7
num_relevant_docs = len(relevant_docs)
num_correct = 0
for doc_id, sim_score in result_docs:
if documents[doc_id] in relevant_docs:
num_correct += 1
accuracy = num_correct / num_relevant_docs
print(f"Accuracy: {accuracy:.2f}")

44
APPENDIX II - SCREENSHOTS

Figure 01 Original Text

45
Figure 02 Tokenization

46
Figure 03 Removal of punctuation

Figure 04 Word folding

47
Figure 05 Stop word removal

Figure 06 Word Stemming

48
Figure 07 Document ranking based on the similarity score before stemming

Figure 08 Document ranking based on the similarity score after steming

49
50

You might also like