Pia Teoria
Pia Teoria
net/publication/389691674
CITATIONS READS
0 51
2 authors, including:
A. Pakgohar
Payame Noor University
45 PUBLICATIONS 178 CITATIONS
SEE PROFILE
All content following this page was uploaded by A. Pakgohar on 09 March 2025.
ABSTRACT
Accurately measuring the similarity between texts is crucial for numerous natural language
processing tasks, from plagiarism detection to information retrieval. This paper delves into Keywords
various approaches to calculating text similarity, exploring their strengths and limitations. We Lin-Wong Divergence,
begin by analyzing character-based methods, including the Jaro and N-gram algorithms, Similarity Measure,
suitable for detecting typos and minor edits. Semantic and corpus-based approaches are then Editing Distance,
addressed, offering deeper insights into meaning and context. This includes techniques like Text Mining,
Dice coefficient, Euclidean distance, and Cosine distance, which compare texts based on vector Similarity Algorithm,
representations and set intersections. Finally, we introduce the statistically robust Lin-Wong
Similarity measure, which quantifies the commonality between probability distributions of
Distance Measure.
words, providing a powerful tool for capturing semantic similarity. By comparing and
contrasting these diverse methods, we highlight the importance of choosing the right measure
for the specific task and dataset. Moving forward, the paper identifies promising avenues for
future research, suggesting the potential of knowledge graphs and deep learning techniques to
further refine and advance the field of text similarity measurement. This comprehensive
exploration equips researchers and practitioners with valuable knowledge and insights for
analyzing and comparing textual data.
1. INTRODUCTION
Researchers are interested in creating systems that can measure the similarity between phrases. They use statistical
calculations, similarity measures between words, or machine learning. Similarity measures are used in natural language
processing, search query correction, and plagiarism detection. Humans are better at measuring similarity than computers,
using knowledge and senses. Techniques include using synonyms, similar expressions, or patterns in the text. These
systems can judge the similarity between sentences and their performance is considered better if it aligns with human
judgment.
This paper discusses different ways of calculating similarity between two documents in text processing. The criteria
include string-based, semantic, and corpus-based methods. The research focuses on using string-based criteria and
algorithms like fingerprinting and hashing, as well as creating new criteria. A fuzzy inference system is used to evaluate
similarity, generating a measure from zero to one, where a higher value means more similarity. The system considers
small, medium, and large scale texts and uses various algorithms to calculate similarity based on the input scale. The
similarity is determined by overlapping, global reordering, and local reordering criteria [1]. Content similarity can be
found between various types of documents, such as audio, video, and text files. However, this research focuses specifically
on text similarity, which can be categorized as low, medium, or high. There are several applications of calculating content
similarity, including correcting typographical and writing errors, detecting plagiarism and copyright infringement,
identifying similarities in software codes, extracting similar answers in interactive question and answer systems,
classifying documents, and finding phrases in search engines [2].
Two strings are given: S1 = "vldb" and S2 = "pvldb". The editing distance (ED) between S1 and S2 is 1. This means that
you can transform the first string into the second by adding the character "p" to it. The two strings are considered similar
if the editing distance is not greater than a certain threshold, denoted by t. In the following, we will examine the details
of these algorithms.
The Longest Common Substring (LCS) algorithm is used to find the largest common subsequence between sets of
sequences, typically two sequences. It is an old problem in computer science and is based on file comparison programs
that show differences between files. The goal is to compare two strings and find their similarity. The largest common
subsequence is a sequence of letters that is present in both strings, maintaining their order but not necessarily
consecutively. Dynamic programming can be used to find the longest common subsequence.
0 𝑚=0
𝑑𝑗 = (| + + ) 𝑜. 𝑤 (1)
| | |
Where:
* m = number of matched characters between the strings
* |S1| and |S2| = lengths of strings S1 and S2
* l = length of the common prefix, up to a maximum of 4 characters
* t = half the number of transpositions (swapped characters)
Similarity,
Jaro Similarity Coefficient = (m + t) / (1 + (1/2) (|S1| + |S2| - m)) (2)
The Jaro Winkler algorithm improves upon the original Jaro algorithm by incorporating an additional term that considers
the difference in the lengths of the two strings. The Jaro Winkler similarity coefficient is calculated using the following
formula:
Jaro Winkler Similarity Coefficient = Jaro Similarity Coefficient (1 - (D / (m + D))) (3)
Where:
D is the difference in the lengths of the two strings.
N-gram algorithms estimate the probability of a word or character appearing given the previous N-1 words or characters.
General formula for conditional probability:
P(wn | w1, w2, ..., w(n-1)) = count(w1, w2, ..., wn) / count(w1, w2, ..., w(n-1)). (4)
Euclidean distance, named after the ancient Greek mathematician Euclid, is a measure of the straight-line or shortest
distance between two points in a Euclidean space. Euclidean space is a mathematical space that consists of multiple
dimensions, such as two-dimensional (2D) or three-dimensional (3D) space. This concept is widely used in various fields,
including geometry, physics, and data science [7].
In a 2D space, the Euclidean distance between two points (x1, y1) and (x2, y2) can be calculated using the following
formula:
𝐷 = ((x₂ − x₁)^2 + (y₂ − y₁)^2 ) (6)
This formula helps to determine the shortest path or the straight-line distance between two points. In a 3D space, the
formula becomes:
𝐷 = (x₂ − x₁)^2 + (y₂ − y₁)^2 + (z₂ − z₁)^2) (7)
In general, for n dimensions:
𝐷= ∑ (𝑥 − 𝑥 )^2) (8)
Where, (xi₁, xi₂, ..., xin): The coordinates of the two points in n-dimensional space.
Where 𝑃 and 𝑄 are the probabilities of occurrence of event i in two different probability distributions.
The LW similarity measure is a statistical measure calculated from the quantity the difference or divergence between two probability
distributions. It provides a way to compare how similar or dissimilar two sets of data are, based on their respective probabilities.
In essence, the LW similarity measure calculates the similarity by considering the commonality in the two sets.
To compute this measure, one needs to calculate the probability of elements in a content for each element in both sets. The information
content represents how much information an element carries within its set.
Therefore, a higher LW Divergence value indicates a greater dissimilarity between the two distributions, while a lower value suggests
a higher similarity. In a similar way we can say a higher LW Similarity value indicates a greater similarity between the two set
distributions.
5. CONCLUSIONS
View publication stats
This paper provides an exploration of different approaches to calculating text similarity. In this paper we discuss various character-
based measures, such as Jaro and N-gram algorithms, as well as alternative perspectives offered by the Dice coefficient and Euclidean
distance. Additionally, cosine distance and Jaccard Similarity are introduced as measures that compare texts based on vector
representations and set intersections, revealing conceptual relationships. The paper also introduces the Lin-Wong Similarity measure,
which quantifies the commonality between probability distributions of words in texts and is useful for assessing semantic similarity.
However, the choice of the best measure depends on the specific task and dataset, and further research is needed to develop even more
sophisticated methods for capturing the complexities of textual meaning. We emphasize the practical applications of these methods in
plagiarism detection, information retrieval, and machine translation. We also suggest future research directions, such as incorporating
knowledge graphs or deep learning techniques, to advance text similarity measurement. Ultimately, the significance of text similarity
is connected to the broader field of natural language processing and its goal of achieving true human-like understanding of language.
6. REFERENCES
[1] Khoshnavataher, K., et al., Developing monolingual Persian corpus for extrinsic plagiarism detection using artificial
obfuscation. Notebook for PAN at CLEF, 2015
[2] Singh, S., Statistical Measure to Compute the Similarity between Answers in Online Question-Answering Portal. International
Journal of Computer Applications, 2014. 103.
[3] Gomaa, W.H. and A.A. Fahmy, A survey of text similarity approaches. International Journal of Computer Applications, 2013.
pp. 13-18.
[4] Agbehadji, I. E., Yang, H., Fong, S., & Millham, R. (2018, August). The comparative analysis of Smith-
Waterman algorithm with Jaro-Winkler algorithm for the detection of duplicate health related records. In 2018
International Conference on Advances in Big Data, Computing and Data Communication Systems (icABCD)
(pp. 1-10). IEEE.
[5] Khoirunnisa, F., Yusliani, M. T., & Rodiah, M. T. (2020). Effect of N-Gram on Document Classification on the Naïve Bayes
Classifier Algorithm. Sriwijaya Journal of Informatics and Applications; Vol 1, No 1 (2020).
http://sjia.ejournal.unsri.ac.id/index.php/sjia/article/view/13.
[6] Anuar, N., & Md Sultan, A. B. (2010). Validate conference paper using dice coefficient.
http://psasir.upm.edu.my/id/eprint/17570/
[7] Danielsson, P.-E., Euclidean distance mapping. Computer Graphics and image processing, 1980. 14(3): pp. 227-248.
[8] Sidorov, G., Gelbukh, A., Gómez-Adorno, H., & Pinto, D. (2014). Soft similarity and soft cosine measure: Similarity of
features in vector space model. Computación y Sistemas, 18(3), 491-504.
[9] Lo, G. S., & Dembele, S. (2015). Probabilistic, statistical and algorithmic aspects of the similarity of texts and application to
Gospels comparison. arXiv preprint arXiv:1508.03772.
[10] Lin, J., Wong, S. K. M. (1990). A new directed divergence measure and its characterization. International Journal of General
System. 17(1): 73-81. 5
[11] Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory. 37(1): 145-1
[12] Pakgohar, A. L. I. R. E. Z. A., Habibirad, A., & Yousefzadeh, F. (2019). Lin–Wong divergence and relations on type I
censored data. Communications in Statistics-Theory and Methods, 48(19), 4804-4819.
[13] Pakgohar, A. L. I. R. E. Z. A., Habibirad, A., & Yousefzadeh, F. (2020). Goodness of fit test using Lin-Wong divergence
based on Type-I censored data. Communications in Statistics-Simulation and Computation, 49(9), 2485-2504.
[14] Khalili, M., Habibirad, A., & Yousefzadeh, F. (2022). Testing Exponentiality Based on the Lin-Wong Divergence on the
Residual Lifetime Data. Journal of the Iranian Statistical Society, 18(2), 39-61.
[15] Khalili, M. O. H. A. D. E. S. E. H., Habibirad, A., & Yousefzadeh, F. (2018). Some properties of Lin–Wong divergence on
the past lifetime data. Communications in Statistics-Theory and Methods, 47(14), 3464-3476.
[16] Csiszar, I. (1972). A class of measures of informativity of observation channels. Periodica Mathematica Hungarica, 2(1-4),
191-213.
[17] Dragomir, S. S. (2003). Bounds for f-divergences under likelihood ratio constraints. Applications of
Mathematics, 48(3), 205-223