0% found this document useful (0 votes)
14 views7 pages

Pia Teoria

This paper analyzes various text similarity measures, including character-based, semantic, and corpus-based methods, highlighting their strengths and limitations. It introduces the Lin-Wong similarity measure, which quantifies commonality between probability distributions of words, offering a robust tool for capturing semantic similarity. The research emphasizes the importance of selecting the appropriate measure for specific tasks and datasets while suggesting future research directions in knowledge graphs and deep learning techniques.

Uploaded by

Andrés Chávez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views7 pages

Pia Teoria

This paper analyzes various text similarity measures, including character-based, semantic, and corpus-based methods, highlighting their strengths and limitations. It introduces the Lin-Wong similarity measure, which quantifies commonality between probability distributions of words, offering a robust tool for capturing semantic similarity. The research emphasizes the importance of selecting the appropriate measure for specific tasks and datasets while suggesting future research directions in knowledge graphs and deep learning techniques.

Uploaded by

Andrés Chávez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/389691674

An Analysis of Text Similarity Measures: Introducing a Lin-Wong similarity


measure

Conference Paper · March 2024

CITATIONS READS

0 51

2 authors, including:

A. Pakgohar
Payame Noor University
45 PUBLICATIONS 178 CITATIONS

SEE PROFILE

All content following this page was uploaded by A. Pakgohar on 09 March 2025.

The user has requested enhancement of the downloaded file.


An Analysis of Text Similarity Measures:
Introducing a Lin-Wong similarity measure
Alireza Pakgohar1, Mehdi Fazli2
1
Department of statistics, Payame Noor University (PNU); [email protected]
2
Department of Mathematics, Payame Noor University (PNU); [email protected]

ABSTRACT
Accurately measuring the similarity between texts is crucial for numerous natural language
processing tasks, from plagiarism detection to information retrieval. This paper delves into Keywords
various approaches to calculating text similarity, exploring their strengths and limitations. We Lin-Wong Divergence,
begin by analyzing character-based methods, including the Jaro and N-gram algorithms, Similarity Measure,
suitable for detecting typos and minor edits. Semantic and corpus-based approaches are then Editing Distance,
addressed, offering deeper insights into meaning and context. This includes techniques like Text Mining,
Dice coefficient, Euclidean distance, and Cosine distance, which compare texts based on vector Similarity Algorithm,
representations and set intersections. Finally, we introduce the statistically robust Lin-Wong
Similarity measure, which quantifies the commonality between probability distributions of
Distance Measure.
words, providing a powerful tool for capturing semantic similarity. By comparing and
contrasting these diverse methods, we highlight the importance of choosing the right measure
for the specific task and dataset. Moving forward, the paper identifies promising avenues for
future research, suggesting the potential of knowledge graphs and deep learning techniques to
further refine and advance the field of text similarity measurement. This comprehensive
exploration equips researchers and practitioners with valuable knowledge and insights for
analyzing and comparing textual data.

1. INTRODUCTION
Researchers are interested in creating systems that can measure the similarity between phrases. They use statistical
calculations, similarity measures between words, or machine learning. Similarity measures are used in natural language
processing, search query correction, and plagiarism detection. Humans are better at measuring similarity than computers,
using knowledge and senses. Techniques include using synonyms, similar expressions, or patterns in the text. These
systems can judge the similarity between sentences and their performance is considered better if it aligns with human
judgment.
This paper discusses different ways of calculating similarity between two documents in text processing. The criteria
include string-based, semantic, and corpus-based methods. The research focuses on using string-based criteria and
algorithms like fingerprinting and hashing, as well as creating new criteria. A fuzzy inference system is used to evaluate
similarity, generating a measure from zero to one, where a higher value means more similarity. The system considers
small, medium, and large scale texts and uses various algorithms to calculate similarity based on the input scale. The
similarity is determined by overlapping, global reordering, and local reordering criteria [1]. Content similarity can be
found between various types of documents, such as audio, video, and text files. However, this research focuses specifically
on text similarity, which can be categorized as low, medium, or high. There are several applications of calculating content
similarity, including correcting typographical and writing errors, detecting plagiarism and copyright infringement,
identifying similarities in software codes, extracting similar answers in interactive question and answer systems,
classifying documents, and finding phrases in search engines [2].

2. AN OVERVIEW OF THE METHODS OF CALCULATING THE SIMILARITY OF TEXTS


Calculating similarity between texts is important in text processing research, with various applications such as information
retrieval, document classification, and machine translation. Initially, word similarity is addressed, considering both
phonetic and lexical perspectives. Phonetic similarity is based on character sequences, while lexical similarity also
considers word meanings. Different algorithms are used to evaluate similarity, including string-based, corpus-based, and
knowledge-based approaches. String-based methods focus on character patterns, corpus-based methods use information
from large corpora, and knowledge-based methods use semantic networks to determine semantic similarity of words [3].

3. CHARACTER-BASED SIMILARITY MEASURES


In this section, we examine some algorithms that calculate the similarity between two input phrases or strings based on
their individual characters. These metrics determine the similarity between two strings by adding, deleting, or changing
characters. These algorithms are commonly used to detect typographical errors. A character-based informative metric
evaluates the minimum number of editing operations required to convert one string into other strings using various
methods. The permissible operations for converting one string into another include insertion, deletion, and substitution.

Two strings are given: S1 = "vldb" and S2 = "pvldb". The editing distance (ED) between S1 and S2 is 1. This means that
you can transform the first string into the second by adding the character "p" to it. The two strings are considered similar
if the editing distance is not greater than a certain threshold, denoted by t. In the following, we will examine the details
of these algorithms.
The Longest Common Substring (LCS) algorithm is used to find the largest common subsequence between sets of
sequences, typically two sequences. It is an old problem in computer science and is based on file comparison programs
that show differences between files. The goal is to compare two strings and find their similarity. The largest common
subsequence is a sequence of letters that is present in both strings, maintaining their order but not necessarily
consecutively. Dynamic programming can be used to find the longest common subsequence.

3.1. JARO ALGORITHM


The Jaro Algorithm is a string metric used in computer science and data analysis to compare the similarity of two strings.
It was developed by Victor Jaro in 1989 and improved by George Winkler in 1999. The algorithm considers matching
characters, transpositions, and the proportion of characters in the shorter string to compare the strings. The Jaro similarity
coefficient is used to calculate the similarity between the strings. The Jaro Winkler algorithm, an improved version, also
takes into account the difference in string lengths. It is widely used in tasks like fuzzy matching, record linkage, and string
similarity analysis. It is efficient, accurate, and helps with tasks such as identifying duplicates in large databases, finding
similar strings, and analyzing text similarity for tasks like sentiment analysis and text mining. Overall, the Jaro algorithm
is a powerful tool for comparing strings and finding similarities in computer science and data analysis [4].

0 𝑚=0
𝑑𝑗 = (| + + ) 𝑜. 𝑤 (1)
| | |
Where:
* m = number of matched characters between the strings
* |S1| and |S2| = lengths of strings S1 and S2
* l = length of the common prefix, up to a maximum of 4 characters
* t = half the number of transpositions (swapped characters)

Similarity,
Jaro Similarity Coefficient = (m + t) / (1 + (1/2) (|S1| + |S2| - m)) (2)

The Jaro Winkler algorithm improves upon the original Jaro algorithm by incorporating an additional term that considers
the difference in the lengths of the two strings. The Jaro Winkler similarity coefficient is calculated using the following
formula:
Jaro Winkler Similarity Coefficient = Jaro Similarity Coefficient (1 - (D / (m + D))) (3)

Where:
D is the difference in the lengths of the two strings.

3.2. N-GRAM ALGORITHM


The N-Gram algorithm is a statistical method used in computational linguistics and data mining to analyze sequences of
words or symbols. It breaks down a text into continuous sequences of a specific length, called n-grams, and calculates
their frequency. This information is then used to understand text structure and patterns, as well as for applications like
language modeling, speech recognition, information retrieval, sentiment analysis, and text mining. N-Gram models have
been widely used for tasks such as predicting the next word in a sentence, recognizing spoken language, searching for
relevant documents, analyzing sentiment, and identifying patterns in large text datasets.
An N-gram is a contiguous sequence of N items from a given text or speech sample [5].
Common N-gram types include:
 Unigrams (N=1): Individual words or characters
 Bigrams (N=2): Pairs of words or characters
 Trigrams (N=3): Sequences of three words or characters
 Higher-order N-grams: Longer sequences
Figure 1. n-Gram algorithm

N-gram algorithms estimate the probability of a word or character appearing given the previous N-1 words or characters.
General formula for conditional probability:

P(wn | w1, w2, ..., w(n-1)) = count(w1, w2, ..., wn) / count(w1, w2, ..., w(n-1)). (4)

Example for bigram probability (N=2):


Bigram probability refers to the likelihood of a specific word occurrence given its preceding word in a sequence. In this
case, we are considering the bigram "the time" and aiming to calculate its probability using counts. To determine P(time
| the), we need to divide the count of occurrences of the bigram "the time" by the count of all occurrences of the word
"the." This allows us to measure how frequently the word "time" follows "the" within our data sample. By calculating
such probabilities, language models gain insights into word associations and can generate more accurate predictions. The
N=2 represents that we are looking at pairs of words or bigrams in this context, ensuring a more contextual understanding
of text rather than just individual words. This methodology divulges valuable information for various natural language
processing tasks like machine translation or text summarization and aids in providing accurate suggestions and meaningful
analyses.

3.3. DICE COEFFICIENT


The Dice coefficient is a measure used to assess the similarity between two sets, often used in fields like information
retrieval and machine learning. It is especially useful for comparing binary data, where elements are either present or
absent in the sets. The formula for calculating the Dice coefficient is 2 times the intersection of the sets divided by the
sum of the elements in both sets. The coefficient ranges from 0 to 1, with 0 indicating no similarity and 1 indicating
perfect similarity [6].
The formula for the Dice coefficient is given by:
Dice coefficient = 2×(|X ∩ Y|) / (|X| + |Y|). (5)

3.4. EUCLIDEAN DISTANCE

Euclidean distance, named after the ancient Greek mathematician Euclid, is a measure of the straight-line or shortest
distance between two points in a Euclidean space. Euclidean space is a mathematical space that consists of multiple
dimensions, such as two-dimensional (2D) or three-dimensional (3D) space. This concept is widely used in various fields,
including geometry, physics, and data science [7].
In a 2D space, the Euclidean distance between two points (x1, y1) and (x2, y2) can be calculated using the following
formula:
𝐷 = ((x₂ − x₁)^2 + (y₂ − y₁)^2 ) (6)
This formula helps to determine the shortest path or the straight-line distance between two points. In a 3D space, the
formula becomes:
𝐷 = (x₂ − x₁)^2 + (y₂ − y₁)^2 + (z₂ − z₁)^2) (7)
In general, for n dimensions:
𝐷= ∑ (𝑥 − 𝑥 )^2) (8)
Where, (xi₁, xi₂, ..., xin): The coordinates of the two points in n-dimensional space.

Figure 2. Euclidean Distance Measure


3.4. COSINE DISTANCE
Cosine distance is a commonly used mathematical tool in various fields, particularly in areas related to data analysis and
machine learning. It is a measurement that quantifies the dissimilarity or similarity between two vectors by calculating
the angle between them. The cosine distance ranges from 0 to 1, where 0 indicates identical vectors and 1 suggests
complete dissimilarity, making it useful for comparing patterns or shapes in high-dimensional spaces. This metric
facilitates clustering, classification, and information retrieval tasks by enabling effective comparisons within a dataset. In
addition, its geometric interpretation allows us to analyze the relationships between vectors based on their orientation,
rather than solely relying on their magnitude. Furthermore, with its ability to handle high-dimensional data effectively
and efficiently due to its straightforward calculation process, Cosine distance plays a crucial role in many applications
such as image recognition, recommendation systems, text analysis, and more [8].

Figure 3. Cosine Distance Measure

3.5. JACCARD SIMILARITY


Jaccard Similarity, also known as Jaccard Coefficient, is a statistical measure used to compare the similarity between sets.
It is commonly applied in the fields of information retrieval, data mining, and bioinformatics to analyze the similarity
between different datasets. The Jaccard Similarity index is defined as the size of the intersection of two sets divided by
the size of the union of these two sets [9].
Jaccard Similarity = |A ∩ B| / |A ∪ B| (9)
where A and B are the two sets being compared, |A ∩ B| represents the size of the intersection of A and B, and |A ∪ B|
represents the size of the union of A and B.
The Jaccard Similarity index ranges from 0 to 1, with 0 representing no similarity between the two sets and 1 representing
complete similarity.

Figure 4. Jaccard Similarity Measure

4. LIN-WONG SIMILARITY MEASURE


Lin-Wong (LW) Divergence is a statistical concept used to measure the discrepancy between two probability distributions. It was
introduced by Lin (1991) as a metric for comparing two sets of data, particularly in the field of bioinformatics [10]. This divergence
quantifies the difference between two probability density functions by calculating the Kullback-Leibler divergence within subsets of
data points where one distribution is limited to certain conditions or categories. By doing so, it provides a more refined understanding
of how the two distributions vary with respect to specific subsets of variables. This methodology has proven useful in various
applications such as analyzing protein structures, identifying genes related to diseases, and distinguishing different ecological patterns.
The Lin-Wong Divergence allows researchers and practitioners to extract valuable insights from complex datasets by precisely
measuring dissimilarities between distributions at a fine-grained level, thereby enabling informed decision-making processes and
advancing scientific knowledge across multiple disciplines.
This divergence has been worked on different functions. For example, Pakgohar et al. performed its characteristics on the type-I
censored scheme distribution [12] and used this divergence to test the goodness of fit on the show and normal type-I censored scheme
[13]. Khalili et al. studies this divergence on truncated data [14-15].
We can mention that this divergence measure can be converted into a similarity coefficient. It is enough to subtract the value of Lin-
Wang divergence from 1 and pay attention to the fact that in Lin-Wang, the logarithm is calculated based on 2 so that this value takes
a value between 0 and 1.
The LW divergence measure is given by:
𝐷 = ∑ 𝑃 log ( ). (10)

Where 𝑃 and 𝑄 are the probabilities of occurrence of event i in two different probability distributions.

4.1. LIN-WONG SIMILARITY MEASURE


In a discrete space, LW similarity measure between two discrete variables X and Y can be calculated using the following formula:

𝑆 =1+∑ 𝑃 log ( ). (11)

The LW similarity measure is a statistical measure calculated from the quantity the difference or divergence between two probability
distributions. It provides a way to compare how similar or dissimilar two sets of data are, based on their respective probabilities.
In essence, the LW similarity measure calculates the similarity by considering the commonality in the two sets.
To compute this measure, one needs to calculate the probability of elements in a content for each element in both sets. The information
content represents how much information an element carries within its set.
Therefore, a higher LW Divergence value indicates a greater dissimilarity between the two distributions, while a lower value suggests
a higher similarity. In a similar way we can say a higher LW Similarity value indicates a greater similarity between the two set
distributions.

4.2. LIN-WONG SIMILARITY AS THE CSISZAR MEASURE


The Csiszar measure is an influential index that has gained significant recognition among applicants in diverse fields such as
information theory, statistics, and machine learning. Named after its creator, Csiszar, this measure is widely used to quantify the amount
of information shared between two random variables or sets of data. Its versatility lies in its ability to capture both the dependence and
independence relationships between variables.
In information theory, the Csiszar measure plays a crucial role in understanding the mutual information between two random variables.
It provides a quantitative measure of how much one variable reveals about another, allowing us to assess the strength of their
relationship. This has applications in data compression, channel coding, and cryptography, where efficient encoding and decoding
schemes rely on accurately estimating mutual information.
Moreover, statisticians utilize the Csiszar measure to evaluate the similarity or dissimilarity between probability distributions. By
comparing their divergence using this index, researchers can assess how well a model fits observed data or compare different models'
goodness-of-fit.
Let φ: [0,) → R be a convex function. Then Csiszar (1972) introduced the φ – divergence functional as a generalized measure of
information on the set of probability distribution. The φ – divergence can be written as [16]:

𝐷 (𝑃, 𝑄) = ∑ 𝑃 𝜙(𝑡 ) . (12)


Where, 𝑡 =
Some properties of φ – divergence are Φ”(x) ≥ 0 and φ(1) = 0. Based on the mentioned definition we can find LW divergence.
If we choose: φ(t) = 0 = t log , t > 0, then obviously LW can be written as:
𝐷 = ∑ 𝑃 𝜙(𝑡 ). (13)
Dragomir (2003) has proved the following result for the φ – divergence in which concerning an upper and a lower bound for the φ –
divergence in terms of the LW distance holds [17].
Statisticians utilize the Csiszar measure to evaluate the similarity or dissimilarity between probability distributions. By comparing their
divergence using this index, researchers can assess how well a model fits observed data or compare different models' goodness-of-fit.
Based on the given definition, the LW similarity measure can be obtained by using Caesar size, which is
𝑆 = 1 + ∑ 𝑃 𝜙(𝑡 ) (14)

5. CONCLUSIONS
View publication stats

This paper provides an exploration of different approaches to calculating text similarity. In this paper we discuss various character-
based measures, such as Jaro and N-gram algorithms, as well as alternative perspectives offered by the Dice coefficient and Euclidean
distance. Additionally, cosine distance and Jaccard Similarity are introduced as measures that compare texts based on vector
representations and set intersections, revealing conceptual relationships. The paper also introduces the Lin-Wong Similarity measure,
which quantifies the commonality between probability distributions of words in texts and is useful for assessing semantic similarity.
However, the choice of the best measure depends on the specific task and dataset, and further research is needed to develop even more
sophisticated methods for capturing the complexities of textual meaning. We emphasize the practical applications of these methods in
plagiarism detection, information retrieval, and machine translation. We also suggest future research directions, such as incorporating
knowledge graphs or deep learning techniques, to advance text similarity measurement. Ultimately, the significance of text similarity
is connected to the broader field of natural language processing and its goal of achieving true human-like understanding of language.

6. REFERENCES
[1] Khoshnavataher, K., et al., Developing monolingual Persian corpus for extrinsic plagiarism detection using artificial
obfuscation. Notebook for PAN at CLEF, 2015
[2] Singh, S., Statistical Measure to Compute the Similarity between Answers in Online Question-Answering Portal. International
Journal of Computer Applications, 2014. 103.
[3] Gomaa, W.H. and A.A. Fahmy, A survey of text similarity approaches. International Journal of Computer Applications, 2013.
pp. 13-18.
[4] Agbehadji, I. E., Yang, H., Fong, S., & Millham, R. (2018, August). The comparative analysis of Smith-
Waterman algorithm with Jaro-Winkler algorithm for the detection of duplicate health related records. In 2018
International Conference on Advances in Big Data, Computing and Data Communication Systems (icABCD)
(pp. 1-10). IEEE.
[5] Khoirunnisa, F., Yusliani, M. T., & Rodiah, M. T. (2020). Effect of N-Gram on Document Classification on the Naïve Bayes
Classifier Algorithm. Sriwijaya Journal of Informatics and Applications; Vol 1, No 1 (2020).
http://sjia.ejournal.unsri.ac.id/index.php/sjia/article/view/13.
[6] Anuar, N., & Md Sultan, A. B. (2010). Validate conference paper using dice coefficient.
http://psasir.upm.edu.my/id/eprint/17570/
[7] Danielsson, P.-E., Euclidean distance mapping. Computer Graphics and image processing, 1980. 14(3): pp. 227-248.
[8] Sidorov, G., Gelbukh, A., Gómez-Adorno, H., & Pinto, D. (2014). Soft similarity and soft cosine measure: Similarity of
features in vector space model. Computación y Sistemas, 18(3), 491-504.
[9] Lo, G. S., & Dembele, S. (2015). Probabilistic, statistical and algorithmic aspects of the similarity of texts and application to
Gospels comparison. arXiv preprint arXiv:1508.03772.
[10] Lin, J., Wong, S. K. M. (1990). A new directed divergence measure and its characterization. International Journal of General
System. 17(1): 73-81. 5
[11] Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory. 37(1): 145-1
[12] Pakgohar, A. L. I. R. E. Z. A., Habibirad, A., & Yousefzadeh, F. (2019). Lin–Wong divergence and relations on type I
censored data. Communications in Statistics-Theory and Methods, 48(19), 4804-4819.
[13] Pakgohar, A. L. I. R. E. Z. A., Habibirad, A., & Yousefzadeh, F. (2020). Goodness of fit test using Lin-Wong divergence
based on Type-I censored data. Communications in Statistics-Simulation and Computation, 49(9), 2485-2504.
[14] Khalili, M., Habibirad, A., & Yousefzadeh, F. (2022). Testing Exponentiality Based on the Lin-Wong Divergence on the
Residual Lifetime Data. Journal of the Iranian Statistical Society, 18(2), 39-61.
[15] Khalili, M. O. H. A. D. E. S. E. H., Habibirad, A., & Yousefzadeh, F. (2018). Some properties of Lin–Wong divergence on
the past lifetime data. Communications in Statistics-Theory and Methods, 47(14), 3464-3476.
[16] Csiszar, I. (1972). A class of measures of informativity of observation channels. Periodica Mathematica Hungarica, 2(1-4),
191-213.
[17] Dragomir, S. S. (2003). Bounds for f-divergences under likelihood ratio constraints. Applications of
Mathematics, 48(3), 205-223

You might also like