0% found this document useful (0 votes)
135 views16 pages

A Novel Unsupervised Corpus-Based Stemming

A Novel Unsupervised Corpus-based Stemming

Uploaded by

Office Work
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
135 views16 pages

A Novel Unsupervised Corpus-Based Stemming

A Novel Unsupervised Corpus-based Stemming

Uploaded by

Office Work
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Knowledge-Based Systems xxx (xxxx) xxx

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

A novel unsupervised corpus-based stemming technique using lexicon


and corpus statistics✩

Jasmeet Singh a , Vishal Gupta b ,
a
Thapar Institute of Engineering and Technology, India
b
University Institute of Engineering and Technology, Panjab University, Chandigarh, India

article info a b s t r a c t

Article history: Word Stemming is a widely used mechanism in the fields of Natural Language Processing, Information
Received 30 August 2018 Retrieval, and Language Modeling. Language-independent stemmers discover classes of morpholog-
Received in revised form 14 May 2019 ically related words from the ambient corpus without using any language related rules. In this
Accepted 15 May 2019
article, we proposed a fully unsupervised language-independent text stemming technique that clusters
Available online xxxx
morphologically related words from the corpus of the language using both lexical and co-occurrence
Keywords: features such as lexical similarity, suffix knowledge, and co-occurrence similarity. The method applies
Stemming to a wide range of inflectional languages as it identifies morphological variants formed through
Inflection different linguistic processes such as affixation, compounding, conversion, etc.
Morphology The proposed approach has been tested in Information Retrieval application for four languages
Corpus
(English, Marathi, Hungarian, and Bengali) using standard TREC, CLEF, and FIRE test collections. A
Information retrieval
significant improvement over word-based retrieval, five other corpus-based stemmers, and rule-based
Natural language processing
stemmers has been achieved in all the languages. Besides, information retrieval, the proposed approach
has also been tested in text classification and inflection removal tasks. Our algorithm excelled over
other baseline methods in all the test scenarios. Thus, we successfully achieved the objective of
developing a multipurpose stemming algorithm that cannot only be used for information retrieval task
but also for non-traditional tasks such as text classification, sentiment analysis, inflection removal, etc.
© 2019 Elsevier B.V. All rights reserved.

1. Introduction the increase in the morphological complexity of the language.


Secondly, stemming reduces the vocabulary size by mapping the
Word stemming is a widely used mechanism in Information morphological variants to a single root. It helps in reducing the
Retrieval (IR) and Natural Language Processing (NLP) systems to storage space required to store the indexing structures of an IR
transform the morphological variant word forms to their base system and improving the search efficiency.
forms. For example, the variant word forms abandons, abandoned, In Natural Language Processing systems, stemming helps to
abandoning, abandonment, abandonments, etc. are transformed to reduce the dimensionality of the feature set or training data
their base form abandon through stemming. for statistical models. Stemming is employed at the text pre-
In Information Retrieval systems, text stemming serves two processing stage to improve the performance of a number of
main purposes. Firstly, stemming improves the ability of an IR linguistic processing applications such as Text Summarization
system to retrieve more relevant documents by solving the prob- [1,2], Text Classification and Clustering [3,4], Machine Translation
lem of vocabulary mismatch in queries and documents at the [5], Opinion Mining [6], Sentiment analysis [7,8], Word Sense
time of indexing and searching. Stemming helps in increasing Disambiguation [9], POS Tagging [10], etc. Hence, stemming is
the within document term frequency which results in promoting widely acceptable in terms of consistency and user acceptance.
more relevant documents in superior ranks. Naturally, the perfor- In this article, we present a fully unsupervised language-
mance of stemmer in improving retrieval accuracy increases with independent stemming approach that uses both lexicon and
corpus statistics to cluster morphologically related words from
the ambient corpus. The lexicon-analysis based features like
✩ No author associated with this paper has disclosed any potential or
string similarity, substring frequency, etc. help to conflate the
pertinent conflicts which may be perceived to have impending conflict with
majority of morphological variants in the language produced
this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.
2019.05.025. through suffixation. But these features miss many other mor-
∗ Corresponding author. phological variations caused due to different linguistic processes
E-mail address: [email protected] (V. Gupta). such as conversion, compounding, etc. For example, the lexical

https://doi.org/10.1016/j.knosys.2019.05.025
0950-7051/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: J. Singh and V. Gupta, A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics, Knowledge-Based Systems
(2019), https://doi.org/10.1016/j.knosys.2019.05.025.
2 J. Singh and V. Gupta / Knowledge-Based Systems xxx (xxxx) xxx

knowledge fails to group morphological variants cat and catlike its meaning or part-of-speech (e.g., drink to drinks). In derivational
(which is formed through the compounding process). The corpus morphology, variations in word forms change its meaning and/or
analysis-based features such as co-occurrence frequency help to part-of-speech (e.g., drink to drinkable). Morphological analysis,
conflate all such variants. Moreover, the methods which only use lemmatization, and stemming are three main mechanisms that
lexicon statistics sometimes conflate many unrelated words by are used to normalize the variant words formed using the given
treating valid word endings as suffixes (as in the case of shin and processes. The morphological analysis mechanism involves seg-
shining). The corpus analysis helps in reducing all such erroneous mentation of words into valid linguistic components. Lemmatiz-
conflation. Thus, our proposed method employs both lexicon ers handle the variants formed through inflectional morphology
and corpus-based features to effectively conflate morphological and usually return a valid dictionary base form. Stemmers, on the
variations caused due to various linguistic processes such as other hand, handle both inflectional and derivational variants and
suffixation, conversion, and compounding, etc. thereby making do not necessarily return valid linguistic word form. In this article,
the method suitable for a wide range of inflectional languages. we focus on unsupervised automatic stemming techniques that
Our proposed method works in two phases. During the first are used to improve the performance of NLP and IR systems.
phase, the distinct words collected from the corpus are grouped The existing stemming techniques are broadly classified into
into initial equivalence classes according to a common prefix of two main categories: Rule-Based (language-specific) and Corpus-
a specific length. We then proceed by computing similarity score Based (language-independent or statistical) techniques. Rule-
among all the word pairs in each equivalence class using both lex- based stemming techniques encode a set of language-specific
ical and semantic features. In the second phase, the relationships rules to transform the morphological variants to their base forms.
between the word pairs in each class are converted into weighted These stemming techniques may range from simple approaches
undirected graphs. The nodes of the graph are the distinct words such as removal of plurals and verb forms to complex approaches
belonging to the class, and there are edges between the pos- that remove a variety of suffixes. One notable feature of these
sibly related word pairs. The graph is then finally decomposed techniques is their ease of use, that is, the rules once created
to conflate morphologically related words. We tested our pro- can be applied to any corpus of the language without additional
posed stemming algorithm in four languages (English, Hungarian, processing. But, the creation of these rules is very time consuming
Bengali, and Marathi) belonging to different language families and requires linguistic experts and resources like dictionaries
(Germanic, Uralic, and Indo Aryan) and of varying complexity and and stem tables. So for resource-scarce languages, these stem-
origin. ming methods are not preferred. Corpus-Based stemmers, on
This article makes the following important contributions: the other hand, are based on semi-supervised or unsupervised
learning of morphology of the language from the ambient corpus.
(1) It presents a novel corpus-based fully unsupervised stem-
The major advantage of corpus-based stemmers over rule-based
ming method that uses both lexical and corpus-based fea-
stemmers is that they obviate the need for any language exper-
tures.
tise or language-specific resources. Moreover, these stemming
(2) It experimentally verifies that the proposed method out- techniques can find less frequent and less obvious cases during
performs the state-of-the-art baseline linguistic and corpus- the processing of large corpus and can handle languages that
based stemmers in information retrieval task. have complex morphology and sparse data. The corpus-based
(3) The article experimentally confirms that the proposed techniques incorporate new languages into the system with very
stemming technique can be used as a multi-purpose tool little effort, and this is useful especially for applications related
in a number of applications such as text classification, to information retrieval. In the following subsections, we review
inflection removal, sentiment analysis, etc. some of these methods.

The rest of the article is organized as follows. The related work


2.1. Rule-based methods
is presented in Section 2. In Section 3, the proposed unsupervised
corpus-based stemming algorithm is described. In Section 4, the
The Porter stemming algorithm (1980) is the most popular
performance of the proposed approach is compared with other
and widely used stemming approach in English IR possibly due to
baseline stemmers in information retrieval using standard TREC,
the balance between simplicity and efficiency. A number of other
CLEF, and FIRE test collections. In Section 5, the effect of the type
rule-based linguistic stemmers have been proposed in literature
of training on the performance of stemmers has been presented.
In Section 6, the experiments related to performance evaluation such as Lovins stemmer [11], Dawson stemmer [12], Paice/Husk
of the proposed algorithm in text classification and sentiment stemmer [13], Krovetz stemmer [9], XEROX Stemmer [14], etc.
analysis task are presented. The inflection removal experiments The effectiveness of stemming in English IR has shown mixed
are presented in Section 7. In Section 8, the efficiency of the results and is always a debated affair. Lennon et al. [15] and
proposed stemmer in terms of processing time and strength is Harman [16] experimented with English rule-based stemmers in
compared with the baseline stemmers. Section 9 concludes this IR task without any positive results. But the subsequent studies
article. [9,14,17] reported positive outcomes in English IR experiments.
Stemming in English has shown mixed results, but for lan-
2. Related work guages with complex morphology such as Hungarian, Finnish [18,
19], Marathi, Bengali [20,21], Dutch [22], Czech [23,24] stemming
In a natural language, the new morphological variant words is found to be quite beneficial. A number of studies [18,25] have
are formed from the basic word form through various linguistic demonstrated the effectiveness of light stemming approaches in
processes. Generally, the variant words are formed through affix- ad-hoc retrieval tasks for both European and Asian languages.
ation process, i.e., the addition of prefixes and/or suffixes accord-
ing to language-specific rules. Compounding is another linguistic 2.2. Corpus-based methods
process where two or more words are combined to form new
words. The morphological variants of a language are divided into Corpus-based stemming is a popular stemming approach as
two main categories: Inflectional and Derivational variants. In it discovers equivalence classes of the variant words from the
inflectional morphology, the variations in the word do not affect ambient corpus without any linguistic knowledge [17,26]. These

Please cite this article as: J. Singh and V. Gupta, A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics, Knowledge-Based Systems
(2019), https://doi.org/10.1016/j.knosys.2019.05.025.
J. Singh and V. Gupta / Knowledge-Based Systems xxx (xxxx) xxx 3

stemming techniques reflect certain features of the ambient cor- presumptions. For any given word, there are multiple paths or
pus and are sometimes found to be more efficient as compared HMMs. The most probable path is chosen by maximizing the
to the rule-based stemmers [27]. Several studies [20,21,28] have probabilities that are estimated from the substring frequency. The
revealed that the corpus-based stemmers are better substitutes change from the state belonging to the prefix set to the state
to the rule-based stemmers, especially for resource-scarce lan- belonging to suffix set in the chosen path determines the split-
guages. The corpus-based stemming methods are broadly clas- point. The series of letters before the split-point represents the
sified into two main categories namely Lexicon analysis based stem of the word.
approaches and Corpus analysis based approaches. Lexicon analysis Bacchin et al. [36] proposed a probabilistic framework for
based approaches analyze the lexicon (words obtained from the unsupervised stemmer generation using the substring frequency
corpus) to generate groups of morphologically related words. and link analysis method. The proposed approach first identifies
These methods discover potential stems and derivations using a set of substrings by splitting each source word in the corpus
features like substring frequency, string distance, etc. Corpus at every possible point. A directed graph is then formed whose
analysis based approaches, on the other hand, analyze the words nodes are the substrings, and there is an edge from node a to
on the basis of their co-occurrences or context with the other node b if there exists a word w = ab in the corpus. The Hyperlink-
words in a particular corpus. The corpus analysis based ap- Induced Topic Search (HITS) method is then used to compute the
proaches require relatively larger corpora as compared to lexi- prefix and suffix scores from the directed graph. Finally, the best
con analysis based approaches to enhance the reliability of co- split of the word is determined by maximizing the probability of
occurrence information. Our proposed approach uses both lexicon prefix and suffix pair.
analysis and corpus analysis based features to learn morpholog- Majumder et al. [20] proposed an unsupervised approach
ical clusters from the unannotated corpus. Some popular lexicon named Yet Another Suffix Stripper (YASS) that clusters morpho-
and corpus analysis based methods proposed in the literature are logically related words occurring in the corpus on the basis of
discussed in the subsequent subsections. the string distance. The authors proposed certain string distance
measures that give low dissimilarity score to morphologically
2.2.1. Lexicon analysis based approaches related words. After calculating the dissimilarity score, the words
Oard et al. [29] proposed an unsupervised stemming approach are clustered using a graph-based complete linkage algorithm.
that discovers potential suffixes from the ambient corpus and The method was not found to be suitable for highly agglutinative
uses them in the suffix stripping process to obtain the stem of languages where long length suffixes are present. Baroni et al.
the word. In the proposed technique, suffixes of length one to [37] and Fernández et al. [38] also proposed similar stemming
four characters are identified from the first 5,000,000 words of techniques that generate equivalence classes of morphologically
the corpus. The threshold for the number of suffixes of each related words on the basis of well-known edit distance. Recently,
length is chosen as the point at which the second derivative of we developed a stemming technique that uses a hierarchical
the count versus rank plot is maximized. The stem of each input agglomerative clustering algorithm to discover groups of mor-
word is then obtained by removing the suffixes on the basis of phologically related words on the basis of a well-known Jaro–
the longest match principle. Paik and Parui [26] also developed Winkler edit distance [39]. Kasthuri et al. [40] also proposed a
a suffix discovery based unsupervised stemming approach. The language independent stemmer that uses string distance metrics
potential suffixes are first identified from the corpus on the basis and dynamic programming to find the stem of the word. Chavula
of their frequency. The words in the corpus are then grouped and Suleman [41] pointed out that the current orthographical
into classes according to the common prefix of the same length. similarity measures used for conflating variant word forms do
Finally, the strength of each class is computed using the potential not fairly take into account the distribution of morphemes for
suffix knowledge which determines the root word of each class. the words in morphologically rich languages.
Goldsmith [30] proposed a morphological analyzer based on Paik et al. [21] proposed a stemming method named Graph
the frequency of substrings and the Minimum Description Length based Stemmer (GRAS) that uses suffix pair information to gen-
(MDL) principle. The method first computes the frequency of the erate equivalence classes of morphologically related words. The
stems and suffixes at every possible split-point for each word ap- method works in two phases. In the first phase, the suffix knowl-
pearing in the corpus. The optimal split-point is then determined edge is extracted from the corpus and in the second phase, the
by minimizing the necessary bits required to store the docu- words are clustered on the basis of the suffix knowledge using
ment collection. The authors also developed a software frame- undirected graphs. Our proposed technique is similar in spirit
work named Linguistica [31]. Creutz and Lagus [32] also pro- with this stemming method but differs in many aspects. Firstly,
posed a recursive MDL based approach named Morfessor baseline our proposed approach uses lexical similarity, suffix knowledge,
for unsupervised segmentation of words into morphemes. The common prefix length, and co-occurrence similarity to cluster
model finds the optimal segmentation of the source vocabu- morphologically related variants from the corpus. So, it is suitable
lary by minimizing an MDL based cost function that considers for a wider range of languages and not just suffixing languages.
the cost of representation of both the source and the morph Moreover, the proposed technique uses an effective and simple
vocabulary. Two category-based extended versions of baseline association measure to check whether the nodes in the graph are
model namely Morfessor Categories – Maximum a Posteriori related or not. Iheanetu and Oha [42] applied frequent pattern
(MAP) [33] and Morfessor Categories – Maximum Likelihood [34] theory to unsupervised stemming. The words in the language are
have also been proposed that categorize the morphemes into mapped into patterns of consonants and vowels.
various categories such as prefix, suffix, stem or non-morpheme.
Both the category-based models refine the segments produced by 2.2.2. Corpus analysis based approaches
the baseline model by correcting various under-segmentation and Xu and Croft [17] proposed a co-occurrence based corpus anal-
over-segmentation errors. ysis technique to refine the equivalence classes already created by
Melucci and Orio [35] also used the concept of substring fre- an aggressive stemmer or to develop a stemmer. The method first
quency along with the Hidden Markov Model (HMM) to develop a computes the association between the words in a group (provided
corpus-based stemmer. In this technique, the letters of the words by some aggressive baseline stemmer or which share a common
appearing in the corpus are modeled as the states of HMMs. length prefix) using a variation of Expectation Mutual Information
These states are partitioned into prefix and suffix sets with some Measure (EMIM). The graph-based optimal partitioning algorithm

Please cite this article as: J. Singh and V. Gupta, A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics, Knowledge-Based Systems
(2019), https://doi.org/10.1016/j.knosys.2019.05.025.
4 J. Singh and V. Gupta / Knowledge-Based Systems xxx (xxxx) xxx

and/or connected component algorithm then conflate morpho- 3.1. Measuring string similarity
logically related words on the basis of EMIM scores. The major
limitation of this stemmer is that the graph-based clustering In this phase, the similarity between the words in the corpus
algorithm used in the approach is not suitable for highly agglu- is computed. The string similarity function maps a pair of words
tinative languages where large size classes of morphologically p and q to a real number r, where a higher value of r denotes
related words are present. Paik et al. [23] also proposed a similar greater similarity between the word pair p and q. We first collect
unique words from the corpus after removing the stop words and
co-occurrence based approach which uses a comparatively sim-
numbers. The words are then divided into a number of classes
ple association measure and a nearest-neighbor based clustering
sharing a common prefix of a specific length. We chose common
method to group morphologically related words. Bhamidipati and prefix length as the average word length of the concerned lan-
Pal [4] also proposed a stemmer refinement technique using an guage. The creation of initial classes according to common prefix
advanced form of co-occurrence, i.e., the distributional similarity length will reduce the computational cost of the algorithm as the
between the words. The proposed approach assumes that the similarity scores are computed between the word pairs belonging
documents in the collection belong to different categories and to the same class rather than all the words in the corpus. We
the distributional similarity between the word pairs is com- define similarity S between two words w1 and w2 in a corpus
puted from the frequency of occurrence of the words in different as follows:
document categories. S (w1 , w2 ) = lexical_sim(w1 , w2 ) + co-occurrence_sim(w1 , w2 )
Peng et al. [43] highlighted that traditional stemming ap-
proaches blindly stem each occurrence of the term without con-
+ potential_suffix_pair_freq(w1 , w2 )
sidering its context. The authors proposed a context-sensitive (1)
stemming method that performs context-based analysis both at
the query and the document side to determine the appropriate Lexical Similarity: We define a lexical similarity function that
morphological variant that must be used to replace the term. gives a high score to morphologically related words without any
Paik et al. [27] proposed a query-based stemming method that knowledge of the language. The metric awards longest common
provides morphological variants that are thematically coherent prefix and avoids any early mismatch while comparing the two
strings. The metric considers the length of the strings, common
with the query words. The method significantly increases the re-
prefix length and number of common characters while computing
trieval performance by reducing the effect of those morphological
similarity between the strings.
variants which are not related to the original query intent.
For any two given strings w1 = x1 x2 x3 . . . xn and w2 =
Brychcín and Konopík [28] proposed a stemming method that y1 y2 y3 . . . yn′ (n ≥ n′ ), we first pad null characters to the smaller
significantly improves precision at the expense of a small de- string so as to make the length of both the strings equal. We then
crease in recall. The method works in two stages. During the compute a boolean function si (as given in Eq. (2)) which is 1 if the
first stage, the words in the corpus are clustered on the ba- character in the ith location of w1 and w2 are same; otherwise, it
sis of normalized longest common prefix using modified maxi- is 0.
mum mutual information (MMI) clustering. The clusters formed {
1 if xi = yi 1 ≤ i ≤ n′
in the first stage are used as training data for the maximum si = (2)
0 other w ise
entropy classifier in the second stage to decide when and how
to stem the word. Sakakini et al. [44] discovered the relation The lexical similarity between w1 and w2 is then computed as
between roots, patterns, and suffixes using orthographic rules and follows:

semantic knowledge available in the words present in the corpus. n
p∑
lexical_sim(w1 , w2 ) = (0.5)i × si (3)
n
i=1
3. The proposed stemming approach
where p is the common prefix length of strings w1 and w2 .
The main objective of our work is to propose an efficient Co-occurrence Similarity: Like Xu and Croft [17], we hypothesize
fully unsupervised corpus-based stemming approach which can that the words which co-occur in a document quite frequently
serve as a multi-purpose tool in a number of Information Re- have a higher likelihood of their proximity and are thus better
trieval and Natural Language Processing applications. The pro- members to be clustered in a class. The document describing a
posed stemming approach exploits the lexical, semantic and co- particular topic very often uses terms which are related to the
occurrence knowledge of the words to group morphologically original document topic. For instance, a document about educa-
related words appearing in the corpus. The following character- tion frequently uses its morphological variants such as educate,
istics make our stemming method attractive: (i) effectiveness in educates, educated, etc. Moreover, the morphological variants of a
language are semantically related to the meaning of the original
terms of performance in various tasks (ii) computational cost (iii)
word form, and their document level co-occurrence knowledge
language-independent nature (iv) robustness in terms of lexical
provides strong information to conflate them together.
and corpus-based features. We used the dice coefficient metric (as given in Eq. (4)) to
Our algorithm works in two phases. During the first stage, the compute co-occurrence similarity between a pair of words. The
similarity between the word pairs appearing in the unannotated metric provides a degree of association between word pairs on
corpus is computed using lexical, co-occurrence and suffix pair the basis of their joint frequency in the corpus.
information. The use of lexical, co-occurrence and common suffix 2 × df (w1 , w2 )
pair features help in identifying various types of morphologi- co-occurrence_sim(w1 , w2 ) = (4)
df (w1 ) + df (w2 )
cal variations among the words (i.e., variants formed through
affixation, compounding, and conversion). The similarity scores where df (w1 , w2 ) denotes document frequency of co-occurrence
obtained in the first phase are used in the second phase by a of words w1 and w2 . df (w1 ) and df (w2) denote document fre-
graph-based clustering algorithm to group morphologically re- quency of words w1 and w2 respectively.
lated words. The architecture of our proposed system is shown Potential Suffix Pair Frequency: In most of the languages, mor-
in Fig. 1. phological variants are formed by the addition of suffixes to the

Please cite this article as: J. Singh and V. Gupta, A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics, Knowledge-Based Systems
(2019), https://doi.org/10.1016/j.knosys.2019.05.025.
J. Singh and V. Gupta / Knowledge-Based Systems xxx (xxxx) xxx 5

Fig. 1. Proposed system architecture.

base word form. The number of potential suffix pairs induced by 3.2. Morphological class formation
the word pair is thus an important feature to find the degree of
association between them. We identified potential co-occurring In this phase, the word relationships computed in the first
suffix pairs rather than single suffix from the corpus as it helps phase are used to identify morphologically related words from
in identifying linguistically valid suffixes. the initial equivalence classes. The word relationships computed
A pair of suffixes s1 and s2 are said to be co-occurring if in the first phase for each class are modeled into a weighted
they can be added to a common prefix p to form linguistically undirected graph, whose nodes are the words of the class. There
valid words w1 and w2 (i.e., w1 = ps1 and w2 = ps2 ). The is an edge between a word pair (w1 , w2 ) if there exists a potential
number of word pairs induced by a co-occurring suffix pair is suffix pair that induces the word pair (w1 , w2 ) or document
termed as the frequency of suffix pair. It is not necessary that level co-occurrence frequency of word pair is greater than zero,
all the co-occurring suffix pairs are linguistically valid endings. i.e., df (w1 , w2 ) > 0. The similarity score computed between the
So, a candidate co-occurring suffix pair is said to be poten- word pair S(w1 , w2 ) in the previous step represents the weight of
tial (valid) suffix pair if its frequency is greater than a pre- the edge, i.e., w (w1 , w2 ). The number of edges incident on a node
decided threshold (θ ). We experimentally verified that a candi- denotes the degree of the node and neighbor (p) denotes the set
date suffix pair can be a potential suffix pair if the number of of nodes which have an edge from node p.
word pairs induced by the suffix pair is 5 or more. So, we use The graph so formed is then decomposed iteratively into
θ = 5 to decide potential suffix pairs. For example, we found classes of morphologically related words. We first determine the
⟨s, ing ⟩ as one of the potential suffix pair because its frequency is central or pivot node (say u), i.e., the node with a maximum
more than 5 (⟨sings, singing ⟩, ⟨reads, reading ⟩, ⟨repeats, repeating ⟩, degree. The nodes neighbors to the central node are considered
⟨brings, bringing ⟩, ⟨plays, playing ⟩, ⟨jumps, jumping ⟩, etc.) in decreasing order of their weights w (u, v ), and then the asso-
ciation between the nodes is computed according to Eq. (5). If
We followed the following procedure to determine the po-
the association score between nodes u and v is greater than or
tential suffix pairs and their frequency. We grouped the words
equal to the pre-decided threshold (α ), then u and v are put in
obtained from the lexicon in different classes according to a
the same class. Otherwise, the edge between u and v is deleted
common prefix of length of 4 characters. We did not choose
from the graph. Therefore, at each step, a class of related words
smaller prefix lengths (3 or fewer characters) as it may not
is identified and removed from the graph. This process continues
return linguistically valid suffixes. Moreover, long prefix lengths
until all the nodes of the graph are processed. The value of α is
may miss some infrequent potential suffix pairs. For every class,
determined experimentally and is discussed in Section 4.5.
co-occurring suffixes are derived from each word pair in that
class. After considering all the classes, the frequencies of all the |Neighbor(u) ∩ Neighbor(v )|
Association(u, v ) = (5)
suffix pairs are calculated, and a set of potential suffix pairs is |Neighbor(u) ∪ Neighbor(v )|
generated. The frequencies of all potential suffix pairs are nor-
malized between 0 and 1 by dividing each frequency value by
the maximum suffix pair frequency in the corpus. This is done 4. Information retrieval experiments
to avoid any bias as the values of other similarity measures (co-
occurrence and lexical similarity) lie between 0 and 1. Thus, while In this section, we describe a series of experiments for eval-
comparing two words w1 and w2 , we find the potential suffix pair uating our proposed technique in the information retrieval task.
that generates the word pair and its corresponding normalized The proposed technique has been tested on standard TREC (Text
frequency is assigned as potential_suffix_pair_frequency (w1 , w2 ) REtrieval Conference), CLEF (Cross-Language Evaluation Forum),
in Eq. (1). and FIRE (Forum for Information Retrieval Evaluation) ad-hoc
Fig. 2 shows the calculation of similarity scores between two track data sets. As a part of our experiments, we compared the
distinct word pairs (conduct, conducted) and (conduct, condition). performance of our proposed stemmer with the state-of-the-art
The document level co-occurrence and suffix frequencies in the linguistic and corpus-based stemmers. In order to deeply analyze
figure have been computed from standard TREC (Text REtrieval the performance of various stemmers, we also studied the impact
Conference) collection consisting of 173,252 Wall Street Journal of stemmers on individual topics (queries).
documents. It is clear from the similarity scores that the words In the subsequent subsections, we describe the experimental
conduct and conducted are closely related as compared to the system and weighting model, test collection, baseline stemming
words conduct and condition. methods, evaluation metrics, evaluation results, and analysis.

Please cite this article as: J. Singh and V. Gupta, A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics, Knowledge-Based Systems
(2019), https://doi.org/10.1016/j.knosys.2019.05.025.
6 J. Singh and V. Gupta / Knowledge-Based Systems xxx (xxxx) xxx

Fig. 2. Calculation of similarity scores between word pairs.

Please cite this article as: J. Singh and V. Gupta, A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics, Knowledge-Based Systems
(2019), https://doi.org/10.1016/j.knosys.2019.05.025.
J. Singh and V. Gupta / Knowledge-Based Systems xxx (xxxx) xxx 7

Table 1
Statistics of corpora used in the experiments.
Language Source No. of documents Mean words per document No. of queries No. of relevant documents
English TIPSTER disks 1 & 2 173,252 337 150 (51–200) 14697
Marathi FIRE 2010 99,275 273 125 (26–150) 1822
Hungarian CLEF 2004–2008 Test Suites 49,530 142 150 (CLEF 2005) 3158
Bengali FIRE 2012 500,122 336 125 (100–225) 5478

4.1. Experimental system and weighting model English: The English retrieval experiments have been conducted
on 173,252 Wall Street Journal documents containing an average
We used TERRIER1 open source search engine for performing of 337 words per document. The collection is supplemented with
indexing and retrieval experiments. TERRIER has been developed 150 queries (51–200) from TREC 1–2–3 ad hoc track.
at the School of Computing Science, University of Glasgow and Marathi: The Marathi retrieval experiments have been performed
is written in Java. TERRIER is a highly efficient, flexible and com- on 99,275 documents from Sakal and Maharashtra Times newspa-
prehensive system for performing indexing and retrieval exper- pers provided by FIRE2 2010. The collection contains an average
iments on large document collections. It employs UTF (Unicode of 273 words per document. The document collection is supple-
Transformation Format) encoding internally and thus supports mented with 125 queries (26–150) from FIRE 2008, 2010, and
document collections written in languages other than English. 2011 ad hoc retrieval tracks.
TERRIER supports frequently used retrieval methods such as Hungarian: The Hungarian Information Retrieval experiments
Okapi BM25, TF–IDF, language models and models from the prob- have been carried out on 2002 Magyar Hirlap document collec-
abilistic DFR (Divergence From Randomness) paradigm. In our tion provided by CLEF.3 The collection contains 49, 530 docu-
experiments, we used IFB2 weighting model from the DFR frame- ments with an average of 142 words per document. The collection
work, as it is suggested to be one of the best performing models comprises of 150 CLEF 2005 ad hoc style queries.
[28,45,46]. IFB2 uses tf–idf (term frequency–inverse collection Bengali: The Bengali FIRE4 2012 document collection comprises
term frequency) as the basic randomness model with Bernoulli 500,122 BDNews24 and Anandabazar Patrika news articles col-
normalization factor (given by the ratio of two Bernoulli pro- lected from the period 2002 to 2007. The corpus is supplemented
cesses) as first normalization and normalization2 as term fre- by 125 (100–225) queries taken from FIRE 2010, 2011, and 2012
quency normalization (term frequency density is inversely pro- ad hoc track.
portional to the document length). The weighting formula for
the IFB2 model is given by Eq. (6). The complete description and 4.3. Baseline stemming methods
derivation of the weighting formula can be found in [46].
∑ ctf + 1 n+1 In this subsection, we present an overview of tested stem-
R (Q , D) = qtf · ntf · · log2 (6)
dtf (ntf + 1) ctf + 0.5 ming methods in our experiments. We compared our proposed
t ∈Q
( ) method with no stemming, five corpus-based, two language-
mean_doc_length specific stemmers for English and one language-specific stem-
ntf = term_freq · log2 1+ (7)
doc_length mer for other languages under analysis. The baseline stemming
methods have been chosen in such a way that they represent
where qtf , ntf , and ctf represent query term frequency, normal-
a different class of algorithms such as language-specific, statis-
ized term frequency, and collection term frequency respectively.
tical stemmers based on lexicon analysis, co-occurrence analy-
dtf and n represent document frequency of term t and collection
sis. The description and settings of baseline stemmers used in
size respectively.
experiments are as follows:
4.2. Description of corpora used in experiments Language-Specific Stemmers: For English, we used the well-
known Porter [47] and Lovins [11] stemmers. In the case of
We conducted retrieval experiments in four languages namely Hungarian, we used Porter style stemmer developed through the
English, Marathi, Hungarian, and Bengali using standard TREC, Snowball framework. For Bengali and Marathi, aggressive rule-
CLEF, and FIRE collections. The languages considered in the analy- based stemmers5 described in [25] have been used that removes
sis are of varying origin and complexity. Among these languages, inflectional variations in nouns, adjectives, and few commonly
Marathi and Hungarian are highly inflectional and agglutinative occurring derivational suffixes.
languages whereas English has the simplest morphology. The
XU Stemmer: The co-occurrence analysis based stemmer pro-
statistics of corpora used in the experiments are presented in
posed by Xu and Croft [17] has been used. The equivalence classes
Table 1. Each test collection comprises queries which are for-
of morphological variants have been created using a connected
mulated according to TREC ad hoc task guidelines. The queries
component algorithm rather than optimal partitioning method
have three sections: title, description, and narrative. The title part
due to the nature of languages we are studying. The cutoff value
reflects short user queries (a few words) submitted to the search
for EMIM is set to 0.01 as recommended by the authors.
engine. The description part of the query provides one or two
sentence description of the user’s requested data. The narrative MORFESSOR: The unsupervised recursive Minimum Description
part describes the condition of relevance for the requested data Length based morphological analyzer has been used. The results
in a few sentences. A number of relevant documents from the have been produced from the publicly available author’s code6
collection are provided for each query. We used the title and
description sections of the query in our experiments. The detailed 2 Available at http://fire.irsi.res.in/fire/static/data.
description of the test collections used in Information Retrieval 3 Available at http://clef-campaign.org/.
experiments is as follows: 4 Available at http://fire.irsi.res.in/fire/static/data.
5 Available at http://members.unine.ch/jacques.savoy/clef/index.html.
1 Available at http://terrier.org. 6 Available at http://www.cis.hut.fi/projects/morpho/.

Please cite this article as: J. Singh and V. Gupta, A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics, Knowledge-Based Systems
(2019), https://doi.org/10.1016/j.knosys.2019.05.025.
8 J. Singh and V. Gupta / Knowledge-Based Systems xxx (xxxx) xxx

for the Morfessor Baseline model. The method does not require Table 2
any parameter tuning. Effect of different values of association cut-off (α ) on MAP values for training
queries.
Yet Another Suffix Stripper (YASS): An unsupervised stemmer Threshold English Marathi Hungarian Bengali
that uses lexical information between word pairs to group mor- 0.1 0.2756 0.2886 0.3420 0.3013
phologically related words has been used [20]. As per the rec- 0.2 0.2841 0.2918 0.3486 0.3027
ommendation of the authors, the cutoff for the complete linkage 0.3 0.2740 0.2956 0.3530 0.2973
0.4 0.2701 0.2903 0.3471 0.2947
clustering algorithm has been set to 1.5 for all the languages. The
0.5 0.2655 0.2896 0.3394 0.2892
results have been produced using the code7 made available by
the authors.
GRAph-based Stemmer (GRAS): The graph-based clustering
that will consider every neighbor to be grouped with the central
method for the generation of morphological classes described in
node. To maintain a balance between both these cases, a value
[21] has been used. The threshold values for suffixes and cohesion
greater than 0.1 and smaller than 0.5 may be an appropriate
(to decide relatedness between two node pairs) have been set to
choice.
4 and 0.8 respectively as per the recommendation of the authors.
We performed retrieval experiments at different cut-off values
High Precision Stemmer (HPS): An unsupervised stemming in all the four languages to select an appropriate value of the
method that discovers morphologically related words from the threshold. We used different query sets in the retrieval exper-
corpus based on both lexical and semantic knowledge has also iments for learning parameters (this section) and testing (Sec-
been tested [28]. The results have been produced by the code8 tion 4.6) to avoid any bias. For learning the value of the pa-
made available by the authors. The threshold value for minimum rameter, we used 50 queries each for English (51–100), Bengali
lexical distance has been set to 0.7 for English, Bengali and 0.6 (176–225), Marathi (126–175), Hungarian (CLEF 2005). The re-
for highly inflectional languages, Marathi and Hungarian. The trieval results for all the languages at the different cut-off values
maximum suffix length has been set to 3 as suggested by the are shown in Table 2. The bold values in the table indicate that
authors. the best MAP values are obtained at α = 0.2 for English, Bengali.
But for highly inflectional languages, Marathi and Hungarian, the
LEXSTEM: This baseline method uses the Jaro–Winkler distance
maximum improvement in MAP are obtained at α = 0.3. So, we
to cluster words using hierarchical clustering method [39]. The
used α = 0.2 for English, Bengali, and α = 0.3 for Marathi and
clustering cut-off has been set to 0.1 for English and 0.2 for other
Hungarian.
languages.
4.6. Evaluation results
4.4. Evaluation measures
In this subsection, we compare the retrieval results of our
We compared the retrieval performance of various stemmers
proposed method with other baseline stemming methods. We
using Mean Average Precision (MAP) as a primary evaluation
present two tables for all the languages under analysis. The
measure. R–Precision (R–P) and Precision@10 (P@10) values have
first table compares the retrieval performance of various stem-
also been reported in order to analyze the precision enhancing ca-
mers in terms of MAP, Precision@10, Recall–Precision, Normal-
pability of the stemmers. In order to measure the ranking quality,
ized Discounted Cumulative Gain@10 (NDCG@10), and number
Normalized Discounted Cumulative Gain@10 (NDCG@10) has also
of relevant documents retrieved. The second table compares the
been reported. The number of relevant documents retrieved (Rel.
query-wise and statistical performance of various stemmers as
Ret.) by each stemming method has also been reported.
compared to no stemming. The bold values in both the tables
The query-by-query comparison of average precision values
indicate the best performance in the particular category. The
has also been performed for each language. The number of topics
values in brackets in the first table denote percentage change
whose average precision values are greater or less than no stem-
(increase/decrease) as compared to no stemming baseline. The
ming baseline has been reported. Robustness Index (as given in
recall–precision curve for each language has also been shown
Eq. (8)) reports the number of queries improved and the extent of
in order to better understand the performance of the stemmers.
improvement by each stemmer for each language. The topic-wise
The recall–precision curve shows the performance of the pro-
average precision values have also been statistically compared
posed techniques against three baseline methods which attain the
using paired t-test at 95% confidence level.
closest MAP to the proposed technique.
N+ − N−
RI = (8)
|Q | 4.6.1. English results
The impact of various stemmers on the retrieval performance
where N+ and N− are the number of topics whose average preci-
for English test collection is presented in Table 3. It is clear from
sion values are more or less than no stemming.
the table that the performance differences of various methods
as compared to no stemming are much less for the English lan-
4.5. Parameter setting
guage which has a simple morphology. The proposed method
achieved the best MAP scores across all the approaches stud-
Our proposed stemming approach depends upon only one
ied and showed an improvement of nearly 10.5% against the
corpus-dependent parameter, i.e., association cut-off (α ) which
unstemmed run. PORTER, LEXSTEM, LOVIN, and GRAS also per-
decides that whether the two nodes in the weighted graph are
formed significantly well and showed an improvement of 9%–10%
morphologically related or not. The value of this cut-off should
as compared to no stemming. MORFESSOR, YASS, and HPS per-
be chosen carefully as a larger value of α (more than 0.5) rejects
formed almost equally but are found to be nearly 3% inferior to
many neighbors for grouping with the central node whereas a low
the proposed method. XU stemmer performed worse and man-
value of α (value nearer to zero) would create a lenient stemmer
aged to show an improvement of only 3.3% in MAP as compared
to the unstemmed run. Out of 14697 relevant documents, the pro-
7 Available at http://www.isical.ac.in/~clia/resources.html. posed method retrieved a maximum number of 10568 documents
8 Available at http://liks.fav.zcu.cz/HPS/. thereby showing an increase of 6.4% as compared to no stemming.

Please cite this article as: J. Singh and V. Gupta, A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics, Knowledge-Based Systems
(2019), https://doi.org/10.1016/j.knosys.2019.05.025.
J. Singh and V. Gupta / Knowledge-Based Systems xxx (xxxx) xxx 9

Table 3
Retrieval results for English.
Total relevant documents: 14697 Total queries: 51–200 (150 queries)
Method MAP R–P P@10 NDCG@10 Rel. Ret.
NO STEM 0.2789 0.3166 0.4873 0.3120 9932
LOVIN 0.3042 (9.07) 0.3360 (6.13) 0.4907 (0.70) 0.3499 (12.15) 10460 (5.32)
PORTER 0.3079 (10.40) 0.3376 (6.63) 0.4880 (0.14) 0.3670 (17.63) 10494 (5.66)
XU 0.2881 (3.30) 0.3232 (2.08) 0.4867 (−0.12) 0.3614 (15.83) 10098 (1.67)
MORFESSOR 0.3011 (7.96) 0.3393 (7.17) 0.4887 (0.29) 0.3614 (15.83) 10491 (5.63)
YASS 0.2995 (7.39) 0.3332 (5.24) 0.4827 (−0.94) 0.3600 (15.38) 10488 (5.60)
GRAS 0.3045 (9.18) 0.3394 (7.20) 0.4913 (0.82) 0.3820 (22.44) 10540 (6.12)
HPS 0.2990 (7.21) 0.3284 (3.73) 0.4927 (1.11) 0.3942 (26.35) 10534 (6.06)
LEXSTEM 0.3075 (10.25) 0.3391 (7.10) 0.4998 (2.56) 0.3998 (28.14) 10546 (6.18)
Proposed 0.3084 (10.58) 0.3428 (8.28) 0.5040 (3.43) 0.4021 (28.88) 10568 (6.40)

Table 4
Query-wise performance and statistical significance test for English.
LOVIN PORTER XU MORF YASS GRAS HPS LEXSTEM PROP
Better Poorer Better Poorer Better Poorer Better Poorer Better Poorer Better Poorer Better Poorer Better Poorer Better Poorer
No. of queries 84 66 87 63 84 66 83 66 84 66 87 62 82 68 89 59 91 59
RI 0.12 0.16 0.12 0.11 0.12 0.17 0.09 0.19 0.21
p-value 0.00265 0.00104 0.3194 0.00026 0.0025 0.00026 0.00249 0.00024 0.00022

Table 5
Retrieval Results for Marathi.
Total relevant documents: 1822 Total queries: 125 (26–150)
Method MAP R–P P@10 NDCG@10 Rel. Ret.
NO STEM 0.2692 0.2720 0.2693 0.1923 1561
RULE 0.2872 (6.69) 0.2715 (−0.18) 0.3019 (12.11) 0.2547 (32.45) 1548 (−0.83)
XU 0.3499 (30.00) 0.3287 (20.85) 0.3192 (18.53) 0.2447 (27.25) 1678 (7.50)
MORFESSOR 0.3707 (37.70) 0.3591 (32.02) 0.3374 (25.29) 0.2608 (35.62) 1692 (8.39)
YASS 0.3225 (19.80) 0.3067 (12.76) 0.3084 (14.52) 0.2443 (27.04) 1633 (4.61)
GRAS 0.3767 (39.93) 0.3533 (29.89) 0.3505 (30.15) 0.2954 (53.61) 1702 (9.03)
HPS 0.3375 (25.37) 0.3350 (23.16) 0.3308 (22.84) 0.2754 (43.21) 1666 (6.72)
LEXSTEM 0.3767 (39.93) 0.3600 (32.35) 0.3505 (30.15) 0.2998 (55.90) 1708 (9.41)
Proposed 0.3804 (41.31) 0.3631 (33.49) 0.3523 (30.82) 0.3008 (56.42) 1723 (10.38)

Table 6
Query-wise performance and statistical significance test for Marathi.
RULE XU MORF YASS GRAS HPS LEXSTEM PROP
Better Poorer Better Poorer Better Poorer Better Poorer Better Poorer Better Poorer Better Poorer Better Poorer
No. of queries 51 49 71 30 76 26 62 40 77 26 73 30 75 30 78 23
RI 0.01 0.28 0.34 0.15 0.34 0.29 0.34 0.37
p-value 0.134 2.234E−07 8.271E−08 0.0057 1.667E−08 2.475E−06 5.52E−08 1.193E−08

GRAS and HPS also performed consistently well and reported an and 37.7% respectively in MAP. Unlike English, XU performed
improvement of nearly 6% in the number of relevant documents consistently better in case of Marathi and showed an improve-
retrieved as compared to no stemming. ment of 30% in MAP as compared to word-based indexing. YASS
The p-values shown in Table 4 indicate that the performance performed relatively inferior in Marathi and managed to show
of all the stemming methods except XU is statistically signifi- an improvement of 19.8% in MAP as compared to no stemming.
cant as compared to the unstemmed run. The proposed method Marathi rule-based stemmer performed worse among all the
improved the average precision values of a maximum number methods and showed a small improvement of 6.6% in MAP as
of 91 queries out of 150 and achieved a maximum score of compared to no stemming. Out of 1822 relevant documents,
Robustness Index. The interpolated precision achieved by the the proposed method retrieved a maximum of 1723 documents
eight stemming approaches at various recall point is shown in
thereby showing an increase of more than 10% as compared to
Fig. 3. It is clear from the figure that all the stemming approaches
no stemming. GRAS and MORFESSOR also increased the number
maintain a marginal improvement as compared to no stemming.
of relevant documents retrieved by nearly 9% as compared to the
The proposed method performed better than the other baseline
unstemmed run.
methods in terms of precision at different recall levels.
The query-by-query analysis presented in Table 6 indicates
4.6.2. Marathi results that all the stemming methods (except rule-based method) im-
Table 5 shows that all the stemming methods performed proved a substantial number of queries and achieved high ro-
better than the unstemmed run. Among all the methods, the bustness index. The proposed method improved a maximum of
proposed method performed the best. The proposed method 78 queries out of 125 and achieved a robustness index of 0.37.
showed a relative improvement of 41.3% in MAP as compared to We can infer from the p values (Table 6) that all the stemming
the unstemmed run, which is clearly quite significant. LEXSTEM, approaches (except Marathi rule-based stemmer) performed sta-
GRAS, and MORFESSOR also performed significantly better than tistically better than no stemming. The precision–recall curve
no stemming baseline and showed an improvement of 39.9% in Fig. 4 depicts that the proposed method maintains a better

Please cite this article as: J. Singh and V. Gupta, A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics, Knowledge-Based Systems
(2019), https://doi.org/10.1016/j.knosys.2019.05.025.
10 J. Singh and V. Gupta / Knowledge-Based Systems xxx (xxxx) xxx

Fig. 3. Precision–recall curve for English.


Fig. 5. Precision–recall curve for Hungarian.

all the stemming methods performed statistically better than


no stemming baseline. Fig. 5 shows a precision–recall curve for
the Hungarian language. It is clear from the figure that all the
methods maintain significant improvement from no stemming
at different recall levels. The proposed method performed better
than all the other methods in terms of precision at various recall
levels.

4.6.4. Bengali results


The performance of all the stemming approaches under var-
ious metrics is summarized in Table 9. It is clear from the ta-
ble that the proposed method resulted in maximum improve-
ment of nearly 19% in MAP as compared to word-based re-
trieval. LEXSTEM also performed consistently well and showed
an improvement of more than 16% in MAP as compared to the
unstemmed run. MORFESSOR, GRAS, and HPS also performed
significantly well but are found to be nearly 4% inferior to the
Fig. 4. Precision–recall curve for Marathi. proposed method. The Bengali rule-based stemmer and YASS
performed almost equally and showed an improvement of nearly
13.5% in MAP as compared to the unstemmed run. XU stemmer
precision at all recall levels than the other stemming strategies managed to show an improvement of 11.1% in MAP as compared
under analysis. to no stemming. We can infer from the table that the various
stemming strategies improved the recall significantly. The pro-
4.6.3. Hungarian results posed method managed to retrieve 11% more relevant documents
The retrieval performance of various methods for Hungarian as compared to no stemming.
test collection is presented in Table 7. The proposed method The performance difference of various stemmers is statistically
performed the best and showed a large relative improvement of significant as compared to no stemming baseline for Bengali test
more than 54% in MAP as compared to the unstemmed run. GRAS collection. It is clear from the p values shown in Table 10. The
and LEXSTEM performed almost equally and showed an improve- query-by-query analysis in Table 10 reveals that the proposed
ment of more than 50% in MAP. RULE, MORFESSOR, and YASS also method outperformed other methods on a substantial number
performed significantly well and improved MAP by nearly 45% of queries. The precision achieved by various methods at each
as compared to no stemming baseline. Like English, XU stemmer recall point is depicted in Fig. 6. All the methods maintain a
performed relatively inferior in Hungarian test collection and consistent performance improvement over word-based retrieval.
managed to show an improvement of 21.2% as compared to The proposed method outperformed other methods in terms of
the unstemmed word retrieval. Out of 3158 relevant documents, precision at various recall levels.
the proposed method retrieved 2699 relevant documents and
showed an improvement of 37.5% as compared to no stemming 4.7. Analysis of retrieval results
baseline. The other stemming methods also showed an improve-
ment of more than 30% in the number of relevant documents The results presented in Section 4.6 provide useful insights
retrieved as compared to no stemming. into the performance of the stemmers in different language fam-
The query-by-query analysis presented in Table 8 indicates ilies. In this subsection, we discuss our critical findings on the
that both the proposed method and GRAS improved the average basis of the analysis of experimental results. It is noticeable from
precision of a substantial number of queries and achieved a high the results that the performance of the corpus-based stemmers
robustness index of 0.42. The p values in the table depict that is better than linguistic stemmers in almost all the languages in

Please cite this article as: J. Singh and V. Gupta, A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics, Knowledge-Based Systems
(2019), https://doi.org/10.1016/j.knosys.2019.05.025.
J. Singh and V. Gupta / Knowledge-Based Systems xxx (xxxx) xxx 11

Table 7
Retrieval results for Hungarian.
Total relevant documents: 3158 Number of Queries: 150
Method MAP R–P P@10 NDCG@10 Rel. Ret.
NO STEM 0.2159 0.2312 0.2728 0.2041 1963
RULE 0.3147 (45.76) 0.3326 (43.85) 0.3654 (33.94) 0. 2958 (44.93) 2516 (28.17)
XU 0.2617 (21.21) 0.2771 (19.85) 0.2940 (7.77) 0.2158 (5.73) 2577 (31.27)
MORFESSOR 0.3115 (44.27) 0.3310 (43.17) 0.3673 (34.64) 0.3021 (48.02) 2580 (31.43)
YASS 0.3149 (45.85) 0.3274 (41.61) 0.3716 (36.22) 0.3122 (52.96) 2587 (31.79)
GRAS 0.3245 (50.30) 0.3397 (46.93) 0.3692 (35.34) 0.3002 (47.08) 2689 (36.98)
HPS 0.3092 (43.21) 0.3334 (44.20) 0.3673 (34.64) 0.3158 (54.73) 2637 (34.34)
LEXSTEM 0.3229 (50.21) 0.3398 (46.97) 0.3720 (36.36) 0.3254 (59.43) 2689 (36.98)
Proposed 0.3341 (54.75) 0.3412 (47.58) 0.3789 (38.89) 0.3314 (62.37) 2699 (37.49)

Table 8
Query-wise performance and statistical significance test for Hungarian.
RULE XU MORF YASS GRAS HPS LEXSTEM PROP
Better Poorer Better Poorer Better Poorer Better Poorer Better Poorer Better Poorer Better Poorer Better Poorer
No. of queries 98 50 80 64 90 45 100 45 105 42 94 55 107 42 111 37
RI 0.32 0.11 0.30 0.37 0.42 0.26 0.44 0.49
p-value 4.34E−08 8.82E−05 5.17E−08 3.37E−08 1.02E−09 6.67E−07 5.24E−09 8.24E−10

Table 9
Retrieval Results for Bengali.
Total relevant documents: 5478 Total queries: 125 (100–225)
Method MAP R–P P@10 NDCG@10 Rel. Ret.
NO STEM 0.2847 0.2981 0.3968 0.2804 4019
RULE 0.3229 (13.42) 0.3302 (10.77) 0.4310 (8.62) 0.3233 (15.30) 4426 (10.13)
MORFESSOR 0.3293 (15.66) 0.3363 (12.81) 0.4429 (11.62) 0.3389 (20.86) 4391 (9.26)
XU 0.3164 (11.13) 0.3232 (8.42) 0.4238 (6.81) 0.3254 (16.05) 4292 (6.79)
YASS 0.3237 (13.70) 0.3278 (9.96) 0.4381 (10.41) 0.3200 (14.12) 4404 (9.58)
GRAS 0.3294 (15.70) 0.3339 (12.01) 0.4460 (12.40) 0.3412 (21.68) 4449 (10.70)
HPS 0.3297 (15.80) 0.3407 (14.29) 0.4429 (11.62) 0.3389 (20.86) 4461 (11.00)
LEXSTEM 0.3312 (16.33) 0.3408 (14.32) 0.4500 (13.40) 0.3459 (23.04) 4461 (11.00)
Proposed 0.3383 (18.83) 0.3420 (14.73) 0.4532 (14.21) 0.3468 (23.68) 4458 (10.92)

Table 10
Query-wise performance and statistical significance test for Bengali.
RULE XU MORF YASS GRAS HPS LEXSTEM PROP
Better Poorer Better Poorer Better Poorer Better Poorer Better Poorer Better Poorer Better Poorer Better Poorer
No. of queries 80 45 77 46 84 41 81 43 86 37 86 38 87 35 88 35
RI 0.28 0.25 0.34 0.30 0.39 0.38 0.41 0.42
p value 0.00024 0.00059 0.00015 0.00013 5.07E−05 6.89E−05 1.24E−05 1.428E−06

stemmers. Another important finding from the retrieval results is


that the performance of stemmers increases with the increase in
the morphological richness of the language. In the case of highly
agglutinative and inflectional languages, Marathi and Hungarian,
stemming resulted in large improvements in MAP as compared
to no stemming.
Among the Asian linguistic stemmers, Bengali linguistic stem-
mer also performed fairly well in the retrieval task. The Marathi
linguistic stemmer performed worse among all the baseline stem-
mers. This is due to the reason that Marathi is highly inflectional
language and the Marathi rule-based stemmer only removes in-
flectional variations in nouns and adjectives. It misses variations
in verbs and many frequently occurring derivational suffixes.
The Hungarian and English (Lovin and Porter) linguistic-based
stemmers performed consistently well both in terms of MAP and
number of relevant documents retrieved.
Among corpus-based stemmers, XU performed relatively infe-
rior in almost all the languages under study. The major limita-
Fig. 6. Precision–recall curve for Bengali.
tion of XU stemmer is that the connected component clustering
method used in the XU algorithm forms chains that lead to
performance degradation. MORFESSOR and HPS performed better
the information retrieval task. In some languages like English, the
for morphologically and inflectionally rich languages Marathi and
performance of linguistic stemmers is at par with corpus-based Hungarian as compared to English, and Bengali.

Please cite this article as: J. Singh and V. Gupta, A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics, Knowledge-Based Systems
(2019), https://doi.org/10.1016/j.knosys.2019.05.025.
12 J. Singh and V. Gupta / Knowledge-Based Systems xxx (xxxx) xxx

Fig. 7. Percentage of topics which are better, equivalent and worse as compared to no stemming.

YASS, on the other hand, performed moderately well in almost 5. Seen vs. unseen training
all the languages except for Marathi. This is probably due to
the reason that the clustering approach used in the YASS algo- In the Information Retrieval experiments presented in Sec-
rithm is not suitable for long length suffixes, which are quite tion 4, all the stemmers are trained using the same document
common in Marathi. Moreover, the complete linkage clustering collection which is used during indexing and searching. The docu-
method used in this algorithm produces distinct clusters when ment collections are continuously increasing as new data emerge
the different order of the elements is considered. GRAS performed quite frequently in real situations. Thus, the stemming algorithms
significantly well, but it performs over stemming in some cases need to be retrained on the new document collections. The re-
as it is highly dependent on the recall rate. Moreover, GRAS is training of stemming techniques involves a lot of computations
suitable only for suffixing languages as it clusters morphological and takes a lot of time. So, stemming methods which can manage
variants from the corpus using suffix knowledge. The proposed the unseen document collection efficiently and does not require
method, on the other hand, considers lexical similarity, suffix retraining are preferred. Thus, in this section, we analyze the
knowledge, common prefix length, and co-occurrence similarity performance of stemmers on both seen and unseen data.
to cluster morphologically related variants from the unannotated In the unseen run, the collection used for training of stemmers
corpus. Hence, the proposed method is suitable for a wide range and indexing is different. The stemmers are trained on the docu-
of languages. The percentage of topics which are better, poorer, ment collection containing Wall Street Journal (WSJ), Associated
or equivalent to no stemming using different stemmers across Press (AP), and Information from computer select disks (Ziff)
all the languages studied is depicted in Fig. 7. It is clear from articles provided by TREC disks 1 and 2. The document collection
the figure that the proposed stemmer improved the maximum used for indexing and searching contains Financial Times (FT),
percentage of topics in all the languages followed by LEXSTEM, Foreign Broadcast Information Services (FBIS), and Los Angeles
MORFESSOR, and GRAS. All the linguistic stemmers under analy- Times (LATimes) news articles provided by TREC disks 3 and 4.
sis (except Marathi) also improved the substantial percentage of In the seen run, the document collection used for training of
queries. stemmers and indexing is the same. In this run, the stemmers

Please cite this article as: J. Singh and V. Gupta, A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics, Knowledge-Based Systems
(2019), https://doi.org/10.1016/j.knosys.2019.05.025.
J. Singh and V. Gupta / Knowledge-Based Systems xxx (xxxx) xxx 13

have been trained and indexed on document collection containing It is clear from Table 14 that all the stemming techniques
FT, FBIS, and LATimes. The complete statistics of data used in both improved the classification performance of the classifiers both in
runs are presented in Table 11. To avoid any bias, the size of terms of precision and recall. The improvement in SVM classifier
the test collection used for training and testing in both cases is is more as compared to MaxEnt classifier for almost all the
approximately the same. stemmers. The proposed stemming method reported a maximum
The retrieval results for both seen and unseen runs have improvement of nearly 14% both in precision and recall in the
been presented in Table 12. It is clear from the results that the classification task using both the classifiers in the case of WebKb
performance difference of both runs is very small. In fact, in the dataset. The proposed method excelled in sentiment analysis task
case of LEXSTEM, YASS, and PORTER stemmers, the MAP and also and improved precision by 26.60% and recall by 23.71%.
number of relevant documents retrieved are slightly more than GRAS, YASS, and PORTER stemmer performed almost equally.
the seen training run. The topic-wise average precision values XU and LOVIN stemmer performed relatively inferior to other
of each stemming approach have been statistically compared for methods and showed small improvement of 5% in precision for
both seen and unseen training using paired t-test. The p-values both the classifiers on the WebKB dataset.
for the statistical tests are reported in Table 13. It is clear from
the p-values that the difference in performance for both seen and 7. Inflection removal experiments
unseen training is statistically insignificant for various stemmers
under analysis.
The experiments presented in the previous sections inves-
tigate the performance of different stemmers in two specific
6. Text classification experiments
applications namely Information Retrieval and Text Classification.
In order to have a detailed assessment of the stemmers, we
This section presents experiments related to the application of
decided to compare their performance directly without any target
stemmers to the text classification task. The goal is to investigate
application. In this section, we compared the output of stemmers
the performance of the stemmers in yet another scenario to have
(words sharing the same stem) with the gold standards, i.e., man-
a deeper insight into the quality of stemmers.
ually annotated words with lemmas. This test actually measures
Text classification is the process of assigning text documents
the ability of stemmers to replace lemmatizers in a number of
to one or more pre-defined classes. Text classification techniques
NLP applications (POS tagging, named entity recognition, etc.)
have been applied to a wide number of applications such as senti-
where lemmatizers perform well.
ment analysis, movie genre classification, spam filtering, language
Due to the lack of manually annotated data for Indian lan-
identification, article triage, etc. The researchers have different
guages under study, we conducted these experiments in English
opinions regarding the effectiveness of stemming for the task of
and Hungarian only. The manually annotated English test corpus
text classification. Initial studies [48,49] which employed stem-
has been obtained from Open American National Corpus (OANC)
ming at the text preprocessing stage did not find stemming to be
texts10 containing 107,092 distinct words. The Hungarian test
advantageous, but more recent studies [4,50,51] found stemming
methods quite effective. So the recent tendency is to employ corpus contains 154,240 distinct words from manually anno-
stemming at the preprocessing stage as it reduces the dimension- tated Szeged Corpus.11 The inflection removal results have been
ality of the feature set by mapping the variant words to the single compared in terms of Precision, Recall and F-score (as given in
stem. Eq. (9)). The F-score metric considers both precision and recall
We now describe the procedure for evaluating the perfor- and is hence used as a primary evaluation metric to compare the
mance of various stemming methods in the text classification performance of stemmers in these experiments.
task. We conducted the classification experiments using RText- TP TP
Tools package [52] on the standard WebKB and Movie dataset. Precision = Recall = Fscore
TP + FP TP + FN
The WebKB dataset has been collected by the World Wide Knowl- (9)
2 × Precision × Recall
edge Base project of the Carnegie Mellon University text learn- =
Precision + Recall
ing group. The dataset comprises of 8282 web pages from the
computer science department of different universities manually where TP denotes the number of times a word in the morpho-
classified into seven categories namely student, faculty, staff, de- logical class of stemmer matches a word in the lemma group (as
partment, course, project, and miscellaneous. The movie dataset9 given by manually annotated data), FP denotes the number of
consists of 2500 movie reviews in both text and bag of words times the morphological class of the stemmer contains incorrect
format. The classification experiments on movie dataset actually word, and FN denotes the number of times the morphological
analyze the performance of stemming in sentiment analysis. The class of the stemmer missed the correct word.
document collections have been randomly divided into training, Table 15 presents inflection removal results of various stem-
and testing data and the same split has been used for each stem- mers for both English and Hungarian languages. In the case of
ming method used in the analysis. Sixty percent of documents in English, the proposed method gave the best results and achieved
the collection are used in the training phase and the remaining an F-score of 72.2%. The Porter stemmer also performed signifi-
forty percent is used in the testing phase. cantly well and achieved an F-score of 70%. YASS, GRAS, HPS, and
We employed Support Vector Machine (SVM) classifier [53] LOVIN stemmer performed almost equally.
and Maximum Entropy-based (MaxEnt) classifier [54] to compare In the case of Hungarian also, the proposed method, LEXSTEM,
the improvement in classification performance using different and the rule-based stemmers performed superior to other meth-
stemmers. The improvement in Precision and Recall values of ods and achieved an F-score of more than 65%. GRAS and MOR-
both the classifiers using different stemmers is presented in Ta- FESSOR performed almost equally and achieved an F-score of
ble 14. Precision refers to the proportion of correctly classified nearly 60%. The results of HPS are found to be quite similar to
documents and recall refers to the proportion of documents in a XU stemmer.
class that is correctly assigned to that class.
10 Available at http://www.anc.org/data/oanc/.
9 Available at http://ai.stanford.edu/~amaas/data/sentiment/. 11 Available at http://www.inf.u-szeged.hu/.

Please cite this article as: J. Singh and V. Gupta, A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics, Knowledge-Based Systems
(2019), https://doi.org/10.1016/j.knosys.2019.05.025.
14 J. Singh and V. Gupta / Knowledge-Based Systems xxx (xxxx) xxx

Table 11
Statistics of training corpus used for seen and unseen training.
Type of training Training corpus Source Size (in GB) Number of documents Number of words
Unseen WSJ, ZIFF, AP TREC disks 1 and 2 1.5 469,949 518,471
Seen FBIS, FT, LATimes TREC disks 4 and 5 1.58 472,525 522,381

Table 12
Retrieval performance for various stemmers for seen and unseen training.
UNSEEN: Training and testing data are different. SEEN: Training and testing data are same.
MAP R–P P@10 Rel. Ret MAP R–P P@10 Rel. Ret
LOVIN 0.2448 0.2871 0.4567 7355 0.2449 0.2868 0.4553 7360
PORTER 0.2617 0.3034 0.4813 7765 0.2596 0.3013 0.4827 7760
XU 0.2489 0.2904 0.4624 7480 0.2512 0.2912 0.4688 7495
MORF 0.2682 0.3034 0.4654 7552 0.2708 0.3035 0.4654 7608
YASS 0.2535 0.2995 0.4639 7672 0.2499 0.2917 0.4634 7652
GRAS 0.2690 0.3031 0.4713 7743 0.2698 0.3090 0.4792 7873
HPS 0.2580 0.3037 0.4693 7550 0.2608 0.3036 0.4693 7612
LEXSTEM 0.2716 0.3115 0.4856 7895 0.2714 0.3108 0.4840 7901
LCS 0.2712 0.3075 0.4850 7889 0.2730 0.3120 0.4892 7922

Table 13
Statistical significance test results for seen and unseen training.
Stemmer LOVIN PORTER XU MORF YASS GRAS SNS HPS LEX STEM LCS
p-values 0.7213 0.6022 0.8457 0.6258 0.4470 0.9217 0.5658 0.4335 0.5041 0.6246

Table 14
Classification results for WebKB and Movies dataset.
WebKB dataset using SVM classifier WebKB dataset using MaxEnt classifier Movies dataset using SVM classifier
Technique
Precision Recall Precision Recall Precision Recall
NO STEMMING 0.3876 0.3789 0.3242 0.2517 0.2642 0.2458
LOVIN 0.4136 0.4050 0.3421 0.2689 0.2589 0.2554
(6.72) (6.91) (5.52) (6.83) (−2.06) (3.90)
PORTER 0.4381 0.4249 0.3614 0.2762 0.2942 0.2800
(13.05) (12.15) (11.47) (9.73) (11.35) (13.91)
XU 0.4064 0.4075 0.3408 0.2739 0.2908 0.2759
(4.86) (7.56) (5.12) (8.82) (10.08) (12.24)
MORFESSOR 0.4267 0.4222 0.3528 0.2794 0.3036 0.2987
(10.10) (11.43) (8.82) (11.01) (14.91) (21.52)
YASS 0.4373 0.4234 0.3614 0.2824 0.3052 0.2931
(12.84) (11.75) (11.47) (12.19) (15.51) (19.24)
GRAS 0.4345 0.4262 0.3629 0.2802 0.3060 0.2931
(12.11) (12.49) (11.94) (11.32) (15.82) (19.24)
HPS 0.4259 0.4186 0.3567 0.2777 0.3117 0.2899
(9.90) (10.50) (10.02) (10.33) (17.97) (17.94)
LEXSTEM 0.4391 0.4294 0.3629 0.2824 0.3212 0.2984
(13.30) (13.32) (11.94) (12.19) (21.57) (21.40)
Proposed 0.4429 0.4323 0.3675 0.2867 0.3345 0.3041
(14.28) (14.11) (13.35) (13.91) (26.60) (23.71)

8. Efficiency LEXSTEM, GRAS, and YASS also appear to be quite aggressive on


all the languages. XU, HPS, and various rule-based stemmers are
Our last set of experiments compares the performance of found to be comparatively light in terms of stemmer strength and
the stemmers directly in terms of the stemmer strength and produced smaller conflation classes.
processing time. The stemmer strength measures the extent to We also calculated the processing time of all the stemming
which a stemmer changes the base word forms. We used a well- techniques. The retrieval runs on all the languages have been
performed on a computer with i5 processor and 8 GB RAM.
known metric mean number of words per conflation class [55] to
The processing time (in minutes) of each stemmer for all the
measure the strength of the stemmer. The mean number of words
languages under analysis is presented in Table 17. It is clear from
per conflation class is defined as the ratio of the number of unique
the table that the proposed method is faster than XU, MORFES-
words before stemming is to the number of unique stemmed SOR, and HPS (i.e. all the stemming techniques that use corpus
words after stemming. A high value of this metric denotes a more based knowledge). The proposed technique takes most of the
aggressive stemmer. time in computing co-occurrence frequencies between the words.
Table 16 shows the values of stemmer strength of various LEXSTEM and GRAS take less processing time as they use only
stemming methods for each language used in our study. The best lexicon statistics to group morphologically related words. YASS, a
values of stemmer strength for each language in the table are lexicon analysis based approach takes large time because it does
marked in bold indicating that the proposed method happens not divide the corpus into initial equivalence classes. Among all
to be the most aggressive stemmer on almost all languages. stemmers, HPS takes maximum processing time.

Please cite this article as: J. Singh and V. Gupta, A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics, Knowledge-Based Systems
(2019), https://doi.org/10.1016/j.knosys.2019.05.025.
J. Singh and V. Gupta / Knowledge-Based Systems xxx (xxxx) xxx 15

Table 15 The last set of experiments compared the performance of


Inflection removal results. stemmers directly in Inflection Removal task without any target
English results application. This test actually measures the ability of the stem-
Method Precision (%) Recall (%) F-Score (%) mers to replace lemmatizers. Our proposed approach excelled in
LOVIN 60.2 67.4 63.5 this test scenario also by showing maximum improvement in the
PORTER 64.1 77.3 70.1 F-Score for both English and Hungarian languages.
XU 54.8 63.2 58.7
MORFESSOR 52.5 61.5 56.6
YASS 57.9 63.7 60.7 References
GRAS 56.1 70.2 62.4
HPS 65.2 62.2 63.7 [1] V. Gupta, N. Kaur, A novel hybrid text summarization system for Punjabi
LEXSTEM 68.3 71.1 69.7 text, Cogn. Comput. 8 (2016) 261–277.
Proposed 69.5 75.1 72.2 [2] A. Louis, A. Nenkova, Automatically evaluating content selection in sum-
Hungarian results marization without human models, in: Proceedings of the Conference on
Empirical Methods in Natural Language Processing, 2009, pp. 306–314.
Method Precision (%) Recall (%) F-Score (%)
[3] M. Rosell, Improving clustering of Swedish newspaper articles using stem-
RULE 68.4 62.2 65.2 ming and compound splitting, in: Proceedings of the Nordic Conference
XU 60.7 50.7 55.2 on Computational Linguistics. Reykjavik, Iceland, 2003, pp. 1–7.
MORFESSOR 67.3 52.7 59.1 [4] N.L. Bhamidipati, S.K. Pal, Stemming via distribution-based word segrega-
YASS 65.4 41.8 51.0 tion for classification and retrieval, IEEE Trans. Syst. Man Cybern. B 37
GRAS 68.2 55.2 61.1 (2007) 350–360.
HPS 62.3 51.3 56.3 [5] K. Toutanova, H. Suzuki, A. Ruopp, Applying Morphology Generation
LEXSTEM 69.8 61.7 65.5 Models to Machine Translation, Association for Computational Linguistics,
Proposed 68.9 63.7 66.2 2008, pp. 514–522.
[6] M. Hu, B. Liu, Mining Opinion Features in Customer Reviews, 2004.
[7] S.M. Arif, M. Mustapha, The effect of noise elimination and stemming in
Table 16
sentiment analysis for Malay documents, in: Proceedings of the Interna-
Stemmer strength of various stemmers.
tional Conference on Computing, Mathematics and Statistics, ICMS 2015,
Language RULE XU MORF YASS GRAS HPS LEXSTEM Proposed 2017, pp. 93–102.
English 1.23 1.07 2.27 2.75 1.15 1.45 2.79 2.91 [8] K. Dashtipour, S. Poria, A. Hussain, et al., Multilingual sentiment analysis:
Marathi 1.16 2.58 2.09 3.21 3.43 1.77 3.76 3.78 State of the art and independent comparison of techniques, Cogn. Comput.
Hungarian 2.01 1.97 3.67 3.65 3.97 2.54 3.75 3.97 (2016) 1–15.
Bengali 2.25 1.78 3.25 3.29 2.70 1.78 3.35 3.43 [9] R. Krovetz, Viewing morphology as an inference process, in: Proceedings
of the 16th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, 1993, pp. 191–202.
Table 17 [10] M. Shrivastava, P. Bhattacharyya, Hindi POS Tagger using naive stemming:
Computation time (in minutes) taken by different methods. harnessing morphological information without extensive linguistic knowl-
Language #words XU MORF YASS GRAS HPS LEXSTEM Proposed edge, in: Proceedings of 6th International Conference on Natural Language
Processing, ICON08, 2008.
English 179,387 17 24 24 0.5 65 0.5 17
[11] J.B. Lovins, Development of a stemming algorithm, Mech. Transl. Comput.
Marathi 854,324 28 42 71 1.2 172 1.4 25
Linguist. 11 (1968) 22–31.
Hungarian 528,315 25 31 32 2.0 119 2.3 22
[12] J.L. Dawson, Suffix removal for word conflation, Assoc. Lit. Linguist.
Bengali 533,605 24 30 28 0.8 90 0.9 21
Comput. Bull. 2 (1974) 33–46.
[13] C.D. Paice, Another stemmer, ACM SIGIR Forum 24 (1990) 56–61.
[14] D.A. Hull, Stemming algorithms-a case study for detailed evaluation, J. Am.
Soc. Inf. Sci. 47 (1996) 70–84.
9. Conclusion [15] M. Lennon, D.S. Pierce, B.D. Tarry, P. Willett, An evaluation of some
conflation algorithms for information retrieval, in: Document Retrieval
In this article, we successfully achieved the objective of de- Systems, 1988, pp. 99–105.
veloping a multi-purpose stemming algorithm that cannot only [16] D. Harman, How effective is suffixing? J. Am. Soc. Inf. Sci. 42 (1991) 7–
be used for information retrieval task but also for non-traditional 15, http://dx.doi.org/10.1002/(SICI)1097-4571(199101)42:1<7::AID-ASI2>
3.0.CO;2-P.
tasks such as text classification, sentiment analysis, inflection
[17] J. Xu, W.B. Croft, Corpus-based stemming using cooccurrence of word
removal, etc. Our approach discovers morphologically related variants, ACM Trans. Inf. Syst. 16 (1998) 61–81.
words from the ambient corpus using lexicon and corpus-based [18] J. Savoy, Light stemming approaches for the French, Portuguese, German
features such as lexical similarity, co-occurrence similarity, suffix and Hungarian languages, in: Proceedings of the 2006 ACM Symposium
pair knowledge, and common prefix length. The proposed ap- on Applied Computing, 2006, pp. 1031–1035.
proach can be used for a wide range of inflectional languages [19] P. Majumder, M. Mitra, D. Pal, Bulgarian, Hungarian and Czech stemming
using YASS, in: Advances in Multilingual and Multimodal Information
and not just suffixing languages as it can discover variants formed
Retrieval: 8th Workshop of the Cross-Language Evaluation Forum, CLEF
through affixation, conversion, compounding, etc. 2007 Budapest, Hungary, 2008.
The performance of the proposed approach is compared with [20] P. Majumder, M. Mitra, S.K. Parui, et al., YASS: Yet Another Suffix Stripper,
word-based retrieval, five language independent stemmers and ACM Trans. Inf. Syst. 25 (18) (2007).
rule-based stemmers in three different test scenarios. In the [21] J. Paik, M. Mitra, S. Parui, K. Jarvelin, Gras: An effective and efficient
first set of experiments, the proposed approach has been tested stemming algorithm for information retrieval, ACM Trans. Inf. Syst. 29 (19)
(2011).
in Information Retrieval application for four languages (English,
[22] W. Kraaij, R. Pohlman, Porter’s stemming algorithm for Dutch, New Rev.
Marathi, Hungarian, and Bengali) using standard TREC, CLEF, Doc. Text Manag. (1994) 167–180.
and FIRE test collections. A significant improvement has been [23] J.H. Paik, D. Pal, S.K. Parui, A novel corpus-based stemming algorithm using
achieved in all the languages over other baseline methods. In the co-occurrence statistics, in: Proceedings of the 34th Annual International
case of Hungarian and Marathi, high-performance improvement ACM SIGIR Conference on Research and Development in Information
of nearly 50% in mean average precision has been reported. Retrieval, SIGIR’11, 2011, pp. 863–872.
The second set of experiments compared the performance of [24] L. Dolamic, J. Savoy, Indexing and stemming approaches for the Czech
language, ACM Trans. Asian Lang. Inf. Process. 45 (2009) 714–720.
the proposed stemmer with the other baseline methods in the
[25] L. Dolamic, J. Savoy, Comparative Study of Indexing and Search Strategies
text classification task. The proposed method outperformed the for the Hindi, Marathi and Bengali Languages, 2010.
other baseline stemmer by reporting maximum improvement in [26] J.H. Paik, S.K. Parui, A fast corpus-based stemmer, ACM Trans. Asian Lang.
the classification performance of the classifiers. Inf. Process. 10 (2011) 1–16.

Please cite this article as: J. Singh and V. Gupta, A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics, Knowledge-Based Systems
(2019), https://doi.org/10.1016/j.knosys.2019.05.025.
16 J. Singh and V. Gupta / Knowledge-Based Systems xxx (xxxx) xxx

[27] J.H. Paik, S.K. Parui, D. Pal, S.E. Robertson, Effective and robust query-based [40] M. Kasthuri, R. Kumar, S. Khaddaj, PLIS: Proposed language independent
stemming, ACM Trans. Inf. Syst. 31 (2013) 1–29, http://dx.doi.org/10.1145/ stemmer for information retrieval systems using dynamic programming,
2536736.2536738. in: World Congress on Computing and Communication Technologies,
[28] T. Brychcín, M. Konopík, HPS: High precision stemmer, Inf. Process. Tiruchirappalli, India, 2017, pp. 132–135.
Manage. 51 (2015) 68–91. [41] C. Chavula, H. Suleman, Morphological cluster induction of Bantu words
[29] D. Oard, G. Levow, C. Cabezas, CLEF experiments at Maryland: Statistical using a weighted similarity measure, in: Proceedings of the South African
stemming and backoff translation, in: Proceedings of the Workshop of Institute of Computer Scientists and Information Technologists, Thaba
Cross-Language Evaluation Forum on Cross Language Information Retrieval ’Nchu, South Africa, 2017, pp. 48–52.
and Evaluation, Springer-Verlag, Berlin, U.K., 2001, pp. 176–187. [42] O.U. Iheanetu, O. Oha, Some salient issues in the unsupervised learning of
[30] J. Goldsmith, Unsupervised Learning of the Morphology of a Natural Igbo morphology, in: Proceedings of the World Congress on Engineering
Language, 2001. and Computer Science, San Francisco, USA, vol. 1, 2017.
[31] J. Goldsmith, An algorithm for the unsupervised learning of morphology, [43] F. Peng, N. Ahmed, X. Li, Lu. Y, Context sensitive stemming for web search,
Nat. Lang. Eng. 12 (2006) 353–371. in: Proceedings of Annual International ACM SIGIR Conference on Research
[32] M. Creutz, K. Lagus, Unsupervised discovery of morphemes, in: Proceedings and Development in Information Retrieval, SIGIR’07, 2007, p. 639.
of the ACL-02 Workshop on Morphological and Phonological Learning, vol. [44] Tarek Sakakini, Suma Bhat, Pramod Viswanath, Fixing the Infix: Unsu-
6, 2002, pp. 21–30. pervised discovery of root-and-pattern morphology, arXiv preprint arXiv:
[33] M. Creutz, K. Lagus, Inducing the morphological lexicon of a natural 1702.02211 (2017).
language from unannotated text, in: Proceedings of the International and [45] J. Singh, V. Gupta, A systematic review of text stemming techniques, Artif.
Interdisciplinary Conference on Adaptive Knowledge Representation and Intell. Rev. 48 (2017) 157–217.
Reasoning, AKRR’05, 2005, pp. 51–59. [46] G. Amati, C.J. Van Rijsbergen, Probabilistic models of information retrieval
[34] M. Creutz, K. Lagus, Induction of a simple morphology for highly-inflecting based on measuring the divergence from randomness, ACM Trans. Inf. Syst.
languages, in: Proceedings of the 7th Meeting of the ACL Special Interest 20 (2002) 357–389.
Group in Computational Phonology: Current Themes in Computational [47] M.F. Porter, An algorithm for suffix stripping, Program Electron. Libr. Inf.
Phonology and Morphology, 2004, pp. 43–51. Syst. 14 (1980) 130–137.
[35] M. Melucci, N. Orio, A novel method for stemmer generation based [48] E. Riloff, Little words can make a big difference for Text Classification, in:
on hidden Markov Models, in: Proceedings of the twelfth Interna- Proceedings, 18th ACM SIGIR Conference, Seattle, WA, 1995, pp. 130–136.
tional Conference on Information and Knowledge Management, 2003, pp. [49] D. Baker, A.K. McCallum, Distributional clustering of words for text clas-
131–138. sification, in: Proc. 21st ACM SIGIR Conf. Melbourne, Australia, 1998, pp.
[36] M. Bacchin, N. Ferro, M. Melucci, A probabilistic model for stemmer 96–103.
generation, Inf. Process. Manage. 41 (2005) 121–137, http://dx.doi.org/10. [50] T. Gaustad, G. Bouma, R. Groningen, Accurate stemming of Dutch for text
1016/j.ipm.2004.04.006. classification, Lang. Comput. 45 (2002) 104–117.
[37] M. Baroni, J. Matiasek, H. Trost, Unsupervised discovery of morpholog- [51] M. Biba, E. Gjatu, Boosting text classification through stemming of
ically related words based on orthographic and semantic similarity, in: composite words, Recent Adv. Intell. Inform. 235 (2014) 185–194.
Workshop on Morphological and Phonological Learning MPL’02, 2002, pp. [52] T.P. Jurka, L. Collingwood, A.E. Boydstun, et al., RTextTools: A supervised
48–57. learning package for text classification, R J. 5 (2013) 6–12.
[38] A. Fernández, J. Diaz, Y. Gutiérrez, R. Munoz, An unsupervised method [53] D. Meyer, E. Dimitriadou, K. Hornik, et al., Misc functions of the department
to improve spanish stemmer, in: Proceedings of the 16th International of statistics, e1071, TU Wien. R Packag. 1 (2012) 5–24.
Conference on Natural Language Processing and Information Systems [54] K. Nigam, J. Lafferty, A. McCallum, Using maximum entropy for text
NLDB’11, Springer-Verlag Berlin. Springer-Verlag, 2011, pp. 221–224. classification, in: IJCAI-99 Workshop on Machine Learning for Information
[39] J. Singh, V. Gupta, An efficient corpus-based stemmer, Cogn. Comput. Filtering, 1999, pp. 61–67.
(2017) 1–18, http://dx.doi.org/10.1007/s12559-017-9479-z. [55] W.B. Frakes, C.J. Fox, Strength and similarity of affix removal stemming
algorithms, ACM SIGIR Forum 37 (2003) 26–30.

Please cite this article as: J. Singh and V. Gupta, A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics, Knowledge-Based Systems
(2019), https://doi.org/10.1016/j.knosys.2019.05.025.

You might also like