Machine Learning in Automated Text Categorization FABRIZIO SEBASTIANI Consiglio Nazionale delle Ricerche, Italy
Introduction: In the last 10 years content-based document management tasks (collectively known as
information retrieval—IR) have gained a prominent status in the information systems field, due to
the increased availability of documents in digital form and the ensuing need to access them in
flexible ways. Text categorization (TC— a.k.a. text classification, or topic spotting), the activity of
labeling natural language texts with thematic categories from a predefined set, is one such task. TC
dates back to the early ’60s, but only in the early ’90s did it become a major subfield of the
information systems discipline, thanks to increased applicative interest and to the availability of
more powerful hardware. TC is now being applied in many contexts, ranging from document
indexing based on a controlled vocabulary, to document filtering, automated metadata generation,
word sense disambiguation, population of hierarchical catalogues of Web resources, and in general
any application requiring document organization or selective and adaptive document dispatching.
Until the late ’80s the most popular approach to TC, at least in the “operational” (i.e., real-world
applications) community, was a knowledge engineering (KE) one, consisting in manually defining a
set of rules encoding expert knowledge on how to classify documents under the given categories. In
the ’90s this approach has increasingly lost popularity (especially in the research community) in favor
of the machine learning (ML) paradigm, according to which a general inductive process automatically
builds an automatic text classifier by learning, from a set of preclassified documents, the
characteristics of the categories of interest. The advantages of this approach are an accuracy
comparable to that achieved by human experts, and a considerable savings in terms of expert labor
power, since no intervention from either knowledge engineers or domain experts is needed for the
construction of the classifier or for its porting to a different set of categories. It is the ML approach to
TC that this paper concentrates on. Current-day TC is thus a discipline at the crossroads of ML and IR,
and as such it shares a number of characteristics with other tasks such as information/ knowledge
extraction from texts and text mining [Knight 1999; Pazienza 1997]. There is still considerable debate
on where the exact border between these disciplines lies, and the terminology is still evolving. “Text
mining” is increasingly being used to denote all the tasks that, by analyzing large quantities of text
and detecting usage patterns, try to extract probably useful (although only probably correct)
information. According to this view, TC is an instance of text mining. TC enjoys quite a rich literature
now, but this is still fairly scattered.1 Although two international journals have devoted special issues
to this topic [Joachims and Sebastiani 2002; Lewis and Hayes 1994], there are no systematic
treatments of the subject: there are neither textbooks nor journals entirely devoted to TC yet, and
Manning and Schutze [1999, Chapter 16] is the only ¨ chapter-length treatment of the subject. As a
note, we should warn the reader that the term “automatic text classification” has sometimes been
used in the literature to mean things quite different from the ones discussed here. Aside from (i) the
automatic assignment of documents to a predefined set of categories, which is the main topic of this
paper, the term has also been used to mean (ii) the automatic identification of such a set of
categories (e.g., Borko and Bernick [1963]), or (iii) the automatic identification of such a set of
categories and the grouping of documents under them (e.g., Merkl [1998]), a task usually called text
clustering, or (iv) any activity of placing text items into groups, a task that has thus both TC and text
clustering as particular instances [Manning and Schutze 1999]. ¨ This paper is organized as follows. In
Section 2 we formally define TC and its various subcases, and in Section 3 we review its most important
applications. Section 4 describes the main ideas underlying the ML approach to classification. Our discussion of
text classification starts in Section 5 by introducing text indexing, that is, the transformation of textual
documents into a form that can be interpreted by a classifier-building algorithm and by the classifier eventually
built by it. Section 6 tackles the inductive construction of a text classifier from a “training” set of preclassified
documents. Section 7 discusses the evaluation of text classifiers. Section 8 concludes, discussing open
issues and possible avenues of further research for TC.
An Evaluation of Statistical Approaches to Text Categorization
Text categorization (TC) is the problem of assigning predefined categories to free text documents. A
growing number of statistical learning methods have been applied to this problem in recent years,
including regression models (Fuhr et al. 1991, Yang and Chute 1994), nearest neighbor classifiers
(Creecy et al. 1992, Yang 1994), Bayesian probabilistic classifiers (Tzeras and Hartman 1993, Lewis
and Ringuette 1994, Moulinier 1997), decision trees (Fuhr et al. 1991, Lewis and Ringuette 1994,
Moulinier 1997), inductive rule learning algorithms (Apte et al. 1994, Cohen and Singer 1996,
Moulinier et al. 1996), neural networks (Wiener et al. 1995, Ng et al. 1997) and on-line learning
approaches (Cohen and Singer 1996, Lewis et al. 1996). With more and more methods available,
cross-method evaluation becomes increasingly important to identify the state-of-the-art in text
categorization. However, without a unified methodology in empirical evaluations, objective
comparisons of different methods are difficult. Ideally, all researchers would like to use a common
collection and comparable performance measures to evaluate their systems, or would allow their
systems to be evaluated under carefully-controlled conditions in a fashion similar to that of the Text
Retrieval Conference (TREC). The reality, however, is far from the ideal. Cross-method comparisons
have been attempted in the literature, but often only for two or three methods. The small scale of
these comparisons could either lead to overly-general statements based on insufficient
observations, or provide little insight into a global comparison between a wide range of approaches.
An alternative to these small-scale comparisons would be to integrate the available results of
categorization methods into a global evaluation, carefully analyzing the test conditions and
evaluation measures used and establishing a common basis for cross-collection and cross-
experiment integration. This solution would lead to a TREC-like controlled evaluation for text
categorization, as well as contribute useful insights to individual studies. This paper is an effort in
that direction. The most serious problem in TC evaluations is the lack of standard data collections.
Even if a common collection is chosen, there are still many ways to introduce inconsistent variations.
The commonly used Reuters news story corpus, for example, has at least five different versions,
depending on how the training/test sets are divided, which subsets of categories or documents were
used or not used for evaluation, and so forth. The number of different configurations of this corpus
is still growing. It is often unclear whether or not the reported results on the different versions of
Reuters are comparable to each other. In this paper we examine the impact of corpus configuration
variations on the performance of classifiers, using a carefully-controlled experiments of several
categorization systems on five different versions of Reuters. As will be shown in Section 5.2,
variations between certain versions of Reuters do have a strong impact, while the variations
between other versions do not. The underlying reason for this will be analyzed. Another important
issue in cross-experiment evaluation is the comparability between different performance measures
used in individual experiments. Many measures have been used, including recall and precision,
accuracy or error, break-even point or F-measure, micro-average and macro-average for binary
categorization, 11-point average precision for category ranking, and so forth (see Section 3 for
definitions). Each of these measures is designed to evaluate some aspect of the categorization
performance of a system; however, none of them convey identical information. Which of these
measures are more suitable for text categorization? How can published results of text categorization
methods be best compared when they were evaluated using different performance measures?
These questions are addressed in this paper by applying a variety of performance measures to
several classifiers, including the measures for category ranking evaluation or the measures for binary
category assignment. We will show that both types of measures are informative and complementary
to each other. We will also show that with carefully chosen performance measures and a baseline
classifier, one can reasonably (indirectly) compare the relevant performance among classifiers across
experiments based on their relevant performance with respect to the baseline classifier. This paper
is divided into six sections in addition to the introduction. Section 2 describes the classifiers and the
Reuters corpus we will use in this paper. Section 3 introduces and analyzes performance measures
for category ranking evaluation and binary categorization evaluation. Section 4 describes the novel
experiments we conducted with WORD, kNN, and LLSF. Section 5 reports the results of our classifiers
and evaluate them together with published results of other classifiers. Finally, we summarize our
conclusions in Section 6.