Machine Learning in Automated Text Categorization FABRIZIO SEBASTIANI Consiglio Nazionale Delle Ricerche

The document discusses the evolution and significance of machine learning in text categorization (TC), emphasizing its shift from knowledge engineering to machine learning approaches. It highlights the challenges in evaluating statistical methods for TC, particularly the lack of standardized data collections and performance measures, and proposes a unified methodology for cross-method evaluations. The paper aims to analyze the impact of corpus variations on classifier performance and to explore suitable performance measures for effective comparison of categorization methods.

Uploaded by

Shakir Ullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views3 pages

Machine Learning in Automated Text Categorization FABRIZIO SEBASTIANI Consiglio Nazionale Delle Ricerche

Uploaded by

Shakir Ullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

Machine Learning in Automated Text Categorization FABRIZIO SEBASTIANI Consiglio Nazionale delle Ricerche, Italy

Introduction: In the last 10 years content-based document management tasks (collectively known as
information retrieval—IR) have gained a prominent status in the information systems field, due to
the increased availability of documents in digital form and the ensuing need to access them in
flexible ways. Text categorization (TC— a.k.a. text classification, or topic spotting), the activity of
labeling natural language texts with thematic categories from a predefined set, is one such task. TC
dates back to the early ’60s, but only in the early ’90s did it become a major subfield of the
information systems discipline, thanks to increased applicative interest and to the availability of
more powerful hardware. TC is now being applied in many contexts, ranging from document
indexing based on a controlled vocabulary, to document filtering, automated metadata generation,
word sense disambiguation, population of hierarchical catalogues of Web resources, and in general
any application requiring document organization or selective and adaptive document dispatching.
Until the late ’80s the most popular approach to TC, at least in the “operational” (i.e., real-world
applications) community, was a knowledge engineering (KE) one, consisting in manually defining a
set of rules encoding expert knowledge on how to classify documents under the given categories. In
the ’90s this approach has increasingly lost popularity (especially in the research community) in favor
of the machine learning (ML) paradigm, according to which a general inductive process automatically
builds an automatic text classifier by learning, from a set of preclassified documents, the
characteristics of the categories of interest. The advantages of this approach are an accuracy
comparable to that achieved by human experts, and a considerable savings in terms of expert labor
power, since no intervention from either knowledge engineers or domain experts is needed for the
construction of the classifier or for its porting to a different set of categories. It is the ML approach to
TC that this paper concentrates on. Current-day TC is thus a discipline at the crossroads of ML and IR,
and as such it shares a number of characteristics with other tasks such as information/ knowledge
extraction from texts and text mining [Knight 1999; Pazienza 1997]. There is still considerable debate
on where the exact border between these disciplines lies, and the terminology is still evolving. “Text
mining” is increasingly being used to denote all the tasks that, by analyzing large quantities of text
and detecting usage patterns, try to extract probably useful (although only probably correct)
information. According to this view, TC is an instance of text mining. TC enjoys quite a rich literature
now, but this is still fairly scattered.1 Although two international journals have devoted special issues
to this topic [Joachims and Sebastiani 2002; Lewis and Hayes 1994], there are no systematic
treatments of the subject: there are neither textbooks nor journals entirely devoted to TC yet, and
Manning and Schutze [1999, Chapter 16] is the only ¨ chapter-length treatment of the subject. As a
note, we should warn the reader that the term “automatic text classification” has sometimes been
used in the literature to mean things quite different from the ones discussed here. Aside from (i) the
automatic assignment of documents to a predefined set of categories, which is the main topic of this
paper, the term has also been used to mean (ii) the automatic identification of such a set of
categories (e.g., Borko and Bernick [1963]), or (iii) the automatic identification of such a set of
categories and the grouping of documents under them (e.g., Merkl [1998]), a task usually called text
clustering, or (iv) any activity of placing text items into groups, a task that has thus both TC and text
clustering as particular instances [Manning and Schutze 1999]. ¨ This paper is organized as follows. In
Section 2 we formally define TC and its various subcases, and in Section 3 we review its most important
applications. Section 4 describes the main ideas underlying the ML approach to classification. Our discussion of
text classification starts in Section 5 by introducing text indexing, that is, the transformation of textual
documents into a form that can be interpreted by a classifier-building algorithm and by the classifier eventually
built by it. Section 6 tackles the inductive construction of a text classifier from a “training” set of preclassified
documents. Section 7 discusses the evaluation of text classifiers. Section 8 concludes, discussing open
issues and possible avenues of further research for TC.
An Evaluation of Statistical Approaches to Text Categorization
Text categorization (TC) is the problem of assigning predefined categories to free text documents. A
growing number of statistical learning methods have been applied to this problem in recent years,
including regression models (Fuhr et al. 1991, Yang and Chute 1994), nearest neighbor classifiers
(Creecy et al. 1992, Yang 1994), Bayesian probabilistic classifiers (Tzeras and Hartman 1993, Lewis
and Ringuette 1994, Moulinier 1997), decision trees (Fuhr et al. 1991, Lewis and Ringuette 1994,
Moulinier 1997), inductive rule learning algorithms (Apte et al. 1994, Cohen and Singer 1996,
Moulinier et al. 1996), neural networks (Wiener et al. 1995, Ng et al. 1997) and on-line learning
approaches (Cohen and Singer 1996, Lewis et al. 1996). With more and more methods available,
cross-method evaluation becomes increasingly important to identify the state-of-the-art in text
categorization. However, without a unified methodology in empirical evaluations, objective
comparisons of different methods are difficult. Ideally, all researchers would like to use a common
collection and comparable performance measures to evaluate their systems, or would allow their
systems to be evaluated under carefully-controlled conditions in a fashion similar to that of the Text
Retrieval Conference (TREC). The reality, however, is far from the ideal. Cross-method comparisons
have been attempted in the literature, but often only for two or three methods. The small scale of
these comparisons could either lead to overly-general statements based on insufficient
observations, or provide little insight into a global comparison between a wide range of approaches.
An alternative to these small-scale comparisons would be to integrate the available results of
categorization methods into a global evaluation, carefully analyzing the test conditions and
evaluation measures used and establishing a common basis for cross-collection and cross-
experiment integration. This solution would lead to a TREC-like controlled evaluation for text
categorization, as well as contribute useful insights to individual studies. This paper is an effort in
that direction. The most serious problem in TC evaluations is the lack of standard data collections.
Even if a common collection is chosen, there are still many ways to introduce inconsistent variations.
The commonly used Reuters news story corpus, for example, has at least five different versions,
depending on how the training/test sets are divided, which subsets of categories or documents were
used or not used for evaluation, and so forth. The number of different configurations of this corpus
is still growing. It is often unclear whether or not the reported results on the different versions of
Reuters are comparable to each other. In this paper we examine the impact of corpus configuration
variations on the performance of classifiers, using a carefully-controlled experiments of several
categorization systems on five different versions of Reuters. As will be shown in Section 5.2,
variations between certain versions of Reuters do have a strong impact, while the variations
between other versions do not. The underlying reason for this will be analyzed. Another important
issue in cross-experiment evaluation is the comparability between different performance measures
used in individual experiments. Many measures have been used, including recall and precision,
accuracy or error, break-even point or F-measure, micro-average and macro-average for binary
categorization, 11-point average precision for category ranking, and so forth (see Section 3 for
definitions). Each of these measures is designed to evaluate some aspect of the categorization
performance of a system; however, none of them convey identical information. Which of these
measures are more suitable for text categorization? How can published results of text categorization
methods be best compared when they were evaluated using different performance measures?
These questions are addressed in this paper by applying a variety of performance measures to
several classifiers, including the measures for category ranking evaluation or the measures for binary
category assignment. We will show that both types of measures are informative and complementary
to each other. We will also show that with carefully chosen performance measures and a baseline
classifier, one can reasonably (indirectly) compare the relevant performance among classifiers across
experiments based on their relevant performance with respect to the baseline classifier. This paper
is divided into six sections in addition to the introduction. Section 2 describes the classifiers and the
Reuters corpus we will use in this paper. Section 3 introduces and analyzes performance measures
for category ranking evaluation and binary categorization evaluation. Section 4 describes the novel
experiments we conducted with WORD, kNN, and LLSF. Section 5 reports the results of our classifiers
and evaluate them together with published results of other classifiers. Finally, we summarize our
conclusions in Section 6.

Techniques of Text Classification
No ratings yet
Techniques of Text Classification
28 pages
Automatic Induction of Rule Based Text Categorization
No ratings yet
Automatic Induction of Rule Based Text Categorization
10 pages
Improve Text Classification Accuracy Based On Classifier Fusion Methods
No ratings yet
Improve Text Classification Accuracy Based On Classifier Fusion Methods
6 pages
Survey on Text Categorization Methods
No ratings yet
Survey on Text Categorization Methods
7 pages
Machine Learning in Automated Text Categorization
No ratings yet
Machine Learning in Automated Text Categorization
55 pages
Techniques For Text Classification: Literature Review and Current Trends
No ratings yet
Techniques For Text Classification: Literature Review and Current Trends
28 pages
Review 3 - Journal Submission Format: Team Number Title (New)
No ratings yet
Review 3 - Journal Submission Format: Team Number Title (New)
28 pages
TM05
No ratings yet
TM05
21 pages
Background Research: 2.1 Machine Learning
No ratings yet
Background Research: 2.1 Machine Learning
9 pages
Hybrid Approach Combining Machine Learning and A Rule-Based Expert System For Text Categorization
No ratings yet
Hybrid Approach Combining Machine Learning and A Rule-Based Expert System For Text Categorization
7 pages
05 Text Categorization
No ratings yet
05 Text Categorization
22 pages
Science Research Journal
No ratings yet
Science Research Journal
7 pages
Different Type of Feature Selection For Text Classification
No ratings yet
Different Type of Feature Selection For Text Classification
6 pages
A Study On The Architecture For Text Categorization and Summarization
No ratings yet
A Study On The Architecture For Text Categorization and Summarization
4 pages
Ijcst V3i2p17
No ratings yet
Ijcst V3i2p17
5 pages
The Use of Bigrams To Enhance Text Categorization
No ratings yet
The Use of Bigrams To Enhance Text Categorization
38 pages
Machine Learning For Text Document Classification-Efficient Classification Approach
No ratings yet
Machine Learning For Text Document Classification-Efficient Classification Approach
8 pages
The Use of Bigrams To Enhance
No ratings yet
The Use of Bigrams To Enhance
31 pages
Minor Project Ii Report Text Mining: Reuters-21578: Submitted by
100% (1)
Minor Project Ii Report Text Mining: Reuters-21578: Submitted by
51 pages
Text Classification Using Support Vector Machine IJERTV1IS3174
No ratings yet
Text Classification Using Support Vector Machine IJERTV1IS3174
4 pages
Theis Finaldoc
No ratings yet
Theis Finaldoc
86 pages
Hierarchical Afaan Oromoo News Text Classification
No ratings yet
Hierarchical Afaan Oromoo News Text Classification
11 pages
A Survey On Different Types of Approaches To Text Categorization
No ratings yet
A Survey On Different Types of Approaches To Text Categorization
3 pages
228 International Conference On Engineering Technologies (ICENTE'17)
No ratings yet
228 International Conference On Engineering Technologies (ICENTE'17)
3 pages
Overview of Categorization Techniques
No ratings yet
Overview of Categorization Techniques
7 pages
A T C A V E M: Rabic EXT Ategorization Lgorithm Using Ector Valuation Ethod
No ratings yet
A T C A V E M: Rabic EXT Ategorization Lgorithm Using Ector Valuation Ethod
10 pages
Similarity-Based Techniques For Text Document Classification
No ratings yet
Similarity-Based Techniques For Text Document Classification
8 pages
108 原文
No ratings yet
108 原文
6 pages
Supervised ML for Text Classification
No ratings yet
Supervised ML for Text Classification
20 pages
Kim 2016
No ratings yet
Kim 2016
5 pages
7 Habits of Highly Effective People Summary
100% (1)
7 Habits of Highly Effective People Summary
4 pages
Counseling and Testing
No ratings yet
Counseling and Testing
56 pages
Inductive and Deductive Reasoning Exercises
100% (1)
Inductive and Deductive Reasoning Exercises
4 pages
Building A Strong Services Brand: 2005 AMA SERVSIG Doctoral Consortium
No ratings yet
Building A Strong Services Brand: 2005 AMA SERVSIG Doctoral Consortium
27 pages
EAPP Quarter 1 Module 1
88% (201)
EAPP Quarter 1 Module 1
26 pages
VR's Impact on Online Learning Outcomes
No ratings yet
VR's Impact on Online Learning Outcomes
23 pages
Language Skills Related Tasks - A3 - VanesaRS
No ratings yet
Language Skills Related Tasks - A3 - VanesaRS
7 pages
Paragraph Writing
100% (1)
Paragraph Writing
2 pages
Unit 2 Language Grade 10
No ratings yet
Unit 2 Language Grade 10
8 pages
Metaphor Studies: A Scholarly Compilation
No ratings yet
Metaphor Studies: A Scholarly Compilation
159 pages
Letters and Sounds - DFES-00281-2007
No ratings yet
Letters and Sounds - DFES-00281-2007
214 pages
Can You... ?: Activity Type
No ratings yet
Can You... ?: Activity Type
3 pages
Description - in Literature and Other Media (W. Wolf, 2007)
100% (1)
Description - in Literature and Other Media (W. Wolf, 2007)
352 pages
AI & Data Analysis Basics
No ratings yet
AI & Data Analysis Basics
20 pages
Gestalt Therapy Overview & Concepts
No ratings yet
Gestalt Therapy Overview & Concepts
7 pages
Oral Comm Test for Students
No ratings yet
Oral Comm Test for Students
4 pages
Figuratively Speaking
100% (1)
Figuratively Speaking
144 pages
Group 1 Sullivans Interpersonal Theory
100% (1)
Group 1 Sullivans Interpersonal Theory
23 pages
Translation Techniques Explained
No ratings yet
Translation Techniques Explained
3 pages
Named Entity Recognition With Bidirectional Lstm-Cnns
No ratings yet
Named Entity Recognition With Bidirectional Lstm-Cnns
14 pages
Essentials of Music Therapy Assessment
100% (1)
Essentials of Music Therapy Assessment
643 pages
Metacognition Strategies for Learning
No ratings yet
Metacognition Strategies for Learning
1 page
Perfect Handwriting Practice Sheet PDF
No ratings yet
Perfect Handwriting Practice Sheet PDF
55 pages
Living and Non Living Things DLP Final Exam Prepared by Jessica Carolino...
No ratings yet
Living and Non Living Things DLP Final Exam Prepared by Jessica Carolino...
5 pages
Lesson Plan Endocrine
86% (14)
Lesson Plan Endocrine
4 pages
Busi1702 Odm Lecture 1 2022-23
No ratings yet
Busi1702 Odm Lecture 1 2022-23
22 pages
Becoming The Boss of The Mind
No ratings yet
Becoming The Boss of The Mind
3 pages
Report On Co-Curricular
No ratings yet
Report On Co-Curricular
3 pages
Evaluating Websites with the CRAAP Test
No ratings yet
Evaluating Websites with the CRAAP Test
2 pages
Filipino Psychology Critique
No ratings yet
Filipino Psychology Critique
5 pages