Classification of Text, Automatic 457
Classification of Text, Automatic
F Sebastiani, Università di Padova, Padova, Italy There are two main directions for providing con-
ß 2006 Elsevier Ltd. All rights reserved. venient access to a large, unstructured repository of
text:
. Providing powerful tools for searching relevant
Introduction
documents within this repository. This is the aim
In the last two decades, the production of textual of text search (see Document Retrieval, Automatic),
documents in digital form has increased exponen- a subdiscipline of IR concerned with building
tially, due to the increased availability of inexpensive systems that take as input a natural language query
hardware and software for generating digital text and return, as a result, a list of documents ranked
(e.g., personal computers, word processors) and for according to their estimated degree of relevance to
digitizing textual data not in digital form (e.g., scan- the user’s information need. Nowadays the tip of the
ners, optical character recognition software). As a iceberg of text search is represented by Web search
consequence, there is an ever-increasing need for engines (see Web Searching), but commercial solu-
mechanized solutions for organizing the vast quantity tions for the text search problem were being deliv-
of digital texts that are being produced, with an ered decades before the very birth of the Web.
eye toward their future use. The design of such solu- . Providing powerful tools for turning this unstruc-
tions has traditionally been the object of study of tured repository into a structured one, thereby easing
information retrieval (IR), the discipline that, broadly storage, search, and browsing. This is the aim
speaking, is concerned with the computer-mediated of text classification (TC), a discipline at the cross-
access to data with poorly specified semantics. roads of IR, machine learning (ML), and (statistical)
458 Classification of Text, Automatic
natural language processing, concerned with build- the labels originally attached to the preclassified
ing systems that partition an unstructured collection documents.
of documents into meaningful groups (Sebastiani,
2002). Single-Label and Multi-Label Text Categorization
TC itself admits of two important variants: single-
Text Clustering and Text Categorization label TC and multi-label TC. Given as input the set
of categorories C ¼ {c1 ,. . ., cm}, single-label TC is
There are two main variants of TC. The first is text
the task of attributing, to each document dj in the
clustering, which is characterized by the fact that only
repository, the category to which it belongs. Multi-
the desired number of groups (or clusters) is known in
label TC, instead, deals with the case in which each
advance: no indication as to the semantics of these
document dj may in principle belong to zero, one,
groups is instead given as input. The second variant is
or more than one category; it thus comes down to
text categorization, whereby the input to the system
deciding, for each category ci in C, whether a given
consists not only of the number of categories (or
document dj belongs or does not belong to ci.
classes), but also of some specification of their se-
The technologies for coping with either single-label
mantics. In the most frequent case, this specification
or multi-label TCs are slightly different (the former
consists in a set of labels, one for each category and
problem often being somehow more challenging), es-
usually consisting of a noun or other short natural
pecially concerning the phases of feature selection,
language expression, and in a set of example labeled
classifier learning, and classifier evaluation (see
texts, i.e., texts whose membership or nonmember-
below). In a real application, it is thus of fundamental
ship in each of the categories is known. Clustering
importance to identify whether the application requires
may thus be seen as the task of finding a latent but as
single-label or multi-label TC from the beginning.
yet undetected group structure in the repository,
while categorization can be seen as the task of struc-
Hard or Soft Text Categorization
turing the repository according to a group structure
known in advance. In logical-philosophical terms, we Taking a binary decision, yes or no, as to whether a
can see clustering as the task of determining both the document dj belongs to a category ci, is sometimes
extensional and intensional level (see Extensionality referred to as a ‘hard’ categorization decision. This is
and Intensionality) of a previously unknown group the kind of decisions that are taken by autonomous
structure, and categorization as determining the ex- text classifiers, i.e., software systems that need to
tensional level only of a group structure whose inten- decide and act accordingly without human supervi-
sional level is known. sion. A different type of decision, sometimes referred
It is the latter task that will be the focus of this to as a ‘soft’ categorization decision, is one which
article (text clustering is covered elsewhere in this consists of attributing a numeric score (e.g., between
volume – see Text Mining). From now on we will 0 and 1) to the pair (dj,ci), reflecting the degree of
thus use the expressions ‘text classification’ and ‘text confidence of the classifier in the fact that dj belongs
categorization’ interchangeably (abbreviated as TC), to ci. This allows, for instance, ranking a set of docu-
and the expression ‘(text) classifier’ to denote a system ments in terms of their estimated appropriateness for
capable of performing automatic TC. category ci, or ranking a set of categories in terms of
Note that the central notion of TC, that of mem- their estimated appropriateness for dj. Such rankings
bership of a document dj in a class ci based on the are often useful for nonautonomous, interactive classi-
semantics of dj and ci, is an inherently subjective fiers, i.e., systems whose goal is to recommend a cate-
notion, since the semantics of dj and ci cannot be gorization decision to a human expert, who is
formally specified. Different classifiers (be they responsible for making the final decision. For instance,
humans or machines) might thus disagree on whether in a single-label TC task a human expert in charge of
dj belongs to ci. This means that membership cannot the final classification decision may take advantage of a
be determined with certainty, which in turn means system that preranks the categories in terms of their
that any classifier (be it human or machine) will be estimated appropriateness to a given document dj.
prone to misclassification errors. As a consequence, it Again, the technologies for coping with either
is customary to evaluate automatic text classifiers by soft or hard categorization decisions are slightly dif-
applying them to a set of labeled (i.e., preclassified) ferent, especially concerning the phases of classifier
documents (a set that here plays the role of a gold learning and classifier evaluation (see below). In any
standard), so that the accuracy (or effectiveness) real-world application, it is thus important to estab-
of the classifier can be measured by the degree of lish whether the task is one requiring soft or hard
coincidence between its classification decisions and decisions from the beginning.
Classification of Text, Automatic 459
Applications Techniques
Maron’s seminal paper (Maron, 1961) is usually Approaches
taken to mark the official birth date of TC, which at
In the 1980s, the most popular approach to TC was
the time was called automatic indexing; this name
one based on knowledge engineering, whereby a
reflected that the main (or only) application that
knowledge engineer and a domain expert working
was then envisaged for TC was automatically
together would build an expert system capable of
indexing (i.e., generating internal representations
automatically classifying text. Typically, such an ex-
for) scientific articles for Boolean IR systems (see
pert system would consist of a set of ‘if . . . then . . .’
Indexing, Automatic). In fact, since index terms for
rules, to the effect that a document was assigned to
these representations were drawn from a fixed, pre-
the class specified in the ‘then’ clause only if the
defined set of such terms, we can regard this type of
linguistic expressions (typically: words) specified in
indexing as an instance of TC (where index terms
the ‘if’ part occurred in the document. The drawback
play the role of categories). The importance of TC
of this approach was the high cost in terms of human-
increased in the late 1980s and early 1990s with the
power required for (i) defining the rule set, and (ii)
need to organize the increasingly larger quantities of
for maintaining it, i.e., for updating the rule set as a
digital text being handled in organizations at all
result of possible subsequent additions or deletions of
levels. Since then, frequently pursued applications of
classes or as a result of shifts in the meaning of the
TC technology have been newswire filtering, i.e., the
existing classes.
grouping, according to thematic classes of interest, of
In the 1990s, this approach was superseded by the
news stories produced by news agencies, thus allow-
machine-learning approach, whereby a general in-
ing personalized delivery of information to customers
ductive process (the learner) is fed with a set of exam-
according to their profiles of interest (Hayes and
ple (training) documents preclassified according to
Weinstein, 1990); patent classification, i.e., the orga-
the categories of interest. By observing the character-
nization of patents and patent applications into
istics of the training documents, the learner may gen-
specialized taxonomies, so as to ease the detection
erate a model (the classifier) of the conditions that are
of existing patents related to a new patent application
satisfied by the documents belonging to the categories
(Fall et al., 2003); and Web page classification, i.e.,
considered. This model can subsequently be applied
the grouping of Web pages (or sites) according to the
to new, unlabeled documents for classifying them
taxonomic classification schemes typical of Web por-
according to these categories.
tals (Dumais and Chen, 2000).
This approach has several advantages over the
The applications above all have a certain thematic
knowledge engineering approach. First of all, a higher
flavor, in the sense that categories tend to coincide
degree of automation is introduced: the engineer
with topics, or themes. However, TC technology has
needs to build not a text classifier, but an automatic
been applied to real-world problems that are not
builder of text classifiers (the learner). Once built, the
thematic in nature, among which spam filtering, i.e.,
learner can then be applied to generating many differ-
the grouping of personal e-mail messages into the two
ent classifiers, for many different domains and appli-
classes LEGITIMATE and SPAM, so as to provide effective
cations: one only needs to feed it with the appropriate
user shields against unsolicited bulk mailings
sets of training documents. By the same token, the
(Drucker et al., 1999); authorship attribution, i.e.,
above-mentioned problem of maintaining a classifier
the automatic identification of the author of a text
is solved by feeding new training documents appro-
among a predefined set of candidates (Diederich et al.,
priate for the revised set of classes. Many inductive
2003) (see Authorship Attribution: Statistical and
learners are available off the shelf; if one of these is
Computational Methods); author gender detection,
used, the only human power needed in setting up a TC
i.e., a special case of the previous task in which the
system is that for manually classifying the documents
issue is deciding whether the author of the text is a
to be used for training. For performing this latter
MALE or a FEMALE (Koppel et al., 2002); genre classi-
task, less skilled human power than for building an
fication, i.e., the identification of the nontopical com-
expert system is needed, which is also advantageous.
municative goal of the text (such as determining if a
It should also be noted that if an organization has
product description is a PRODUCTREVIEW or an ADVER-
previously relied on manual work for classifying docu-
TISEMENT) (Stamatatos et al., 2000); survey coding, i.e.,
ments, then many preclassified documents are already
the classification of respondents to a survey based on
available to be used as training documents when the
the textual answers they have returned to an open-
organization decides to automate the process.
ended question (Giorgetti and Sebastiani, 2003); or
Most importantly, one of the advantages of the ML
even sentiment classification, as in deciding if a prod-
approach is that the accuracy of classifiers built
uct review is a THUMBSUP or a THUMBSDOWN (Turney
by these techniques now often rivals that of human
and Littman, 2003).
460 Classification of Text, Automatic
professionals, and usually exceeds that of classifiers regression methods, decision trees, Boolean decision
built by knowledge engineering methods. This has rules, neural networks, incremental or batch methods
brought about a wider and wider acceptance of for learning linear classifiers, example-based methods,
learning methods even outside academia. While for classifier ensembles (including boosting methods), and
certain applications such as spam filtering a combina- support vector machines. While all of these techniques
tion of ML and knowledge engineering still lies at the still retain their popularity, it is fair to say that in recent
basis of several commercial systems, it is fair to say years support vector machines (Joachims, 1998) and
that in most other TC applications (especially of the boosting (Schapire and Singer, 2000) have been the two
thematic type), the adoption of ML technology has dominant learning methods in TC. This seems attribut-
been widespread. able to a combination of two factors: (i) these two
Note that the ML approach is especially suited to the methods have strong justifications in terms of compu-
case in which no additional knowledge (of a procedural tational learning theory, and (ii) in comparative experi-
or declarative nature) of the meaning of the categories ments on widely accepted benchmarks, they have
is available, since in this case the classification rules can outperformed all other competing approaches. An ad-
be determined only on the basis of knowledge extracted ditional factor that has determined their success is the
from the training documents. This case is the most free availability, at least for research purposes, of well-
frequent one, and is thus the usual focus of TC research. known software packages based on these methods,
Solutions devised for the case in which no additional such as SVMlight and BoosTexter.
knowledge is available are extremely general, since
they do not presuppose the existence of e.g., addition-
Building Internal Representations for Documents
al lexicosemantic resources that, in real-life situations,
might be either unavailable or expensive to create The learners discussed above cannot operate on the
(see Computational Lexicons and Dictionaries). documents as they are, but require the documents to
A further reason why TC research rarely tackles the be given internal representations that the learners can
case of additionally available external knowledge is make sense of. The same is true of the classifiers, once
that these sources of knowledge may vary widely in they have been built by learners. It is thus customary
type and format, thereby making each instance of to transform all the documents (i.e., those that are
their application to TC a case in its own, from which used in the training phase, the testing phase, or the
any lesson learned can hardly be exported to different operational phase of the classifier) into internal repre-
application contexts. When in a given application sentations by means of methods used in text search,
external knowledge of some kind is available, heuris- where the same need is also present (see Indexing,
tic techniques of any nature may be adopted in order Automatic). Accordingly, a document is usually
to leverage on these data, either in combination or in represented by a vector lying in a vector space
isolation from the IR and ML techniques we will whose dimensions correspond to the terms that
discuss here. However, it should be noted that past occur in the training set, and the value of each
research has not been able to show any substantial individual entry corresponds to the weight that the
benefit from the use of external resources (such as term in question has for the document.
lexicons, thesauri, or ontologies) in TC. In TC applications of the thematic kind, the set
As previously noted, the meaning of categories is of terms is usually made to coincide with the set of
subjective. The ML techniques used for TC, rather content-bearing words (which means all words but
than trying to learn a supposedly perfect classifier (a topic-neutral ones such as articles, prepositions, etc.),
gold standard of dubious existence), strive to repro- possibly reduced to their morphological roots (stems
duce the subjective judgment of the expert who has – see Stemming) so as to avoid excessive stochastic
labeled the training documents, and do this by exam- dependence among different dimensions of the vector.
ining the manifestations of this judgment, i.e., the Weights for these words are meant to reflect the im-
documents that the expert has manually classified. portance that the word has in determining the seman-
The kind of learning that these ML techniques engage tics of the document it occurs in, and are
in is usually called supervised learning, since it is automatically computed by weighting functions.
supervised, or facilitated, by the knowledge of the These functions usually rely on intuitions of a statis-
preclassified data. tical kind, such as (i) the more often a term occurs in a
document, the more important it is for that docu-
ment; and (ii) the more documents a term appears
Learning Text Classifiers
in, the less important it is in characterizing the seman-
Many different types of supervised learners have tics of a document it occurs in.
been used in TC (Sebastiani, 2002), including probabi- In TC applications of a nonthematic nature, the
listic ‘naive Bayesian’ methods, Bayesian networks, opposite is often true. For instance, it is the frequency
Classification of Text, Automatic 461
of use of articles, prepositions, and punctuation this, feature selection functions are employed for
(together with many other stylistic features) that scoring each term according to this expected impact,
may be a helpful clue in authorship attribution, so that the highest scoring ones can be retained for the
while it is more unlikely that the frequencies of use new vector space. These functions mostly come from
of content-bearing words can be of help (see Compu- statistics (e.g., Chi-square) or information theory
tational Stylistics). This shows that choosing the right (e.g., mutual information, also known as information
dimensions of the vector space for the right classifica- gain), and tend to encode (each one in their own way)
tion task requires a deep understanding, on the part of the intuition that the best terms for classifica-
the engineer, of the nature of the task. tion purposes are the ones that are distributed most
It is fairly evident from the above discussion that differently across the different categories.
internal representations used in TC applications are,
from the standpoint of linguistic analysis, extremely
Challenges
primitive: with the possible exception of applications
in sentiment classification (Turney and Littman, 2003), TC, especially in its ML incarnation, is today a fairly
hardly any sophisticated linguistic analysis is usually mature technology that has delivered working solu-
attempted in order to provide a more faithful rendition tions in a number of applicative contexts. Interest in
of the semantics of the text. This is because previous TC has grown exponentially in the last 10 years, from
attempts at applying state-of-the-art natural language researchers and developers alike.
processing techniques (including techniques for parsing For IR researchers, this interest is one particular
text robustly (Moschitti and Basili, 2004), extracting aspect of a general movement toward leveraging user
collocations (Koster and Seutter, 2003), performing data for taming the inherent subjectivity of the IR
word sense disambiguation (Kehagias et al., 2003), task, i.e., taming the fact that it is the user, and only
etc.) have not shown any substantial benefit with the user, who can say whether a given item of infor-
respect to the basic representations outlined above. mation is relevant to a query she has issued to a Web
search engine, or relevant to a private folder of hers in
which documents should be filed according to con-
Reducing the Dimensionality of the Vectors
tent. Wherever there are predefined classes, docu-
The techniques described in the previous section tend ments previously (and manually) classified by the
to generate very large vectors, with sizes in the tens of user are often available; as a consequence, these latter
thousands. This situation is problematic in TC, since data can be exploited for automatically learning the
the efficiency of many learning devices (e.g., neural (extensional) meaning that the user attributes to the
networks) tends to degrade rapidly with the size of categories, thereby reaching accuracy levels that
the vectors. In TC applications, it is thus customary to would be unthinkable if these data were unavailable.
run a dimensionality reduction pass before starting to For ML researchers, this interest is because TC appli-
build the internal representations of the documents. cations prove a challenging benchmark for their newly
Basically, this means identifying a new vector space in developed techniques, since these applications usually
which to represent the documents, with a much smal- feature extremely high-dimensional vector spaces and
ler number of dimensions than the original one. Sev- provide large quantities of test data. In the last 5 years,
eral techniques for dimensionality reduction have this has resulted in more and more ML researchers
been devised within TC (or, more often, borrowed adopting TC as one of their benchmark applications
from the fields of ML and pattern recognition). of choice, which means that cutting-edge ML techni-
An important class of such techniques is that of ques are being applied to TC with minimal delay since
feature extraction methods (examples of which are their original invention.
term clustering methods and latent semantic indexing For application developers, this interest is mainly
– see Latent Semantic Analysis). Feature extraction due to the enormously increased need to handle
methods define a new vector space in which each larger and larger quantities of documents, a need
dimension is a combination of some (or all) of the emphasized by increased connectivity and availability
original dimensions; their effect is usually a reduction of document bases of all types at all levels in the
of both the dimensionality of the vectors and the information chain. But this interest also results from
overall stochastic dependence among dimensions. TC techniques having reached accuracy levels that
An even more important class of dimensionality often rival the performance of trained professionals,
reduction techniques is that of feature selection meth- levels that can be achieved with high efficiency on
ods, which do not attempt to generate new terms, but standard hardware and software resources. This
try to select the best ones from the original set. The means that more and more organizations are auto-
measure of quality for a term is its expected impact on mating all their activities that can be cast as TC tasks.
the accuracy of the resulting classifier. To measure Still, a number of challenges remain for TC research.
462 Classification of Text, Automatic
The first and foremost challenge is to deliver high Journal of the American Society for Information Science
accuracy in all applicative contexts. While highly and Technology 54(12), 1269–1277.
effective classifiers have been produced for applica- Hayes PJ & Weinstein SP (1990). ‘CONSTRUE/ TIS: a system
tive domains such as the thematic classification of for content-based indexing of a database of news stories.’
In Rappaport A & Smith R (eds.) Proceedings of IAAI-
professionally authored text (such as newswire
90, 2nd Conference on Innovative Applications of Artifi-
stories), in other domains reported accuracies are far
cial Intelligence. Menlo Park: AAAI Press. 49–66.
from satisfying. Such applicative contexts include the Joachims T (1998). ‘Text categorization with support vec-
classification of Web pages (where the use of text is tor machines: learning with many relevant features.’ In
more varied and obeys rules different from the ones of Nédellec C & Rouveirol C (eds.) Proceedings of ECML-
linear verbal communication), spam filtering (a task 98, 10th European Conference on Machine Learning.
which has an adversarial nature, in that spammers Lecture Notes in Computer Science series, no. 1398
adapt their spamming strategies so as to circumvent Heidelberg: Springer Verlag. 137–142.
the latest spam filtering technologies), authorship Kehagias A, Petridis V, Kaburlasos VG & Fragkou P (2003).
attribution (where current technology is not yet able ‘A comparison of word- and sense-based text categoriza-
to tackle the inherent stylistic variability among tion using several classification algorithms.’ Journal of
Intelligent Information Systems 21(3), 227–247.
texts written by the same author), and sentiment clas-
Koppel M, Argamon S & Shimoni AR (2002). ‘Automati-
sification (which requires much more sophisticated cally categorizing written texts by author gender.’ Liter-
linguistic analysis than classification by topic). ary and Linguistic Computing 17(4), 401–412.
A second important challenge is to bypass the Koster CH & Seutter M (2003). ‘Taming wild phrases.’ In
document labeling bottleneck, i.e., tackling the facts Sebastiani F (ed.) Proceedings of ECIR-03, 25th European
that labeled documents for use in the training phase Conference on Information Retrieval. Pisa: Springer
are not always available, and that labeling (i.e., man- Verlag. 161–176.
ually classifying) documents is costly. To this end, Maron M (1961). ‘Automatic indexing: an experimental
semisupervised methods have been proposed that inquiry.’ Journal of the Association for Computing
allow building classifiers from a small sample of la- Machinery 8(3), 404–417.
beled documents and a (usually larger) sample of Moschitti A & Basili R (2004). ‘Complex linguistic features
for text classification: a comprehensive study.’ In
unlabeled documents (Nigam and Ghani, 2000;
McDonald S & Tait J (eds.) Proceedings of ECIR-04,
Nigam et al., 2000). However, the problem of
26th European Conference on Information Retrieval
learning text classifiers mainly from unlabeled data Research. Lecture Notes in Computer Science series,
unfortunately is still open. no. 2997. Heidelberg: Springer Verlag. 181–196.
Nigam K & Ghani R (2000). ‘Analyzing the applicability
See also: Authorship Attribution: Statistical and Computa- and effectiveness of co-training.’ In Agah A, Callan J &
tional Methods; Computational Lexicons and Dictionaries; Rundensteiner E (eds.) Proceedings of CIKM-00, 9th
Computational Stylistics; Document Retrieval, Automatic; ACM International Conference on Information and
Extensionality and Intensionality; Indexing, Automatic; Knowledge Management. McLean: ACM Press. 86–93.
Latent Semantic Analysis; Stemming; Text Mining; Web Nigam K, McCallum AK, Thrun S & Mitchell TM (2000).
Searching. ‘Text classification from labeled and unlabeled docu-
ments using EM.’ Machine Learning 39(2/3), 103–134.
Schapire RE & Singer Y (2000). ‘BoosTexter: a boosting-
Bibliography based system for text categorization.’ Machine Learning
39(2/3), 135–168.
Diederich J, Kindermann J, Leopold E & Paaß G (2003). Sebastiani F (2002). ‘Machine learning in automated text
‘Authorship attribution with support vector machines.’ categorization.’ ACM Computing Surveys 34(1), 1–47.
Applied Intelligence 19(1/2), 109–123. Stamatatos E, Fakotakis N & Kokkinakis G (2000). ‘Auto-
Drucker H, Vapnik V & Wu D (1999). ‘Support vector matic text categorization in terms of genre and author.’
machines for spam categorization.’ IEEE Transactions Computational Linguistics 26(4), 471–495.
on Neural Networks 10(5), 1048–1054. Turney PD & Littman ML (2003). ‘Measuring praise
Dumais ST & Chen H (2000). ‘Hierarchical classification and criticism: inference of semantic orientation from as-
of Web content.’ In Belkin NJ, Ingwersen P & Leong M-K sociation.’ ACM Transactions on Information Systems
(eds.) Proceedings of SIGIR-00, 23rd ACM International 21(4), 315–346.
Conference on Research and Development in Informa-
tion Retrieval. Athens: ACM Press. 256–263. Relevant Websites
Fall CJ, Törcsvári A, Benzineb K & Karetka G (2003).
‘Automated categorization in the International Patent http://svmlight.joachims.org – SVMlight web site.
Classification.’ SIGIR Forum 37(1), 10–25. http://www.research.att.com/%schapire/BoosTexter/ – Boos-
Giorgetti D & Sebastiani F (2003). ‘Automating survey Texter website.
coding by multiclass text categorization techniques.’ http://www.math.unipd.it/fabseb60 – F. Sebastiani’s website.