82 Taeho Jo
5 Text Categorization: Conceptual View
This chapter is concerned with the conceptual view of text categorization.
It is the process of assigning a category or some categories to each text.
Text categorization is divided into hard text categorization and soft text
categorization, depending on whether more than one category is allowed to
be assigned, or not. Depending on whether categories are predefined as a
list or a tree, it is divided into flat text categorization and hierarchical text
categorization. In this chapter, we describe the text categorization tasks with
respect to their types and real examples.
Keywords: Text Categorization, Hard Text Categorization, Soft Text Cat-
egorization, Hierarchical Text Categorization
We cover the overview of text categorization in Section 5.1 and explain
the classification in its conceptual view in Section 5.2. In Section 5.3, we will
explore the types of text categorization by the different dichotomizations. We
mention the typical real tasks which are derived from the text categorization,
in Section 5.4, and make the summarization and the further discussions on
this chapter in Section 5.5.
5.1 Definition of Text Categorization
The text categorization is defined as the process of assigning one or some
of the predefined categories to each text, basically, through the three steps
which is shown in Figure 48. The text categorization which is called text
classification is an instance of classification task, where a text is given as the
classification target. Texts are encoded into numerical vectors by the pro-
cesses which were described in Chapter 2 and 3. We use mainly the machine
learning algorithms which are described in next chapter, as the approaches
to the text categorization. In this section, we explore the overview of text
categorization, before covering it in detail.
The preliminary tasks are required for executing the text categorization
system, even in the simplest version which will be mentioned in Chapter
7. It is required to predefine a list or a tree of categories as the frame of
classifying data items. Texts should be allocated to each category, as sample
ones. All sample texts are indexed into a list of words which are called feature
candidates and some of them are selected. As the additional preliminary task,
we may decide the classification algorithm and the type of classification such
as exclusive or overlapping classification and flat or hierarchical one.
Once accomplishing the above preliminary tasks including the decision
which of classification algorithm is adopted, the classification capacity should
be constructed using the sample texts. Sample texts which are allocated to
categories in the preliminary task are encoded into numerical vectors whose
attributes are the selected features. Using the training examples which are en-
Text Mining: Concepts, Implementation, and Big Data Challenge 83
Fig. 48 Text Categorization Steps
coded from the sample texts, the classification capacity which is given as one
of various forms such as equations, symbolic rules, or optimized parameters,
depending on classification algorithm are constructed; this is called learning
process. Optionally, the results from the learning process may be validated
using a set of examples, called validation set which is separated from training
examples. In this case, the set of samples texts may be divided into the two
sets: the pure training set and the validation set; the validation set is not
involved in the learning process.
After the learning process, the texts which are given, separated from the
sample texts, are classified. The classification capacity is constructed through
the learning process which is mentioned above, including the validation pro-
cess. The classification performance is evaluated, using the test set which is
initially left separated from the training set. Depending on the classification
performance, we decide whether we adopt or not the classification algorithm.
The set of texts which is given in a real field after adopting the classification
is called the real set.
Let us consider some issues from implementing the text categorization
systems. A list or tree of topics, what is called a classification frame, is pre-
defined, depending on subjectivity, before gathering sample texts. It is not
easy to gather labeled sample texts; it is very tedious to decide manually tar-
get categories of texts. Independencies among categories are not guaranteed;
some categories are correlated with others. The classification frame for main-
taining texts continually is usually fixed; it takes very much cost for changing
it in maintaining text categorization systems.
84 Taeho Jo
Fig. 49 Binary Classification
5.2 Data Classification
This section is concerned with the data classification in its conceptual view
and consists of the four sections. In Section 5.2.1, we describe the binary clas-
sification conceptually as the simplest classification task. In Section 5.2.2, we
cover the multiple classification which is expanded from the binary classifi-
cation. In Section 5.2.3, we explain the process of decomposing the multiple
classification into binary classification tasks as many as categories. In Section
5.2.4, we describe the regression to which the supervised learning algorithms
which we mention in Chapter 6.
5.2.1 Binary Classification
The binary classification is referred to the simplest classification task where
each item is classified into one of the two categories, as illustrated in Figure
49. The assumption underlying in the binary classification is that each item
belongs to one of the two classes, exclusively. We need to define the criteria,
in order to decide each item to which of the two classes it belongs. Some
item may belong to both categories in real classification tasks; such classi-
fication task is called overlapping or fuzzy binary classification. Therefore,
in this section, we describe the binary classification task as the entrance for
understanding the classification task.
Let us present a simple example of binary classification. Points may be
plotted in the two dimensional space; each point expressed as (x1 , x2 ). The
points belonging to the positive class are plotted in the area where x1 ≥ 0
and x2 ≥ 0. The points belonging to the negative class are plotted in the area
where x1 < 0 and x2 < 0. The points of the two classes are plotted in the
two dimensional space, and let us consider the dichotomy which separates
the two classes from each other.
Let us define the symbolic classification rules for classifying the points into
one of the two classes. The rule for classifying a point into the positive class
is defined as if x1 ≥ 0 and x2 ≥ 0 then the point belongs to the positive
class. The rule for classifying a point into the negative class is defined as if
x1 < 0 and x2 < 0 then the point belongs to the negative class. According
to the above rules, the points where x1 ≥ 0 and x2 < 0, or x1 < 0 and
Text Mining: Concepts, Implementation, and Big Data Challenge 85
Fig. 50 Multiple Classification
x2 ≥ 0 are rejected; they are out of the above rules. We may consider the
alternative dichotomy to the above rules for classifying a point into one of
the two classes.
The machine learning algorithms are considered as the alternative kinds
of classification tools to the rule based approaches. Instead of above rules,
we gather examples which are labeled with positive or negative class; they
are called training examples. By analyzing the training examples the classifi-
cation capacity is constructed and given as various forms such as equations,
symbolic rules, or neural network models. The examples in the separated
set which is called the test set, are classified by the classification capacity.
In Chapter 6, we will describe the classification capacity which is generated
from the training examples, in detail.
Even if the binary classification looks very simple toy task, it may exist
as a real task. The spam mail filtering where junk emails are automatically
filtered from incoming ones is the typical instance of binary classification.
Detecting whether a report about a traffic accident is true or false is a real
examples which is used in insurance companies. The information retrieval
task is viewed as a binary classification which decides whether each text is
relevant or irrelevant to a given query. The keyword extraction and the text
summarization are instances of binary classification.
5.2.2 Multiple Classification
The multiple classification is illustrated as a block diagram in Figure 50. Each
item is classified into one of the two classes in the binary classification, as
illustrated in Figure 49. The multiple classification is referred to the classifi-
cation task, where each item is classified into one of at least three categories.
The multiple classification may be decomposed into binary classification tasks
as many as categories; it will be described in Section 5.2.3. Therefore, in this
section, we describe the multiple classification, in its conceptual view.
In order to understand it easily, we present a simple example of multiple
classification in the two dimensional space. It is assumed that the four classes
are given, class 1, class 2, class 3, and class 4. The constraints for class 1 are
given as x1 ≥ 0 and x2 ≥ 0 and those of class 2 are given as x1 < 0 and
x2 ≥ 0. The constraints, x1 ≥ 0 and x2 < 0, are considered for class 3,
86 Taeho Jo
and x1 < 0 and x2 < 0 are considered for class 4. A point, (x1 , x2 ) in the
two dimensional space, is classified into one of the four classes by the above
constraints.
The if-then rules for classifying points in the two dimensional space into
one of the four classes are defined as follows:
if x1 ≥ 0 and x2 ≥ 0 then class 1
if x1 < 0 and x2 ≥ 0 then class 2
if x1 ≥ 0 and x2 < 0 then class 3
if x1 < 0 and x2 < 0 then class 4
For example, the point, (5, 4), is classified into class 1 by the first rule, because
both x1 and x2 greater than or equal to zero. As another example, the point,
(−7, 2) is classified into class 2, applying the second rule to it. The point,
(6, −3) is classified into class 3, according to the third rule. It is possible that
no rule is applicable to an input in real tasks which are complicated much
more than the current multiple classification.
Let us consider the soft classification where an item is allowed to be clas-
sified into more than one class. The classification which is mentioned above
belongs to the hard classification where no overlapping among the four classes
is allowed from the rules. The first rule is modified as follows: if x1 ≥ −2 and
x2 ≥ −2 then class 1. For example, the point, (−1, −1.5) , may be classified
into class 1 and class 4, because the first and the second are applied to the
input. Therefore, the area, −2 ≤ x1 < 0 and −2 ≤ x2 < 0, is overlapping
between class 1 and class 4.
Let us present some real multiple classification instances. The case of clas-
sifying news articles into one or some of predefined sections may be mentioned
as a typical case. The optical character recognition which is referred to the
process of classifying a character image into one of ASCII codes is mentioned
as another case. The POS (Part of Speech) tagging which is the process of
classifying a word by its grammatical function is a multiple classification in-
stance in the area of natural language processing. The sentimental analysis
which is a special type of text classification is the process of classifying a text
into one of the three categories: positive, neutral, and negative.
5.2.3 Classification Decomposition
This section is concerned with the process of decomposing the multiple clas-
sification task into binary classification ones, as illustrated in Figure 51. It is
very risky of misclassifying items to apply a single classifier directly to the
multiple classification task where each item is classified into one of multiple
categories. The multiple classification is divided into binary classifications as
Text Mining: Concepts, Implementation, and Big Data Challenge 87
Fig. 51 Decomposition of Multiple Classification into Binary Classifications
many as categories; each item is classified into positive as the correspond-
ing category or negative; it is intended to reduce the misclassification risk.
The category or categories which correspond to what the binary classifiers
produce positives are assigned to each item. In this section, we describe the
process of decomposing the multiple classification into binary classifications.
Let us mention the binary classification task which is derived from a mul-
tiple classification task by the decomposition. The task which is initially
defined is the multiple classification where each item is classified into one of
M categories. The M binary classification tasks which correspond to the M
categories are derived by decomposing the initial task, and two labels are
defined in each binary classification as positive and negative. The positive
class stands for the fact that the item belongs to the corresponding category,
while the negative class stands for the fact that it does not. The multiple
classification task is interpreted into a group of binary classification tasks
through the decomposition.
Let us mention the process of decomposing the multiple classification into
the M binary classification tasks. The M classifiers are allocated to the M
categories as the binary classifiers. The training examples which belong to
the corresponding category are labeled with the positive class, and some of
training examples which do not belong to it are selected as many as the
positive examples, and labeled with the negative class. The classifiers which
correspond to the category learns their own positive examples and negative
examples. If all of training examples which do not belong to the corresponding
category are used as the negative examples, the classifier tend strongly to
88 Taeho Jo
classify novice examples into negative class by the unbalanced distribution
which is biased toward it.
Let us explain the process of classifying a novice item after learning the
sample examples. A novice item is submitted to the classifiers which corre-
spond to the predefined categories as the classification target. It is classified
by the classifiers into a list of positive and negative classes. The categories
which correspond to classifiers which classify the novice example into the
positive class are assigned to it. If the given classification is exclusive, the
category with its maximum categorical score among them is selected.
Note that much overhead is caused by decomposing the multiple classi-
fication into binary classifications. The training examples which are labeled
with one or some of the predefined categories should be relabeled with the
positive class or the negative class, and they are should be to each classifier.
The classifiers as many as the categories should learn training examples, in
parallel. We must wait until all of the classifier make their decisions, instead
of taking an output of single classifier. Because the classifier are independent
of each other with respect to their learning and classification, it is possible
to implement them as the parallel and distributed computing.
5.2.4 Regression
The regression is referred to the process of estimating an output value or
output values as continuous ones by analyzing input values. The supervised
learning algorithm is also applied to the task like the classification. In the
classification, one of output values is given as a discrete value, whereas, in
the regression, it is given as a continuous one. The regression may be mapped
into the classification in some areas by discretizing each continuous value into
a finite number of intervals. In this section, we describe the regression in its
functional view and map it into the classification.
As mentioned above, it is possible to map the regression into the classifi-
cation. An output value is discretized into a finite number of intervals; the
intervals are given as the predefined categories. The target output value is
replaced by its corresponding interval for each training examples; they are
learned by the given supervised machine learning algorithm. A novice item
is classified into one of the predefined intervals; the classified one is the label
which indicates the interval within which its continuous value exists. The
regression is mapped into a binary classification by discretizing the output
value into only two intervals; if the number of intervals is more than two, it
is mapped into a multiple classification.
Figure 52 illustrates the process of decomposing the regression into the
binary classification tasks. The regression is mapped into the multiple clas-
sification by discretizing the output continuous value into several intervals.
The multiple classification is decomposed into the binary classifications by
the process which was described in Section 5.2.3. The classifiers are allocated
Text Mining: Concepts, Implementation, and Big Data Challenge 89
Fig. 52 Decomposition of Regression into Binary Classifications
to the corresponding intervals and learn their corresponding training exam-
ples, but the process of preparing the training examples will be mentioned in
the next paragraph. The classification mapped from the regression belong to
the exclusive one where each item is classified into only one category; only
one classifier is allowed to classify the item into the positive class.
The training examples which are prepared for the regression task are la-
beled with its own continuous value. The label which is given as a continuous
value is changed into its corresponding internal in mapping so. By chang-
ing one more time the discrete label into the positive or the negative, the
training examples are prepared for each classifier in mapping the multiple
classification into the binary classifications. The classifiers are allocated to
the intervals and trained with the prepared examples. In Section 5.2.3, we
already explained the meaning of the positive and negative to the binary
classifiers.
The classification and the regression are compared with each other in terms
of their differences and sharing points. The supervised learning algorithms
are applied to the both kinds of tasks as their sharing point. The differ-
ence between them is that the classification generates a discrete value or
values as its output whereas the regression does a continuous value or val-
ues, as mentioned above. The classification instances are spam mail filtering,
text categorization, and image classification, and the regression instances are
nonlinear function approximation and time series prediction. The error rate
which is called risk function is defined as the rate of misclassified items to the
total items in the classification and average over differences between desired
and computed values in the regression.
5.3 Classification Types
This section is concerned with the types of text categorization, depending on
the dichotomy criterion. In Section 5.3.1, we will explain the hard classifica-
tion and the soft classification depending on whether each item is allowed to
belong to more than one category, or not. In Section 5.3.2, we mention the
flat classification and the hierarchical classification, depending on whether
90 Taeho Jo
Fig. 53 Hard vs Soft Classification
any nested category in a particular one is allowed or not. In Section 5.3.3, we
describe the single viewed classification and the multiple viewed one, depend-
ing on whether multiple classification frames are accommodated or not. In
Section 5.3.4, we consider the independent classification and the dependent
one, depending on whether a current classification is influenced by the results
from classifying previously items.
5.3.1 Hard vs Soft Classification
The hard classification and the soft one are illustrated in Figure 53. The hard
classification is one where no overlapping exists among categories; all items
belong to only one category. The soft one is one where any overlapping exists;
one item belongs to two categories, at least. The criteria of deciding the hard
classification and the soft one is whether the overlapping is allowed or not. In
this section, we describe the two classification types in the conceptual view.
The hard classification is defined as the classification where every item is
classified into only one category, as shown in the left part of Figure 53. In
this classification type, training examples which are labeled with only one
category are prepared. When a small number of categories is predefined, we
may use a single classification without decomposing the task into binary clas-
sifications. For each test item, one of a fixed number of predefined categories
is decided by the classifier. The optical digit recognition and the spam mail
filtering belong to the hard classification.
The soft classification is referred to one where, at least, one item is clas-
sified into more than one category. Some or almost all of training examples
are initially labeled with more than one category. This type of classification
should be decomposed into binary classifications as the requirement for ap-
plying machine learning algorithms. The F1 measure is used as the evaluation
metric, rather than accuracy in this case. The news article classification and
the image classification become instances of this classification type.
Text Mining: Concepts, Implementation, and Big Data Challenge 91
Fig. 54 Example of Hard and Soft Classification
The both classification types are demonstrated by a simple example in Fig-
ure 54. The eight items and the four categories are prepared in the example.
In the hard classification, the eight items are labeled on only one of the four
categories, as shown in the left part of Figure 54. In the soft classification,
the six items are labeled with more than one category, as shown in the right
part of Figure 54. It is possible to assign the category membership values to
each item, instead of some categories.
The text categorization belongs to the soft classification, more frequently
than to the hard classification. The spam mail filtering in the email system
is the typical case of hard classification. Because each text tends to cover
more than one topic, the text categorization belongs to the soft classification.
The text collection, Reuter21578, where almost all of texts are labeled with
more than one topic is most standard one which is used for evaluating text
categorization systems. So we consider segmenting a text into subtexts based
on its topics which will be mentioned in Chapter 14.
5.3.2 Flat vs Hierarchical Classification
Figure 55 illustrates the two kinds of classification: flat classification and hi-
erarchical classification. The dichotomy criteria for dividing the classification
in the types is whether any nested category is allowed in a particular one, or
not. The flat classification is one where the predefined categories are given as
a list and no nested category is allowed, whereas the hierarchical category is
one where the predefined categories are given as a tree and any nested cate-
gory is allowed. In the hierarchical classification, we need to consider the case
where a data item is classified correctly in the abstract level, but incorrectly
in the specific level. In this section, we describe the two types of classification
in detail.
The flat classification is illustrated in the left part of Figure 55. It is
one where no nested category is available in any category. In this classifi-
cation type, the categories are predefined as a list. Among the predefined
92 Taeho Jo
Fig. 55 Flat vs Hierarchical Classification
categories, one or some are assigned to each item. The digit recognition, the
optical character recognition, and spam mail filtering are instances of the flat
classification.
The hierarchical classification is shown in the right part of Figure 5.8. It
is one where any nested category is allowed in a particular category. In this
classification type, categories are predefined as a tree. An item is classified in
the top-down direction along the classification tree. The scheme of evaluating
the performance is more complicated in the hierarchical classification than in
the flat one.
Figure 56 illustrates the process of applying the classifiers to the hierar-
chical classification task. In the first level, there are two categories, A and
B, the category A has the nested categories, A-1 and A-2, and the category
B has B-1 and B-2, as shown in the left part of Figure 56. The classifier is
prepared in the first level, for classifying an item into A or B. In the second
level, the two classifiers are needed: one which classifies the items which are
classified into A into A-1 or A-2 and the other which does ones which are
classified into B into B-1 or B-2. In the hierarchical classification, classifiers
are allocated in the root node and the interior nodes as the basic scheme of
applying them.
In implementing the classification systems, there is the trend of expanding
the flat classification into the hierarchical one. The optimal character recog-
nition and the spam mail filtering belong to the flat classification, typically.
The classification module in the digital library system and patent classifica-
tion system tend to be implemented as complicated hierarchical classification
system in early of 2000s [15][17]. Because we need to consider the case of clas-
sifying items correctly in the higher level, but incorrectly in the lower level,
Text Mining: Concepts, Implementation, and Big Data Challenge 93
Fig. 56 Process of Applying Classifiers to Hierarchical Classification
Fig. 57 Different Classification Categories
the process of evaluating the results becomes much more complicated. Vari-
ous schemes of applying classifiers to the hierarchical classification exist other
than what is mentioned above.
5.3.3 Single vs Multiple Viewed Classification
Figure 57 illustrates several different classification categories which are define
differently even within a same domain by subjective. A fixed single classifi-
cation frame is required for implementing classification systems; it takes too
much cost for deciding a single classification frame in the process of doing so.
It very tedious to update and maintain continually the current classification
frame, dynamically. Even if the classification frame is updated and changed,
it does not guarantee that the new one is more suitable to the current collec-
tion of texts than the previous one. In this section, we propose the multiple
viewed classification and compare it with the traditional classification type.
94 Taeho Jo
The single viewed classification is referred to the classification type where
only one group of categories is predefined as a list or a tree. Until now, it has
been the classification paradigm which underlies inherently in existing classi-
fication programs. Only one group of categories is predefined in the standard
text collections which have been used for evaluating text categorization sys-
tems, such as Reuter21578, 20NewsGroups, and OSUMED. In the hard and
single viewed classification, only one category is assigned to each item. Co-
ordinating different opinions about the category predefinition is required for
keeping this classification type.
The multiple viewed classification is one where at least two groups of
classification categories each of which is called view are allowed as trees or
lists. The training examples even in the hard classification are labeled with
multiple categories corresponding to groups of predefined categories. The
classifiers are allocated to the groups of predefined categories, assuming that
the machine learning algorithms applied to the classification tasks without
any decomposition. Novice items are classified independently of views. The
multiple viewed classification is interpreted as independent classifications as
many as groups of predefined categories.
A simple example of the multiple viewed classification is illustrated in Fig-
ure 58. The two views are given in Figure 58 as the two groups of predefined
categories: view 1 where class A and B are predefined, and view 2 where
class C and D are predefined. Each of eight texts is labeled with exactly two
categories; one is from view 1 and the other is from view 2. We need the two
binary classifiers: one is for classifying an item into class A or B, and the
other is for classify it into class C or D. The two binary classifiers correspond
to the two views under a single set of training examples.
The differences between the hierarchical classification and the multiple
viewed one are illustrated in Figure 59. The hierarchical classification is shown
in the left part of Figure 5.12; there are three categories in the first level and
each of them has its own two nested categories. The multiple viewed classi-
fication is presented in the right part of Figure 59; there are three different
independent binary classifications. In the hierarchical classification, for ex-
ample, text is classified into the category, B-1, by means of the category B,
whereas in the multiple viewed category, it is classified into the three in-
dependent categories, A-2, B-1, and C-2. The predefined category structure
is given as a tree in the hierarchical classification, whereas the structure is
given as a forest which consists of more than independent trees or lists, in
the multiple viewed classification.
5.3.4 Independent vs Dependent Classification
This section is concerned with the two types of classification, depending on
whether results from classifying items previous influence on the current clas-
sification, or not. The two types, the independent classification and the de-
Text Mining: Concepts, Implementation, and Big Data Challenge 95
Fig. 58 Example of Multiple Viewed Classification
Fig. 59 Hierarchical vs Multiple Viewed Classification
pendent one, are illustrated in Figure 60. The independent classification is
one where results from previous and current classifications are independent
of each other, whereas the dependent classification is one where the current
classification results depend on the results of classifying data items, previ-
ously. The three independent binary classifications are presented in the left
part of Figure 60, whereas the two binary classifications which depend on a
binary classification are shown in the right side of Figure 60. In this section,
we describe the two kinds of classification in detail and compare them with
each other.
The independent classification is one where is no influence of classifying
an item into a particular category on decision into another category. The flat
and exclusive classification belong strictly to the independent classification,
whereas the hierarchical classification belong to the independent classifica-
96 Taeho Jo
Fig. 60 Independent vs Dependent Classification
tion, in that the scope of specific categories dependent on their higher cate-
gory. There are two cases which are required for the independent classifica-
tion; the prior classification is not required for classifying the item, currently
and the category into which the item classified before does not influence
on the current classification. The spam mail filtering, the optical character
recognition, and the digital recognition belong to the independent classifica-
tion which does not the prior classification. The flat classification may belong
to the independent classification and the dependent on, by correlation among
categories.
The dependent is one where a data item is classified into a category, influ-
enced by a prior classification or prior classifications. The hierarchical clas-
sification belongs typically to the independent classification, as mentioned
above. The dependent classification may be applicable to the flat overlap-
ping classification where if an item is classified into a particular category, it
may be classified into its related ones with higher probabilities. For example,
if a news article is classified into business, it may be classified more easily into
the topic, IT, compared with the topic, sports. The text classification where
texts tend to be classified into related categories becomes a typical example
of the independent classification.
Figure 61 presents the specific examples of two kinds of classification.
The example of independent classification where the four classifiers classify
the given text independently into the positive class or the negative class is
presented in the left part of Figure 61. The right part of Figure 61 presents
the example of dependent classification where the classifier which corresponds
to the category, business, is dependent on the classifier corresponding to the
category, politics, and the classifier corresponding to the category, IT, is
Text Mining: Concepts, Implementation, and Big Data Challenge 97
Fig. 61 Examples of Independent and Dependent Classification
dependent on the two classifiers which correspond to the categories, politics,
and science, respectively. When the classifier corresponding to the category
politics, classify a text into the positive class, the classifier corresponding to
the category, business, is able to classify a text into one of the two classes.
When the classifier corresponding to the category, politics, classifies it into
the negative class and the classifier corresponding to the category, science,
does it into the positive, the classifier corresponding to the category, IT, is
able to classify it.
We need to consider the relations among categories in doing classification
tasks. The independent relations among classes have been assumed in the tra-
ditional classification tasks, such as digit recognition, optical character recog-
nition, spam mail filtering, and other single binary classification tasks. More
than ten categories exist in the standard text collections such as Reuter21578
and 20NewsGroups which are mentioned in Chapter 8. The structure of cat-
egories is hierarchical in the collection, 20NewsGroups, and more than 100
categories in the collection Reuter21578 are related semantically with each
other. So, the text classification usually belongs to the dependent classifica-
tion.
98 Taeho Jo
Fig. 62 Spam Mail Filtering
5.4 Variants of Text Categorization
This section is concerned with the tasks which are derived from the text
categorization, in their conceptual views. In Section 5.4.1, we cover the spam
mail filtering which decides whether each email is junk or sound. In Section
5.4.2, we study the sentimental analysis which classifies an opinion into one
of the three categories, negative, positive, and neutral. In Section 5.4.3, we
describe the information filtering which decides whether a text is interesting
or not. In Section 5.4.4, we mention the topic routing which is the reverse
task to the text categorization.
5.4.1 Spam Mail Filtering
The spam mail filtering is an instance of binary classification where each email
is classified into ham (sound) or spam (junk), as shown in Figure 62. User of
email account tend to take very much time for removing their unnecessary
emails. The task may be automated by the spam mail filtering which is a
special instance of text categorization. An email is assumed to be a text and
the categories, spam and ham, are predefined. In this section, we describe the
spam mail filtering as an instance of text categorization, in detail.
It is necessary to clear junk emails in managing email accounts. Users usu-
ally have more than one email account. Too many junk emails tend to arrive
every day and it takes too much time for removing junk emails, scanning
them individually. They need to remove junk emails automatically before
they arrive at users. So, the spam mail filtering system should be installed in
almost all of email accounts.
It is considered to apply the machine learning algorithms to the spam mail
filtering, viewing it into a binary classification task. The sample emails which
are labeled with spam or ham are prepared as training examples. The classifier
learns the sample labeled emails. An email which arrives subsequently is
classified into ham or spam. In the real version, it does by symbolic rules,
rather than machine learning algorithms.
Text Mining: Concepts, Implementation, and Big Data Challenge 99
Fig. 63 Sentimental Analysis
Let us review some representative cases of applying the machine learning
algorithms to the spam mail filtering. In 2003, the Naive Bayes was applied
to the spam mail filtering by Schneider [103]. In 2003, the neural networks
was proposed as the approach to the email classification by Clark et al [10].
In 2007, the four main machine learning algorithms, the Naive Bayes, the
K Nearest Neighbor, the decision tree, and the support vector machine, are
compared with each other by Youn and McLeod [119]. In 2010, using the
resampling, the KNN based spam mail filtering system was improved by
Loredana et al [78].
Let us consider some issues in implementing the spam mail filtering sys-
tems. We need to discriminate the two types of misclassifications: misclassi-
fication of spam into ham and vice versa. Emails are classified into spam or
ham, depending on sample emails which are previously gathered. Spam mails
tend to be different from previous ones; alien mails tend to be misclassified.
An email is given as very short and colloquial text, so it is not easy to analyze
it.
5.4.2 Sentimental Analysis
The sentimental analysis is referred to the process of classifying an opinion
into the positive, the neutral, or negative, as shown in Figure 63. An opinion
is given a textual input data and one of the three attitudes is generated
as the output. The sentimental analysis is used for classifying automatically
opinions to commercial products or political issues based on the attitudes. It
should be distinguished from the topic spotting where a text is classified by
one or some of topics. In this section, we describe the sentimental analysis as
a specialized instance of text categorization.
Let us mention the three categories in the sentimental analysis. The posi-
tive, means the opinion which described something or somebody with positive
expressions such as good, excellent, and great. The neutral means one which
describe something objectively without positive nor negative, or with the
mixture of both of them. The negative means one which describe something
with negative expressions, such as bad, terrible, and poor. The neutral may
100 Taeho Jo
be divided into the two types: one with no sentimental expression and one
with mixture of positive and negative.
The machine learning algorithm could be applied to the sentimental anal-
ysis by viewing the task into a classification task. The texts which are labeled
with one of the three categories are collected and encoded into numerical vec-
tors. The machine learning algorithm learns the numerical vectors which are
encoded from the sample labeled texts. If a novice text is given, it is encoded
into a numerical vector, it is classified into one of the three classes. Even if
the sentimental analysis is an instance of text classification, its classification
criteria different from that of the topic based one.
Let us introduce some previous approaches to the sentimental analysis.
In 2004, the support vector machine was applied to the sentimental analysis
based on diverse information sources by Mullen and Colier [89]. In 2008,
Pang and Lee explained the sentimental analysis and the feature extraction
in detail [93]. In 2009, by Boiy and Moens, both learning algorithms, the
support vector machine and the Naive Bayes are applied to the sentimental
analysis of web texts which are written in English, Dutch, and French [6].
In 2011, words are defined as features and the support vector machine was
applied to the task by Maas et al[81].
We need to consider some issues in the sentimental analysis which is distin-
guished from the topic based text categorization. The sentimental analysis
tends to depend strongly on the positive and negative terms. Because an
opinion is given as a very short text, information is not sufficient for dis-
tinguishing positive and negative opinions from each other. An opinion or a
thread tends to include colloquial expressions and slangs. Negative opinions
which are expressed softly in neutral words may be misclassified into the
neutral one.
5.4.3 Information Filtering
The information filtering is referred to the process of providing interesting
texts for users among incoming ones, as shown in Figure 64. It assumed
that texts are incoming continually to users. A classifier is allocated to each
user, and learns sample texts which are labeled with interesting or not. Each
incoming text is classified into one of two labels and the texts which are clas-
sified into interesting are transferred to the users. In this section, we describe
the information filtering which is derived from the text categorization.
Let mention the web crawling before describing the information filtering.
The web crawling means the process of collecting web documents which are
interesting to users based on their profiles, and the user profile means the
historical records of making queries and accessing web documents. The user
profiles are gathered and relevant documents are retrieved through internet.
The information retrieval is the process of retrieving relevant documents only
in the given query, whereas the web crawling is the process of retrieving ones
Text Mining: Concepts, Implementation, and Big Data Challenge 101
Fig. 64 Information Filtering
continually by collecting the user profiles. The important issue is how to make
the schedule of performing the web crawling.
The task of information filtering is viewed as a binary classification task.
For each user, sample texts which are labeled with interesting or not are
collected. The sample texts are encoded into numerical vectors and the allo-
cated machine learning algorithm learns them. After learning, incoming web
documents are classified into interesting or not. The web documents which
are labeled with uninteresting are discarded and only ones which are labeled
with interesting are delivered to the user.
Let us explore previous works on the information filtering system. In
1995, some heuristic algorithms of information filtering were proposed by
Shardanand and Mases [106]. In 1996, the Rocchio algorithm which is an
instance of machine learning algorithm was applied to the information fil-
tering by Kroon and Kerckhoffs [72]. In 2001, schemes of extracting features
and various kinds of approaches about the information filtering were men-
tioned by Hanani et al [20]. In 2010, the bag of words marching algorithm
was applied to the information filtering by Srriam et al [107].
Let us mention some issues in implementing the information filtering sys-
tem. We need to consider the scheme of gathering sample texts from each user
before deciding the approach. Depending on a user, even same text should
be labeled differently. If a user change his/her interest, some of sample texts
should be labeled differently. As well as texts, images may be considered for
performing the information filtering.
5.4.4 Topic Routing
This section is concerned with the topic routing which is derived from the
text categorization. The text classification which allows the soft classification
is called topic spotting whereas the topic rouging is the inverse task. The
topic routing is defined as the process of retrieving texts which belong to the
102 Taeho Jo
given topic; a topic or topics are given as the input and texts which belong
to it or them are retrieved as the output. Because a topic or topics are given
as the query in the topic routing, it looks similar as the information retrieval.
So, in this section, we describe the topic routing in the functional view.
The overlapping text categorization was called topic spotting as its nick
name where more than one category are allowed to be assigned to each text
[114]. The process of assigning more than one category is viewed as the process
of spotting topics on a text. In [114], the Perceptron and MLP (Multiple Layer
Perceptron) were proposed as the approach and the task was decomposed
into binary classifications. A single text is given as the input and its relevant
topics are retrieved as the output, in the topic spotting. The topic spotting
has been mentioned in subsequent literatures [60][90][76].
Let us consider the process of doing the topic routing. A list of categories
is predefined and texts should be tagged with their topic or topics. A topic is
given as a query and texts whose tags match with it are retrieved. If a topic is
given out of the predefined ones, texts whose contents match it are retrieved.
The difference from the information retrieval is to predefine the categories in
advance and label texts with one or some of them.
The topic routing may be evaluated likely the information retrieval task.
The categories are predefined as mentioned above and the labeled texts are
prepared. The recall and the precision are computed from texts which are
retrieved from the topic. The two metrics are combined into the F1 measure.
The process of computing the F1 measure is described in detail, in Chapter
8.
We need to point out some issues in implementing the topic routing system.
If a topic is given out of the predefined ones as a query, the system ignores
it or converted into the information retrieval system. When the topics are
predefined as a hierarchical structure, we need to refine the abstract topic
which is given as a query into its specific ones. We need to consider the
data structures for string topic associations to their relevant texts. If a text
consists of subtexts which cover their down different topics, heterogeneously,
it is considered to retrieve subtexts, instead of full texts.
5.5 Summary and Further Discussions
This chapter is characterized as the functional description of text catego-
rization. The class task is defined as a text mining task, and the multiple
classification is decomposed into binary classifications. Depending on the di-
chotomy views, we surveyed the various kinds of classification. We covered
spam mail filtering, sentimental analysis, information filtering, and the topic
routing, as the variant tasks which are derived from the text categorization.
In this section, we make the further discussion about what we study in this
chapter.
Text Mining: Concepts, Implementation, and Big Data Challenge 103
Clustering which is covered from Chapter 9 plays the role of automating
the preliminary tasks for the text categorization. The preliminary tasks are
to predefine categories as a list or a tree and to allocate sample texts. The
results from doing the preliminary tasks are accomplished by the clustering
task. In 2006, Jo proposed the combination of the text categorization and
the text clustering with each other for managing texts automatically, based
on the idea [25]. It will be mentioned in detail in Chapter 16.
Let us characterize the text categorization in the functional view. The
manual preliminary tasks which were mentioned above are required for ex-
ecuting the main task. The results from doing the text categorization are
evaluated by the accuracy or the F1 measure. The supervised learning algo-
rithms which are described in Chapter 6 are applied for the task. In the text
categorization, it is assumed that the frame of organizing texts is given as a
list of a tree of predefined categories.
The text categorization is actually the semi-automated management of
texts by itself. It requires the preliminary manual tasks, category predefini-
tion and sample text preparation. Learning of sample texts and classification
of novice texts belong to the automatic portion in the text categorization. As
texts are added and deleted, we need to update categories. The full automa-
tion of text management is implemented by combining the text categorization
with the text clustering, and it will be described in detail in Chapter 16.
Machine learning algorithms are adopted as approaches to text categoriza-
tion, in this study. The K Nearest Neighbor and its variants are adopted as
the simple and practical approaches to the task. The probabilistic learning
such as Bayes classifier, Naive Bayes, and Bayesian learning are considered
as typical approaches. The support vector machine is the most popular tool
of any kind of classification tasks, as well as the text categorization. The
machine learning algorithms will be described in Chapter 6.