0% found this document useful (0 votes)
31 views18 pages

CAP 11 Io1

CLASIFICACION DE DOCUMENTOS POR LENGUAJE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views18 pages

CAP 11 Io1

CLASIFICACION DE DOCUMENTOS POR LENGUAJE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

11

C hap te r
Classifying documents by language

Overview
In this chapter we will show how linear optimization can be used in machine learning. Ma-
chine learning is a branch of artificial intelligence that deals with algorithms that identify
(‘learn’) complex relationships in empirical data. These relationships can then be used to
make predictions based on new data. Applications of machine learning include spam email
detection, face recognition, speech recognition, webpage ranking in internet search engines,
natural language processing, medical diagnosis based on patients’ symptoms, fraud detection
for credit cards, control of robots, and games such as chess and backgammon.
An important drive behind the development of machine learning algorithms has been the
commercialization of the internet in the past two decades. Large internet businesses, such
as search engine and social network operators, process large amounts of data from around
the world. To make sense of these data, a wide range of machine learning techniques are
employed. One important application is ad click prediction; see, e.g., McMahan et al. (2013).
In this chapter, we will study the problem of automated language detection of text docu-
ments, such as newspaper articles and emails. We will develop a technique called a support
vector machine for this purpose. For an elaborate treatment of support vector machines in
machine learning, we refer to, e.g., Cristianini and Shawe-Taylor (2000).

11.1 Machine learning


Machine learning is a branch of applied mathematics and computer science that aims at de-
veloping computational methods that use experience in the form of data to make accurate
predictions. Usually, the ‘experience’ comes in the form of examples. In general, a machine
learning algorithm uses these examples to ‘learn’ about the relationships between the ex-
amples. Broadly speaking, there are a few different but related types of such algorithms.
The present chapter is about a so-called classification algorithm. A classification algorithm re-

423
424 C h a p t e r 11 . C l a s s i f y i n g d o c u m e n t s b y l a n g uag e

quires a set of examples, each of which has a label. An example of a classification problem
is the problem faced by email providers to classify incoming email messages into spam and
non-spam messages. The examples for such a classification problem would be a number of
email messages, each labeled as either ‘spam’ or ‘not spam’. The goal of the classification
algorithm is to find a way to accurately predict labels of new observations. The labels for
the examples are usually provided by persons. This could be someone who explicitly looks
at the examples and classifies them as ‘spam’ or ‘not spam’, or this could be provided by
the users of the email service. For example, when a user clicks on the ‘Mark this message
as spam’ button in an email application, the message at hand is labeled as ‘spam’, and this
information can be used for future predictions.
Other areas of machine learning include: regression, where the goal is to predict a number
rather than a label (e.g., tax spending based on income, see also Section 1.6.2); ranking, where
the goal is to learn how to rank objects (for example in internet search engines); clustering,
which means determining groups of objects that are similar to each other. In the cases of
classification and regression, the examples usually have labels attached to them, and these
labels are considered to be the ‘correct’ labels. The goal of the algorithms is then to ‘learn’ to
predict these labels. Such problems are sometimes categorized as supervised machine learning.
In contrast, in ranking and clustering, the examples usually do not have labels, and they are
therefore categorized as unsupervised machine learning.
The current chapter is a case study of a (supervised) classification algorithm. To make a classifi-
cation algorithm successful, certain features of the messages are determined that (hopefully)
carry predictive knowledge. For example, for spam classification, it can be helpful to count
the number of words in the message that relate to pharmaceutical products, or whether or
not the email message is addressed to many recipients rather than to one particular recipient.
Such features are indicative for the message being classified as spam. Other words, such
as the name of the recipient’s friends may be indicative for the message not being spam.
The features are represented as numbers, and the features of a single example can hence be
grouped together as a vector. Usually, it is not one particular feature that determines the
label of an example, but it is rather the combination of them. For example, an email message
that is sent to many different recipients and that contains five references to pharmaceutical
products can probably be classified as spam, whereas a message that has multiple recipients
and includes ten of the main recipient’s friends should probably be classified as non-spam. A
classification algorithm attempts to make sense of the provided features and labels, and uses
these to classify new, unlabeled, examples.
Clearly, the design of features is crucial and depends on the problem at hand. Features should
be chosen that have predictive power, and hence the design uses prior knowledge about the
problem. In many cases it may not be immediately clear how to choose the features.
11 . 2 . C l a s s i f y i n g d o c u m e n t s u s i n g s e pa rat i n g h y p e r p l a n e s 425

11.2 Classifying documents using separating hyperplanes


In automated text analysis, a basic problem is to determine the language in which a given
text document (e.g., a newspaper article, or an email) is written in. In this chapter, we will
show how linear optimization can be used to classify text documents into two languages,
English and Dutch.
Suppose that we have a (potentially very large) set D of text documents, some of which are
written in English and the others are written in Dutch. For each document d in D, we
calculate m so-called features denoted by f1d , . . . , fm d
. We will use as features the relative
frequency of each letter, i.e., the number of times each letter appears divided by the total
number of letters in the document. Of course, one can think of many more features that
may be relevant, e.g., the average word length, the relative frequency of words of length
1, 2, 3, . . ., or the relative frequency of certain letter combinations. We will restrict ourselves
to the relative frequency of individual letters. For simplicity, we will also treat capitalized and
small case letters equally. So, in our case we have that m = 26. Thus, for each document,
we construct an m-dimensional vector containing these features. Such a vector is called a
feature vector. Since we have |D| documents, we have |D| of these m-dimensional feature
vectors. For each document d ∈ D, let f d (∈ Rm ) be the feature vector for d. For any
subset D0 ⊆ D, define F (D0 ) = f d d ∈ D0 .
As an example, we have taken 31 English and 39 Dutch newspaper articles from the internet
and calculated the letter frequencies. Table 11.1 shows the relative frequencies of the 26
letters for six English and six Dutch newspaper articles. The columns of the table are the
twelve corresponding feature vectors f d (d = 1, . . . , 12).
Our goal is to construct a function g : Rm → R, a so-called classifier (also called a support
vector machine), which, for each document d ∈ D, assigns to the feature vector f d a real
number that will serve as a tool for deciding in which language document d was written.
The interpretation of the value g(f d ) is as follows. For any document d ∈ D, if g(f d ) > 0,
then we conclude that the text was written in English; if g(f d ) < 0, then we conclude that
the text was written in Dutch.
To construct such a classifier, we assume that for a small subset of the documents, the lan-
guage is known in advance (for example, the articles have been read and classified by a
person). We partition this subset into two subsets, L and V . The subset L is called the
learning set, and it will be used to construct a classifier. The subset V is called the validation
set, and it will be used to check that the classifier constructed from the learning set correctly
predicts the language of given documents. If the classifier works satisfactorily for the valida-
tion set, then it is accepted as a valid classifier, and it may be used to determine the language
of the documents that are not in L ∪ V (i.e., for the documents for which the language is
currently unknown). Let L1 be the subset of L that are known to be written in English
and, similarly, let L2 be the subset of L that are known to be written in Dutch. Define V1
and V2 analogously. In our example of newspaper articles, we will use the data in Table 11.1
as the learning set.
426 C h a p t e r 11 . C l a s s i f y i n g d o c u m e n t s b y l a n g uag e

Articles written in English Articles written in Dutch


Letter 1 2 3 4 5 6 7 8 9 10 11 12
A 10.40 9.02 9.48 7.89 8.44 8.49 8.68 9.78 12.27 7.42 8.60 10.22
B 1.61 1.87 1.84 1.58 1.41 1.55 2.03 1.08 0.99 1.82 1.79 2.62
C 2.87 2.95 4.86 2.78 3.85 3.13 0.80 1.37 1.10 1.03 2.15 1.83
D 4.29 3.52 3.16 4.18 3.91 5.04 5.50 6.05 6.13 7.11 5.97 6.46
E 12.20 11.75 12.69 11.24 11.88 11.82 18.59 17.83 17.74 17.85 15.29 18.25
F 2.12 2.31 2.97 2.00 2.55 1.90 0.76 0.89 0.77 1.26 1.19 1.48
G 2.23 1.67 2.17 2.16 1.79 2.58 2.75 2.99 2.96 2.21 2.39 4.10
H 4.45 4.36 4.25 5.56 4.45 5.43 1.69 3.29 2.96 1.66 2.27 2.62
I 9.20 7.61 7.59 7.62 8.33 8.25 5.25 6.89 6.24 8.14 7.29 5.68
J 0.13 0.22 0.14 0.25 0.22 0.40 1.35 1.30 0.88 2.05 1.55 1.57
K 0.75 0.64 0.57 0.91 0.33 0.91 3.90 1.90 1.86 2.13 1.91 2.10
L 4.05 3.28 4.81 4.74 4.31 3.77 4.11 4.44 3.50 3.40 3.82 3.76
M 2.41 3.46 2.55 3.10 2.74 2.58 2.50 2.21 3.40 3.00 1.67 1.75
N 7.03 7.70 6.60 7.02 7.16 7.34 10.63 8.80 10.30 11.45 9.32 11.53
O 5.85 6.82 8.11 6.74 6.76 6.54 6.48 5.51 4.27 4.82 4.78 4.28
P 1.53 2.51 1.79 1.93 2.41 1.43 1.31 1.23 1.42 1.11 2.51 0.52
Q 0.11 0.02 0.14 0.12 0.19 0.08 0.00 0.05 0.00 0.00 0.00 0.09
R 6.44 7.00 6.04 5.82 6.35 6.07 7.75 6.24 5.04 6.24 6.69 6.20
S 7.35 7.68 5.28 7.22 6.35 6.66 3.43 4.66 3.18 5.13 6.57 3.14
T 8.50 8.10 8.92 9.03 8.98 7.61 5.42 6.77 7.12 4.82 6.33 5.94
U 2.25 3.17 2.12 2.87 2.88 3.09 1.74 1.32 1.20 1.58 2.75 1.40
V 0.80 1.01 1.08 0.89 1.22 0.95 3.30 2.56 3.61 4.19 2.75 2.45
W 1.26 1.21 1.51 1.61 1.47 2.62 0.97 1.51 1.53 0.55 1.31 0.87
X 0.05 0.22 0.09 0.05 0.19 0.28 0.04 0.00 0.00 0.00 0.00 0.00
Y 1.72 1.87 1.04 2.56 1.71 1.43 0.08 0.14 0.11 0.00 0.24 0.09
Z 0.40 0.02 0.19 0.14 0.14 0.08 0.93 1.18 1.42 1.03 0.84 1.05

Table 11.1: Relative letter frequencies (in percentages) of several newspaper articles.

We will restrict our attention to linear classifiers, i.e., the classifier g is restricted to have the
form m
X T
wj fj + b = wT f + b for f = f1 . . . fm ∈ Rm ,

g(f ) =
j=1

where w (∈ R \ {0}) is called the weight vector of the classifier, b (∈ R) is the intercept,
m

and f is any feature vector. Note that we exclude the possibility that w = 0, because the
corresponding classifier does not take into account any feature, and therefore is not of any
use to predict the language of a document. Our goal is to construct a weight vector w and
an intercept b such that:

d ∈ L1 =⇒ wT f d + b > 0, and
(11.1)
d ∈ L2 =⇒ wT f d + b < 0.
Linear classifiers have the following geometric
 interpretation. For any w ∈ Rm \ {0} and
b ∈ R, define the hyperplane H(w, b) = f ∈ Rm wT f + b = 0 ,and the two (strict)
halfspaces H (w, b) = f ∈ Rm wT f + b < 0 and H − (w, b) = f wT f + b > 0
+


corresponding to H(w, b). Hence, (11.1) is equivalent to:

F (L1 ) = f d d ∈ L1 ⊆ H + (w, b), and



(11.2)
F (L2 ) = f d d ∈ L2 ⊆ H − (w, b).
11 . 2 . C l a s s i f y i n g d o c u m e n t s u s i n g s e pa rat i n g h y p e r p l a n e s 427

f2 (d) f2 (d)
H2

H1

(∗)

f1 (d) f1 (d)

Figure 11.1: Separable learning set with 40 Figure 11.2: Nonseparable learning set with 40
documents. The solid and the dashed documents. The convex hulls of the
lines are separating hyperplanes. learning sets intersect.

So, we want to construct a hyperplane in Rm such that the feature vectors corresponding
to documents in L1 lie in the halfspace H + (w, b), and the vectors corresponding to L2 in
H − (w, b).
If there exist a weight vector w and an intercept b such that the conditions of (11.2) are
satisfied, then F (L1 ) and F (L2 ) are said to be separable; they are called nonseparable oth-
erwise. The corresponding hyperplane H(w, b) is called a separating hyperplane for F (L1 )
and F (L2 ), and the function wT f d + b is called a separator for F (L1 ) and F (L2 ); see also
Appendix D. We make the following observations (see Exercise 11.8.2):
I H + (−w, −b) = H − (w, b) for w ∈ Rm \ {0}, b ∈ R.
I H(λw, λb) = H(w, b) for w ∈ Rm \ {0}, b ∈ R, and λ 6= 0.
I If w and b define a separating hyperplane for F (L1 ) and F (L2 ) such that F (L1 ) ⊆
H + (w, b) and F (L2 ) ⊆ H − (w, b), then we also have that conv(F (L1 )) ⊆ H + (w, b)
and conv(F (L2 )) ⊆ H − (w, b); therefore, w and b also define a separating hyperplane
for conv(F (L1 )) and conv(F (L2 )).
Note that even for a small learning set L, it is not beforehand clear whether or not F (L1 )
and F (L2 ) are separable. So the first question that needs to be addressed is: does there
exist a separating hyperplane for F (L1 ) and F (L2 )? Figure 11.1 shows an example of a
separable learning set with (m =) 2 features. The squares correspond to the feature vectors
in F (L1 ), and the circles to the feature vectors in F (L2 ). Also, the convex hulls of square
points and the circle points are shown. The solid and the dashed lines represent two possible
hyperplanes. Figure 11.2 shows a learning set which is not separable.
Figure 11.1 illustrates another important fact. Suppose that we discard feature f2 and only
consider feature f1 . Let F 0 (L1 ) (⊂ R1 ) and F 0 (L2 ) (⊂ R1 ) be the feature ‘vectors’
obtained from discarding feature f2 . Then, the vectors in F 0 (L1 ) and F 0 (L2 ) are one-
dimensional and can be plotted on a line; see Figure 11.3. (This graph can also be constructed
by moving all points in Figure 11.1 straight down onto the horizontal axis.) A hyperplane
428 C h a p t e r 11 . C l a s s i f y i n g d o c u m e n t s b y l a n g uag e

f1 (d)

Figure 11.3: The learning set of Figure 11.1 after discarding feature f2 .

in one-dimensional Euclidean space is a point. Hence, the one-dimensional learning set is


separable if and only if there exists a point P on the line such that all vectors in F 0 (L1 )
are strictly to the left of P , and all vectors in F 0 (L2 ) are strictly on the right of P . From
this figure, it is clear that such a point does not exist. Hence, the learning set has become
nonseparable.
This fact can also be seen immediately from Figure 11.1 as follows. Discarding feature f2 is
equivalent to requiring that the weight w2 assigned to f2 is zero. This, in turn, is equivalent
to requiring that the separating hyperplane is ‘vertical’ in the figure. Clearly, there is no
vertical separating hyperplane for the learning set drawn in Figure 11.1. The same holds
when discarding feature f1 and only considering f2 , which is equivalent to requiring that
the separating hyperplane is horizontal.
In general, the following observations hold (see Exercise 11.8.1):
I If a learning set with a certain set of features is separable, then adding a feature keeps the
learning set separable. However, removing a feature may make the learning set nonsepa-
rable.
I If a learning set with a certain set of features is nonseparable, then removing a feature
keeps the learning set nonseparable. However, adding a feature may make the learning
set separable.
In the context of the English and Dutch documents, consider the frequencies of the letters
A, B, and N. Figure 11.4 shows scatter plots of the letter frequencies of the pairs (A, B),
(A, N), and (B, N). It is clear from the plots that if the letter frequencies of only two of
the three letters A, B, N are used as features, then the corresponding learning sets are not
separable. However, if the letter frequencies of all three letters are taken into account, the
learning set is separable. Therefore, a question to consider is: given a learning set, what is a
minimal set of features that makes the learning set separable?

11.3 LO-model for finding separating hyperplanes


Constructing a separating hyperplane for the learning set L1 ∪ L2 can be done by designing
and solving an LO-model as follows. The decision variables of the LO-model will be the
entries w1 , . . . , wm of the weight vector w, and the intercept b. Since the value of the
classifier should be strictly positive if document d was written in English (i.e., d ∈ L1 ), and
strictly negative if document d was written in Dutch (i.e., d ∈ L2 ), we have the constraints:

wT f d + b > 0 for d ∈ L1 , and


(11.3)
wT f d + b < 0 for d ∈ L2 .
11 . 3 . L O - m o d e l f o r f i n d i n g s e pa rat i n g h y p e r p l a n e s 429

fB (d) fN (d) fN (d)


3 10 10
9 9
2
8 8
1 7 7
7 8 9 10 fA (d) 7 8 9 10 fA (d) 1 2 3 fB (d)

(a) Letters A and B. (b) Letters A and N. (c) Letters B and N.

Figure 11.4: Scatter plots of relative letter frequencies (in percentages). The squares represent the vectors
in F (L1 ) and the circles are the vectors in F (L2 ). Here, L1 is the set of English documents,
and L2 is the set of Dutch documents.

Because these inequalities are strict inequalities, they cannot be used in an LO-model. To
circumvent this ‘limitation’, we will show that it suffices to use the following ‘≥’ and ‘≤’
inequalities instead:

wT f d + b ≥ 1 for d ∈ L1 , and
(11.4)
wT f d + b ≤ −1 for d ∈ L2 .

Clearly, the solution set (in terms of w and b) of (11.4) is in general a strict subset of the
solution set of (11.3). However, the sets of hyperplanes defined by (11.3) and (11.4) coincide.
To be precise, let H1 = {H(w, b) | w and b satisfy (11.3)}, i.e., H1 is the collection of hy-
perplanes defined by the solutions of (11.3). Let H2 = {H(w, b) | w and b satisfy (11.4)}.
We claim that H1 = H2 . It is easy to check that H2 ⊆ H1 . To see that H1 ⊆ H2 ,
take any w and b that satisfy (11.3). Then, because L1 and L2 are finite sets, there exists
ε > 0 such that wT f d + b ≥ ε for d ∈ L1 and wT f d + b ≤ −ε for d ∈ L2 . Define
ŵ = 1ε w and b̂ = 1ε b. Then, it is straightforward to check that ŵ and b̂ satisfy (11.4) and
that H(ŵ, b̂) = H(w, b), as required.
From now on, we will only consider the inequalities of (11.4). For each w ∈ Rm \ {0}
and b ∈ R, define the following halfspaces:

H +1 (w, b) = f ∈ Rm wT f + b ≥ 1 , and


H −1 (w, b) = f ∈ Rm wT f + b ≤ −1 .


Then, (11.4) is equivalent to:

F (L1 ) ⊆ H +1 (w, b), and F (L2 ) ⊆ H −1 (w, b). (11.5)

If the halfspaces H +1 (w, b) and H −1 (w, b) satisfy the conditions of (11.5), then the set
f ∈ Rm −1 ≤ wT f + b ≤ −1 is called a separation for F (L1 ) and F (L2 ), because it


‘separates’ F (L1 ) from F (L2 ). Figure 11.5 illustrates this concept.


430 C h a p t e r 11 . C l a s s i f y i n g d o c u m e n t s b y l a n g uag e

f2
+1 b =1
H (w, b) T f+
w w 1
b =−
T f+
w

−1
H (w, b)

f1

Figure 11.5: Separation for a learning set. The area between the dashed lines is the separation.

It follows from the discussion above that, in order to find a separating hyperplane for F (L1 )
and F (L2 ), the system of inequalities (11.4) needs to be solved. This can be done by solving
the following LO-model:

min 0
s.t. w1 f1d + . . . + wm fm d
+b≥ 1 for d ∈ L1
(11.6)
d d
w1 f1 + . . . + wm fm + b ≤ −1 for d ∈ L2
w1 , . . . , wm , b free.
In this LO-model, the decision variables are the weights w1 , . . . , wm and the intercept b of
the classifier. The values of fid with i ∈ {1, . . . , m} and d ∈ L1 ∪ L2 are parameters of
the model.
Once a classifier (equivalently, a separating hyperplane) for the learning set L1 ∪ L2 has
been constructed by solving the LO-model (11.6), this classifier may be used to predict
the language of any given document d ∈ D. This prediction is done as follows. Let
w1∗ , . . . , wm

, b∗ be an optimal solution of model (11.6). This optimal solution defines the
classifier value w1∗ f1d + . . . + wm∗ d
fm + b∗ for document d, based on the feature values of
that document. If the classifier value is ≥ 1, then the document is classified as an English
document; if the value is ≤ −1, then the document is classified as a Dutch document. If
the value lies between −1 and 1, then the classifier does not clearly determine the language
of the document. In that case, the closer the value lies to 1, the more confident we can be
that d is an English document. Similarly, the closer the value lies to −1, the more confident
we can be that d is a Dutch document.
Example 11.3.1. Consider the learning set of Table 11.1, where L1 is the set of the six newspaper
articles written in English, and L2 is the set of the six newspaper articles written in Dutch. Solving
model (11.6) using a computer package (e.g., the online solver for this book) yields the following optimal
11 . 4 . Va l i dat i o n o f a c l a s s i f i e r 431

solution:

b∗ = 0, w8∗ = 0.296, w15


∗ ∗
= 0.116, w17 ∗
= 1.978, w21 ∗
= −0.163, w26 = −2.116.

(See Section 11.7 for the GMPL code for this model.) All other decision variables have value zero at
this optimal solution. The corresponding classifier is:

g(f ) = w8∗ = 0.296f8 + 0.116f15 + 1.978f17 − 0.163f21 − 2.116f26 .

The weights correspond to the letters H, O, Q, U, and Z, respectively. Thus, the classifier bases its
calculations only on the relative frequencies of the letters H, O, Q, U, and Z. Note that the weight

w17 assigned to the letter Q is positive and relatively large compared to the other positive weights. This
means that, for any given document d ∈ D, the expression w1∗ f1d + . . . + wm ∗ d
fm + b tends to
be more positive if the document contains relatively many occurrences of the letter Q. This means that
such a document is more likely to be classified as an English newspaper article. On the other hand,

the weight w26 assigned to the letter Z is negative, and so a document containing relatively many
occurrences of the letter Z is likely to be classified as a Dutch newspaper article.

11.4 Validation of a classifier


Recall that the set of documents for which the language is known is partitioned into two
parts, namely the learning set L and the validation set V . Before using a classifier to predict
the language of any document d ∈ D, it is good practice to validate it by comparing its
predictions to the expected results for the validation set V . This validation step is done
in order to check that the classifier found by the LO-model makes sensible predictions for
documents that are not in the learning set. We illustrate this process with an example.
Example 11.4.1. For the validation of the classifier found in Example 11.3.1, we have a validation
set consisting of twenty-five English and thirty-three Dutch newspaper articles, i.e., |V1 | = 25 and
|V2 | = 33. Table 11.2 lists the relevant letter frequencies, the classifier value w1∗ f1d + . . . + wm
∗ d
fm +

b , and the language prediction found for a number of documents d in L ∪ V . The row ‘Predicted
language’ indicates the language of the article predicted by the classifier. A question mark indicates that
the classifier is inconclusive about the language; in that case the sign of the classifier value determines
whether the classifier leans towards one of the two languages.
The documents 1, 2, 7, and 8, which are in the learning set, are correctly predicted. This should come
as no surprise, as the constraints of model (11.6) ensure this fact. For the validation set, the results are
not as clear-cut. The classifier correctly predicts the language of most newspaper articles in the validation
set; these cases have been omitted from Table 11.2 (except article 21). The classifier is inconclusive about
articles 30 and 66, but at least the sign of the classifier value is correct, meaning that the classifier leans
towards the correct language. However, for articles 57 and 67, even the sign of the classifier is wrong.

The above example illustrates the fact that the validation step may reveal problems with the
classifier constructed using model (11.6). One way to improve the classifier is to increase
432 C h a p t e r 11 . C l a s s i f y i n g d o c u m e n t s b y l a n g uag e

English newspaper articles Dutch newspaper articles


Letter 1 2 21 30 7 8 57 66 67
H 4.45 4.36 5.14 4.47 1.69 3.29 1.68 3.40 2.33
O 5.85 6.82 8.00 7.35 6.48 5.51 6.15 3.93 5.37
Q 0.11 0.02 0.06 0.00 0.00 0.05 0.00 0.00 0.00
U 2.25 3.17 2.69 2.32 1.74 1.32 2.24 1.17 2.15
Z 0.40 0.02 0.00 0.46 0.93 1.18 0.34 0.85 0.18
Classifier value 1.00 1.56 2.13 0.82 −1.00 −1.00 0.13 −0.53 0.58
Predicted language Eng Eng Eng Eng? Dut Dut Eng? Dut? Eng?

Table 11.2: Validation results for the classifier. The articles 1, 2, 7, and 8 are in the learning set; the articles
21, 30, 57, 66, and 67 are in the validation set. The question marks in the row ‘Predicted
language’ indicate that the classifier is inconclusive about the language.

the learning set. In the example, we used only six documents per language. In real-life
applications the learning set is usually taken to be much larger.
In the next sections, we present another way to improve the classification results. Note that
the objective function of model (11.6) is the zero function, which means that any feasible
solution of the model is optimal. So, the objective is in a sense ‘redundant’, because it can be
replaced by maximizing or minimizing any constant objective function. In fact, in general,
the model has multiple optimal solutions. Hence, there are in general multiple separating
hyperplanes. Figure 11.1 shows two hyperplanes corresponding to two feasible solutions,
namely a dotted line and a solid line. In the next section, we study the ‘quality’ of the
hyperplanes.

11.5 Robustness of separating hyperplanes; separation width


The LO-model (11.6) generally has multiple optimal solutions. In fact, since the objective
function is the zero function, any feasible solution (i.e., any choice of separating hyperplane)
is optimal. Recall that the goal of constructing a classifier is to use this function to auto-
matically classify the documents that are not in the learning set into English and Dutch
documents.
Figure 11.1 shows two hyperplanes, H1 and H2 , corresponding to two feasible solutions.
Suppose that we find as an optimal solution the hyperplane H1 (drawn as a solid line),
and suppose that we encounter a text document dˆ (∈ D) whose feature vector f (d) ˆ =
h iT
ˆ f2 (d)
f1 (d) ˆ is very close to the feature vector marked with (∗), but just on the other
side of H1 . Based on H1 , we would conclude that the new text document is written in
Dutch. However, the feature vector f (d) ˆ is much closer to a vector corresponding to an
English document than to a vector corresponding to a Dutch document. So it makes more
sense to conclude that document dˆ was written in English. Observe also that hyperplane
H2 does not suffer as much from this problem. In other words, hyperplane H2 is more
robust with respect to perturbations than hyperplane H1 .
11 . 5 . Ro b u s t n e s s o f s e pa rat i n g h y p e r p l a n e s ; s e pa rat i o n w i d t h 433

To measure the robustness of a given separating hyperplane, we calculate its so-called separa-
tion width. Informally speaking, the separation width is the m-dimensional generalization
of the width of the band between the dashed lines in Figure 11.5. For given w ∈ Rm \ {0}
and b ∈ R, the separation width of the hyperplane H(w, b) = f ∈ Rm wT f + b = 0
is defined as the distance between the halfspaces H +1 (w, b) and H −1 (w, b), i.e.,

width(w, b) = min kf − f 0 k f ∈ H +1 (w, b), f 0 ∈ H −1 (w, b) ,




where kf − f 0 k is the Euclidean distance between the vectors f and f 0 (∈ Rm ). Note that,
for any w ∈ Rm \ {0} and b ∈ R, width(w, b) is well-defined because the minimum in
the right hand side in the above expression is attained. In fact, the following theorem gives
an explicit formula for the separation width.

Theorem 11.5.1.
For any w ∈ Rm \ {0} and b ∈ R, it holds that width(w, b) = 2
kwk
.

Proof. Take any point f̂ ∈ Rm such that wT f̂ + b = −1. Note that f̂ ∈ H −1 (w, b). Define
0 ∗ ∗
f̂ = f̂ + w , with w = 2
2 w. Then, we have that kw∗ k = 2
kwk
. It follows that:
kwk
!
T
T 0 T 2 T 2w w
w f̂ + b = w f̂ + 2
w + b = w f̂ + b + 2
= −1 + 2 = 1,
kwk kwk
where we have used the fact that wT w = kwk2 . Therefore, f̂ 0 ∈ H +1 (w, b). So, we have
that f̂ ∈ H −1 (w, b) and f̂ 0 ∈ H +1 (w, b). Hence, width(w, b) ≤ kf̂ − f̂ 0 k = kw∗ k = 2
kwk
.
To show that width(w, b) ≥ 2
kwk
, take any f̂ ∈ H +1 (w, b) and f̂ 0 ∈ H −1 (w, b). By the
definitions of H +1 (w, b) and H −1
(w, b), we have that:
T T 0
w f̂ + b ≥ 1, and w f̂ + b ≤ −1.
Subtracting the second inequality from the first one gives the inequality wT (f̂ 0 − f̂ ) ≥ 2. The
cosine rule (see Appendix B) implies that:
T 0
w (f̂ − f̂ ) 2
cos θ = 0 ≥ 0 ,
kwk kf̂ − f̂ k kwk kf̂ − f k
where θ is the angle between the vectors w and f̂ 0 − f . Since cos θ ≤ 1, we have that:
2
0 ≤ 1.
kwk kf̂ − f̂ k
Rearranging, this yields that kf̂ 0 − f̂ k ≥ kwk
2
for all f̂ ∈ H +1 (w, b) and all f̂ 0 ∈ H −1 (w, b).
n o
Hence, also min kf − f 0 k f ∈ H +1 (w, b), f 0 ∈ H −1 (w, b) ≥ kwk 2
, i.e., width(w, b) ≥
2
kwk
, as required. 

We conclude that the separation direction is determined by the direction of the vector w
and that, according to Theorem 11.5.1, the separation width is inversely proportional to
the length of w. Figure 11.1 depicts two separating hyperplanes. From this figure, we can
434 C h a p t e r 11 . C l a s s i f y i n g d o c u m e n t s b y l a n g uag e

see that the separation width corresponding to hyperplane H2 is much smaller than the
separation width corresponding to hyperplane H1 .

11.6 Models that maximize the separation width


In order to find a hyperplane that is as robust as possible with respect the separation, the
values of the weight vector w and the intercept b should be chosen so as to maximize the
separation width. According to Theorem 11.5.1, the separation width is kwk 2
. Note that
minimizing kwk yields the same optimal values for w and b as maximizing kwk 2
. Hence,
it suffices to solve the following optimization model in order to find a maximum-width
separation for the learning set:

min kwk
s.t. wT f d + b ≥ 1 for d ∈ L1
(11.7)
wT f d + b ≤ −1 for d ∈ L2
b, wj free for j = 1, . . . , m.
qP
m 2
The objective function kwk = i=1 wi is obviously a nonlinear function of the deci-
sion variables w1 , . . . , wm , so that (11.7) is a nonlinear optimization model. Such models
may be hard to solve, especially when the number of documents (and, hence, the number
of variables) is very large. Therefore, we look for a linear objective function. In general,
this will result in a classifier of less quality, i.e., the hyperplane corresponding to the result-
ing (w, b) has smaller separation width than the optimal hyperplane corresponding to an
optimal solution (w∗ , b∗ ) of (11.7).
The objective function of the above (nonlinear) optimization model is the Euclidean norm
(see Appendix B.1) of the vector w. A generalization of the Euclidean norm is the so-called
T
p-norm. The p-norm of a vector w = w1 . . . wm ∈ Rm is denoted and defined as


(p ≥ 1 and integer):
m
!1/p
X
kwkp = |wi |p .
i=1
Clearly, the Euclidean norm corresponds to the special case p = 2, i.e., the Euclidean
norm is the 2-norm. Since the 2-norm is a nonlinear function, it cannot be included in an
LO-model. Below, however, we will see that two other choices for p lead to LO-models,
namely the choices p = 1 and p = ∞. In the remainder of this section, we consecutively
discuss LO-models that minimize the 1-norm and the ∞-norm of the weight vector.
11 . 6 . M o d e l s t h at ma x i m i z e t h e s e pa rat i o n w i d t h 435

11.6.1 Minimizing the 1-norm of the weight vector


In this section, we consider minimizing the 1-norm of the weight vector. The 1-norm of
a vector w ∈ Rm is defined as:
m
X
kwk1 = |wi |.
i=1
q
So, we replace the objective min w12 + . . . + wm
2
of model (11.7) by the objective:
m
X
min |wj |.
j=1
Pm
The function j=1 |wj | is still not a linear function, but we already saw in Section 1.5.2
how to deal with absolute values in the context of linear optimization. In order to turn the
objective into a linear objective, as in Section 1.5.2, define wj = wj+ −wj− for j = 1, . . . , m.
Hence, |wj | = wj+ − wj− , with wj+ ≥ 0 and wj− ≥ 0. This leads to the following LO-
model: m
X
min (wj+ + wj− )
j=1
m
X
s.t. wj+ fjd − wj− fjd + b ≥ 1 for d ∈ L1
j=1
(11.8)
Xm
wj+ fjd − wj− fjd + b ≤ −1 for d ∈ L2
j=1
wj+ ≥ 0, wj− ≥ 0, b free for j = 1, . . . , m.
The constraints are still linear, because the values of the fji ’s are parameters of the model,
and hence the left hand sides of the constraints are linear functions of the decision variables
b, wj+ , and wj− (j = 1, . . . , m).

11.6.2 Minimizing the ∞-norm of the weight vector


We now consider minimizing the ∞-norm of the weight vector. Mathematically, the ∞-
norm of a vector is defined as the limit of its p-norm as p goes to infinity. The following
theorem states that this limit is well-defined, and it is in fact equal to the entry with the
largest absolute value.

Theorem 11.6.1. T
Let w = w1 . . . wm be a vector. Then, lim kwkp = max{|w1 |, . . . , |wm |}.
p→∞
436 C h a p t e r 11 . C l a s s i f y i n g d o c u m e n t s b y l a n g uag e

Proof. Define M = max{|wi | | i = 1, . . . , m}, and let p be any positive integer. We have
that: !1/p
m
p p 1/p
X
kwkp = |wi | ≥ M = M.
i=1
On the other hand, we have that:
m
!1/p
p p 1/p 1/p
X
kwkp = |wi | ≤ mM =m M.
i=1
1/p
It follows that M ≤ kwkp ≤ m M . Letting p → ∞ in this expression, we find that
M ≤ lim kwkp ≤ M , which is equivalent to lim kwkp = M , as required. 
p→∞ p→∞

So, according to Theorem 11.6.1, we should consider the following objective:

min max{|w1 |, . . . , |wm |}.

The objective function max{|w1 |, . . . , |wm |} is clearly not linear. However, it can be
incorporated in an LO-model by using the following ‘trick’. First, a new decision variable
z is introduced, which will represent max{|w1 |, . . . , |wm |}. The objective is then replaced
by: ‘min x’, and the following constraints are added:

|wj | ≤ x for j = 1, . . . , m.

Because the value of the variable x is minimized at any optimal solution, we will have that the
optimal value x∗ will be as small as possible, while satisfying x∗ ≥ |wj∗ | for j = 1, . . . , m.
This means that in fact x∗ = max{|w1∗ |, . . . , |wm∗
|} at any optimal solution. Combining
this ‘trick’ with the treatment of absolute values as in model (11.8), we find the following
LO-model:
min x
m
X
s.t. wj+ fjd − wj− fjd + b ≥ 1 for d ∈ L1
j=1
m
X (11.9)
wj+ fjd − wj− fjd + b ≤ −1 for d ∈ L2
j=1
wj+ + wj− ≤ x for j = 1, . . . , m.
x ≥ 0, wj+ ≥ 0, wj− ≥ 0, b free for j = 1, . . . , m.

The values of the fji ’s are parameters of the model, and the decision variables are b, x, wj+ ,
and wj− for j = 1, . . . , m.

11.6.3 Comparing the two models


We have constructed two LO-models that ‘approximately’ solve the problem of maximizing
the separation width. It is interesting to compare the results of the two models. To do so, we
have used a computer package to find optimal solutions of models (11.8) and (11.9) for the
11 . 6 . M o d e l s t h at ma x i m i z e t h e s e pa rat i o n w i d t h 437

learning set of Table 11.1. (See Section 11.7 for the GMPL code corresponding to model
(11.8).)
Model (11.8), corresponding to minimizing the 1-norm of the weight vector, has the optimal
solution (for the learning set of Table 11.1):

w5∗ = −0.602, w15



= 0.131, b∗ = 7.58.

All other decision variables have optimal value zero. Therefore, the corresponding linear
classifier g1 (f ) for this learning set is:

g1 (f ) = −0.602f5 + 0.131f19 + 7.58.

Note that this classifier uses very little information from the feature vector f . Only two
features are taken into account, namely the relative frequencies of occurrences of the letters

E and O. Because w15 > 0, a newspaper article with relatively many occurrences of the letter
O is more likely to be categorized as written in English, whereas the letter E is considered
an indication that the article is written in Dutch.
In contrast, consider model (11.9), corresponding to minimizing the ∞-norm of the weight
vector. This model has the optimal solution:

wj∗ = 0.0765 for j = 1, 3, 6, 8, 9, 13, 15, 17, 19, 20, 21, 23, 24, 25,
wj∗ = −0.0765 for j = 2, 4, 5, 10, 11, 12, 14, 16, 18, 22, 26,
w7∗ = 0.0463,
b∗ = −0.5530.
Let g∞ (f ) be the corresponding linear classifier. As opposed to g1 (f ), the classifier g∞ (f )
takes into account all features to make a prediction for the language of a given article. The
first set of weights in the solution above corresponds to the letters A, C, F, H, I, M, O, Q,
S, T, U, W, X, and Y. Since these weights are all positive, the classifier treats a relatively high
frequency of occurrences of any of these letters in a given article as evidence that the article
may be written in English. On the other hand, the second set of weights, corresponding
to the letters B, D, E, J, K, L, N, P, R, V, and Z, are negative. This means that a relatively
high frequency of occurrences of any of these is treated as evidence that the article may be
written in Dutch. Note that weight w7∗ corresponds to the letter G.
It is left to the reader to carry out the validation steps (see Section 11.4) for these classifiers,
i.e., to verify that these classifiers correctly predict the language of all newspaper articles in
the learning set and in the validation set, although both classifiers have values between −1
and 1 for some newspaper articles. (The data are available on the website of this book.) Note
that both models assign a negative weight to the letter E, and a positive weight to the letter
O. Hence, both classifiers treat frequent occurrences of the letter E as evidence towards the
article being written in Dutch, and frequent occurrences of the letter O as evidence towards
it being written in English.
438 C h a p t e r 11 . C l a s s i f y i n g d o c u m e n t s b y l a n g uag e

d
g∞ (f )

−4 0 4 d
g1 (f )

−2

Figure 11.6: Comparison of classifiers based on minimizing the 1-norm of the weight vector, versus
minimizing the ∞-norm.

An interesting question is: is one of the two classifiers significantly better than the other? To
answer this question, we have calculated the values of the two classifiers for all documents in
the learning set and in the validation set. Figure 11.6 shows the results of these calculations.
Each point in the figure represents a newspaper article. On the horizontal axis we have
plotted the values of g1 (f d ), and on the vertical axis we have plotted the values of g∞ (f d )
(d ∈ D). From the figure, we see that the ‘north west’ and ‘south east’ quadrants contain
no points at all. This means that the two classifiers have the same sign for each d ∈ D:
whenever g1 (f d ) is positive, g∞ (f d ) is positive, and vice versa. The ‘north east’ quadrant
contains points for which both classifiers are positive, i.e., these are the newspaper articles
that are predicted to be in English by both classifiers. Similarly, the ‘south west’ quadrant
contains the newspaper articles that are predicted to be in Dutch. It can be seen from the
figure that the values of the two classifiers are more or less linearly related, meaning that
they result in roughly the same predictions.
The horizontal gray band in the figure is the area in which the classifier g1 (f ) has a value
between −1 and 1, i.e., the area in which the classifier does not give a clear-cut prediction.
Similarly, the vertical gray band is the area in which the classifier g∞ (f ) does not give a
clear-cut prediction. The horizontal gray band contains 25 points, whereas the vertical gray
band contains only 9 points. From this, we can conclude that the classifier g∞ (f ) tends to
give more clear-cut predictions than g1 (f ). So, in that sense, the classifier g∞ (f ) is a better
classifier than g1 (f ).

11.7 GMPL model code


The following listing gives the GMPL model code for model (11.8).

1 param N1; # number of English documents in learning set


2 param N2; # number of Dutch documents in learning set
11 . 7. G M P L m o d e l c o d e 439

3
4 set DOCUMENTS := 1..(N1+N2); # set of all documents
5 set L1 := 1..N1; # set of English documents
6 set L2 := (N1+1)..(N1+N2); # set of Dutch documents
7 set FEATURES; # set of features
8
9 param f{FEATURES, DOCUMENTS}; # values of the feature vectors
10
11 var wp{FEATURES} >= 0; # positive part of weights
12 var wm{FEATURES} >= 0; # negative part of weights
13 var b; # intercept
14
15 minimize obj: # objective
16 sum {j in FEATURES} (wp[j] + wm[j]);
17
18 subject to cons_L1{i in L1}: # constraints for English documents
19 sum {j in FEATURES} (wp[j] − wm[j]) * f[j, i] + b >= 1;
20
21 subject to cons_L2{i in L2}: # constraints for Dutch documents
22 sum {j in FEATURES} (wp[j] − wm[j]) * f[j, i] + b <= −1;
23
24 data;
25
26 param N1 := 6;
27 param N2 := 6;
28
29 set FEATURES := A B C D E F G H I J K L M N O P Q R S T U V W X Y Z;
30
31 param f :
32 1 2 3 4 5 6 7 8 9 10 11 12 :=
33 A 10.40 9.02 9.48 7.89 8.44 8.49 8.68 9.78 12.27 7.42 8.60 10.22
34 B 1.61 1.87 1.84 1.58 1.41 1.55 2.03 1.08 0.99 1.82 1.79 2.62
35 C 2.87 2.95 4.86 2.78 3.85 3.13 0.80 1.37 1.10 1.03 2.15 1.83
36 D 4.29 3.52 3.16 4.18 3.91 5.04 5.50 6.05 6.13 7.11 5.97 6.46
37 E 12.20 11.75 12.69 11.24 11.88 11.82 18.59 17.83 17.74 17.85 15.29 18.25
38 F 2.12 2.31 2.97 2.00 2.55 1.90 0.76 0.89 0.77 1.26 1.19 1.48
39 G 2.23 1.67 2.17 2.16 1.79 2.58 2.75 2.99 2.96 2.21 2.39 4.10
40 H 4.45 4.36 4.25 5.56 4.45 5.43 1.69 3.29 2.96 1.66 2.27 2.62
41 I 9.20 7.61 7.59 7.62 8.33 8.25 5.25 6.89 6.24 8.14 7.29 5.68
42 J 0.13 0.22 0.14 0.25 0.22 0.40 1.35 1.30 0.88 2.05 1.55 1.57
43 K 0.75 0.64 0.57 0.91 0.33 0.91 3.90 1.90 1.86 2.13 1.91 2.10
44 L 4.05 3.28 4.81 4.74 4.31 3.77 4.11 4.44 3.50 3.40 3.82 3.76
45 M 2.41 3.46 2.55 3.10 2.74 2.58 2.50 2.21 3.40 3.00 1.67 1.75
46 N 7.03 7.70 6.60 7.02 7.16 7.34 10.63 8.80 10.30 11.45 9.32 11.53
47 O 5.85 6.82 8.11 6.74 6.76 6.54 6.48 5.51 4.27 4.82 4.78 4.28
48 P 1.53 2.51 1.79 1.93 2.41 1.43 1.31 1.23 1.42 1.11 2.51 0.52
49 Q 0.11 0.02 0.14 0.12 0.19 0.08 0.00 0.05 0.00 0.00 0.00 0.09
50 R 6.44 7.00 6.04 5.82 6.35 6.07 7.75 6.24 5.04 6.24 6.69 6.20
51 S 7.35 7.68 5.28 7.22 6.35 6.66 3.43 4.66 3.18 5.13 6.57 3.14
52 T 8.50 8.10 8.92 9.03 8.98 7.61 5.42 6.77 7.12 4.82 6.33 5.94
53 U 2.25 3.17 2.12 2.87 2.88 3.09 1.74 1.32 1.20 1.58 2.75 1.40
54 V 0.80 1.01 1.08 0.89 1.22 0.95 3.30 2.56 3.61 4.19 2.75 2.45
55 W 1.26 1.21 1.51 1.61 1.47 2.62 0.97 1.51 1.53 0.55 1.31 0.87
56 X 0.05 0.22 0.09 0.05 0.19 0.28 0.04 0.00 0.00 0.00 0.00 0.00
440 C h a p t e r 11 . C l a s s i f y i n g d o c u m e n t s b y l a n g uag e

57 Y 1.72 1.87 1.04 2.56 1.71 1.43 0.08 0.14 0.11 0.00 0.24 0.09
58 Z 0.40 0.02 0.19 0.14 0.14 0.08 0.93 1.18 1.42 1.03 0.84 1.05;
59
60 end;

11.8 Exercises
Exercise 11.8.1.
(a) Show that if a learning set with a certain set of features is separable, then adding a feature
keeps the learning set separable.
(b) Give an example of a separable learning set with the property that removing a feature
makes the learning set nonseparable.

Exercise 11.8.2. Prove the following statements:


(a) H + (−w, −b) = H − (w, b) for w ∈ Rm \ {0}, b ∈ R.
(b) H(λw, λb) = H(w, b) for w ∈ Rm \ {0}, b ∈ R, and λ =
6 0.
(c) If w and b define a separating hyperplane for F (L1 ) and F (L2 ) such that F (L1 ) ⊆
H + (w, b) and F (L2 ) ⊆ H − (w, b), then also conv(F (L1 )) ⊆ H + (w, b) and
conv(F (L2 )) ⊆ H − (w, b); therefore w and b also define a separating hyperplane
for conv(F (L1 )) and conv(F (L2 )).

Exercise 11.8.3. Let n ≥ 1 and i ∈ {1, 2}. For Si ⊆ Rn , define:


n T  T o
Si0 = x1 . . . xn−1 x1 . . . xn−1 xn ∈ Si .

(a) Prove that if Si is convex, then Si0 is convex.


(b) Prove that if S10 and S20 are separable, then S1 and S2 are separable.
(c) Is it true that if S1 and S2 are separable, then S10 and S20 are separable? Prove this or
give a counterexample.

Exercise 11.8.4. In real-life applications, the learning set is usually not separable. One
way to deal with this problem is to allow that some points in the data set lie ‘on the wrong
side’ of the hyperplane. Whenever this happens, however, this should be highly penalized.
How can model (11.8) be generalized to take this into account?

You might also like