CAP 11 Io1
CAP 11 Io1
C hap te r
Classifying documents by language
Overview
In this chapter we will show how linear optimization can be used in machine learning. Ma-
chine learning is a branch of artificial intelligence that deals with algorithms that identify
(‘learn’) complex relationships in empirical data. These relationships can then be used to
make predictions based on new data. Applications of machine learning include spam email
detection, face recognition, speech recognition, webpage ranking in internet search engines,
natural language processing, medical diagnosis based on patients’ symptoms, fraud detection
for credit cards, control of robots, and games such as chess and backgammon.
An important drive behind the development of machine learning algorithms has been the
commercialization of the internet in the past two decades. Large internet businesses, such
as search engine and social network operators, process large amounts of data from around
the world. To make sense of these data, a wide range of machine learning techniques are
employed. One important application is ad click prediction; see, e.g., McMahan et al. (2013).
In this chapter, we will study the problem of automated language detection of text docu-
ments, such as newspaper articles and emails. We will develop a technique called a support
vector machine for this purpose. For an elaborate treatment of support vector machines in
machine learning, we refer to, e.g., Cristianini and Shawe-Taylor (2000).
423
424 C h a p t e r 11 . C l a s s i f y i n g d o c u m e n t s b y l a n g uag e
quires a set of examples, each of which has a label. An example of a classification problem
is the problem faced by email providers to classify incoming email messages into spam and
non-spam messages. The examples for such a classification problem would be a number of
email messages, each labeled as either ‘spam’ or ‘not spam’. The goal of the classification
algorithm is to find a way to accurately predict labels of new observations. The labels for
the examples are usually provided by persons. This could be someone who explicitly looks
at the examples and classifies them as ‘spam’ or ‘not spam’, or this could be provided by
the users of the email service. For example, when a user clicks on the ‘Mark this message
as spam’ button in an email application, the message at hand is labeled as ‘spam’, and this
information can be used for future predictions.
Other areas of machine learning include: regression, where the goal is to predict a number
rather than a label (e.g., tax spending based on income, see also Section 1.6.2); ranking, where
the goal is to learn how to rank objects (for example in internet search engines); clustering,
which means determining groups of objects that are similar to each other. In the cases of
classification and regression, the examples usually have labels attached to them, and these
labels are considered to be the ‘correct’ labels. The goal of the algorithms is then to ‘learn’ to
predict these labels. Such problems are sometimes categorized as supervised machine learning.
In contrast, in ranking and clustering, the examples usually do not have labels, and they are
therefore categorized as unsupervised machine learning.
The current chapter is a case study of a (supervised) classification algorithm. To make a classifi-
cation algorithm successful, certain features of the messages are determined that (hopefully)
carry predictive knowledge. For example, for spam classification, it can be helpful to count
the number of words in the message that relate to pharmaceutical products, or whether or
not the email message is addressed to many recipients rather than to one particular recipient.
Such features are indicative for the message being classified as spam. Other words, such
as the name of the recipient’s friends may be indicative for the message not being spam.
The features are represented as numbers, and the features of a single example can hence be
grouped together as a vector. Usually, it is not one particular feature that determines the
label of an example, but it is rather the combination of them. For example, an email message
that is sent to many different recipients and that contains five references to pharmaceutical
products can probably be classified as spam, whereas a message that has multiple recipients
and includes ten of the main recipient’s friends should probably be classified as non-spam. A
classification algorithm attempts to make sense of the provided features and labels, and uses
these to classify new, unlabeled, examples.
Clearly, the design of features is crucial and depends on the problem at hand. Features should
be chosen that have predictive power, and hence the design uses prior knowledge about the
problem. In many cases it may not be immediately clear how to choose the features.
11 . 2 . C l a s s i f y i n g d o c u m e n t s u s i n g s e pa rat i n g h y p e r p l a n e s 425
Table 11.1: Relative letter frequencies (in percentages) of several newspaper articles.
We will restrict our attention to linear classifiers, i.e., the classifier g is restricted to have the
form m
X T
wj fj + b = wT f + b for f = f1 . . . fm ∈ Rm ,
g(f ) =
j=1
where w (∈ R \ {0}) is called the weight vector of the classifier, b (∈ R) is the intercept,
m
and f is any feature vector. Note that we exclude the possibility that w = 0, because the
corresponding classifier does not take into account any feature, and therefore is not of any
use to predict the language of a document. Our goal is to construct a weight vector w and
an intercept b such that:
d ∈ L1 =⇒ wT f d + b > 0, and
(11.1)
d ∈ L2 =⇒ wT f d + b < 0.
Linear classifiers have the following geometric
interpretation. For any w ∈ Rm \ {0} and
b ∈ R, define the hyperplane H(w, b) = f ∈ Rm wT f + b = 0 ,and the two (strict)
halfspaces H (w, b) = f ∈ Rm wT f + b < 0 and H − (w, b) = f wT f + b > 0
+
f2 (d) f2 (d)
H2
H1
(∗)
f1 (d) f1 (d)
Figure 11.1: Separable learning set with 40 Figure 11.2: Nonseparable learning set with 40
documents. The solid and the dashed documents. The convex hulls of the
lines are separating hyperplanes. learning sets intersect.
So, we want to construct a hyperplane in Rm such that the feature vectors corresponding
to documents in L1 lie in the halfspace H + (w, b), and the vectors corresponding to L2 in
H − (w, b).
If there exist a weight vector w and an intercept b such that the conditions of (11.2) are
satisfied, then F (L1 ) and F (L2 ) are said to be separable; they are called nonseparable oth-
erwise. The corresponding hyperplane H(w, b) is called a separating hyperplane for F (L1 )
and F (L2 ), and the function wT f d + b is called a separator for F (L1 ) and F (L2 ); see also
Appendix D. We make the following observations (see Exercise 11.8.2):
I H + (−w, −b) = H − (w, b) for w ∈ Rm \ {0}, b ∈ R.
I H(λw, λb) = H(w, b) for w ∈ Rm \ {0}, b ∈ R, and λ 6= 0.
I If w and b define a separating hyperplane for F (L1 ) and F (L2 ) such that F (L1 ) ⊆
H + (w, b) and F (L2 ) ⊆ H − (w, b), then we also have that conv(F (L1 )) ⊆ H + (w, b)
and conv(F (L2 )) ⊆ H − (w, b); therefore, w and b also define a separating hyperplane
for conv(F (L1 )) and conv(F (L2 )).
Note that even for a small learning set L, it is not beforehand clear whether or not F (L1 )
and F (L2 ) are separable. So the first question that needs to be addressed is: does there
exist a separating hyperplane for F (L1 ) and F (L2 )? Figure 11.1 shows an example of a
separable learning set with (m =) 2 features. The squares correspond to the feature vectors
in F (L1 ), and the circles to the feature vectors in F (L2 ). Also, the convex hulls of square
points and the circle points are shown. The solid and the dashed lines represent two possible
hyperplanes. Figure 11.2 shows a learning set which is not separable.
Figure 11.1 illustrates another important fact. Suppose that we discard feature f2 and only
consider feature f1 . Let F 0 (L1 ) (⊂ R1 ) and F 0 (L2 ) (⊂ R1 ) be the feature ‘vectors’
obtained from discarding feature f2 . Then, the vectors in F 0 (L1 ) and F 0 (L2 ) are one-
dimensional and can be plotted on a line; see Figure 11.3. (This graph can also be constructed
by moving all points in Figure 11.1 straight down onto the horizontal axis.) A hyperplane
428 C h a p t e r 11 . C l a s s i f y i n g d o c u m e n t s b y l a n g uag e
f1 (d)
Figure 11.3: The learning set of Figure 11.1 after discarding feature f2 .
Figure 11.4: Scatter plots of relative letter frequencies (in percentages). The squares represent the vectors
in F (L1 ) and the circles are the vectors in F (L2 ). Here, L1 is the set of English documents,
and L2 is the set of Dutch documents.
Because these inequalities are strict inequalities, they cannot be used in an LO-model. To
circumvent this ‘limitation’, we will show that it suffices to use the following ‘≥’ and ‘≤’
inequalities instead:
wT f d + b ≥ 1 for d ∈ L1 , and
(11.4)
wT f d + b ≤ −1 for d ∈ L2 .
Clearly, the solution set (in terms of w and b) of (11.4) is in general a strict subset of the
solution set of (11.3). However, the sets of hyperplanes defined by (11.3) and (11.4) coincide.
To be precise, let H1 = {H(w, b) | w and b satisfy (11.3)}, i.e., H1 is the collection of hy-
perplanes defined by the solutions of (11.3). Let H2 = {H(w, b) | w and b satisfy (11.4)}.
We claim that H1 = H2 . It is easy to check that H2 ⊆ H1 . To see that H1 ⊆ H2 ,
take any w and b that satisfy (11.3). Then, because L1 and L2 are finite sets, there exists
ε > 0 such that wT f d + b ≥ ε for d ∈ L1 and wT f d + b ≤ −ε for d ∈ L2 . Define
ŵ = 1ε w and b̂ = 1ε b. Then, it is straightforward to check that ŵ and b̂ satisfy (11.4) and
that H(ŵ, b̂) = H(w, b), as required.
From now on, we will only consider the inequalities of (11.4). For each w ∈ Rm \ {0}
and b ∈ R, define the following halfspaces:
H +1 (w, b) = f ∈ Rm wT f + b ≥ 1 , and
H −1 (w, b) = f ∈ Rm wT f + b ≤ −1 .
If the halfspaces H +1 (w, b) and H −1 (w, b) satisfy the conditions of (11.5), then the set
f ∈ Rm −1 ≤ wT f + b ≤ −1 is called a separation for F (L1 ) and F (L2 ), because it
f2
+1 b =1
H (w, b) T f+
w w 1
b =−
T f+
w
−1
H (w, b)
f1
Figure 11.5: Separation for a learning set. The area between the dashed lines is the separation.
It follows from the discussion above that, in order to find a separating hyperplane for F (L1 )
and F (L2 ), the system of inequalities (11.4) needs to be solved. This can be done by solving
the following LO-model:
min 0
s.t. w1 f1d + . . . + wm fm d
+b≥ 1 for d ∈ L1
(11.6)
d d
w1 f1 + . . . + wm fm + b ≤ −1 for d ∈ L2
w1 , . . . , wm , b free.
In this LO-model, the decision variables are the weights w1 , . . . , wm and the intercept b of
the classifier. The values of fid with i ∈ {1, . . . , m} and d ∈ L1 ∪ L2 are parameters of
the model.
Once a classifier (equivalently, a separating hyperplane) for the learning set L1 ∪ L2 has
been constructed by solving the LO-model (11.6), this classifier may be used to predict
the language of any given document d ∈ D. This prediction is done as follows. Let
w1∗ , . . . , wm
∗
, b∗ be an optimal solution of model (11.6). This optimal solution defines the
classifier value w1∗ f1d + . . . + wm∗ d
fm + b∗ for document d, based on the feature values of
that document. If the classifier value is ≥ 1, then the document is classified as an English
document; if the value is ≤ −1, then the document is classified as a Dutch document. If
the value lies between −1 and 1, then the classifier does not clearly determine the language
of the document. In that case, the closer the value lies to 1, the more confident we can be
that d is an English document. Similarly, the closer the value lies to −1, the more confident
we can be that d is a Dutch document.
Example 11.3.1. Consider the learning set of Table 11.1, where L1 is the set of the six newspaper
articles written in English, and L2 is the set of the six newspaper articles written in Dutch. Solving
model (11.6) using a computer package (e.g., the online solver for this book) yields the following optimal
11 . 4 . Va l i dat i o n o f a c l a s s i f i e r 431
solution:
(See Section 11.7 for the GMPL code for this model.) All other decision variables have value zero at
this optimal solution. The corresponding classifier is:
The weights correspond to the letters H, O, Q, U, and Z, respectively. Thus, the classifier bases its
calculations only on the relative frequencies of the letters H, O, Q, U, and Z. Note that the weight
∗
w17 assigned to the letter Q is positive and relatively large compared to the other positive weights. This
means that, for any given document d ∈ D, the expression w1∗ f1d + . . . + wm ∗ d
fm + b tends to
be more positive if the document contains relatively many occurrences of the letter Q. This means that
such a document is more likely to be classified as an English newspaper article. On the other hand,
∗
the weight w26 assigned to the letter Z is negative, and so a document containing relatively many
occurrences of the letter Z is likely to be classified as a Dutch newspaper article.
The above example illustrates the fact that the validation step may reveal problems with the
classifier constructed using model (11.6). One way to improve the classifier is to increase
432 C h a p t e r 11 . C l a s s i f y i n g d o c u m e n t s b y l a n g uag e
Table 11.2: Validation results for the classifier. The articles 1, 2, 7, and 8 are in the learning set; the articles
21, 30, 57, 66, and 67 are in the validation set. The question marks in the row ‘Predicted
language’ indicate that the classifier is inconclusive about the language.
the learning set. In the example, we used only six documents per language. In real-life
applications the learning set is usually taken to be much larger.
In the next sections, we present another way to improve the classification results. Note that
the objective function of model (11.6) is the zero function, which means that any feasible
solution of the model is optimal. So, the objective is in a sense ‘redundant’, because it can be
replaced by maximizing or minimizing any constant objective function. In fact, in general,
the model has multiple optimal solutions. Hence, there are in general multiple separating
hyperplanes. Figure 11.1 shows two hyperplanes corresponding to two feasible solutions,
namely a dotted line and a solid line. In the next section, we study the ‘quality’ of the
hyperplanes.
To measure the robustness of a given separating hyperplane, we calculate its so-called separa-
tion width. Informally speaking, the separation width is the m-dimensional generalization
of the width of the band between the dashed lines in Figure 11.5. For given w ∈ Rm \ {0}
and b ∈ R, the separation width of the hyperplane H(w, b) = f ∈ Rm wT f + b = 0
is defined as the distance between the halfspaces H +1 (w, b) and H −1 (w, b), i.e.,
where kf − f 0 k is the Euclidean distance between the vectors f and f 0 (∈ Rm ). Note that,
for any w ∈ Rm \ {0} and b ∈ R, width(w, b) is well-defined because the minimum in
the right hand side in the above expression is attained. In fact, the following theorem gives
an explicit formula for the separation width.
Theorem 11.5.1.
For any w ∈ Rm \ {0} and b ∈ R, it holds that width(w, b) = 2
kwk
.
Proof. Take any point f̂ ∈ Rm such that wT f̂ + b = −1. Note that f̂ ∈ H −1 (w, b). Define
0 ∗ ∗
f̂ = f̂ + w , with w = 2
2 w. Then, we have that kw∗ k = 2
kwk
. It follows that:
kwk
!
T
T 0 T 2 T 2w w
w f̂ + b = w f̂ + 2
w + b = w f̂ + b + 2
= −1 + 2 = 1,
kwk kwk
where we have used the fact that wT w = kwk2 . Therefore, f̂ 0 ∈ H +1 (w, b). So, we have
that f̂ ∈ H −1 (w, b) and f̂ 0 ∈ H +1 (w, b). Hence, width(w, b) ≤ kf̂ − f̂ 0 k = kw∗ k = 2
kwk
.
To show that width(w, b) ≥ 2
kwk
, take any f̂ ∈ H +1 (w, b) and f̂ 0 ∈ H −1 (w, b). By the
definitions of H +1 (w, b) and H −1
(w, b), we have that:
T T 0
w f̂ + b ≥ 1, and w f̂ + b ≤ −1.
Subtracting the second inequality from the first one gives the inequality wT (f̂ 0 − f̂ ) ≥ 2. The
cosine rule (see Appendix B) implies that:
T 0
w (f̂ − f̂ ) 2
cos θ = 0 ≥ 0 ,
kwk kf̂ − f̂ k kwk kf̂ − f k
where θ is the angle between the vectors w and f̂ 0 − f . Since cos θ ≤ 1, we have that:
2
0 ≤ 1.
kwk kf̂ − f̂ k
Rearranging, this yields that kf̂ 0 − f̂ k ≥ kwk
2
for all f̂ ∈ H +1 (w, b) and all f̂ 0 ∈ H −1 (w, b).
n o
Hence, also min kf − f 0 k f ∈ H +1 (w, b), f 0 ∈ H −1 (w, b) ≥ kwk 2
, i.e., width(w, b) ≥
2
kwk
, as required.
We conclude that the separation direction is determined by the direction of the vector w
and that, according to Theorem 11.5.1, the separation width is inversely proportional to
the length of w. Figure 11.1 depicts two separating hyperplanes. From this figure, we can
434 C h a p t e r 11 . C l a s s i f y i n g d o c u m e n t s b y l a n g uag e
see that the separation width corresponding to hyperplane H2 is much smaller than the
separation width corresponding to hyperplane H1 .
min kwk
s.t. wT f d + b ≥ 1 for d ∈ L1
(11.7)
wT f d + b ≤ −1 for d ∈ L2
b, wj free for j = 1, . . . , m.
qP
m 2
The objective function kwk = i=1 wi is obviously a nonlinear function of the deci-
sion variables w1 , . . . , wm , so that (11.7) is a nonlinear optimization model. Such models
may be hard to solve, especially when the number of documents (and, hence, the number
of variables) is very large. Therefore, we look for a linear objective function. In general,
this will result in a classifier of less quality, i.e., the hyperplane corresponding to the result-
ing (w, b) has smaller separation width than the optimal hyperplane corresponding to an
optimal solution (w∗ , b∗ ) of (11.7).
The objective function of the above (nonlinear) optimization model is the Euclidean norm
(see Appendix B.1) of the vector w. A generalization of the Euclidean norm is the so-called
T
p-norm. The p-norm of a vector w = w1 . . . wm ∈ Rm is denoted and defined as
(p ≥ 1 and integer):
m
!1/p
X
kwkp = |wi |p .
i=1
Clearly, the Euclidean norm corresponds to the special case p = 2, i.e., the Euclidean
norm is the 2-norm. Since the 2-norm is a nonlinear function, it cannot be included in an
LO-model. Below, however, we will see that two other choices for p lead to LO-models,
namely the choices p = 1 and p = ∞. In the remainder of this section, we consecutively
discuss LO-models that minimize the 1-norm and the ∞-norm of the weight vector.
11 . 6 . M o d e l s t h at ma x i m i z e t h e s e pa rat i o n w i d t h 435
Theorem 11.6.1. T
Let w = w1 . . . wm be a vector. Then, lim kwkp = max{|w1 |, . . . , |wm |}.
p→∞
436 C h a p t e r 11 . C l a s s i f y i n g d o c u m e n t s b y l a n g uag e
Proof. Define M = max{|wi | | i = 1, . . . , m}, and let p be any positive integer. We have
that: !1/p
m
p p 1/p
X
kwkp = |wi | ≥ M = M.
i=1
On the other hand, we have that:
m
!1/p
p p 1/p 1/p
X
kwkp = |wi | ≤ mM =m M.
i=1
1/p
It follows that M ≤ kwkp ≤ m M . Letting p → ∞ in this expression, we find that
M ≤ lim kwkp ≤ M , which is equivalent to lim kwkp = M , as required.
p→∞ p→∞
The objective function max{|w1 |, . . . , |wm |} is clearly not linear. However, it can be
incorporated in an LO-model by using the following ‘trick’. First, a new decision variable
z is introduced, which will represent max{|w1 |, . . . , |wm |}. The objective is then replaced
by: ‘min x’, and the following constraints are added:
|wj | ≤ x for j = 1, . . . , m.
Because the value of the variable x is minimized at any optimal solution, we will have that the
optimal value x∗ will be as small as possible, while satisfying x∗ ≥ |wj∗ | for j = 1, . . . , m.
This means that in fact x∗ = max{|w1∗ |, . . . , |wm∗
|} at any optimal solution. Combining
this ‘trick’ with the treatment of absolute values as in model (11.8), we find the following
LO-model:
min x
m
X
s.t. wj+ fjd − wj− fjd + b ≥ 1 for d ∈ L1
j=1
m
X (11.9)
wj+ fjd − wj− fjd + b ≤ −1 for d ∈ L2
j=1
wj+ + wj− ≤ x for j = 1, . . . , m.
x ≥ 0, wj+ ≥ 0, wj− ≥ 0, b free for j = 1, . . . , m.
The values of the fji ’s are parameters of the model, and the decision variables are b, x, wj+ ,
and wj− for j = 1, . . . , m.
learning set of Table 11.1. (See Section 11.7 for the GMPL code corresponding to model
(11.8).)
Model (11.8), corresponding to minimizing the 1-norm of the weight vector, has the optimal
solution (for the learning set of Table 11.1):
All other decision variables have optimal value zero. Therefore, the corresponding linear
classifier g1 (f ) for this learning set is:
Note that this classifier uses very little information from the feature vector f . Only two
features are taken into account, namely the relative frequencies of occurrences of the letters
∗
E and O. Because w15 > 0, a newspaper article with relatively many occurrences of the letter
O is more likely to be categorized as written in English, whereas the letter E is considered
an indication that the article is written in Dutch.
In contrast, consider model (11.9), corresponding to minimizing the ∞-norm of the weight
vector. This model has the optimal solution:
wj∗ = 0.0765 for j = 1, 3, 6, 8, 9, 13, 15, 17, 19, 20, 21, 23, 24, 25,
wj∗ = −0.0765 for j = 2, 4, 5, 10, 11, 12, 14, 16, 18, 22, 26,
w7∗ = 0.0463,
b∗ = −0.5530.
Let g∞ (f ) be the corresponding linear classifier. As opposed to g1 (f ), the classifier g∞ (f )
takes into account all features to make a prediction for the language of a given article. The
first set of weights in the solution above corresponds to the letters A, C, F, H, I, M, O, Q,
S, T, U, W, X, and Y. Since these weights are all positive, the classifier treats a relatively high
frequency of occurrences of any of these letters in a given article as evidence that the article
may be written in English. On the other hand, the second set of weights, corresponding
to the letters B, D, E, J, K, L, N, P, R, V, and Z, are negative. This means that a relatively
high frequency of occurrences of any of these is treated as evidence that the article may be
written in Dutch. Note that weight w7∗ corresponds to the letter G.
It is left to the reader to carry out the validation steps (see Section 11.4) for these classifiers,
i.e., to verify that these classifiers correctly predict the language of all newspaper articles in
the learning set and in the validation set, although both classifiers have values between −1
and 1 for some newspaper articles. (The data are available on the website of this book.) Note
that both models assign a negative weight to the letter E, and a positive weight to the letter
O. Hence, both classifiers treat frequent occurrences of the letter E as evidence towards the
article being written in Dutch, and frequent occurrences of the letter O as evidence towards
it being written in English.
438 C h a p t e r 11 . C l a s s i f y i n g d o c u m e n t s b y l a n g uag e
d
g∞ (f )
−4 0 4 d
g1 (f )
−2
Figure 11.6: Comparison of classifiers based on minimizing the 1-norm of the weight vector, versus
minimizing the ∞-norm.
An interesting question is: is one of the two classifiers significantly better than the other? To
answer this question, we have calculated the values of the two classifiers for all documents in
the learning set and in the validation set. Figure 11.6 shows the results of these calculations.
Each point in the figure represents a newspaper article. On the horizontal axis we have
plotted the values of g1 (f d ), and on the vertical axis we have plotted the values of g∞ (f d )
(d ∈ D). From the figure, we see that the ‘north west’ and ‘south east’ quadrants contain
no points at all. This means that the two classifiers have the same sign for each d ∈ D:
whenever g1 (f d ) is positive, g∞ (f d ) is positive, and vice versa. The ‘north east’ quadrant
contains points for which both classifiers are positive, i.e., these are the newspaper articles
that are predicted to be in English by both classifiers. Similarly, the ‘south west’ quadrant
contains the newspaper articles that are predicted to be in Dutch. It can be seen from the
figure that the values of the two classifiers are more or less linearly related, meaning that
they result in roughly the same predictions.
The horizontal gray band in the figure is the area in which the classifier g1 (f ) has a value
between −1 and 1, i.e., the area in which the classifier does not give a clear-cut prediction.
Similarly, the vertical gray band is the area in which the classifier g∞ (f ) does not give a
clear-cut prediction. The horizontal gray band contains 25 points, whereas the vertical gray
band contains only 9 points. From this, we can conclude that the classifier g∞ (f ) tends to
give more clear-cut predictions than g1 (f ). So, in that sense, the classifier g∞ (f ) is a better
classifier than g1 (f ).
3
4 set DOCUMENTS := 1..(N1+N2); # set of all documents
5 set L1 := 1..N1; # set of English documents
6 set L2 := (N1+1)..(N1+N2); # set of Dutch documents
7 set FEATURES; # set of features
8
9 param f{FEATURES, DOCUMENTS}; # values of the feature vectors
10
11 var wp{FEATURES} >= 0; # positive part of weights
12 var wm{FEATURES} >= 0; # negative part of weights
13 var b; # intercept
14
15 minimize obj: # objective
16 sum {j in FEATURES} (wp[j] + wm[j]);
17
18 subject to cons_L1{i in L1}: # constraints for English documents
19 sum {j in FEATURES} (wp[j] − wm[j]) * f[j, i] + b >= 1;
20
21 subject to cons_L2{i in L2}: # constraints for Dutch documents
22 sum {j in FEATURES} (wp[j] − wm[j]) * f[j, i] + b <= −1;
23
24 data;
25
26 param N1 := 6;
27 param N2 := 6;
28
29 set FEATURES := A B C D E F G H I J K L M N O P Q R S T U V W X Y Z;
30
31 param f :
32 1 2 3 4 5 6 7 8 9 10 11 12 :=
33 A 10.40 9.02 9.48 7.89 8.44 8.49 8.68 9.78 12.27 7.42 8.60 10.22
34 B 1.61 1.87 1.84 1.58 1.41 1.55 2.03 1.08 0.99 1.82 1.79 2.62
35 C 2.87 2.95 4.86 2.78 3.85 3.13 0.80 1.37 1.10 1.03 2.15 1.83
36 D 4.29 3.52 3.16 4.18 3.91 5.04 5.50 6.05 6.13 7.11 5.97 6.46
37 E 12.20 11.75 12.69 11.24 11.88 11.82 18.59 17.83 17.74 17.85 15.29 18.25
38 F 2.12 2.31 2.97 2.00 2.55 1.90 0.76 0.89 0.77 1.26 1.19 1.48
39 G 2.23 1.67 2.17 2.16 1.79 2.58 2.75 2.99 2.96 2.21 2.39 4.10
40 H 4.45 4.36 4.25 5.56 4.45 5.43 1.69 3.29 2.96 1.66 2.27 2.62
41 I 9.20 7.61 7.59 7.62 8.33 8.25 5.25 6.89 6.24 8.14 7.29 5.68
42 J 0.13 0.22 0.14 0.25 0.22 0.40 1.35 1.30 0.88 2.05 1.55 1.57
43 K 0.75 0.64 0.57 0.91 0.33 0.91 3.90 1.90 1.86 2.13 1.91 2.10
44 L 4.05 3.28 4.81 4.74 4.31 3.77 4.11 4.44 3.50 3.40 3.82 3.76
45 M 2.41 3.46 2.55 3.10 2.74 2.58 2.50 2.21 3.40 3.00 1.67 1.75
46 N 7.03 7.70 6.60 7.02 7.16 7.34 10.63 8.80 10.30 11.45 9.32 11.53
47 O 5.85 6.82 8.11 6.74 6.76 6.54 6.48 5.51 4.27 4.82 4.78 4.28
48 P 1.53 2.51 1.79 1.93 2.41 1.43 1.31 1.23 1.42 1.11 2.51 0.52
49 Q 0.11 0.02 0.14 0.12 0.19 0.08 0.00 0.05 0.00 0.00 0.00 0.09
50 R 6.44 7.00 6.04 5.82 6.35 6.07 7.75 6.24 5.04 6.24 6.69 6.20
51 S 7.35 7.68 5.28 7.22 6.35 6.66 3.43 4.66 3.18 5.13 6.57 3.14
52 T 8.50 8.10 8.92 9.03 8.98 7.61 5.42 6.77 7.12 4.82 6.33 5.94
53 U 2.25 3.17 2.12 2.87 2.88 3.09 1.74 1.32 1.20 1.58 2.75 1.40
54 V 0.80 1.01 1.08 0.89 1.22 0.95 3.30 2.56 3.61 4.19 2.75 2.45
55 W 1.26 1.21 1.51 1.61 1.47 2.62 0.97 1.51 1.53 0.55 1.31 0.87
56 X 0.05 0.22 0.09 0.05 0.19 0.28 0.04 0.00 0.00 0.00 0.00 0.00
440 C h a p t e r 11 . C l a s s i f y i n g d o c u m e n t s b y l a n g uag e
57 Y 1.72 1.87 1.04 2.56 1.71 1.43 0.08 0.14 0.11 0.00 0.24 0.09
58 Z 0.40 0.02 0.19 0.14 0.14 0.08 0.93 1.18 1.42 1.03 0.84 1.05;
59
60 end;
11.8 Exercises
Exercise 11.8.1.
(a) Show that if a learning set with a certain set of features is separable, then adding a feature
keeps the learning set separable.
(b) Give an example of a separable learning set with the property that removing a feature
makes the learning set nonseparable.
Exercise 11.8.4. In real-life applications, the learning set is usually not separable. One
way to deal with this problem is to allow that some points in the data set lie ‘on the wrong
side’ of the hyperplane. Whenever this happens, however, this should be highly penalized.
How can model (11.8) be generalized to take this into account?