0% found this document useful (0 votes)
22 views11 pages

Learning Theory: y For Examples

Uploaded by

Jamil Junior
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views11 pages

Learning Theory: y For Examples

Uploaded by

Jamil Junior
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

By Marti A.

Hearst
University of California, Berkeley
[email protected]

mate a functionf:RN+{+l} using training


data-that is, N-dimensional pattems xi
and class labels yr,

such thatf will correctly classify new exam-


ples (x,y)-that is,f(x) = y for examples (x,y),
which were generated from the same under-
lying probability distribution P(x,y) as the
training data. If we put no restriction on the
class of functions that we choose our estimate
f from, however, even a fimction that does
well on the training data-for exampleby
satisfyingflx,) = y L(here and below, the index
basis function (RBF) nets, and polynomial z is understood to run over 1, . . .,k!)-need
v------ -- classifiers as special cases. Yet it is simple not generalizewell to unseen examples. Sup-
learning theory enough to be analyzed mathematically, pose we knownothing additional aboutf(for
Bernhard Scholkopi GMD First because it can be shown to correspond to a example, about its smoothness).Then the
Is there anything worthwhile to learn linear method in a high-dimensional fea- values on the training pattems carry no infor-
about the new SVM algorithm, or does it ture space nonlinearly related to input mabon whatsoever about values on novel
fall into the category of “yet-another-algo- space. Moreover, even though we can think patterns. Hence learning is impossible, and
rithm,” in which case readers should stop of it as a linear algorithm in a high-dimen- minimizing the t ” g error does not imply
here and save their time for something sional space, in practice, it does not involve a small expected test error.
more useful’ In this short overview, I will any computations in that high-dimensional Statistical learning theory? or VC (Vap-
try to argue that studying support-vector space. By the use of kernels, all necessary nik-Chervonenkis) theory, shows that it is
learning is very useful in two respects computations are performed directly in crucial to restrict the class of functions
First, it is quite satisfying from a theoreti- input space. This is the characteristic twist that the learning machine can implement
cal point of view. SV learning is based on of SV methods-we are dealing with com- to one with a capacity that is suitable for
some beautifully simple ideas and provides plex algorithms for nonlinear pattern the amount of available training data.
a clear intuition of what learning from ex- recognition,’ regression? or feature extrac-
amples is about Second, it can lead to high t i ~ nbut
, ~ for the sake of analysis and algo- Hyperplane classifiers
perfoimances in practical applications. rithmics, we can pretend that we are work- To design learning algorithms, we thus
In the following sense can the SV algo- ing with a simple linear algorithm. must come up with a class of functions
nthm be considered as lying at the intersec- I will explain the gist of SV methods by whose capacity can be computed. SV clas-
tion of learning theory and practice for describing their roots in learning theory, sifiers are based on the class of hyperplanes
certain simple types of algorithms, statisti- the optimal hyperplane algorithm, the ker-
cal leaning theory can identify rather pre- nel trick, and SV function estimation. For (WX) +6=O W ERN,6 € R, (2)
cisely the factois that need to be taken into details and further references, see Vladimir
account to learn successfully Real-world Vapnik’s authoritative treatment,2 the col- corresponding to decision functions
applications, however, often mandate the lection my colleagues and I have put to-
use of more complex models and algori- gether: and the SV Web page at http://svm. Ax) = sign((w.x) + 6). (3)
thms-such as neural networks-that are first. gmd. de.
much harder to analyze theoretically The We can show that the optimal hyper-
SV algorithm achieves both It constructs learning pattern recognition from plane, defined as the one with the maximal
models that are complex enough it con- examples margin of separation between the two
tains a large class of neural nets, r a l a l For pattern recognition, we try to esti- classes (see Figure l),has the lowest ca-

18 IEEE INTELLIGENT SYSTEMS

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.
and sigmoid kernels (with gain K and offset
0)
k(x,y)=tanh(lr(x.y)+O). (9)

SVMs
We now have all the tools to construct
I I nonlinear classifiers (see Figure 2). To this
Figure 1. A separable classification toy problem: separate balls from diamonds. The optimal hyperplane is orthogonal end, we substitute @(xi) for each training
to the shortest line connecting the convex hulls of the two classes (dotted), and intersectsit half way. There is a weight example x,, and perform the optimal hyper-
vector w and a threshold 6such that y i . ((w.xi) + 6) > 0. Rescalingw and 6such that the paint(s) closest to the plane algorithm in F. Because we are using
hyperplane satisfy I(w.xi+)61 = 1, we obtain a form (w,b) of the hyperplanewith yj(w.3+ b) 2 1. Note that the kernels, we will thus end up with nonlinear
margin, measured perpendicularlyto the hyperplane, equals 2/11 w I I. To maximize the margin, we thus have to decision function of the form
minimize IwI subject to y/(w.x) + b) 2 1,

other dot product space (called the feature


I I
I lnputspace I Featurespace space) F via a nonlinear map The parameters v iare computed as the so-
0
lution of a quadratic programming
problem.
In input space, the hyperplane corres-
0
and perform the above linear algorithm in ponds to a nonlinear decision function
F. As I’ve noted, this only requires the whose form is determined by the kernel
evaluation of dot products. (see Figures 3 and 4).
The algorithm I’ve described thus far has
I I
k(x, Y):=(@(x)’ Q(Y)). (5) a number of astonishing properties:
Figure 2. The idea of SV machines: map the training data
nonlinearlyinto a higher-dimensionalfeature space via It is based on statistical learning theory,
Clearly, if F is high-dimensional, the right-
a,and construct a separatinghyperplanewith maximum hand side of Equation 5 will be very expen- It is practical (as it reduces to a quad-
margin there. This yields a nonlinear decision boundary in
input space. By the use of a kernel function, it is possible sive to compute. In some cases, however, ratic programming problem with a
to compute the separating hyperplane without explicitly there is a simple kernel k that can be evalu- unique solution), and
carryingout the map into the feature space. ated efficiently. For instance, the polyno- It contains a number of more or less
mial kernel heuristic algorithms as special cases: by
the choice of different kernel functions,
pacity. It can be uniquely constructed by we obtain different architectures (Fig-
solving a constrained quadratic optimiza- ure 4), such as polynomial classifiers
tion problem whose solution w has an ex- can be shown to correspond to a map CP (Equation 6), RBF classifiers (Equation
E;
pansion w = vixi in terms of a subset of into the space spanned by all products of 8 and Figure 3), and three-layer neural
training patterns that lie on the margin (see exactly d dimensions of RN.For d=2 and x, nets (Equation 9).
Figure 1). These training patterns, called YE R2,for example, we have
support vectors, carry all relevant informa- The most important restriction up to now
tion about the classification problem. has been that we were only considering the
Omitting the details of the calcu- case of classification. However, a general-
lations, there is just one crucial property ization to regression estimation-that is, to
of the algorithm that we need to empha- YE R, can be given. In this case, the algo-
size: both the quadratic programming rithm tries to construct a linear function in
problem and the final decision function the feature space such that the training
f(x)=sign(x, v i ( x . x i ) + b) depend only on points lie within a distance E > 0. Similar to
dot products between patterns. This is the pattern-recognition case, we can write
precisely what lets us generalize to the defining ~ ( x=) ( x ~ , , h x l x 2 , x ~More
). gen- this av a quadratic programming problem
nonlinear case. erally, we can prove that for every kernel in terms of kernels. The nonlinear regres-
that gives rise to a positive matrix (k(xi,xj))@ sion estimate takes the form
Feature spates and kernels we can construct a map such that Equa-
Figure 2 shows the basic idea of SV ma-
chines, which is to map the data into some
tion 5 holds.
Besides Equation 6, SV practitioners use
1
fO=c
1=1
vt .k(x,,x) + b (11)

IULY/AUGUST 1998 19

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.
respect, several fields have emerged.

Training methods for speeding up the


quadratic program, such as the one de-
scribed later in this installment of
Trends & Controversies by John Platt.
* Speeding up the evaluation of the deci-
sion function is of interest in a variety
of applications, such as optical-charac-
ter recognition.6
The choice of kernel functions, and
hence of the feature space to work in, is
of both theoretical and practical inter-
est. It determines both the functional
form of the estimate and, via the objec-
tive function of the quadratic program,
the type of regularization that is used to
constrain the e ~ t i m a t e .However,
~.~ even
though different kernels lead to differ-
Figure 3. Example of an SV classifier found by using a radial basis function kernel (Equation 8). Circles and disks are ent types of learning machines, the
two classes of training examples; the solid line is the decision surface; the support vectors found by the algorithm lie choice of kernel seems to be less crucial
on, or between, the dashed lines. Colors code the modulus of the argument E, v L k(x,x,) + b of the decision
than it may appear at first sight. In OCR
function in Equation 10.
applications, the kernels (Equations 6,
9, and 8) lead to very similar perfor-
mance and to strongly overlapping sets
of support vectors.
Although the use of SV methods in ap-
plications has only recently begun, ap-
plication developers have already re-
ported state-of-the-art performances in a
variety of applications in pattern recog-
nition, regression estimation, and time
series prediction. However, it is proba-
bly fair to say that we are still missing
an application where SV methods sig-
nificantly outperform any other avail-
able algorithm or solve a problem that
has so far been impossible to tackle. For
the latter, SV methods for solving in-
verse problems are a promising candi-
date.9 Sue Dumais and Edgar Osuna
describe promising applications in this
discussion.
Figure 4. Architecture of SV methods. The input x and the support vectors x, (in this example: digits) are nonlinearly
Using kernels for other algorithms
mapped (by @) into a feature space F, where dot products are computed. By the use of the kernel k, these two layers
are in practice computed in one single step. The results are linearly combined by weights v , found by solving a qua-
emerges as an exciting opportunity for
dratic program (in pattern recognition, v, = hal; in regression estimation, v, = a*,- a,)2 or an eigenvalue problem developing new learning techniques.
(in kernel PCA3). The linear combination is fed into the function B (in pattern recognition, B(X) = sign(x t b); in The kernel method for computing dot
regression stimation, o(x) = x t b; in kernel PCA, B(X) = x. products in feature spaces is not re-
stricted to SV machines. We can use it
to derive nonlinear generalizations of
To apply the algorithm, we either Current developments and open any algorithm that can be cast in terms
specify e a priori, or we specify an issues of dot products. As a mere start, we
upper bound on the fraction of training decided to apply this idea to one of the
points allowed to lie outside of a dis- Chances are that those readers who are most widely used algorithms for data
tance E from the regression estimate still with me might be interested to hear analysis, principal component analysis.
(asymptotically, the number of SVs) how researchers have built on the above, This leads to kernel PCA,3 an algorithm
and the corresponding E is computed applied the algorithm to real-world prob- that performs nonlinear PCA by carry-
a~tomatically.~ lems, and developed extensions. In this ing out linear PCA in feature space. The

20 IEEE INTELLIGENT SYSTEMS

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.
method consists of solving a linear
eigenvalue problem for a matrix whose
elements are computed using the kernel
function. The resulting feature extrac-
tors have the same architecture as SV
machines (see Figure 4). A number of
researchers have since started to “ker-
nelize” various other linear algorithms.

References
1. B.E. Boser, I.M. Guyon, andV.N. Vapnik,
“A TrainingAlgorithm for Optimal Margin
Classifiers,”Proc. Fifth Ann. Workshop
Computational Learning Theory, ACM
Press, New York, 1992,pp. 144152.
2. V. Vapnik, The Nature of Statistical Learn-
ing Theory, Springer-Verlag,New York,
1995.
3. B. Scholkopf,A. Smola, and K.-R. Muller,
“NonlinearComponent Analysis as a Ker-
nel Eigenvalue Problem,”Neural Computa-
tion,Vol. 10, 1998,pp. 1299-1319.
4. B. Scholkopf, C.J.C. Burges, and A.J.
Smola,Advances in Kernel Methods-Sup-
port Vector Learning, to appear, MIT Press,
Cambridge,Mass, 1998.
5. B. Scholkopf et al., “SupportVector Regres-
sion with Automatic Accuracy Control,”to be
published in Proc. Eighth Int’l Conj Artifi-
cial Neural Networks, Perspectives in Neural
Computing, Springer-Verlag,Berlin, 1998.
6. C.J.C. Burges, “Simplified SupportVector
Decision Rules,” Proc. 13th Int’l Conj
Machine Learning, Morgan Kaufmann, San
Francisco, 1996,pp. 71-77.
I . A. Smola and B. Scholkopf,“From Regu-
larization Operators to SupportVector Ker-
nels,”Advances in Neural Information Pro-
cessing Systems 10, M. Jordan, M. Kearns, methods, including SVMs, have tremendous bility-especially for large or rapidly
and S. Solla, eds., MIT Press, 1998. potential for helping people more effectively changing collections. Consequently, inter-
8. E Girosi,An Equivalence between Sparse organize electronic resources. est is growing in developing technologies
Approximation and Support VectorMachines, Today, most text categorization is done by for (semi)automatic text categorization.
AI Memo No. 1606,MIT, Cambridge, Mass.,
1997. people. We all save hundreds of files, e-mail Rule-based approaches similar to those
9. J. Weston et al.,Density Estimation Using messages, and URLs in folders every day. employed in expert systems have been used,
Support Vector Machines, Tech. Report We are often asked to choose keywords but they generally require manual construc-
CSD-TR-97-23, Royal Holloway, Univ. of from an approved set of indexing terms for tion of the rules, make rigid binary deci-
London, 1997. describing our technical publications. On a sions about category membership, and are
much larger scale, trained specialists assign typically difficult to modify. Another strat-
new items to categories in large taxonomies egy is to use inductive-learning techniques
such as the Dewey Decimal or Library of to automatically construct classifiers using
Congress subject headings, Medical Subject labeled training data. Researchers have ap-
Susan Dumais, Decision Theory and Adap- Headings (MeSH), orYahoo!’s Intemet di- plied a growing number of learning tech-
tive Systems Group, Microsofi Research rectory. Between these two extremes, people niques to text categorization, including
As the volume of electronicinformation organize objects into categories to support a multivariate regression, nearest-neighbor
increases, there is growing interest in devel- wide variety of information-management classifiers, probabilistic Bayesian models,
oping tools to help people better find, filter, tasks, including information routindfilter- decision trees, and neural Re-
and manage these resources. Text cutegorizu- inglpush, identification of objectionable cently, my colleagues and I and others have
tion-the assignmentof natural-language materials or junk mail, structured search and used SVMs for text categorization with
texts to one or more predefined categories browsing, and topic identification for topic- very promising result^.^.^ In this essay, I
based on their content-is an important com- specific processing operations. briefly describe the results of experiments
ponent in many information organization Human categorization is very time-con- in which we use SVMs to classify newswire
and management tasks. Machine-learning suming and costly, thus limiting its applica- stories from Reuters.“

JULY/AUGUST 1998 21

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.
Table 1. Break-even performance for five learning algorithms.
FINDSIM NAIVEBAYES EAVESNETS TREES LINEARSVM
(”/.) (“N (”/.) (% (”/.I
97 8 98.0 classification, simpler binary feature values
earn 92.9 95.9 95 8
64.7 87.8 88 3 89 7 93 6 (a word either occurs or does not occur in a
acq
money-fx 46.7 56.6 58 8 66 2 74 5 document) are often used instead.
grain 67.5 78.8 81.4 85.0 94 6 Text collections containing millions of
crude 70.1 79.5 79 6 85 0 88 9 unique terms are quite common. Thus, for
trade 65.1 63.9 69 0 72 5 75 9
interest 63.4 64.9 71 3 67.1 77 7 both efficiency and efficacy,feature selection
ship 49.2 85.4 84 4 74 2 85 6 is widely used when applying machine-
wheat 68.9 69.7 82 7 92 5 91 8 learning methods to text categorization.To
corn 48.2 65.3 76 4 91 8 90.3 reduce the number of features, we first re-
Avg. top 10 64.6 81.5 85 0 a8 4 92.0 move features based on overall frequency
Avg. all 61.7 75.2 80.0 NiA 87.0 counts, and then select a small number of
features based on their fit to categories.We
use the mutual informationbetween each
feature and a category to further reduce the
feature space. These much smaller document
descriptions then serve as input to the SVM.

Learning SVMs. We used simple linear


SVMs because they provide good general-
ization accuracy and are fast to learn.
Thorsten Joachims has explored two classes
of nonlinear SVMs-polynomial classifiers
and radial basis functions-and observed
only small benefits compared to linear
model^.^ We used John Platt’s Sequential
Minimal Optimization methodh (described
in a later essay) to learn the vector of fea-
ture weights, 2.Once the weights are
learned, new items are classified by com-
puting x” .$where w’is the vector of
learned weights, and ?is the binary vector
representing a new document. We also
Figure 5. ROC curve.
learned two parameters of a sigmoid func-
tion to transform the output of the SVM to
learning text categorizers 0 confidence (“interest” category) = probabilities.
0.3*interest + 0.4*rate + 0.7*quarterly
The goal of automatic text-categoriza- An example-Reuters
tion systems is to assign new items to one The key idea behind SVMs and other in- The Reuters collection is a popular one
or more of a set of predefined categories on ductive-learningapproaches is to use a train- for text-categorization research and is pub-
the basis of their textual content. Optimal ing set of labeled instances to learn the clas- licly available at http://www. research. att.
categorization functions can be learned sification function automatically. SVM com/-lewis/reuters22578. html. We used
from labeled training examples. classifiers resemble the second example the 12,902 Reuters stories that have been
above-a vector of learned feature weights. classified into 118 categories. Following
Inductive learning of classifiers.A classi- The resulting classifiers are easy to construct the ModApte split, we used 75% of the
fier is a function that maps an input attri- and update, depend only on information that stories (9,603 stories) to build classifiers
bute vector, 2 = (xl, x2,xg,.. ., xn),to the is easy for people to provide (that is, exam- and the remaining 25% (3,299 stories) to
confidence that the input belongs to a ples of items that are in or out of categories), test the accuracy of the resulting models in
class-that is, f@) = confidence(c1ass).In and allow users to smoothly trade off preci- reproducing the manual category assign-
the case of text classification, the attributes sion and recall depending on their task. ments. Stories can be assigned to more than
are words in the document and the classes one category.
are the categories of interest (for example, Text representationand feature selec- Text files are automaticallyprocessed to
Reuters categories include “interest,” tion. Each document is represented as a produce a vector of words for each docu-
“earnings,” and “grain”). vector of words, as is typically done in in- ment. Eliminating.wordsthat appear in only
Example classifiers for the Reuters cate- formation retrie~al.~ For most text-retrieval one document and then selectingthe 300
gory interest are applications, the entries in the vector are words with highest mutual information with
weighted to reflect the frequency of terms each category reduces the number of fea-
0 if (interestAND rate) OR (quarterly),then in documents and the distribution of terms tures. These 300-elementbinary feature vec-
confidence (“interest” category) = 0.9 across the collection as a whole. For text tors serve as input to the SVM. A separate

22 IEEE INTELLIGENTSYSTEMS

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.
classifier @) is image, and if
learned for each
category. Using
++. * - *- there are, return
an encoding of
SMO to train the their location.
linear SVM takes an
average of 0.26 it-- -- The encoding in
this system is to
CPU seconds per tit each face in a
category (averaged I bounding box
over 118 categories) defined by the
on a 266-MHz Pentium I1 running Windows offer great potential to support flexible, dy- image coordinates of the corners.
NT. Other learning methods are 20 to 50 namic, and personalized information access Face detection as a computer-vision task
times slower. New instances are classified by and management in a wide variety of tasks. has many applications. It has direct rele-
computing a score for each document @d) vance to the face-recognition problem, be-
and comparing the score with a learned cause the first important step of a fully au-
threshold. New documents exceeding the References tomatic human face recognizer is usually
threshold are said to belong to the category. 1. D.D. Lewis and P. Hayes, special issue of identifying and locating faces in an un-
The learned SVM classifiers are intu- ACM Trans. Information Systems, Vol. 12, known image. Face detection also has po-
itively reasonable. The weight vector for No. 1, July 1994. tential application in human-computer in-
2. Y.Yang, “An Evaluation of StatisticalAp-
the category “interest” includes the words proaches to Text Categorization,”to be terfaces, surveillance systems, and census
prime (.70), rate (.67), interest (.63), rates published in J. Information Retrieval, 1998. systems, for example.
(.60), and discount (.46), with large posi- 3. T. Joachims, “Text Categorization with Sup- For this discussion, face detection is also
tive weights, and the words group (-.24), port Vector Machines: Learning with Many interesting as an example of a natural and
year (-.25), sees (-.33) world (-.35), and Relevant Features,” to be published in Pmc. challenging problem for demonstrating and
10th European Conj Machine Leaming
dlrs (-.71), with large negative weights. (ECML),Springer-Verlag,1998;bttp://
testing the potentials of SVMs. Many other
As is typical in evaluating text catego- www-ai.cs.uni-dortmund.de“AW real-world object classes and phenomena
rization, we measure classification accu- joachims.html/Joachims~97b.ps.g~. share similar characteristics-for example,
racy using the average of precision and 4. S. Dumais et al., “InductiveLearning Algo- tumor anomalies in MRI scans and struc-
recall (the so-called breakeven point). Pre- r i t h m s and Representationsfor Text Catego- tural defects in manufactured parts. A suc-
rization, to be published in Proc. Con$ Infor-
cision is the proportion of items placed in mation and Knowledge Management, 1998; cessful and general methodology for find-
the category that are really in the category, http://research.microsoft.com/-sdumais/ ing faces using SVMs should generalize
and recall is the proportion of items in the cikm98.doc. well for other spatially well-defined pat-
category that are actually placed in the cat- 5 . G.Salton and M. McGill, Introducrion ro tern- and feature-detection problems.
egory Table 1 summarizes microaveraged Modern Information Retrieval, McGraw Face detection, like most object-detection
Hill, New York, 1983.
breakeven performance for five learning problems, is a difficult task because of the
6. J. Platt, “Fast Training of SVMs Using Se-
algorithms explored by my colleagues and quential Minimal Optimization,”to be pub- significant pattern variations that are hard to
I explored for the 10 most frequent cate- lished in Advances in Kernel Methods- parameterize analytically.Some common
gories, as well as the overall score for all Support Vector Machine Leaming, B. sources of pattern variations are facial ap-
118 ~ategories.~ Scholkpf, C. Burges, andA. Smola, eds., pearance, expression, presence or absence of
MIT Press, Cambridge, Mass., 1998.
Linear SVMs were the most accurate common structuralfeatures such as glasses or
method, averaging 91.3% for the 10 most a moustache, and light-source distribution.
frequent categories and 85.5% over all 118 This system works by testing candidate
categories. These results are consistent image locations for local patterns that ap-
with Joachims’ results in spite of substan- pear like faces, using a classification proce-
tial differences in text preprocessing, term Edgar Osuna, MIT Centerfor Biologicaland dure that determines whether a given local
weighting, and parameter selection, sug- Computational Learning and Operations Re- image pattern is a face. Therefore, our ap-
gesting that the SVM approach is quite search Center proach comes at the face-detection problem
robust and generally applicable for text- This essay introduces an SMV applica- as a classification problem given by exam-
categorization problem^.^ tion for detecting vertically oriented and ples of two classes: faces and nonfaces.
Figure 5 shows a representative ROC unoccluded frontal views of human faces in
curve for the category “grain.” We generate gray-level images. This application handles Previous systems
this curve by varying the decision threshold faces over a wide range of scales and works Researchers have approached the face-
to produce higher precision or higher re- under different lighting conditions, even detection problem with different techniques
call, depending on the task. The advantages with moderately strong shadows. in the last few years, including neural net-
of the SVM can be seen over the entire We can define the face-detection prob- detection of face features and use
recall-precision space. lem as follows. Given as input an arbitrary of geometrical constraint^,^ density estima-
image, which could be a digitized video tion of the training data,4labeled graphs:
Summary signal or a scanned photograph, determine and clustering and distribution-based mod-
In summary, inductive learning methods whether there are any human faces in the eling. 6,7 The results of Kah-Kay Sung and

JULY/AUGUST 1998 23

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.
Non-faces

principled way to choose some important


free parameters such as the number of clus-
ters it uses.
Similarly, Rowley and his colleagues have
used problem information in the design of a
retinally connected neural network trained to
classify face and nonface patterns. Their ap-
proach relies on training several neural net-
works emphasizing sets of the training data to
obtain dfferent sets of weights. Then, their
1 approach uses different schemes of arbitra-
tion between them to reach a final answer.
\

\
Our SVM approach to the face-detection
\ system uses no prior information to obtain
\
\ the decision surface, this being an interest-
ing property that can be exploited in using
the same approach for detecting other ob-
I jects in digital images.
I
I

I
The SVM face-detectionsystem
I
I This system detects faces by exhaus-
I tively scanning an image for face-like pat-
1
Faces I terns at many possible scales, by dividing
1 the original image into overlapping subim-
ages and classifying them using an SVM to
of liow llie SVM sepurales h e fare and nonfare [lasses. Thc panctns arc real
yule 6. Geuiiielrittrl i~iterpiu~ulio~i
determine the appropriate class-face or
support vetlors obtuined ufter lrainiiiy tlie system. Notice /he small number of total support vettors arid the fact that a nonface. The system handles multiple
hiyher proporlion of them Correspond IO nonfaces. scales by examining windows taken from
scaled versions of the original image.
Clearly, the major use of SVMs is in the
classification step, which is the most criti-
cal part of this work. Figure 6 gives a geo-
metrical interpretation of the way SVMs
work in the context of face detection.
More specifically, this system works as
follows. We train on a database of face and
nonface 19x19 pixel patterns, assigned to
classes +1 and -1, respectively, using the
support vector algorithm. This process uses
a second-degree homogeneous polynomial
kernel function and an upper bound C =
200 to obtain a perfect training error.
To compensate for certain sources of
image variation, we perform some prepro-
cessing of the data:

Figure 7. False detections obtained with the first version of the system. These false positives later served as nonfate * Masking, A binary pixel mask removes
examples in the training process. some pixels close to the window-pattem
boundary, allowing a reduction in the
dimensionality of the input space from
Tomaso Poggi0~3~ and Henry Rowley2re- their result is the clustering and use of 19 x 19 = 361 to 283 This step reduces
flect systems with very high detection rates combined Mahalanobis and Euclidean met- background patterns that introduce un-
and low false-positive detection rates. rics to measure the distance from a new necessary noise in the training process
Sung and Poggio use clustering and dis- pattem and the clusters. Other important * Illumination gradient correction The
tance metrics to model the distribution of features of their approach are the use of process subtracts a best-fit brightness
the face and nonface manifold and a neural nonface clusters and a bootstrapping tech- plane from the unmasked window pixel
network to classify a new pattern given the nique to collect important nonface patterns. values, allowing reduction of light and
measurements. The key to the quality of However, this approach does not provide a heavy shadows
~

24 IEEE INTELLIGENTSYSTEMS

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.
Input image Extracted window Light Histogram Classification using
equalization support vector machines
SVM quick discard

ii
' I
'/ I
I
If possible face 1
I
SVM complete classifier
facehonface
/

Preprocessing

Figure 8. System architectureat runtime. (Used with permission?)

Histogram equalization: Our process Rescale the input image several times;
performs a histogram equalization over Cut 19x19 window patterns out of the
the patterns to compensate for differences scaled image;
in illuminationbrightness and different Preprocess the window using masking,
cameras' response curves, and so on. light correction and histogran equaliza-
tion;
Once the process obtains a decision sur- Classify the pattern using the SVM; and
face through training, it uses the runtime If the class corresponds to a face, draw
system over images that do not contain a regtangle aroung the face in the output
faces, storing misclassifications so that image.
they can be used as negative examples in
subsequent training phases. Images of Figure 8 reflects the system's architec-
landscapes, trees, buildings, and rocks, for ture at runtime.
example, are good sources of false posi-
tives because of the many different textured Experimental results on static
patterns they contain. This bootstrapping images
step, which Sung and Poggio6 successfully To test the runtime system, we used two
used, is very important in the context of a sets of images. Set A contained 313 high-
face detector that learns from examples: quality images with the same number of
faces. Set B contained 23 images of mixed
Although negative examples are abun- quality, with a total of 155 faces. We tested
dant, negative examples that are useful both sets, first using our system and then
from a learning standpoint are very dif- the one by Sung and Poggio. 5,6 To give ~

Figure 9. Resultsfrom our fote-detectionsystem.


ficult to characterize and define. true meaning to the number of false posi-
By approaching the problem of object tives obtained, note that set A involved
detection, and in this case of face de- 4,669,960 pattern windows, while set B Table 2. Performanceof the SVM fate-detection system.
tection, by using the paradigm of bi- involved 5,383,682. Table 2 compares the
nary pattern classification, the two two systems. TESTSET A TESTSET B
classes-object and nonobject-are not Figure 9 presents some output images of DETECT DETECT
equally complex. The nonobject class our system, which were not used during the RATE FALSE RATE FALSE
is broader and richer, and therefore training phase of the system. (%) ALARMS (yo) ALARMS
needs more examples to get an accurate SVM 97.1 4 74.2 20
definition that separates it from the Extension to a real-time system Sung 94.6 2 74.2 11
object class. Figure 7 shows an image The system I've discussed so far spends
used for bootstrapping with some mis- approximately 6 seconds (SparcStation 20)
classifications that later served as nega- on a 320x240 pixels gray-level image. Al-
tive examples. though this is faster than most previous a Matrox RGB frame grabber and a
systems, it is not fast enough for use as a Hitachi three-chip color camera. We
After training the SVM, using an imple- runtime system. To build a runtime version used no special hardware to speed up
mentation of the algorithm my colleagues of the system, we took the following steps: the computational burden of the system.
and I describe elsewhere,8we incorporate it 0 We collected several color images with
as the classifier in a runtime system very 0 We ported the C code developed on the faces, from which we extracted areas
similar to the one used by Sung and Pog- Sun environment to a Windows NT with skin and nonskin pixels. We col-
g i ~ .It~performs
,~ the following operations: Pentium 200-MHz computer and added lected a dataset of 6,000 examples.

25

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.
sponding skin-detection output.
0 We coded a very primitive motion de-
John Platt, Microsoft Research
tector based on thresholded frame dif-
ferencing to identify areas of movement In the past few years, SVMs have proven
and use them as the focus of attention. to be very effective in red-world classifica-
Motion was not a requirement to be de- tion tasks. This installment of Trends &
tected by the system because every so Controversies descnbes two of these tasks:
many frames (20 in the current imple- face recognition and text categorization.
mentation), we skipped this step and However, many people have found the nu-
scanned the whole image. merical implementation of SVMs to be
0 We put together a hierarchical system intimidating. In this essay, I will attempt to
using as a first step the mohon-detection demystify the implementation of SVMs. As
module. We used the SVM skm-detec- a first step, if you are interested in imple-
tion system as second layer to identify menting an SVM, I recommend reading
c a d d a t e locations of faces. We used the Chris Burges’ tutorial on SVMsF avalable
facelnonface SVM classifier described I at http://svm. research. bell-labs.com/
described earlier over the gray-level ver- SVMdoc. html.
sion of the can&date locations. An SVM is a parametenzed function
whose functional form is defined before
The whole system achieves rates of 4 to training. Training an SVM requires a la-
gure 10. An example of the skin detection module
5 frames per second. Figure 11 presents a beled training set, because the SVM will fit
implemented using SVMs. couple of images captured by our PC-based the function from a set of examples. The
Color Real-Time face-detection system. training set consists of a set of N examples
Each example consists of an input vector,
x,, and a label, y,, which describes whether
the input vector is in a predefined category.
There are N free parameters in an SVM
trained with N examples. These parameters
are called a,.To find these parameters, you
References must solve a quadratic programming (QP)
1 G Burel and D Carel, “Detectlon and Lo- problem.
calization of Faces on Digital Images,”
* N N
Pattern Recognition Letters, Vol 15, 1994,
pp 963-961 minimize
1
-
.’
a i ~ p-C
j a,;
i,j=1 i=l
2 H Rowley, S Bduja, and T Kanade, Human
Face Detection in Visual Scenes, Tech Re-
port 95-158, Computer ScienceDept ,
Carnegie Mellon Univ ,Pittsburgh, 1995
3 G Yang and T Huang, “Human Face Detec-
tion in a Complex Background,”Pattern where Q is an NxN matrix that depends
Recognition, Vol 27, 1994,pp. 53-63. on the training inputs xi, the labels yi, and
4 B Moghaddam and A Pentland, Proba-
bilistic Visual Learning for Object Detec- the functional form of the SVM. We call
tion, Tech Report 326, MIT Media Labora- this problem quadratic programming be-
tory, Cambndge, Mass , 1995 cause the function to be minimized (called
5 N Kruger, M Potzsch, and C v d Mals- the objectivefunction) depends on the a,
burg, Determination of Face Position and quadratically, while aionly appears lin-
Pose with Learned Representation Based on
Labeled Graphs Tech Report 96-03, Ruht- early in the constraints (see http://www-
Universitat, 1996 c.mcs.an1.
I I 6 K Sung, Learning and Example Selection gov/home/otc/Guide/Opt Webkontinuo us/
for Object and Pattern Detection, PhD the- constrained/qprog). Definitions and appli-
Figure 1 1. Face detection on the PC-based Color Reul- sis, MIT AI Lab and Center for Biological
Time system cations of xi, yi.ai,
and Q appear in the tu-
and Computational Learning, 1995
torial by Burges.2
I K Sung and T Poggio, Example-Based
Learning for Wew-Based Human Face De-
UII
Conceptually, the SVM QP problem is to
We trained a SVM classifier using the find a minimum of a bowl-shaped objective
skin and nonskin data. The inmt vari-
ables were normalized green and red
I tection, A.I. Memo 1521, C.B.C.LPaper
112,Dec. 1994.
8. E. Osuna, R. Freund, and E Girosi, “An
function. The search for the minimum is
constrained to lie within a cube and on a
values-g/(r+g+b) and r/(r+g+b), re- Improved Training Algorithm for Support plane. The search occurs in a high-dimen-
Vector Machines,”Proc. IEEE Workshop on
spectively. Figure 10 presents an image Neural Networks and Signal Processing, sional space, so that the bowl is high dimen-
captured by the system and its corre- I B E E Press, Piscataway,N.J, 1997. ~ sional, the cube is a hypercube, and the

26 IEEE INTELLIGENTSYSTEMS

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.
plane is a hyperplane. For most typical
SVM functional forms, the matrix Q has
special properties, so that the objective
function is either bowl-shaped (positive
definite) or has flat-bottomed troughs (posi-
tive semidefinite), but is never saddle-
shaped (indefinite). Thus, there is either a
unique minimum or a connected set of
equivalent minima. An SVM QP has a defi-
nite termination (or optimality) condition I I
that describes these minima. We call these Figure 12. Three alternative methods for training SVMs: (a) Chunking, (b) Osuna‘s algorithm, and (c) SMO. There are
optimality conditions the Karush-Kuhn- three steps for each method. The horizontal thin line at every step represents the training set, while the thick boxes
Tucker (KKT) conditions, and they simply represent the a,being optimized at thot step. A given group of three lines correspondsto three troining iterations,
describe the set of aithat are constrained with the first iteration at the top.
minima.3
The values of aialso have an intuitive
explanation. There is one aifor each train- move the rows and columns of the matrix Q 12b). Using a constant-size matrix allows
ing example. Each aidetermines how that correspond to zero ai. Therefore, the the training of arbitrarily sized datasets.
much each training example influences the large QP problem can break down into a The algorithm in Osuna’s paper suggests
SVM function. Most of the training exam- series of smaller QP problems, whose ulti- adding one example and deleting one ex-
ples do not affect the SVM function, so mate goal is to identify all of the nonzero ai ample at every step. Such an algorithm
most of the aiare 0. and discard all of the zero ai. At every step, converges, although it might not be the
Because of its simple form, you might chunking solves a QP problem that consists fastest possible algorithm. In practice, re-
expect the solution to the SVM QP problem of the following ai: every nonzero aifrom searchers add and delete multiple examples
to be quite simple. Unfortunately, for real- the last step, and the aithat correspond to according to various unpublished heuris-
world problems, the matrix Q can be enor- the M worst violations of the KKT condi- tics. Typically, these heuristics add KKT
mous: it has a dimension equal to the num- tions, for some value of M (see Figure 12a). violators at each step and delete those a,
ber of training examples. A training set of The size of the QP subproblem tends to that are either 0 or C. Joachims has pub-
60,000 examples will yield a Q matrix with grow with time. At the last step, the chunk- lished an algorithm for adding and deleting
3.6 billion elements, which cannot easily fit ing approach has identified the entire set of examples from the QP steps, which rapidly
into the memory of a standard computer. nonzero ai; hence, the last step solves the decreases the objective functioa6
We have at least two different ways of overall QP problem. All of these decomposition methods
solving such gigantic QP problems. First, Chunking reduces the Q matrix’s dimen- require a numencal QP package. Such
there are QP methods that use sophisticated sion from the number of training examples packages might be expensive for commer-
data structures.“ These QP methods do not to approximately the number of nonzero ai. cial users (see the “Where to get the pro-
require the storage of the entire Q matrix, However, chunking still might not handle grams” section). Writing your own efficient
because they do not need to access the rows large-scale training problems, because even QP package is difficult without a numeri-
or columns of Q that correspond to those ai this reduced matrix might not fit into mem- cal-analysis background.
that are at 0 or at C. Deep in the inner loop, ory. Of course, we can combine chunking
these methods only perform dot products with the sophisticated QP methods des- Sequential minimal optimization
between rows or columns of Q and a vec- cribed above, which do not require full Sequential minimal optimization is an
tor, rather than performing an entire ma- storage of a matrix. alternative method that can decompose the
trix-vector multiplication. In 1997,Edgar Osuna and his colleagues SVM QP problem without any extra matrix
suggested a new strategy for solving the storage and without using numerical QP
Decomposingthe QP problem SVM QP p r ~ b l e mOsuna
.~ showed that the optimization step^.^,^ SMO decomposes
The other method for attacking the large- large QP problem can be broken down into the overall QP problem into QP subprob-
scale SVM QP problem is to decompose a series of smaller QP subproblems. As long lems, identically to Osuna’s method. Un-
the large QP problem into a series of as at least one aithat violates the KKT con- like the previous decomposition heuristics,
smaller QP problems. Thus, the selection ditions is added to the previous subproblem, SMO chooses to solve the smallest possible
of submatrices of Q happens outside of the each step reduces the objective function and optimization problem at every step. For the
QP package, rather than inside. Conse- maintains all of the constraints. Therefore, a standard SVM QP problem, the smallest
quently, the decomposition method is com- sequence of QP subproblems that always possible optimization problem involves
patible with standard QP packages. add at least one KKT violator will asymp- two elements of a,,because the a, must
Vapnik first suggested the decomposition totically converge. obey one linear equality constraint. At
approach in a method that has since been Osuna suggests keeping a constant size every step, SMO chooses two a,to jointly
known as chunking.’ The chunking algo- matrix for every QP subproblem, which optimize, finds the optimal values for these
rithm exploits the fact that the value of the implies adding and deleting the same num- a,, and updates the SVM to reflect the new
objective function is the same if you re- ber of examples at every step5 (see Figure optimal values (see Figure 12c).

JULY /AUGUST i998 27

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.
Table 3. Five experiments comparing SMO to PCG chunking. The functional form of the SVM,training set size, CPU
times, and scaling exponents are shown.

PCG
SMO CHUNKING PCG
TRAINING TRAINING SMO CHUNKING
lems. Purchasing a QP package from a
KERNEL TRAINING CPU CPU SCALING SCALING well-known numerical analysis source is
EXPERIMENT USED SET SIZE TIME (SEC.) TIME (SEC.) EXPONENT EXPONENT the best bet, unless you have an extensive
numerical analysis background, in which
Adult l i n e a r Linear 11,221 170 20,711 3 19 31 case you can create your own QP package.
Web l i n e a r Linear 49,749 2683 17,1647 16 25
11,221 781 4 11,910 6 21 29
Osuna and his colleagues use MINOS for
Adult G a u s s i a n Gaussian
Web G a u s s i a n Gaussian 49,749 3,863 5 23,877 6 17 20 their QP package, which has licensing in-
MNIST Polynomial 60,000 29, ‘1.0 33,109.0 MIA NIA formation at http://www-1eland.stanford.
edu/-saunders~rochure/bvochure.htmL5
LOQO is another robust, large-scale inte-
SMO can solve for two a, analytically, chine-learning benchmark.* The Web ex- rior-point package suitable for QP and
thus avoiding numerical QP optimization periment is a text-categorization task. The available for a fee at http://www.princeton.
entirely. The inner loop can be expressed in Adult and Web datasets are available at edu/-wdb.
a short amount of C code, rather than by http://www. research.microsoft.cod-jplatt/ Finally, a program that implements Joa-
invoking an entire QP library routine. Even smo.htmZ. The MNIST experiment is an chims’ version of Osuna’s algorithm,6called
though more optimization subproblems are OCR benchmark available at http://www. SVMIight,is available free, for scientificpur-
solved in the course of the algorithm, each research. att. cod-yannJocr/mnist. The poses only, at http://www-ai. informatik.uni-
subproblem is so fast that the overall QP training CPU time is listed for both SMO dortmund.de/FORSCHUNG/ VERFAHREN/
problem can be solved quickly. and PCG chunkmg for the training set size SVM-LIGHT/svm-light. eng.html. CI
Because there are so many possible com- shown in the table. The scaling exponent is
binations of QP packages, decomposition the slope of a linear fit to a log-log plot of
heuristics, code optimizations, data struc- the training time versus the training set
tures, and benchmark problems, it is very size. This scaling exponent varies with the
difficult to determine which SVM algo- dataset used. The empirical worst-case
rithm (if any) is the most efficient. SMO scaling for SMO is quadratic, while the References
has been compared to the standard chunk- empirical worst-case scaling for PCG
ing algorithm suggested by Burges in his chunking is cubic. 1. V. Vapnik, Estimation of Dependencies
t ~ t o r i a l .The
~ , ~QP algorithm used by this For a linear problem with sparse inputs, Based on Empirical Data, Springer-Verlag,
version of chunking is projected conjugate SMO can be more than 1,000 times faster NewYork, 1982.
gradient (PCG). Table 3 compares the re- than PCG chunking. 2. C.J.C. Burges, “A Tutorial on SupportVec-
tor Machines for Pattern Recognition,”
sults for SMO versus PCG chunking. Both Joachims has compared his algorithm submitted to Data Mining and Knowledge
algorithms are coded in C++, share SVM (SVM”ghtversion 2) and SMO on the same Discovery, 1998.
evaluation code, are compiled with Micro- datasets.6 His algorithm and SMO have 3. J.C. Platt, “Fast Training of SVMs Using
soft Visual C++ version 5.0, and are run on comparable scaling with training set size. Sequential Minimal Optimization,”to be
a 266-MHz Pentium I1 with Windows NT The CPU time of Joachims’ algorithm published in Advances in Kernel Methods-
Support Vector Learning, B. Scholkopf, C .
and 128 Mbytes of memory. Both algori- seems roughly comparable to SMO; differ- Burges, andA. Smola, eds., MIT Press,
thms have inner loops that take advantage ent code optimizations make exact compar- Cambridge,Mass., 1998.
of input vectors that contain mostly zero ison between the two algorithms difficult. 4. L. Kaufman, “Solving the Quadratic Pro-
entries (that is, sparse vectors). gramming ProblemArising in SupportVec-
For more details on this comparison, and Where to get the programs tor Classification,”to be published in Ad-
vunces in Kernel Methods-Support Vector
for more experiments on synthetic datasets, The pseudocode for SMO is currently in Learning, MIT Press, 1998.
please consult my upcoming p~blication.~ a technical report available at http://www. 5. E. Osuna, R. Freund, and F. Girosi, “An
The Adult experiment is an income-predic- research. microsoft.cod-jplatt/smo. htmL7 Improved Training Algorithm for Support
tion task and is derived from the UCI ma- SMO can be quickly implemented in the Vector Machines,”Proc. IEEE Neural Net-
programming language of your choice works for Signal Processing VII Workshop,
IEEE Press, Piscataway,N.J., 1997,pp.
._
using this pseudocode. I would recommend 216-285.
SMO if you are planning on using linear 6. T. Joachims, “MakingLarge-Scale SVM
v u

SVMs, if your data is sparse, or if you want Learning Practical,”to be published in Ad-
to write your own end-to-end code. vances in Kernel Methods-Support Vector
If you decide to use a QP-based system, Learning, MIT Press, 1998.
7. J.C. Platt, Sequential Minimal Optimiza-
be careful about writing QP code your- tion: A Fast Algorithm for Training Support
self-there are many subtle numerical pre- Vector Machines, Microsoft Research Tech.
cision issues involved, and you can find Report MSR-TR-98-14,Microsoft,Red-
yourself in a quagmire quite rapidly. Also, mond, Wash., 1998.
be wary of freeware QP packages available 8. C.J.Merz and P.M. Murphy, UCI Repository
of Machine Learning Databases, Univ. of
on the Web: in my experience, such pack- California, Irvine, Dept. Information and
ages tend to run slowly and might not work Computer Science,Irvine, Calif.;http://www.
well for ill-conditioned or very large prob- ics.uci.edu/-mleamlMLRepository.html.

28 IEEE INTELLIGENTSYSTEMS

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.

You might also like