Learning Theory: y For Examples
Learning Theory: y For Examples
Hearst
University of California, Berkeley
[email protected]
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.
and sigmoid kernels (with gain K and offset
0)
k(x,y)=tanh(lr(x.y)+O). (9)
SVMs
We now have all the tools to construct
I I nonlinear classifiers (see Figure 2). To this
Figure 1. A separable classification toy problem: separate balls from diamonds. The optimal hyperplane is orthogonal end, we substitute @(xi) for each training
to the shortest line connecting the convex hulls of the two classes (dotted), and intersectsit half way. There is a weight example x,, and perform the optimal hyper-
vector w and a threshold 6such that y i . ((w.xi) + 6) > 0. Rescalingw and 6such that the paint(s) closest to the plane algorithm in F. Because we are using
hyperplane satisfy I(w.xi+)61 = 1, we obtain a form (w,b) of the hyperplanewith yj(w.3+ b) 2 1. Note that the kernels, we will thus end up with nonlinear
margin, measured perpendicularlyto the hyperplane, equals 2/11 w I I. To maximize the margin, we thus have to decision function of the form
minimize IwI subject to y/(w.x) + b) 2 1,
IULY/AUGUST 1998 19
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.
respect, several fields have emerged.
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.
method consists of solving a linear
eigenvalue problem for a matrix whose
elements are computed using the kernel
function. The resulting feature extrac-
tors have the same architecture as SV
machines (see Figure 4). A number of
researchers have since started to “ker-
nelize” various other linear algorithms.
References
1. B.E. Boser, I.M. Guyon, andV.N. Vapnik,
“A TrainingAlgorithm for Optimal Margin
Classifiers,”Proc. Fifth Ann. Workshop
Computational Learning Theory, ACM
Press, New York, 1992,pp. 144152.
2. V. Vapnik, The Nature of Statistical Learn-
ing Theory, Springer-Verlag,New York,
1995.
3. B. Scholkopf,A. Smola, and K.-R. Muller,
“NonlinearComponent Analysis as a Ker-
nel Eigenvalue Problem,”Neural Computa-
tion,Vol. 10, 1998,pp. 1299-1319.
4. B. Scholkopf, C.J.C. Burges, and A.J.
Smola,Advances in Kernel Methods-Sup-
port Vector Learning, to appear, MIT Press,
Cambridge,Mass, 1998.
5. B. Scholkopf et al., “SupportVector Regres-
sion with Automatic Accuracy Control,”to be
published in Proc. Eighth Int’l Conj Artifi-
cial Neural Networks, Perspectives in Neural
Computing, Springer-Verlag,Berlin, 1998.
6. C.J.C. Burges, “Simplified SupportVector
Decision Rules,” Proc. 13th Int’l Conj
Machine Learning, Morgan Kaufmann, San
Francisco, 1996,pp. 71-77.
I . A. Smola and B. Scholkopf,“From Regu-
larization Operators to SupportVector Ker-
nels,”Advances in Neural Information Pro-
cessing Systems 10, M. Jordan, M. Kearns, methods, including SVMs, have tremendous bility-especially for large or rapidly
and S. Solla, eds., MIT Press, 1998. potential for helping people more effectively changing collections. Consequently, inter-
8. E Girosi,An Equivalence between Sparse organize electronic resources. est is growing in developing technologies
Approximation and Support VectorMachines, Today, most text categorization is done by for (semi)automatic text categorization.
AI Memo No. 1606,MIT, Cambridge, Mass.,
1997. people. We all save hundreds of files, e-mail Rule-based approaches similar to those
9. J. Weston et al.,Density Estimation Using messages, and URLs in folders every day. employed in expert systems have been used,
Support Vector Machines, Tech. Report We are often asked to choose keywords but they generally require manual construc-
CSD-TR-97-23, Royal Holloway, Univ. of from an approved set of indexing terms for tion of the rules, make rigid binary deci-
London, 1997. describing our technical publications. On a sions about category membership, and are
much larger scale, trained specialists assign typically difficult to modify. Another strat-
new items to categories in large taxonomies egy is to use inductive-learning techniques
such as the Dewey Decimal or Library of to automatically construct classifiers using
Congress subject headings, Medical Subject labeled training data. Researchers have ap-
Susan Dumais, Decision Theory and Adap- Headings (MeSH), orYahoo!’s Intemet di- plied a growing number of learning tech-
tive Systems Group, Microsofi Research rectory. Between these two extremes, people niques to text categorization, including
As the volume of electronicinformation organize objects into categories to support a multivariate regression, nearest-neighbor
increases, there is growing interest in devel- wide variety of information-management classifiers, probabilistic Bayesian models,
oping tools to help people better find, filter, tasks, including information routindfilter- decision trees, and neural Re-
and manage these resources. Text cutegorizu- inglpush, identification of objectionable cently, my colleagues and I and others have
tion-the assignmentof natural-language materials or junk mail, structured search and used SVMs for text categorization with
texts to one or more predefined categories browsing, and topic identification for topic- very promising result^.^.^ In this essay, I
based on their content-is an important com- specific processing operations. briefly describe the results of experiments
ponent in many information organization Human categorization is very time-con- in which we use SVMs to classify newswire
and management tasks. Machine-learning suming and costly, thus limiting its applica- stories from Reuters.“
JULY/AUGUST 1998 21
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.
Table 1. Break-even performance for five learning algorithms.
FINDSIM NAIVEBAYES EAVESNETS TREES LINEARSVM
(”/.) (“N (”/.) (% (”/.I
97 8 98.0 classification, simpler binary feature values
earn 92.9 95.9 95 8
64.7 87.8 88 3 89 7 93 6 (a word either occurs or does not occur in a
acq
money-fx 46.7 56.6 58 8 66 2 74 5 document) are often used instead.
grain 67.5 78.8 81.4 85.0 94 6 Text collections containing millions of
crude 70.1 79.5 79 6 85 0 88 9 unique terms are quite common. Thus, for
trade 65.1 63.9 69 0 72 5 75 9
interest 63.4 64.9 71 3 67.1 77 7 both efficiency and efficacy,feature selection
ship 49.2 85.4 84 4 74 2 85 6 is widely used when applying machine-
wheat 68.9 69.7 82 7 92 5 91 8 learning methods to text categorization.To
corn 48.2 65.3 76 4 91 8 90.3 reduce the number of features, we first re-
Avg. top 10 64.6 81.5 85 0 a8 4 92.0 move features based on overall frequency
Avg. all 61.7 75.2 80.0 NiA 87.0 counts, and then select a small number of
features based on their fit to categories.We
use the mutual informationbetween each
feature and a category to further reduce the
feature space. These much smaller document
descriptions then serve as input to the SVM.
22 IEEE INTELLIGENTSYSTEMS
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.
classifier @) is image, and if
learned for each
category. Using
++. * - *- there are, return
an encoding of
SMO to train the their location.
linear SVM takes an
average of 0.26 it-- -- The encoding in
this system is to
CPU seconds per tit each face in a
category (averaged I bounding box
over 118 categories) defined by the
on a 266-MHz Pentium I1 running Windows offer great potential to support flexible, dy- image coordinates of the corners.
NT. Other learning methods are 20 to 50 namic, and personalized information access Face detection as a computer-vision task
times slower. New instances are classified by and management in a wide variety of tasks. has many applications. It has direct rele-
computing a score for each document @d) vance to the face-recognition problem, be-
and comparing the score with a learned cause the first important step of a fully au-
threshold. New documents exceeding the References tomatic human face recognizer is usually
threshold are said to belong to the category. 1. D.D. Lewis and P. Hayes, special issue of identifying and locating faces in an un-
The learned SVM classifiers are intu- ACM Trans. Information Systems, Vol. 12, known image. Face detection also has po-
itively reasonable. The weight vector for No. 1, July 1994. tential application in human-computer in-
2. Y.Yang, “An Evaluation of StatisticalAp-
the category “interest” includes the words proaches to Text Categorization,”to be terfaces, surveillance systems, and census
prime (.70), rate (.67), interest (.63), rates published in J. Information Retrieval, 1998. systems, for example.
(.60), and discount (.46), with large posi- 3. T. Joachims, “Text Categorization with Sup- For this discussion, face detection is also
tive weights, and the words group (-.24), port Vector Machines: Learning with Many interesting as an example of a natural and
year (-.25), sees (-.33) world (-.35), and Relevant Features,” to be published in Pmc. challenging problem for demonstrating and
10th European Conj Machine Leaming
dlrs (-.71), with large negative weights. (ECML),Springer-Verlag,1998;bttp://
testing the potentials of SVMs. Many other
As is typical in evaluating text catego- www-ai.cs.uni-dortmund.de“AW real-world object classes and phenomena
rization, we measure classification accu- joachims.html/Joachims~97b.ps.g~. share similar characteristics-for example,
racy using the average of precision and 4. S. Dumais et al., “InductiveLearning Algo- tumor anomalies in MRI scans and struc-
recall (the so-called breakeven point). Pre- r i t h m s and Representationsfor Text Catego- tural defects in manufactured parts. A suc-
rization, to be published in Proc. Con$ Infor-
cision is the proportion of items placed in mation and Knowledge Management, 1998; cessful and general methodology for find-
the category that are really in the category, http://research.microsoft.com/-sdumais/ ing faces using SVMs should generalize
and recall is the proportion of items in the cikm98.doc. well for other spatially well-defined pat-
category that are actually placed in the cat- 5 . G.Salton and M. McGill, Introducrion ro tern- and feature-detection problems.
egory Table 1 summarizes microaveraged Modern Information Retrieval, McGraw Face detection, like most object-detection
Hill, New York, 1983.
breakeven performance for five learning problems, is a difficult task because of the
6. J. Platt, “Fast Training of SVMs Using Se-
algorithms explored by my colleagues and quential Minimal Optimization,”to be pub- significant pattern variations that are hard to
I explored for the 10 most frequent cate- lished in Advances in Kernel Methods- parameterize analytically.Some common
gories, as well as the overall score for all Support Vector Machine Leaming, B. sources of pattern variations are facial ap-
118 ~ategories.~ Scholkpf, C. Burges, andA. Smola, eds., pearance, expression, presence or absence of
MIT Press, Cambridge, Mass., 1998.
Linear SVMs were the most accurate common structuralfeatures such as glasses or
method, averaging 91.3% for the 10 most a moustache, and light-source distribution.
frequent categories and 85.5% over all 118 This system works by testing candidate
categories. These results are consistent image locations for local patterns that ap-
with Joachims’ results in spite of substan- pear like faces, using a classification proce-
tial differences in text preprocessing, term Edgar Osuna, MIT Centerfor Biologicaland dure that determines whether a given local
weighting, and parameter selection, sug- Computational Learning and Operations Re- image pattern is a face. Therefore, our ap-
gesting that the SVM approach is quite search Center proach comes at the face-detection problem
robust and generally applicable for text- This essay introduces an SMV applica- as a classification problem given by exam-
categorization problem^.^ tion for detecting vertically oriented and ples of two classes: faces and nonfaces.
Figure 5 shows a representative ROC unoccluded frontal views of human faces in
curve for the category “grain.” We generate gray-level images. This application handles Previous systems
this curve by varying the decision threshold faces over a wide range of scales and works Researchers have approached the face-
to produce higher precision or higher re- under different lighting conditions, even detection problem with different techniques
call, depending on the task. The advantages with moderately strong shadows. in the last few years, including neural net-
of the SVM can be seen over the entire We can define the face-detection prob- detection of face features and use
recall-precision space. lem as follows. Given as input an arbitrary of geometrical constraint^,^ density estima-
image, which could be a digitized video tion of the training data,4labeled graphs:
Summary signal or a scanned photograph, determine and clustering and distribution-based mod-
In summary, inductive learning methods whether there are any human faces in the eling. 6,7 The results of Kah-Kay Sung and
JULY/AUGUST 1998 23
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.
Non-faces
\
Our SVM approach to the face-detection
\ system uses no prior information to obtain
\
\ the decision surface, this being an interest-
ing property that can be exploited in using
the same approach for detecting other ob-
I jects in digital images.
I
I
I
The SVM face-detectionsystem
I
I This system detects faces by exhaus-
I tively scanning an image for face-like pat-
1
Faces I terns at many possible scales, by dividing
1 the original image into overlapping subim-
ages and classifying them using an SVM to
of liow llie SVM sepurales h e fare and nonfare [lasses. Thc panctns arc real
yule 6. Geuiiielrittrl i~iterpiu~ulio~i
determine the appropriate class-face or
support vetlors obtuined ufter lrainiiiy tlie system. Notice /he small number of total support vettors arid the fact that a nonface. The system handles multiple
hiyher proporlion of them Correspond IO nonfaces. scales by examining windows taken from
scaled versions of the original image.
Clearly, the major use of SVMs is in the
classification step, which is the most criti-
cal part of this work. Figure 6 gives a geo-
metrical interpretation of the way SVMs
work in the context of face detection.
More specifically, this system works as
follows. We train on a database of face and
nonface 19x19 pixel patterns, assigned to
classes +1 and -1, respectively, using the
support vector algorithm. This process uses
a second-degree homogeneous polynomial
kernel function and an upper bound C =
200 to obtain a perfect training error.
To compensate for certain sources of
image variation, we perform some prepro-
cessing of the data:
Figure 7. False detections obtained with the first version of the system. These false positives later served as nonfate * Masking, A binary pixel mask removes
examples in the training process. some pixels close to the window-pattem
boundary, allowing a reduction in the
dimensionality of the input space from
Tomaso Poggi0~3~ and Henry Rowley2re- their result is the clustering and use of 19 x 19 = 361 to 283 This step reduces
flect systems with very high detection rates combined Mahalanobis and Euclidean met- background patterns that introduce un-
and low false-positive detection rates. rics to measure the distance from a new necessary noise in the training process
Sung and Poggio use clustering and dis- pattem and the clusters. Other important * Illumination gradient correction The
tance metrics to model the distribution of features of their approach are the use of process subtracts a best-fit brightness
the face and nonface manifold and a neural nonface clusters and a bootstrapping tech- plane from the unmasked window pixel
network to classify a new pattern given the nique to collect important nonface patterns. values, allowing reduction of light and
measurements. The key to the quality of However, this approach does not provide a heavy shadows
~
24 IEEE INTELLIGENTSYSTEMS
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.
Input image Extracted window Light Histogram Classification using
equalization support vector machines
SVM quick discard
ii
' I
'/ I
I
If possible face 1
I
SVM complete classifier
facehonface
/
Preprocessing
Histogram equalization: Our process Rescale the input image several times;
performs a histogram equalization over Cut 19x19 window patterns out of the
the patterns to compensate for differences scaled image;
in illuminationbrightness and different Preprocess the window using masking,
cameras' response curves, and so on. light correction and histogran equaliza-
tion;
Once the process obtains a decision sur- Classify the pattern using the SVM; and
face through training, it uses the runtime If the class corresponds to a face, draw
system over images that do not contain a regtangle aroung the face in the output
faces, storing misclassifications so that image.
they can be used as negative examples in
subsequent training phases. Images of Figure 8 reflects the system's architec-
landscapes, trees, buildings, and rocks, for ture at runtime.
example, are good sources of false posi-
tives because of the many different textured Experimental results on static
patterns they contain. This bootstrapping images
step, which Sung and Poggio6 successfully To test the runtime system, we used two
used, is very important in the context of a sets of images. Set A contained 313 high-
face detector that learns from examples: quality images with the same number of
faces. Set B contained 23 images of mixed
Although negative examples are abun- quality, with a total of 155 faces. We tested
dant, negative examples that are useful both sets, first using our system and then
from a learning standpoint are very dif- the one by Sung and Poggio. 5,6 To give ~
25
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.
sponding skin-detection output.
0 We coded a very primitive motion de-
John Platt, Microsoft Research
tector based on thresholded frame dif-
ferencing to identify areas of movement In the past few years, SVMs have proven
and use them as the focus of attention. to be very effective in red-world classifica-
Motion was not a requirement to be de- tion tasks. This installment of Trends &
tected by the system because every so Controversies descnbes two of these tasks:
many frames (20 in the current imple- face recognition and text categorization.
mentation), we skipped this step and However, many people have found the nu-
scanned the whole image. merical implementation of SVMs to be
0 We put together a hierarchical system intimidating. In this essay, I will attempt to
using as a first step the mohon-detection demystify the implementation of SVMs. As
module. We used the SVM skm-detec- a first step, if you are interested in imple-
tion system as second layer to identify menting an SVM, I recommend reading
c a d d a t e locations of faces. We used the Chris Burges’ tutorial on SVMsF avalable
facelnonface SVM classifier described I at http://svm. research. bell-labs.com/
described earlier over the gray-level ver- SVMdoc. html.
sion of the can&date locations. An SVM is a parametenzed function
whose functional form is defined before
The whole system achieves rates of 4 to training. Training an SVM requires a la-
gure 10. An example of the skin detection module
5 frames per second. Figure 11 presents a beled training set, because the SVM will fit
implemented using SVMs. couple of images captured by our PC-based the function from a set of examples. The
Color Real-Time face-detection system. training set consists of a set of N examples
Each example consists of an input vector,
x,, and a label, y,, which describes whether
the input vector is in a predefined category.
There are N free parameters in an SVM
trained with N examples. These parameters
are called a,.To find these parameters, you
References must solve a quadratic programming (QP)
1 G Burel and D Carel, “Detectlon and Lo- problem.
calization of Faces on Digital Images,”
* N N
Pattern Recognition Letters, Vol 15, 1994,
pp 963-961 minimize
1
-
.’
a i ~ p-C
j a,;
i,j=1 i=l
2 H Rowley, S Bduja, and T Kanade, Human
Face Detection in Visual Scenes, Tech Re-
port 95-158, Computer ScienceDept ,
Carnegie Mellon Univ ,Pittsburgh, 1995
3 G Yang and T Huang, “Human Face Detec-
tion in a Complex Background,”Pattern where Q is an NxN matrix that depends
Recognition, Vol 27, 1994,pp. 53-63. on the training inputs xi, the labels yi, and
4 B Moghaddam and A Pentland, Proba-
bilistic Visual Learning for Object Detec- the functional form of the SVM. We call
tion, Tech Report 326, MIT Media Labora- this problem quadratic programming be-
tory, Cambndge, Mass , 1995 cause the function to be minimized (called
5 N Kruger, M Potzsch, and C v d Mals- the objectivefunction) depends on the a,
burg, Determination of Face Position and quadratically, while aionly appears lin-
Pose with Learned Representation Based on
Labeled Graphs Tech Report 96-03, Ruht- early in the constraints (see http://www-
Universitat, 1996 c.mcs.an1.
I I 6 K Sung, Learning and Example Selection gov/home/otc/Guide/Opt Webkontinuo us/
for Object and Pattern Detection, PhD the- constrained/qprog). Definitions and appli-
Figure 1 1. Face detection on the PC-based Color Reul- sis, MIT AI Lab and Center for Biological
Time system cations of xi, yi.ai,
and Q appear in the tu-
and Computational Learning, 1995
torial by Burges.2
I K Sung and T Poggio, Example-Based
Learning for Wew-Based Human Face De-
UII
Conceptually, the SVM QP problem is to
We trained a SVM classifier using the find a minimum of a bowl-shaped objective
skin and nonskin data. The inmt vari-
ables were normalized green and red
I tection, A.I. Memo 1521, C.B.C.LPaper
112,Dec. 1994.
8. E. Osuna, R. Freund, and E Girosi, “An
function. The search for the minimum is
constrained to lie within a cube and on a
values-g/(r+g+b) and r/(r+g+b), re- Improved Training Algorithm for Support plane. The search occurs in a high-dimen-
Vector Machines,”Proc. IEEE Workshop on
spectively. Figure 10 presents an image Neural Networks and Signal Processing, sional space, so that the bowl is high dimen-
captured by the system and its corre- I B E E Press, Piscataway,N.J, 1997. ~ sional, the cube is a hypercube, and the
26 IEEE INTELLIGENTSYSTEMS
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.
plane is a hyperplane. For most typical
SVM functional forms, the matrix Q has
special properties, so that the objective
function is either bowl-shaped (positive
definite) or has flat-bottomed troughs (posi-
tive semidefinite), but is never saddle-
shaped (indefinite). Thus, there is either a
unique minimum or a connected set of
equivalent minima. An SVM QP has a defi-
nite termination (or optimality) condition I I
that describes these minima. We call these Figure 12. Three alternative methods for training SVMs: (a) Chunking, (b) Osuna‘s algorithm, and (c) SMO. There are
optimality conditions the Karush-Kuhn- three steps for each method. The horizontal thin line at every step represents the training set, while the thick boxes
Tucker (KKT) conditions, and they simply represent the a,being optimized at thot step. A given group of three lines correspondsto three troining iterations,
describe the set of aithat are constrained with the first iteration at the top.
minima.3
The values of aialso have an intuitive
explanation. There is one aifor each train- move the rows and columns of the matrix Q 12b). Using a constant-size matrix allows
ing example. Each aidetermines how that correspond to zero ai. Therefore, the the training of arbitrarily sized datasets.
much each training example influences the large QP problem can break down into a The algorithm in Osuna’s paper suggests
SVM function. Most of the training exam- series of smaller QP problems, whose ulti- adding one example and deleting one ex-
ples do not affect the SVM function, so mate goal is to identify all of the nonzero ai ample at every step. Such an algorithm
most of the aiare 0. and discard all of the zero ai. At every step, converges, although it might not be the
Because of its simple form, you might chunking solves a QP problem that consists fastest possible algorithm. In practice, re-
expect the solution to the SVM QP problem of the following ai: every nonzero aifrom searchers add and delete multiple examples
to be quite simple. Unfortunately, for real- the last step, and the aithat correspond to according to various unpublished heuris-
world problems, the matrix Q can be enor- the M worst violations of the KKT condi- tics. Typically, these heuristics add KKT
mous: it has a dimension equal to the num- tions, for some value of M (see Figure 12a). violators at each step and delete those a,
ber of training examples. A training set of The size of the QP subproblem tends to that are either 0 or C. Joachims has pub-
60,000 examples will yield a Q matrix with grow with time. At the last step, the chunk- lished an algorithm for adding and deleting
3.6 billion elements, which cannot easily fit ing approach has identified the entire set of examples from the QP steps, which rapidly
into the memory of a standard computer. nonzero ai; hence, the last step solves the decreases the objective functioa6
We have at least two different ways of overall QP problem. All of these decomposition methods
solving such gigantic QP problems. First, Chunking reduces the Q matrix’s dimen- require a numencal QP package. Such
there are QP methods that use sophisticated sion from the number of training examples packages might be expensive for commer-
data structures.“ These QP methods do not to approximately the number of nonzero ai. cial users (see the “Where to get the pro-
require the storage of the entire Q matrix, However, chunking still might not handle grams” section). Writing your own efficient
because they do not need to access the rows large-scale training problems, because even QP package is difficult without a numeri-
or columns of Q that correspond to those ai this reduced matrix might not fit into mem- cal-analysis background.
that are at 0 or at C. Deep in the inner loop, ory. Of course, we can combine chunking
these methods only perform dot products with the sophisticated QP methods des- Sequential minimal optimization
between rows or columns of Q and a vec- cribed above, which do not require full Sequential minimal optimization is an
tor, rather than performing an entire ma- storage of a matrix. alternative method that can decompose the
trix-vector multiplication. In 1997,Edgar Osuna and his colleagues SVM QP problem without any extra matrix
suggested a new strategy for solving the storage and without using numerical QP
Decomposingthe QP problem SVM QP p r ~ b l e mOsuna
.~ showed that the optimization step^.^,^ SMO decomposes
The other method for attacking the large- large QP problem can be broken down into the overall QP problem into QP subprob-
scale SVM QP problem is to decompose a series of smaller QP subproblems. As long lems, identically to Osuna’s method. Un-
the large QP problem into a series of as at least one aithat violates the KKT con- like the previous decomposition heuristics,
smaller QP problems. Thus, the selection ditions is added to the previous subproblem, SMO chooses to solve the smallest possible
of submatrices of Q happens outside of the each step reduces the objective function and optimization problem at every step. For the
QP package, rather than inside. Conse- maintains all of the constraints. Therefore, a standard SVM QP problem, the smallest
quently, the decomposition method is com- sequence of QP subproblems that always possible optimization problem involves
patible with standard QP packages. add at least one KKT violator will asymp- two elements of a,,because the a, must
Vapnik first suggested the decomposition totically converge. obey one linear equality constraint. At
approach in a method that has since been Osuna suggests keeping a constant size every step, SMO chooses two a,to jointly
known as chunking.’ The chunking algo- matrix for every QP subproblem, which optimize, finds the optimal values for these
rithm exploits the fact that the value of the implies adding and deleting the same num- a,, and updates the SVM to reflect the new
objective function is the same if you re- ber of examples at every step5 (see Figure optimal values (see Figure 12c).
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.
Table 3. Five experiments comparing SMO to PCG chunking. The functional form of the SVM,training set size, CPU
times, and scaling exponents are shown.
PCG
SMO CHUNKING PCG
TRAINING TRAINING SMO CHUNKING
lems. Purchasing a QP package from a
KERNEL TRAINING CPU CPU SCALING SCALING well-known numerical analysis source is
EXPERIMENT USED SET SIZE TIME (SEC.) TIME (SEC.) EXPONENT EXPONENT the best bet, unless you have an extensive
numerical analysis background, in which
Adult l i n e a r Linear 11,221 170 20,711 3 19 31 case you can create your own QP package.
Web l i n e a r Linear 49,749 2683 17,1647 16 25
11,221 781 4 11,910 6 21 29
Osuna and his colleagues use MINOS for
Adult G a u s s i a n Gaussian
Web G a u s s i a n Gaussian 49,749 3,863 5 23,877 6 17 20 their QP package, which has licensing in-
MNIST Polynomial 60,000 29, ‘1.0 33,109.0 MIA NIA formation at http://www-1eland.stanford.
edu/-saunders~rochure/bvochure.htmL5
LOQO is another robust, large-scale inte-
SMO can solve for two a, analytically, chine-learning benchmark.* The Web ex- rior-point package suitable for QP and
thus avoiding numerical QP optimization periment is a text-categorization task. The available for a fee at http://www.princeton.
entirely. The inner loop can be expressed in Adult and Web datasets are available at edu/-wdb.
a short amount of C code, rather than by http://www. research.microsoft.cod-jplatt/ Finally, a program that implements Joa-
invoking an entire QP library routine. Even smo.htmZ. The MNIST experiment is an chims’ version of Osuna’s algorithm,6called
though more optimization subproblems are OCR benchmark available at http://www. SVMIight,is available free, for scientificpur-
solved in the course of the algorithm, each research. att. cod-yannJocr/mnist. The poses only, at http://www-ai. informatik.uni-
subproblem is so fast that the overall QP training CPU time is listed for both SMO dortmund.de/FORSCHUNG/ VERFAHREN/
problem can be solved quickly. and PCG chunkmg for the training set size SVM-LIGHT/svm-light. eng.html. CI
Because there are so many possible com- shown in the table. The scaling exponent is
binations of QP packages, decomposition the slope of a linear fit to a log-log plot of
heuristics, code optimizations, data struc- the training time versus the training set
tures, and benchmark problems, it is very size. This scaling exponent varies with the
difficult to determine which SVM algo- dataset used. The empirical worst-case
rithm (if any) is the most efficient. SMO scaling for SMO is quadratic, while the References
has been compared to the standard chunk- empirical worst-case scaling for PCG
ing algorithm suggested by Burges in his chunking is cubic. 1. V. Vapnik, Estimation of Dependencies
t ~ t o r i a l .The
~ , ~QP algorithm used by this For a linear problem with sparse inputs, Based on Empirical Data, Springer-Verlag,
version of chunking is projected conjugate SMO can be more than 1,000 times faster NewYork, 1982.
gradient (PCG). Table 3 compares the re- than PCG chunking. 2. C.J.C. Burges, “A Tutorial on SupportVec-
tor Machines for Pattern Recognition,”
sults for SMO versus PCG chunking. Both Joachims has compared his algorithm submitted to Data Mining and Knowledge
algorithms are coded in C++, share SVM (SVM”ghtversion 2) and SMO on the same Discovery, 1998.
evaluation code, are compiled with Micro- datasets.6 His algorithm and SMO have 3. J.C. Platt, “Fast Training of SVMs Using
soft Visual C++ version 5.0, and are run on comparable scaling with training set size. Sequential Minimal Optimization,”to be
a 266-MHz Pentium I1 with Windows NT The CPU time of Joachims’ algorithm published in Advances in Kernel Methods-
Support Vector Learning, B. Scholkopf, C .
and 128 Mbytes of memory. Both algori- seems roughly comparable to SMO; differ- Burges, andA. Smola, eds., MIT Press,
thms have inner loops that take advantage ent code optimizations make exact compar- Cambridge,Mass., 1998.
of input vectors that contain mostly zero ison between the two algorithms difficult. 4. L. Kaufman, “Solving the Quadratic Pro-
entries (that is, sparse vectors). gramming ProblemArising in SupportVec-
For more details on this comparison, and Where to get the programs tor Classification,”to be published in Ad-
vunces in Kernel Methods-Support Vector
for more experiments on synthetic datasets, The pseudocode for SMO is currently in Learning, MIT Press, 1998.
please consult my upcoming p~blication.~ a technical report available at http://www. 5. E. Osuna, R. Freund, and F. Girosi, “An
The Adult experiment is an income-predic- research. microsoft.cod-jplatt/smo. htmL7 Improved Training Algorithm for Support
tion task and is derived from the UCI ma- SMO can be quickly implemented in the Vector Machines,”Proc. IEEE Neural Net-
programming language of your choice works for Signal Processing VII Workshop,
IEEE Press, Piscataway,N.J., 1997,pp.
._
using this pseudocode. I would recommend 216-285.
SMO if you are planning on using linear 6. T. Joachims, “MakingLarge-Scale SVM
v u
SVMs, if your data is sparse, or if you want Learning Practical,”to be published in Ad-
to write your own end-to-end code. vances in Kernel Methods-Support Vector
If you decide to use a QP-based system, Learning, MIT Press, 1998.
7. J.C. Platt, Sequential Minimal Optimiza-
be careful about writing QP code your- tion: A Fast Algorithm for Training Support
self-there are many subtle numerical pre- Vector Machines, Microsoft Research Tech.
cision issues involved, and you can find Report MSR-TR-98-14,Microsoft,Red-
yourself in a quagmire quite rapidly. Also, mond, Wash., 1998.
be wary of freeware QP packages available 8. C.J.Merz and P.M. Murphy, UCI Repository
of Machine Learning Databases, Univ. of
on the Web: in my experience, such pack- California, Irvine, Dept. Information and
ages tend to run slowly and might not work Computer Science,Irvine, Calif.;http://www.
well for ill-conditioned or very large prob- ics.uci.edu/-mleamlMLRepository.html.
28 IEEE INTELLIGENTSYSTEMS
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 08,2021 at 15:36:19 UTC from IEEE Xplore. Restrictions apply.