0% found this document useful (0 votes)
84 views13 pages

Cloudsvm: Training An SVM Classifier in Cloud Computing Systems

This document summarizes a research paper about training support vector machine (SVM) classifiers in cloud computing systems using MapReduce. The authors propose a method called CloudSVM that iteratively trains subsets of large datasets split across cloud storage servers. It merges the support vectors from each trained subset to converge on a global optimal classifier. The authors analyze CloudSVM using various public datasets and implement it using Hadoop MapReduce in the cloud. This allows large-scale datasets to be trained using SVMs that were previously not possible on a single computer.

Uploaded by

sfaritha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views13 pages

Cloudsvm: Training An SVM Classifier in Cloud Computing Systems

This document summarizes a research paper about training support vector machine (SVM) classifiers in cloud computing systems using MapReduce. The authors propose a method called CloudSVM that iteratively trains subsets of large datasets split across cloud storage servers. It merges the support vectors from each trained subset to converge on a global optimal classifier. The authors analyze CloudSVM using various public datasets and implement it using Hadoop MapReduce in the cloud. This allows large-scale datasets to be trained using SVMs that were previously not possible on a single computer.

Uploaded by

sfaritha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

CloudSVM : Training an SVM Classifier in Cloud

Computing Systems
F. Ozgur CATAK
National Research Institute Of Electronics and Cryptology
(UEKAE)TUBITAK
M. Erdal BALABAN
Industrial Engineering, Isik University

Abstract
In conventional method, distributed support vector machines (SVM)
algorithms are trained over pre-configured intranet/internet environments
to find out an optimal classifier. These methods are very complicated and
costly for large datasets. Hence, we propose a method that is referred as
the Cloud SVM training mechanism (CloudSVM) in a cloud computing
environment with MapReduce technique for distributed machine learning
applications. Accordingly, (i) SVM algorithm is trained in distributed
cloud storage servers that work concurrently; (ii) merge all support vectors
in every trained cloud node; and (iii) iterate these two steps until the SVM
converges to the optimal classifier function. Large scale data sets are not
possible to train using SVM algorithm on a single computer. The results
of this study are important for training of large scale data sets for machine
learning applications. We provided that iterative training of splitted data
set in cloud computing environment using SVM will converge to a global
optimal classifier in finite iteration size.

1 Introduction
Machine learning applications generally require large amounts of computation
time and storage space. Learning algorithms have to be scaled up to handle
extremely large data sets. When the training set is large, not all the examples
can be loaded into memory in training phase of the machine learning algorithm
at one step. It is required to distribute computation and memory requirements
among several connected computers.

In machine learning field, support vector machines(SVM) offers most ro-


bust and accurate classification method due to their generalized properties.
With its solid theoretical foundation and also proven effectiveness, SVM has
contributed to researchers’ success in many fields. But, SVM’s suffer from a

1
widely recognized scalability problem in both memory requirement and compu-
tational time[1]. SVM algorithm’s computation and memory requirements in-
crease rapidly with the number of instances in data set, many data sets are not
suitable for classification[14]. The SVM algorithm is formulated as quadratic
optimization problem. Quadratic optimization problem has O(m3 ) time and
O(m2 ) space complexity, where m is the training set size[2]. The computation
time of SVM training is quadratic in the number of training instances.

The first approach to overcome large scale data set training is to reduce fea-
ture vector size. Feature selection and feature transformation methods are basic
approaches for reducing vector size [3]. Feature selection algorithms choose a
subset of the features from the original feature set and feature transformation
algorithms creates new data from the original feature space to a new space
with reduced dimensionality. In literature, there are several methods; Singular
Value Decomposition (SVD)[4], Principal Component Analysis (PCA) [5], In-
dependent Component Analysis (ICA)[6], Correlation Based Feature Selection
(CFS)[7], Sampling based data set selection. All of these methods have a big
problem for generalization of final machine learning model.

Second approach for large scale data set training is chunking [13]. Collobert
et al. [12] propose a parallel SVM training algorithm that each subset of whole
dataset is trained with SVM and then the classifiers are combined into a fi-
nal single classifier. Lu et al.[8] proposed distributed support vector machine
(DSVM) algorithm that finds support vectors (SVs) on strongly connected net-
works. Each site within a strongly connected network classifies subsets of train-
ing data locally via SVM and passes the calculated SVs to its descendant sites
and receives SVs from its ancestor sites and recalculates the SVs and passes
them to its descendant sites and so on. Ruping et al.[9] proposed incremental
learning with Support Vector Machine. One needs to make an error on the
old Support Vectors(which represent the old learning set) more costly than an
error on a new example. Syed et al. [10] proposed the distributed support
vector machine (DSVM) algorithm that finds SVs locally and processes them
altogether in a central processing center. Caragea et al. [11] in 2005 improved
this algorithm by allowing the data processing center to send support vectors
back to the distributed data source and iteratively achieve the global optimum.
Graf et al. [14] had an algorithm that implemented distributed processors into
cascade top-down network topology, namely Cascade SVM. The bottom node
of the network is the central processing center. The distributed SVM methods
in these works converge and increase test accuracy. All of these works have
similar problems. They require a pre-defined network topology and computer
size in their network. The performance of training depends on the special net-
work configuration. Main idea of current distributed SVM methods is first data
chunking then parallel implementation of SVM training. Global synchroniza-
tion overheads are not considered in these approaches.

In this paper, we propose a Cloud Computing based SVM method with

2
MapReduce [18] technique for distributed training phase of algorithm. By split-
ting training set over a cloud computing system’s data nodes, each subset is
optimized iteratively to find out a single global classifier. The basic idea behind
this approach is to collect SVs from every optimized subset of training set at
each cloud node, then merge them to save as global support vectors. Comput-
ers in cloud computing system exchange only minimum number of training set
samples. Our algorithm CloudSVM is analysed with various public datasets.
CloudSVM is built on the LibSVM and implemented using the Hadoop imple-
mentation of MapReduce.

This paper is organized as follows. In section 2, we will provide an overview


to SVM formulations. In Section 3, presents the Map Reduce pattern in detail.
Section 4 explains system model with our implementation of the Map Reduce
pattern for the SVM training. In section 5, convergence of CloudSVM is ex-
plained. In section 6, simulation results with various UCI datasets are shown.
Thereafter, we will give concluding remarks in Section 7.

2 Support Vector Machine


Support vector machine is a supervised learning method in statistics and com-
puter science, to analyse data and recognize patterns, used for classification and
regression analysis. The standard SVM takes a set of input data and predicts,
for each given input, which of two possible classes forms the input, making
the SVM a non-probabilistic binary linear classifier. Note that if the training
data[singular/plural] are linearly separable as shown in figure 1, we can select
the two hyperplanes of the margin in a way that there are no points between
them and then try to maximize their distance. By using geometry, we find the
distance between these two hyperplanes is 2/kwk. Given some training data D,
a set of n points of the form

D = {(xi , yi ) | xi ∈ Rm , yi ∈ {−1, 1} }ni=1 (1)

where xi is an m-dimensional real vector, yi is either -1 or 1 denoting the class


to which point xi belongs. SVMs aim to search a hyperplane in the Reproduc-
ing Kernel Hilbert Space (RKHS) that maximizes the margin between the two
classes of data in D with the smallest training error [13]. This problem can be
formulated as the following quadratic optimization problem:
m
1 X
minimize :P (w, b, ξ) = kwk2 + C ξi
2 i=1
(2)
subjectto :yi (hw, φ(xi )i + b) ≥ 1 − ξi
ξi ≥ 0

for i = 1, ..., m, where ξi are slack variables and C is a constant denoting the
cost of each slack. C is a trade-off parameter which controls the maximization of

3
Figure 1: Binary classification of an SVM with Maximum-margin hyperplane
trained with samples from two classes. Samples on the margin are called the
support vectors.

the margin and minimizing the training error. The decision function of SVMs is
f (x) = wT φ(x) + b where the w and b are obtained by solving the optimization
problem P in (2). By using Lagrange multipliers , the optimization problem P
in (2) can be expressed as
1 T
min :F (α) = α QαT − αT 1
2
subjectto :0 ≤ α ≤ C (3)

yT α = 0

where [Q]ij = yi yj φT (xi )φ(xj ) is the Lagrangian multiplier variable. It


is not need to know φ, but it is necessary to know is how to compute the
modified inner product which will be called as kernel function represented as
K(xi , xj ) = φT (xi )φ(xj ). Thus, [Q]ij = yi yj K(xi , xj ). Choosing a positive
definite kernel K, by Mercers theorem, then optimization problem P is a convex
quadratic programming (QP) problem with linear constraints and can be solved
in polynomial time.

3 MapReduce
MapReduce is a programming model derived from the map and reduce function
combination from functional programming. MapReduce model widely used to
run parallel applications for large scale data sets processing. Users specify a
map function that processes a key/value pair to generate a set of intermediate
key/value pairs, and a reduce function that merges all intermediate values as-
sociated with the same intermediate key[18]. MapReduce is divided into two

4
major phases called map and reduce, separated by an internal shuffle phase
of the intermediate results. The framework automatically executes those func-
tions in parallel over any number of processors[19]. Simply, a MapReduce job
executes three basic operations on a data set distributed across many shared-
nothing cluster nodes. First task is Map function that processes in parallel
manner by each node without transferring any data with other notes. In next
operation, processed data by Map function is repartitioned across all nodes of
the cluster. Lastly, Reduce task is executed in parallel manner by each node
with partitioned data.

Figure 2: Overview of MapReduce System

A file in the distributed file system (DFS) is split into multiple chunks and
each chunk is stored on different data-nodes. A map function takes a key/value
pair as input from input chunks and produces a list of key/value pairs as output.
The type of output key and value can be different from input key and value:

map(key1 , value1 ) ⇒ list(key2 , value2 )

A reduce function takes a key and associated value list as input and generates
a list of new values as output:

reduce(key2 , list(value2 )) ⇒ list(value3 )

Each Reduce call typically produces either one value v3 or an empty return,
though one call is allowed to return more than one value. The returns of all
calls are collected as the desired result list.Main advantage of MapReduce system
is that it allows distributed processing of submitted job on the subset of a whole
dataset in the network.

5
4 System Model

Figure 3: Schematic of Cloud SVM architecture.

CloudSVM is a MapReduce based SVM training algorithm that runs in parallel


on multiple commodity computers with Hadoop. As shown in figure 3, the
training set of the algorithm is split into subsets and each one is evaluated
individually to get α values (i.e. support vectors). In Map stage of MapReduce
job, the subset of training set is combined with global support vectors. In
Reduce step, the merged subset of training data is evaluated. The resulting
new support vectors are combined with the global support vectors in Reduce
step. The CloudSVM with MapReduce algorithm can be explained as follows.
First, each computer within a cloud computing system reads the global support
vectors, then merges global SVs with subsets of local training data and classifies
via SVM. Finally, all the computed SVs in cloud computers are merged. Thus,
algorithm saves global SVs with new ones. The algorithm of CloudSVM consists

6
of the following steps.
1. As initialization the global support vector set as t = 0, V t = ∅
2. t = t + 1;

3. For any computer in l, l = 1, ..., L reads global SVs and merge them with
subset of training data.
4. Train SVM algorithm with merged new data set
5. Find out support vectors

6. After all computers in cloud system complete their training phase, merge
all calculated SVs and save the result to the global SVs
7. If ht = ht−1 stop, otherwise go to step 2
Pseudo code of CloudSVM Algorithm’s Map and Reduce function are given
in Algorithm 1 and Algorithm 2

Algorithm 1 Map Function of CloudSVM Algorithm


SVGlobal = ∅ // Empty global support vector set
while ht 6= ht−1 do
for l ∈ L // For each subset loop do
Dlt ← Dlt ∪ SVGlobal
t

end for
end while

Algorithm 2 Reduce Function of CloudSVM Algorithm


while ht 6= ht−1 do
for l ∈ L do
SVl , ht ← svm(Dl ) // Train merged Dataset to obtain Support Vectors
and Hypothesis
end for
for l ∈ L do
SVGlobal ← SVGlobal ∪ SVl
end for
end while

For training SVM classifier functions, we used LibSVM with various kernels.
Appropriate parameters C and γ values were found by cross validation test.
All system is implemented with Hadoop and streaming Python package mrjob
library.

7
5 Convergence of CloudSVM
Let S denotes a subset of training set D, F (S) is the optimal objective function
over data set S, h∗ is the global optimal hypothesis for which has a minimal
empirical risk Remp (h). Our algorithm starts with SV0Global = 0, and generates
a non-increasing sequence of positive set of vectors SVtGlobal , where SVtGlobal
is the vector of support vector at the t.th iteration. We used hinge loss for
testing our models trained with CloudSVM algorithm. Hinge loss works well
for its purposes in SVM as a classifier, since the more you violate the margin,
the higher the penalty is[20]. The hinge loss function is the following:

l(f (x), y) = max {0, 1 − y.f (x)}

Empirical risk can be computed with an approximation:


n
1X
Remp (h) = l(h(xi ), yi )
n i=1

According to the empirical risk minimization principle the learning algorithm


should choose a hypothesis ĥ which minimizes the empirical risk:

ĥ = arg min Remp (h).


h∈H

A hypothesis is found in every cloud node. Let X be a subset of training data at


cloud node i where X ∈ Rmxn , SVtGlobal is the vector of support vector at the
t.th iteration, ht,i is hypothesis at node i with iteration t, then the optimization
problem in equation 3 becomes
 T      T  
t,l 1 α1 Q11 Q12 α1 1 α1
maximize h =− +
α
2 2 Q21 Q22 α2 1 α2
I (4)
X
subjectto : 0 ≤ αi ≤ C, ∀i and αi yi = 0
i

where Q12 and Q21 are kernel matrices with respect to


n o
t
Q12 = Ki,j (xij , SVGlobal(i,j) ) | i = 1, ..., m, j = 1, ..., n .

α1 and α2 are the solutions estimated by node i with dataset X and SVGlobal .
Because of the Mercer’s theorem, our kernel matrix Q is a symmetric positive-
definite function on a square. Then our sub matrices Q12 and Q21 must be
equal.
We can define Q11 and Q22 matrices such that

Q11 = {Ki,j (xi,j , xi,j )|xi,j ∈ X , i = 1, ..., m, j = 1, ..., n}

Q22 = {Ki,j (SVGlobal , SVGlobal )|i = 1, ..., m, j = 1, ..., n}

8
at iteration t.
Algorithm’s stop point is reached when the hypothesis’ empirical risk is same
with previous iteration. That is:

Remp (ht ) = Remp (ht−1 ) (5)

Lemma : Accuracy of the decision function of CloudSVM classifier at iteration


t is always greater or equal to the maximum accuracy of the decision function
of SVM classifier at iteration t − 1. That is

Remp (ht ) ≤ arg min


t−1
Remp (h) (6)
h∈H

Proof :Without loss of generality, Iterated CloudSVM monotonically con-


verges to optimum classifier.

SVtGlobal = SVt−1 t−1



Global ∪ SVi | i = 1, ...n

where n is the data set split size(or cloud node size). Then, training set for svm
algorithm at node i is

d = X ∪ SVtGlobal
Adding more samples cannot decrease the optimal value. Accuracy of the
sub problem in each node monotonically increases in each step.

6 Simulation Results

Table 1: The datasets used in experiments


Dataset Name Train. Data Dim.
German 1000 24
Heart 270 13
Ionosphere 351 34
Satellite 4435 36

We have selected several data sets from the UCI Machine Learning Repos-
itory, namely, German, Heart, Ionosphere, Hand Digit and Satellite. The data
sets length and input dimensions are shown in Table 1. We test our algorithm
over a real-word data sets to demonstrate the convergence. Linear kernels were
used with optimal parameters (γ, C). Parameters were estimated by cross-
validation method.
We used 10-fold cross-validation, dividing the set of samples at random into
10 approximately equal-size parts. The 10 parts were roughly balanced, ensur-
ing that the classes were distributed uniformly to each of the 10 parts. Ten-fold
cross-validation works as follows: we fit the model on 90% of the samples and

9
Table 2: Performance Results of CloudSVM algorithm with various UCI
datasets γ
Dataset Name γ C No. Of Iteration No. of SVs Accuracy Kernel Type
German 100 1 5 606 0.7728 Linear
Heart 100 1 3 137 0.8259 Linear
Ionosphere 108 1 3 160 0.8423 Linear
Satellite 100 1 2 1384 0.9064 Linear

then predict the class labels of the remaining 10% (the test samples). This pro-
cedure is repeated 10 times, with each part playing the role of the test samples
and the errors on all 10 parts added together to compute the overall error.
To analyse the CloudSVM, we randomly distributed all the training data to a
cloud computing system with 10 computers with pseudo distributed Hadoop.
Data set prediction accuracy with iterations and total number of SVs are shown
in Table 3. When iteration size become 3 - 5, test accuracy values of all data
sets reach to the highest values. If the iteration size is increased, the value of
test accuracy falls into a steady state. The value of test accuracy is not changed
for large enough number of iteration size.

When the iteration size is increased, the number of global support vectors
are passed the steady-state condition. As a result, the CloudSVM algorithm is
useful for large size training data.

7 Conclusion and Further Research


We have proposed distributed support vector machine implementation in cloud
computing systems with MapReduce technique that improves scalability and
parallelism of split data set training. The performance and generalization prop-
erty of our algorithm are evaluated in Hadoop. Our algorithm is able to work on
cloud computing systems without knowing how many computers connected to
run parallel. The algorithm is designed to deal with large scale data set training
problems. It is empirically shown that the generalization performance and the
risk minimization of our algorithm are better than the previous results.

References
[1] Chang, E.Y., Zhu, K., Wang, H., Bai, H., Li, J. and Qiu, Z.,Cui, H.:
PSVM: Parallelizing Support Vector Machines on Distributed Comput-
ers. Advances in Neural Information Processing Systems 20, (2007)

10
Table 3: Data set prediction accuracy with iterations

[2] Tsang, I.W., Kwok, J.T., Cheung, P.M. : Core Vector Machines: Fast
SVM Training on Very Large Data Sets. J. Mach. Learn. Res. 6, 363-392
(2005)

[3] Weston, J., Mukherjee, S., Chapelle O., Pontil M., Poggio T., Vap-
nik V.: Feature selection for SVMs Advances in Neural Information
Processing Systems 13, 668-674 (2000)
[4] Golub, G., Reinsch, C. ER: Singular value decomposition and least
squares solutions. Numerische Mathematik, 14, 403-420 (1970)
[5] Jolliffe I.: T. Principal Component Analysis , Series: Springer Series in
Statistics , 2nd ed., New York (2002)
[6] Comon P.: Independent Component Analysis, a new concept ?. Signal
Processing, Elsevier, 36, 287-314 (1994)

11
[7] Hall M.A.: Correlation-based Feature Selection for Discrete and Nu-
meric Class Machine Learning. In: Proceedings of the Seventeenth
International Conference on Machine Learning, pp. 359-366. Morgan
Kaufmann Publishers Inc., San Francisco, CA (2000)
[8] Lu, Y., Roychowdhury, V., Vandenberghe, L.: Distributed parallel sup-
port vector machines in strongly connected networks. IEEE Trans. Neu-
ral Networks, 19, 1167-1178 (2008)
[9] Stefan, R.: Incremental Learning with Support Vector Machines. In:
Data Mining, IEEE International Conference on, pp. 641. IEEE Com-
puter Society, Los Alamitos, CA (2001)
[10] Syed, N.A., Liu, H.,Sung K.: Incremental learning with support vec-
tor machines. In: Proceedings of the Fifth ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining (KDD),
San Diego, California, (1999)
[11] Caragea, C., Caragea, D., Honavar, V: Learning support vector ma-
chine classifiers from distributed data sources.In Proceedings of the
Twentieth National Conference on Artificial Intelligence (AAAI), Stu-
dent Abstract and Poster Program, pp. 1602-1603. AAAI Press, Pitts-
burgh, Pennsylvania, (2005)
[12] Collobert, R., Bengio, S., Bengio, Y.: A parallel mixture of SVMs for
very large scale problems. Neural Computation, 14, 1105-1114 (2002)
[13] Vapnik, V.N.: The nature of statistical learning theory. Springer, NY
(1995)
[14] Graf,H. P., Cosatto, E., Bottou, L., Durdanovic, I., Vapnik, V.: Par-
allel support vector machines: The cascade SVM.In: Proceedings of
the Eighteenth Annual Conference on Neural Information Processing
Systems (NIPS), pp. 521-528. MIT Press, Vancouver (2004)
[15] Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector ma-
chines. ACM Transactions on Intelligent Systems and Technology, 2,
27-27 (2011)
[16] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learn-
ing applied to document recognition. Proceedings of the IEEE, 86,
22782324, (1998)
[17] Bertsekas, D.P.:Nonlinear Programming (Second ed.). Athena Scien-
tific. Cambridge, (1999)
[18] Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on
large clusters. In : Proceedings of the 6th conference on Symposium
on Operating Systems Design & Implementation(OSDI), pp. 10-10.
USENIX Association, Berkeley (2004)

12
[19] Schatz, M.C.: CloudBurst: highly sensitive read mapping with MapRe-
duce. Bioinformatics (Oxford, England), 25, 1363-1369 (2009)
[20] Rosasco, L., De Vito, E., Caponnetto, A., Piana, M., Verri, A.: Are
loss functions all the same. Neural Computation, 16, 1063-1076 (2011)

13

You might also like