You are on page 1of 124

------------------------------------ Python interface of LIBSVM -----------------------------------Table of Contents

=================
-

Introduction
Installation
Quick Start
Design Description
Data Structures
Utility Functions
Additional Information

Introduction
============
Python (http://www.python.org/) is a programming language suitable for
rapid
development. This tool provides a simple Python interface to LIBSVM, a
library
for support vector machines (http://www.csie.ntu.edu.tw/~cjlin/libsvm).
The
interface is very easy to use as the usage is the same as that of LIBSVM.
The
interface is developed with the built-in Python library "ctypes."
Installation
============
On Unix systems, type
> make
The interface needs only LIBSVM shared library, which is generated by
the above command. We assume that the shared library is on the LIBSVM
main directory or in the system path.
For windows, the shared library libsvm.dll for 32-bit python is ready
in the directory `..\windows'. You can also copy it to the system
directory (e.g., `C:\WINDOWS\system32\' for Windows XP). To regenerate
the shared library, please follow the instruction of building windows
binaries in LIBSVM README.
Quick Start
===========
There are two levels of usage. The high-level one uses utility functions
in svmutil.py and the usage is the same as the LIBSVM MATLAB interface.
>>> from svmutil import *
# Read data in LIBSVM format
>>> y, x = svm_read_problem('../heart_scale')

>>> m = svm_train(y[:200], x[:200], '-c 4')


>>> p_label, p_acc, p_val = svm_predict(y[200:], x[200:], m)
# Construct problem in python format
# Dense data
>>> y, x = [1,-1], [[1,0,1], [-1,0,-1]]
# Sparse data
>>> y, x = [1,-1], [{1:1, 3:1}, {1:-1,3:-1}]
>>> prob = svm_problem(y, x)
>>> param = svm_parameter('-t 0 -c 4 -b 1')
>>> m = svm_train(prob, param)
# Precomputed kernel data (-t 4)
# Dense data
>>> y, x = [1,-1], [[1, 2, -2], [2, -2, 2]]
# Sparse data
>>> y, x = [1,-1], [{0:1, 1:2, 2:-2}, {0:2, 1:-2, 2:2}]
# isKernel=True must be set for precomputer kernel
>>> prob = svm_problem(y, x, isKernel=True)
>>> param = svm_parameter('-t 4 -c 4 -b 1')
>>> m = svm_train(prob, param)
# For the format of precomputed kernel, please read LIBSVM README.
# Other utility functions
>>> svm_save_model('heart_scale.model', m)
>>> m = svm_load_model('heart_scale.model')
>>> p_label, p_acc, p_val = svm_predict(y, x, m, '-b 1')
>>> ACC, MSE, SCC = evaluations(y, p_label)
# Getting online help
>>> help(svm_train)
The low-level use directly calls C interfaces imported by svm.py. Note
that
all arguments and return values are in ctypes format. You need to handle
them
carefully.
>>> from svm import *
>>> prob = svm_problem([1,-1], [{1:1, 3:1}, {1:-1,3:-1}])
>>> param = svm_parameter('-c 4')
>>> m = libsvm.svm_train(prob, param) # m is a ctype pointer to an
svm_model
# Convert a Python-format instance to svm_nodearray, a ctypes structure
>>> x0, max_idx = gen_svm_nodearray({1:1, 3:1})
>>> label = libsvm.svm_predict(m, x0)
Design Description
==================
There are two files svm.py and svmutil.py, which respectively correspond
to
low-level and high-level use of the interface.

In svm.py, we adopt the Python built-in library "ctypes," so that


Python can directly access C structures and interface functions defined
in svm.h.
While advanced users can use structures/functions in svm.py, to
avoid handling ctypes structures, in svmutil.py we provide some easy-touse
functions. The usage is similar to LIBSVM MATLAB interface.
Data Structures
===============
Four data structures derived from svm.h are svm_node, svm_problem,
svm_parameter,
and svm_model. They all contain fields with the same names in svm.h.
Access
these fields carefully because you directly use a C structure instead of
a
Python object. For svm_model, accessing the field directly is not
recommanded.
Programmers should use the interface functions or methods of svm_model
class
in Python to get the values. The following description introduces
additional
fields and methods.
Before using the data structures, execute the following command to load
the
LIBSVM shared library:
>>> from svm import *
- class svm_node:
Construct an svm_node.
>>> node = svm_node(idx, val)
idx: an integer indicates the feature index.
val: a float indicates the feature value.
Show the index and the value of a node.
>>> print(node)
- Function: gen_svm_nodearray(xi [,feature_max=None [,isKernel=False]])
Generate a feature vector from a Python list/tuple or a dictionary:
>>> xi, max_idx = gen_svm_nodearray({1:1, 3:1, 5:-2})
xi: the returned svm_nodearray (a ctypes structure)

max_idx: the maximal feature index of xi


feature_max: if feature_max is assigned, features with indices larger
than
feature_max are removed.
isKernel: if isKernel == True, the list index starts from 0 for
precomputed
kernel. Otherwise, the list index starts from 1. The
default
value is False.
- class svm_problem:
Construct an svm_problem instance
>>> prob = svm_problem(y, x)
y: a Python list/tuple of l labels (type must be int/double).
x: a Python list/tuple of l data instances. Each element of x must be
an instance of list/tuple/dictionary type.
Note that if your x contains sparse data (i.e., dictionary), the
internal
ctypes data format is still sparse.
For pre-computed kernel, the isKernel flag should be set to True:
>>> prob = svm_problem(y, x, isKernel=True)
Please read LIBSVM README for more details of pre-computed kernel.
- class svm_parameter:
Construct an svm_parameter instance
>>> param = svm_parameter('training_options')
If 'training_options' is empty, LIBSVM default values are applied.
Set param to LIBSVM default values.
>>> param.set_to_default_values()
Parse a string of options.
>>> param.parse_options('training_options')
Show values of parameters.
>>> print(param)

- class svm_model:
There are two ways to obtain an instance of svm_model:
>>> model = svm_train(y, x)
>>> model = svm_load_model('model_file_name')
Note that the returned structure of interface functions
libsvm.svm_train and libsvm.svm_load_model is a ctypes pointer of
svm_model, which is different from the svm_model object returned
by svm_train and svm_load_model in svmutil.py. We provide a
function toPyModel for the conversion:
>>> model_ptr = libsvm.svm_train(prob, param)
>>> model = toPyModel(model_ptr)
If you obtain a model in a way other than the above approaches,
handle it carefully to avoid memory leak or segmentation fault.
Some interface functions to access LIBSVM models are wrapped as
members of the class svm_model:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>

svm_type = model.get_svm_type()
nr_class = model.get_nr_class()
svr_probability = model.get_svr_probability()
class_labels = model.get_labels()
sv_indices = model.get_sv_indices()
nr_sv = model.get_nr_sv()
is_prob_model = model.is_probability_model()
support_vector_coefficients = model.get_sv_coef()
support_vectors = model.get_SV()

Utility Functions
=================
To use utility functions, type
>>> from svmutil import *
The above command loads
svm_train()
:
svm_predict()
:
svm_read_problem() :
svm_load_model()
:
svm_save_model()
:
evaluations()
:

train an SVM model


predict testing data
read the data from a LIBSVM-format file.
load a LIBSVM model.
save model to a file.
evaluate prediction results.

- Function: svm_train
There are three ways to call svm_train()
>>> model = svm_train(y, x [, 'training_options'])
>>> model = svm_train(prob [, 'training_options'])
>>> model = svm_train(prob, param)

y: a list/tuple of l training labels (type must be int/double).


x: a list/tuple of l training instances. The feature vector of
each training instance is an instance of list/tuple or dictionary.
training_options: a string in the same form as that for LIBSVM
command
mode.
prob: an svm_problem instance generated by calling
svm_problem(y, x).
For pre-computed kernel, you should use
svm_problem(y, x, isKernel=True)
param: an svm_parameter instance generated by calling
svm_parameter('training_options')
model: the returned svm_model instance. See svm.h for details of this
structure. If '-v' is specified, cross validation is
conducted and the returned model is just a scalar: crossvalidation
accuracy for classification and mean-squared error for
regression.
To train the same data many times with different
parameters, the second and the third ways should be faster..
Examples:
>>>
>>>
>>>
>>>
>>>
>>>
>>>

y, x = svm_read_problem('../heart_scale')
prob = svm_problem(y, x)
param = svm_parameter('-s 3 -c 5 -h 0')
m = svm_train(y, x, '-c 5')
m = svm_train(prob, '-t 2 -c 5')
m = svm_train(prob, param)
CV_ACC = svm_train(y, x, '-v 3')

- Function: svm_predict
To predict testing data with a model, use
>>> p_labs, p_acc, p_vals = svm_predict(y, x, model
[,'predicting_options'])
y: a list/tuple of l true labels (type must be int/double). It is
used
for calculating the accuracy. Use [0]*len(x) if true labels are
unavailable.
x: a list/tuple of l predicting instances. The feature vector of
each predicting instance is an instance of list/tuple or
dictionary.

predicting_options: a string of predicting options in the same format


as
that of LIBSVM.
model: an svm_model instance.
p_labels: a list of predicted labels
p_acc: a tuple including accuracy (for classification), mean
squared error, and squared correlation coefficient (for
regression).
p_vals: a list of decision values or probability estimates (if '-b 1'
is specified). If k is the number of classes in training
data,
for decision values, each element includes results of
predicting
k(k-1)/2 binary-class SVMs. For classification, k = 1 is a
special case. Decision value [+1] is returned for each testing
instance, instead of an empty list.
For probabilities, each element contains k values indicating
the probability that the testing instance is in each class.
Note that the order of classes is the same as the
'model.label'
field in the model structure.
Example:
>>> m = svm_train(y, x, '-c 5')
>>> p_labels, p_acc, p_vals = svm_predict(y, x, m)
- Functions:

svm_read_problem/svm_load_model/svm_save_model

See the usage by examples:


>>> y, x = svm_read_problem('data.txt')
>>> m = svm_load_model('model_file')
>>> svm_save_model('model_file', m)
- Function: evaluations
Calculate some evaluations using the true values (ty) and predicted
values (pv):
>>> (ACC, MSE, SCC) = evaluations(ty, pv)
ty: a list of true values.
pv: a list of predict values.
ACC: accuracy.
MSE: mean squared error.

SCC: squared correlation coefficient.


Additional Information
======================
This interface was written by Hsiang-Fu Yu from Department of Computer
Science, National Taiwan University. If you find this tool useful, please
cite LIBSVM as follows
Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support
vector machines. ACM Transactions on Intelligent Systems and
Technology, 2:27:1--27:27, 2011. Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm
For any question, please contact Chih-Jen Lin <cjlin@csie.ntu.edu.tw>,
or check the FAQ page:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html

This directory includes some useful codes:


1. subset selection tools.
2. parameter selection tools.
3. LIBSVM format checking tools
Part I: Subset selection tools
Introduction
============
Training large data is time consuming. Sometimes one should work on a
smaller subset first. The python script subset.py randomly selects a
specified number of samples. For classification data, we provide a
stratified selection to ensure the same class distribution in the
subset.
Usage: subset.py [options] dataset number [output1] [output2]
This script selects a subset of the given data set.
options:
-s method : method of selection (default 0)
0 -- stratified selection (classification only)
1 -- random selection
output1 : the subset (optional)
output2 : the rest of data (optional)
If output1 is omitted, the subset will be printed on the screen.
Example
=======
> python subset.py heart_scale 100 file1 file2
From heart_scale 100 samples are randomly selected and stored in
file1. All remaining instances are stored in file2.
Part II: Parameter Selection Tools
Introduction
============
grid.py is a parameter selection tool for C-SVM classification using
the RBF (radial basis function) kernel. It uses cross validation (CV)
technique to estimate the accuracy of each parameter combination in
the specified range and helps you to decide the best parameters for
your problem.
grid.py directly executes libsvm binaries (so no python binding is
needed)
for cross validation and then draw contour of CV accuracy using gnuplot.

You must have libsvm and gnuplot installed before using it. The package
gnuplot is available at http://www.gnuplot.info/
On Mac OSX, the precompiled gnuplot file needs the library Aquarterm,
which thus must be installed as well. In addition, this version of
gnuplot does not support png, so you need to change "set term png
transparent small" and use other image formats. For example, you may
have "set term pbm small color".
Usage: grid.py [grid_options] [svm_options] dataset
grid_options :
-log2c {begin,end,step | "null"} : set the range of c (default -5,15,2)
begin,end,step -- c_range = 2^{begin,...,begin+k*step,...,end}
"null"
-- do not grid with c
-log2g {begin,end,step | "null"} : set the range of g (default 3,-15,-2)
begin,end,step -- g_range = 2^{begin,...,begin+k*step,...,end}
"null"
-- do not grid with g
-v n : n-fold cross validation (default 5)
-svmtrain pathname : set svm executable path and name
-gnuplot {pathname | "null"} :
pathname -- set gnuplot executable path and name
"null"
-- do not plot
-out {pathname | "null"} : (default dataset.out)
pathname -- set output file path and name
"null"
-- do not output file
-png pathname : set graphic output file path and name (default
dataset.png)
-resume [pathname] : resume the grid task using an existing output file
(default pathname is dataset.out)
Use this option only if some parameters have been checked for the
SAME data.
svm_options : additional options for svm-train
The program conducts v-fold cross validation using parameter C (and
gamma)
= 2^begin, 2^(begin+step), ..., 2^end.
You can specify where the libsvm executable and gnuplot are using the
-svmtrain and -gnuplot parameters.
For windows users, please use pgnuplot.exe. If you are using gnuplot
3.7.1, please upgrade to version 3.7.3 or higher. The version 3.7.1
has a bug. If you use cygwin on windows, please use gunplot-x11.
If the task is terminated accidentally or you would like to change the
range of parameters, you can apply '-resume' to save time by re-using
previous results. You may specify the output file of a previous run
or use the default (i.e., dataset.out) without giving a name. Please
note that the same condition must be used in two runs. For example,
you cannot use '-v 10' earlier and resume the task with '-v 5'.
The value of some options can be "null." For example, `-log2c -1,0,1

-log2 "null"' means that C=2^-1,2^0,2^1 and g=LIBSVM's default gamma


value. That is, you do not conduct parameter selection on gamma.
Example
=======
> python grid.py -log2c -5,5,1 -log2g -4,0,1 -v 5 -m 300 heart_scale
Users (in particular MS Windows users) may need to specify the path of
executable files. You can either change paths in the beginning of
grid.py or specify them in the command line. For example,
> grid.py -log2c -5,5,1 -svmtrain "c:\Program Files\libsvm\windows\svmtrain.exe" -gnuplot c:\tmp\gnuplot\binary\pgnuplot.exe -v 10 heart_scale
Output: two files
dataset.png: the CV accuracy contour plot generated by gnuplot
dataset.out: the CV accuracy at each (log2(C),log2(gamma))
The following example saves running time by loading the output file of a
previous run.
> python grid.py -log2c -7,7,1 -log2g -5,2,1 -v 5 -resume heart_scale.out
heart_scale
Parallel grid search
====================
You can conduct a parallel grid search by dispatching jobs to a
cluster of computers which share the same file system. First, you add
machine names in grid.py:
ssh_workers = ["linux1", "linux5", "linux5"]
and then setup your ssh so that the authentication works without
asking a password.
The same machine (e.g., linux5 here) can be listed more than once if
it has multiple CPUs or has more RAM. If the local machine is the
best, you can also enlarge the nr_local_worker. For example:
nr_local_worker = 2
Example:
> python grid.py heart_scale
[local] -1 -1 78.8889 (best c=0.5, g=0.5, rate=78.8889)
[linux5] -1 -7 83.3333 (best c=0.5, g=0.0078125, rate=83.3333)
[linux5] 5 -1 77.037 (best c=0.5, g=0.0078125, rate=83.3333)
[linux1] 5 -7 83.3333 (best c=0.5, g=0.0078125, rate=83.3333)
.
.
.

If -log2c, -log2g, or -v is not specified, default values are used.


If your system uses telnet instead of ssh, you list the computer names
in telnet_workers.
Calling grid in Python
======================
In addition to using grid.py as a command-line tool, you can use it as a
Python module.
>>> rate, param = find_parameters(dataset, options)
You need to specify `dataset' and `options' (default ''). See the
following example.
> python
>>> from grid import *
>>> rate, param = find_parameters('../heart_scale', '-log2c -1,1,1 -log2g
-1,1,1')
[local] 0.0 0.0 rate=74.8148 (best c=1.0, g=1.0, rate=74.8148)
[local] 0.0 -1.0 rate=77.037 (best c=1.0, g=0.5, rate=77.037)
.
.
[local] -1.0 -1.0 rate=78.8889 (best c=0.5, g=0.5, rate=78.8889)
.
.
>>> rate
78.8889
>>> param
{'c': 0.5, 'g': 0.5}
Part III: LIBSVM format checking tools
Introduction
============
`svm-train' conducts only a simple check of the input data. To do a
detailed check, we provide a python script `checkdata.py.'
Usage: checkdata.py dataset
Exit status (returned value): 1 if there are errors, 0 otherwise.
This tool is written by Rong-En Fan at National Taiwan University.
Example
=======
> cat bad_data
1 3:1 2:4
> python checkdata.py bad_data

line 1: feature indices must be in an ascending order, previous/current


features 3:1 2:4
Found 1 lines with error.

Libsvm is a simple, easy-to-use, and efficient software for SVM


classification and regression. It solves C-SVM classification, nu-SVM
classification, one-class-SVM, epsilon-SVM regression, and nu-SVM
regression. It also provides an automatic model selection tool for
C-SVM classification. This document explains the use of libsvm.
Libsvm is available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm
Please read the COPYRIGHT file before using libsvm.
Table of Contents
=================
- Quick Start
- Installation and Data Format
- `svm-train' Usage
- `svm-predict' Usage
- `svm-scale' Usage
- Tips on Practical Use
- Examples
- Precomputed Kernels
- Library Usage
- Java Version
- Building Windows Binaries
- Additional Tools: Sub-sampling, Parameter Selection, Format checking,
etc.
- MATLAB/OCTAVE Interface
- Python Interface
- Additional Information
Quick Start
===========
If you are new to SVM and if the data is not large, please go to
`tools' directory and use easy.py after installation. It does
everything automatic -- from data scaling to parameter selection.
Usage: easy.py training_file [testing_file]
More information about parameter selection can be found in
`tools/README.'
Installation and Data Format
============================
On Unix systems, type `make' to build the `svm-train' and `svm-predict'
programs. Run them without arguments to show the usages of them.
On other systems, consult `Makefile' to build them (e.g., see
'Building Windows binaries' in this file) or use the pre-built
binaries (Windows binaries are in the directory `windows').
The format of training and testing data file is:

<label> <index1>:<value1> <index2>:<value2> ...


.
.
.
Each line contains an instance and is ended by a '\n' character. For
classification, <label> is an integer indicating the class label
(multi-class is supported). For regression, <label> is the target
value which can be any real number. For one-class SVM, it's not used
so can be any number. The pair <index>:<value> gives a feature
(attribute) value: <index> is an integer starting from 1 and <value>
is a real number. The only exception is the precomputed kernel, where
<index> starts from 0; see the section of precomputed kernels. Indices
must be in ASCENDING order. Labels in the testing file are only used
to calculate accuracy or errors. If they are unknown, just fill the
first column with any numbers.
A sample classification data included in this package is
`heart_scale'. To check if your data is in a correct form, use
`tools/checkdata.py' (details in `tools/README').
Type `svm-train heart_scale', and the program will read the training
data and output the model file `heart_scale.model'. If you have a test
set called heart_scale.t, then type `svm-predict heart_scale.t
heart_scale.model output' to see the prediction accuracy. The `output'
file contains the predicted class labels.
For classification, if training data are in only one class (i.e., all
labels are the same), then `svm-train' issues a warning message:
`Warning: training data in only one class. See README for details,'
which means the training data is very unbalanced. The label in the
training data is directly returned when testing.
There are some other useful programs in this package.
svm-scale:
This is a tool for scaling input data file.
svm-toy:
This is a simple graphical interface which shows how SVM
separate data in a plane. You can click in the window to
draw data points. Use "change" button to choose class
1, 2 or 3 (i.e., up to three classes are supported), "load"
button to load data from a file, "save" button to save data to
a file, "run" button to obtain an SVM model, and "clear"
button to clear the window.
You can enter options in the bottom of the window, the syntax of
options is the same as `svm-train'.
Note that "load" and "save" consider dense data format both in
classification and the regression cases. For classification,

each data point has one label (the color) that must be 1, 2,
or 3 and two attributes (x-axis and y-axis values) in
[0,1). For regression, each data point has one target value
(y-axis) and one attribute (x-axis values) in [0, 1).
Type `make' in respective directories to build them.
You need Qt library to build the Qt version.
(available from http://www.trolltech.com)
You need GTK+ library to build the GTK version.
(available from http://www.gtk.org)
The pre-built Windows binaries are in the `windows'
directory. We use Visual C++ on a 32-bit machine, so the
maximal cache size is 2GB.
`svm-train' Usage
=================
Usage: svm-train [options] training_set_file [model_file]
options:
-s svm_type : set type of SVM (default 0)
0 -- C-SVC
(multi-class classification)
1 -- nu-SVC
(multi-class classification)
2 -- one-class SVM
3 -- epsilon-SVR (regression)
4 -- nu-SVR
(regression)
-t kernel_type : set type of kernel function (default 2)
0 -- linear: u'*v
1 -- polynomial: (gamma*u'*v + coef0)^degree
2 -- radial basis function: exp(-gamma*|u-v|^2)
3 -- sigmoid: tanh(gamma*u'*v + coef0)
4 -- precomputed kernel (kernel values in training_set_file)
-d degree : set degree in kernel function (default 3)
-g gamma : set gamma in kernel function (default 1/num_features)
-r coef0 : set coef0 in kernel function (default 0)
-c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default
1)
-n nu : set the parameter nu of nu-SVC, one-class SVM, and nu-SVR
(default 0.5)
-p epsilon : set the epsilon in loss function of epsilon-SVR (default
0.1)
-m cachesize : set cache memory size in MB (default 100)
-e epsilon : set tolerance of termination criterion (default 0.001)
-h shrinking : whether to use the shrinking heuristics, 0 or 1 (default
1)
-b probability_estimates : whether to train a SVC or SVR model for
probability estimates, 0 or 1 (default 0)
-wi weight : set the parameter C of class i to weight*C, for C-SVC
(default 1)
-v n: n-fold cross validation mode
-q : quiet mode (no outputs)

The k in the -g option means the number of attributes in the input data.
option -v randomly splits the data into n parts and calculates cross
validation accuracy/mean squared error on them.
See libsvm FAQ for the meaning of outputs.
`svm-predict' Usage
===================
Usage: svm-predict [options] test_file model_file output_file
options:
-b probability_estimates: whether to predict probability estimates, 0 or
1 (default 0); for one-class SVM only 0 is supported
model_file is the model file generated by svm-train.
test_file is the test data you want to predict.
svm-predict will produce output in the output_file.
`svm-scale' Usage
=================
Usage: svm-scale [options] data_filename
options:
-l lower : x scaling lower limit (default -1)
-u upper : x scaling upper limit (default +1)
-y y_lower y_upper : y scaling limits (default: no y scaling)
-s save_filename : save scaling parameters to save_filename
-r restore_filename : restore scaling parameters from restore_filename
See 'Examples' in this file for examples.
Tips on Practical Use
=====================
* Scale your data. For example, scale each attribute to [0,1] or [-1,+1].
* For C-SVC, consider using the model selection tool in the tools
directory.
* nu in nu-SVC/one-class-SVM/nu-SVR approximates the fraction of training
errors and support vectors.
* If data for classification are unbalanced (e.g. many positive and
few negative), try different penalty parameters C by -wi (see
examples below).
* Specify larger cache size (i.e., larger -m) for huge problems.
Examples
========
> svm-scale -l -1 -u 1 -s range train > train.scale
> svm-scale -r range test > test.scale
Scale each feature of the training data to be in [-1,1]. Scaling
factors are stored in the file range and then used for scaling the

test data.
> svm-train -s 0 -c 5 -t 2 -g 0.5 -e 0.1 data_file
Train a classifier with RBF kernel exp(-0.5|u-v|^2), C=10, and
stopping tolerance 0.1.
> svm-train -s 3 -p 0.1 -t 0 data_file
Solve SVM regression with linear kernel u'v and epsilon=0.1
in the loss function.
> svm-train -c 10 -w1 1 -w-2 5 -w4 2 data_file
Train a classifier with penalty 10 = 1 * 10 for class 1, penalty 50 =
5 * 10 for class -2, and penalty 20 = 2 * 10 for class 4.
> svm-train -s 0 -c 100 -g 0.1 -v 5 data_file
Do five-fold cross validation for the classifier using
the parameters C = 100 and gamma = 0.1
> svm-train -s 0 -b 1 data_file
> svm-predict -b 1 test_file data_file.model output_file
Obtain a model with probability information and predict test data with
probability estimates
Precomputed Kernels
===================
Users may precompute kernel values and input them as training and
testing files. Then libsvm does not need the original
training/testing sets.
Assume there are L training instances x1, ..., xL and.
Let K(x, y) be the kernel
value of two instances x and y. The input formats
are:
New training instance for xi:
<label> 0:i 1:K(xi,x1) ... L:K(xi,xL)
New testing instance for any x:
<label> 0:? 1:K(x,x1) ... L:K(x,xL)
That is, in the training file the first column must be the "ID" of
xi. In testing, ? can be any value.
All kernel values including ZEROs must be explicitly provided. Any
permutation or random subsets of the training/testing files are also
valid (see examples below).

Note: the format is slightly different from the precomputed kernel


package released in libsvmtools earlier.
Examples:
Assume the original training data has three four-feature
instances and testing data has one instance:
15
45
25

1:1 2:1 3:1 4:1


2:3
4:3
3:1

15

1:1

3:1

If the linear kernel is used, we have the following new


training/testing sets:
15
45
25

0:1 1:4 2:6 3:1


0:2 1:6 2:18 3:0
0:3 1:1 2:0 3:1

15

0:? 1:2 2:0

3:1

? can be any value.


Any subset of the above training file is also valid. For example,
25
45

0:3 1:1 2:0 3:1


0:2 1:6 2:18 3:0

implies that the kernel matrix is


[K(2,2) K(2,3)] = [18 0]
[K(3,2) K(3,3)] = [0 1]
Library Usage
=============
These functions and structures are declared in the header file
`svm.h'. You need to #include "svm.h" in your C/C++ source files and
link your program with `svm.cpp'. You can see `svm-train.c' and
`svm-predict.c' for examples showing how to use them. We define
LIBSVM_VERSION and declare `extern int libsvm_version; ' in svm.h, so
you can check the version number.
Before you classify test data, you need to construct an SVM model
(`svm_model') using training data. A model can also be saved in
a file for later use. Once an SVM model is available, you can use it
to classify new data.
- Function: struct svm_model *svm_train(const struct svm_problem *prob,
const struct svm_parameter *param);

This function constructs and returns an SVM model according to


the given training data and parameters.
struct svm_problem describes the problem:
struct svm_problem
{
int l;
double *y;
struct svm_node **x;
};
where `l' is the number of training data, and `y' is an array
containing
their target values. (integers in classification, real numbers in
regression) `x' is an array of pointers, each of which points to a
sparse
representation (array of svm_node) of one training vector.
For example, if we have the following training data:
LABEL
----1
2
1
2
3

ATTR1 ATTR2 ATTR3 ATTR4 ATTR5


----- ----- ----- ----- ----0
0.1
0.2
0
0
0.1
0.3 -1.2
0.4
0
0
0
0
0.1
0
1.4
-0.1 -0.2
0.1
1.1

0
0
0
0.5
0.1

then the components of svm_problem are:


l = 5
y -> 1 2 1 2 3
x ->
[
[
[
[

[
]
]
]
]

] -> (2,0.1) (3,0.2) (-1,?)


-> (2,0.1) (3,0.3) (4,-1.2) (-1,?)
-> (1,0.4) (-1,?)
-> (2,0.1) (4,1.4) (5,0.5) (-1,?)
-> (1,-0.1) (2,-0.2) (3,0.1) (4,1.1) (5,0.1) (-1,?)

where (index,value) is stored in the structure `svm_node':


struct svm_node
{
int index;
double value;
};
index = -1 indicates the end of one vector. Note that indices must
be in ASCENDING order.
struct svm_parameter describes the parameters of an SVM model:

struct svm_parameter
{
int svm_type;
int kernel_type;
int degree; /* for poly */
double gamma;
/* for poly/rbf/sigmoid */
double coef0;
/* for poly/sigmoid */
/* these are for training only */
double cache_size; /* in MB */
double eps; /* stopping criteria */
double C; /* for C_SVC, EPSILON_SVR, and NU_SVR */
int nr_weight;
/* for C_SVC */
int *weight_label;
/* for C_SVC */
double* weight;
/* for C_SVC */
double nu; /* for NU_SVC, ONE_CLASS, and NU_SVR */
double p; /* for EPSILON_SVR */
int shrinking;
/* use the shrinking heuristics */
int probability; /* do probability estimates */
};
svm_type can be one of C_SVC, NU_SVC, ONE_CLASS, EPSILON_SVR, NU_SVR.
C_SVC:
NU_SVC:
ONE_CLASS:
EPSILON_SVR:
NU_SVR:

C-SVM classification
nu-SVM classification
one-class-SVM
epsilon-SVM regression
nu-SVM regression

kernel_type can be one of LINEAR, POLY, RBF, SIGMOID.


LINEAR: u'*v
POLY: (gamma*u'*v + coef0)^degree
RBF:
exp(-gamma*|u-v|^2)
SIGMOID:
tanh(gamma*u'*v + coef0)
PRECOMPUTED: kernel values in training_set_file
cache_size is the size of the kernel cache, specified in megabytes.
C is the cost of constraints violation.
eps is the stopping criterion. (we usually use 0.00001 in nu-SVC,
0.001 in others). nu is the parameter in nu-SVM, nu-SVR, and
one-class-SVM. p is the epsilon in epsilon-insensitive loss function
of epsilon-SVM regression. shrinking = 1 means shrinking is
conducted;
= 0 otherwise. probability = 1 means model with probability
information is obtained; = 0 otherwise.
nr_weight, weight_label, and weight are used to change the penalty
for some classes (If the weight for a class is not changed, it is
set to 1). This is useful for training classifier using unbalanced
input data or with asymmetric misclassification cost.
nr_weight is the number of elements in the array weight_label and
weight. Each weight[i] corresponds to weight_label[i], meaning that

the penalty of class weight_label[i] is scaled by a factor of


weight[i].
If you do not want to change penalty for any of the classes,
just set nr_weight to 0.
*NOTE* Because svm_model contains pointers to svm_problem, you can
not free the memory used by svm_problem if you are still using the
svm_model produced by svm_train().
*NOTE* To avoid wrong parameters, svm_check_parameter() should be
called before svm_train().
struct svm_model stores the model obtained from the training
procedure.
It is not recommended to directly access entries in this structure.
Programmers should use the interface functions to get the values.
struct svm_model
{
struct svm_parameter param; /* parameter */
int nr_class;
/* number of classes, = 2 in
regression/one class svm */
int l;
/* total #SV */
struct svm_node **SV;
/* SVs (SV[l]) */
double **sv_coef; /* coefficients for SVs in decision
functions (sv_coef[k-1][l]) */
double *rho;
/* constants in decision functions
(rho[k*(k-1)/2]) */
double *probA;
/* pairwise probability information */
double *probB;
int *sv_indices;
/* sv_indices[0,...,nSV-1] are values
in [1,...,num_traning_data] to indicate SVs in the training set */
/* for classification only */
int *label;
int *nSV;
/* XXX */
int free_sv;
svm_load_model*/

/* label of each class (label[k]) */


/* number of SVs for each class (nSV[k]) */
/* nSV[0] + nSV[1] + ... + nSV[k-1] = l */
/* 1 if svm_model is created by
/* 0 if svm_model is created by svm_train */

};
param describes the parameters used to obtain the model.
nr_class is the number of classes. It is 2 for regression and oneclass SVM.
l is the number of support vectors. SV and sv_coef are support
vectors and the corresponding coefficients, respectively. Assume
there are

k classes. For data in class j, the corresponding sv_coef includes


(k-1) y*alpha vectors,
where alpha's are solutions of the following two class problems:
1 vs j, 2 vs j, ..., j-1 vs j, j vs j+1, j vs j+2, ..., j vs k
and y=1 for the first j-1 vectors, while y=-1 for the remaining k-j
vectors. For example, if there are 4 classes, sv_coef and SV are
like:
+-+-+-+--------------------+
|1|1|1|
|
|v|v|v| SVs from class 1 |
|2|3|4|
|
+-+-+-+--------------------+
|1|2|2|
|
|v|v|v| SVs from class 2 |
|2|3|4|
|
+-+-+-+--------------------+
|1|2|3|
|
|v|v|v| SVs from class 3 |
|3|3|4|
|
+-+-+-+--------------------+
|1|2|3|
|
|v|v|v| SVs from class 4 |
|4|4|4|
|
+-+-+-+--------------------+
See svm_train() for an example of assigning values to sv_coef.
rho is the bias term (-b). probA and probB are parameters used in
probability outputs. If there are k classes, there are k*(k-1)/2
binary problems as well as rho, probA, and probB values. They are
aligned in the order of binary problems:
1 vs 2, 1 vs 3, ..., 1 vs k, 2 vs 3, ..., 2 vs k, ..., k-1 vs k.
sv_indices[0,...,nSV-1] are values in [1,...,num_traning_data] to
indicate support vectors in the training set.
label contains labels in the training data.
nSV is the number of support vectors in each class.
free_sv is a flag used to determine whether the space of SV should
be released in free_model_content(struct svm_model*) and
free_and_destroy_model(struct svm_model**). If the model is
generated by svm_train(), then SV points to data in svm_problem
and should not be removed. For example, free_sv is 0 if svm_model
is created by svm_train, but is 1 if created by svm_load_model.
- Function: double svm_predict(const struct svm_model *model,
const struct svm_node *x);
This function does classification or regression on a test vector x
given a model.

For a classification model, the predicted class for x is returned.


For a regression model, the function value of x calculated using
the model is returned. For an one-class model, +1 or -1 is
returned.
- Function: void svm_cross_validation(const struct svm_problem *prob,
const struct svm_parameter *param, int nr_fold, double *target);
This function conducts cross validation. Data are separated to
nr_fold folds. Under given parameters, sequentially each fold is
validated using the model from training the remaining. Predicted
labels (of all prob's instances) in the validation process are
stored in the array called target.
The format of svm_prob is same as that for svm_train().
- Function: int svm_get_svm_type(const struct svm_model *model);
This function gives svm_type of the model. Possible values of
svm_type are defined in svm.h.
- Function: int svm_get_nr_class(const svm_model *model);
For a classification model, this function gives the number of
classes. For a regression or an one-class model, 2 is returned.
- Function: void svm_get_labels(const svm_model *model, int* label)
For a classification model, this function outputs the name of
labels into an array called label. For regression and one-class
models, label is unchanged.
- Function: void svm_get_sv_indices(const struct svm_model *model, int
*sv_indices)
This function outputs indices of support vectors into an array called
sv_indices.
The size of sv_indices is the number of support vectors and can be
obtained by calling svm_get_nr_sv.
Each sv_indices[i] is in the range of [1, ..., num_traning_data].
- Function: int svm_get_nr_sv(const struct svm_model *model)
This function gives the number of total support vector.
- Function: double svm_get_svr_probability(const struct svm_model
*model);
For a regression model with probability information, this function
outputs a value sigma > 0. For test data, we consider the
probability model: target value = predicted value + z, z: Laplace
distribution e^(-|z|/sigma)/(2sigma)
If the model is not for svr or does not contain required

information, 0 is returned.
- Function: double svm_predict_values(const svm_model *model,
const svm_node *x, double* dec_values)
This function gives decision values on a test vector x given a
model, and return the predicted label (classification) or
the function value (regression).
For a classification model with nr_class classes, this function
gives nr_class*(nr_class-1)/2 decision values in the array
dec_values, where nr_class can be obtained from the function
svm_get_nr_class. The order is label[0] vs. label[1], ...,
label[0] vs. label[nr_class-1], label[1] vs. label[2], ...,
label[nr_class-2] vs. label[nr_class-1], where label can be
obtained from the function svm_get_labels. The returned value is
the predicted class for x. Note that when nr_class = 1, this
function does not give any decision value.
For a regression model, dec_values[0] and the returned value are
both the function value of x calculated using the model. For a
one-class model, dec_values[0] is the decision value of x, while
the returned value is +1/-1.
- Function: double svm_predict_probability(const struct svm_model *model,
const struct svm_node *x, double* prob_estimates);
This function does classification or regression on a test vector x
given a model with probability information.
For a classification model with probability information, this
function gives nr_class probability estimates in the array
prob_estimates. nr_class can be obtained from the function
svm_get_nr_class. The class with the highest probability is
returned. For regression/one-class SVM, the array prob_estimates
is unchanged and the returned value is the same as that of
svm_predict.
- Function: const char *svm_check_parameter(const struct svm_problem
*prob,
const struct svm_parameter
*param);
This function checks whether the parameters are within the feasible
range of the problem. This function should be called before calling
svm_train() and svm_cross_validation(). It returns NULL if the
parameters are feasible, otherwise an error message is returned.
- Function: int svm_check_probability_model(const struct svm_model
*model);
This function checks whether the model contains required
information to do probability estimates. If so, it returns
+1. Otherwise, 0 is returned. This function should be called

before calling svm_get_svr_probability and


svm_predict_probability.
- Function: int svm_save_model(const char *model_file_name,
const struct svm_model *model);
This function saves a model to a file; returns 0 on success, or -1
if an error occurs.
- Function: struct svm_model *svm_load_model(const char
*model_file_name);
This function returns a pointer to the model read from the file,
or a null pointer if the model could not be loaded.
- Function: void svm_free_model_content(struct svm_model *model_ptr);
This function frees the memory used by the entries in a model
structure.
- Function: void svm_free_and_destroy_model(struct svm_model
**model_ptr_ptr);
This function frees the memory used by a model and destroys the model
structure. It is equivalent to svm_destroy_model, which
is deprecated after version 3.0.
- Function: void svm_destroy_param(struct svm_parameter *param);
This function frees the memory used by a parameter set.
- Function: void svm_set_print_string_function(void (*print_func)(const
char *));
Users can specify their output format by a function. Use
svm_set_print_string_function(NULL);
for default printing to stdout.
Java Version
============
The pre-compiled java class archive `libsvm.jar' and its source files are
in the java directory. To run the programs, use
java
java
java
java

-classpath
-classpath
-classpath
-classpath

libsvm.jar
libsvm.jar
libsvm.jar
libsvm.jar

svm_train <arguments>
svm_predict <arguments>
svm_toy
svm_scale <arguments>

Note that you need Java 1.5 (5.0) or above to run it.
You may need to add Java runtime library (like classes.zip) to the
classpath.
You may need to increase maximum Java heap size.

Library usages are similar to the C version. These functions are


available:
public class svm {
public static final int LIBSVM_VERSION=318;
public static svm_model svm_train(svm_problem prob, svm_parameter
param);
public static void svm_cross_validation(svm_problem prob,
svm_parameter param, int nr_fold, double[] target);
public static int svm_get_svm_type(svm_model model);
public static int svm_get_nr_class(svm_model model);
public static void svm_get_labels(svm_model model, int[] label);
public static void svm_get_sv_indices(svm_model model, int[]
indices);
public static int svm_get_nr_sv(svm_model model);
public static double svm_get_svr_probability(svm_model model);
public static double svm_predict_values(svm_model model, svm_node[]
x, double[] dec_values);
public static double svm_predict(svm_model model, svm_node[] x);
public static double svm_predict_probability(svm_model model,
svm_node[] x, double[] prob_estimates);
public static void svm_save_model(String model_file_name, svm_model
model) throws IOException
public static svm_model svm_load_model(String model_file_name)
throws IOException
public static String svm_check_parameter(svm_problem prob,
svm_parameter param);
public static int svm_check_probability_model(svm_model model);
public static void
svm_set_print_string_function(svm_print_interface print_func);
}
The library is in the "libsvm" package.
Note that in Java version, svm_node[] is not ended with a node whose
index = -1.
Users can specify their output format by
your_print_func = new svm_print_interface()
{
public void print(String s)
{
// your own format
}
};
svm.svm_set_print_string_function(your_print_func);
Building Windows Binaries
=========================
Windows binaries are in the directory `windows'. To build them via
Visual C++, use the following steps:

1. Open a DOS command box (or Visual Studio Command Prompt) and change
to libsvm directory. If environment variables of VC++ have not been
set, type
"C:\Program Files\Microsoft Visual Studio 10.0\VC\bin\vcvars32.bat"
You may have to modify the above command according which version of
VC++ or where it is installed.
2. Type
nmake -f Makefile.win clean all
3. (optional) To build shared library libsvm.dll, type
nmake -f Makefile.win lib
4. (optional) To build 64-bit windows binaries, you must
(1) Setup vcvars64.bat instead of vcvars32.bat
(2) Change CFLAGS in Makefile.win: /D _WIN32 to /D _WIN64
Another way is to build them from Visual C++ environment. See details
in libsvm FAQ.
- Additional Tools: Sub-sampling, Parameter Selection, Format checking,
etc.
=========================================================================
===
See the README file in the tools directory.
MATLAB/OCTAVE Interface
=======================
Please check the file README in the directory `matlab'.
Python Interface
================
See the README file in python directory.
Additional Information
======================
If you find LIBSVM helpful, please cite it as
Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support
vector machines. ACM Transactions on Intelligent Systems and
Technology, 2:27:1--27:27, 2011. Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm
LIBSVM implementation document is available at
http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf

For any questions and comments, please email cjlin@csie.ntu.edu.tw


Acknowledgments:
This work was supported in part by the National Science
Council of Taiwan via the grant NSC 89-2213-E-002-013.
The authors thank their group members and users
for many helpful discussions and comments. They are listed in
http://www.csie.ntu.edu.tw/~cjlin/libsvm/acknowledgements

Libsvm is a simple, easy-to-use, and efficient software for SVM


classification and regression. It solves C-SVM classification, nu-SVM
classification, one-class-SVM, epsilon-SVM regression, and nu-SVM
regression. It also provides an automatic model selection tool for
C-SVM classification. This document explains the use of libsvm.

Libsvm is available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm
Please read the COPYRIGHT file before using libsvm.

Table of Contents
=================

- Quick Start
- Installation and Data Format
- `svm-train' Usage
- `svm-predict' Usage
- `svm-scale' Usage
- Tips on Practical Use
- Examples
- Precomputed Kernels
- Library Usage
- Java Version
- Building Windows Binaries
- Additional Tools: Sub-sampling, Parameter Selection, Format checking, etc.
- MATLAB/OCTAVE Interface

- Python Interface
- Additional Information

Quick Start
===========

If you are new to SVM and if the data is not large, please go to
`tools' directory and use easy.py after installation. It does
everything automatic -- from data scaling to parameter selection.

Usage: easy.py training_file [testing_file]

More information about parameter selection can be found in


`tools/README.'

Installation and Data Format


============================

On Unix systems, type `make' to build the `svm-train' and `svm-predict'


programs. Run them without arguments to show the usages of them.

On other systems, consult `Makefile' to build them (e.g., see


'Building Windows binaries' in this file) or use the pre-built
binaries (Windows binaries are in the directory `windows').

The format of training and testing data file is:

<label> <index1>:<value1> <index2>:<value2> ...


.
.
.

Each line contains an instance and is ended by a '\n' character. For


classification, <label> is an integer indicating the class label
(multi-class is supported). For regression, <label> is the target
value which can be any real number. For one-class SVM, it's not used
so can be any number. The pair <index>:<value> gives a feature
(attribute) value: <index> is an integer starting from 1 and <value>
is a real number. The only exception is the precomputed kernel, where
<index> starts from 0; see the section of precomputed kernels. Indices
must be in ASCENDING order. Labels in the testing file are only used
to calculate accuracy or errors. If they are unknown, just fill the
first column with any numbers.

A sample classification data included in this package is


`heart_scale'. To check if your data is in a correct form, use
`tools/checkdata.py' (details in `tools/README').

Type `svm-train heart_scale', and the program will read the training
data and output the model file `heart_scale.model'. If you have a test
set called heart_scale.t, then type `svm-predict heart_scale.t
heart_scale.model output' to see the prediction accuracy. The `output'

file contains the predicted class labels.

For classification, if training data are in only one class (i.e., all
labels are the same), then `svm-train' issues a warning message:
`Warning: training data in only one class. See README for details,'
which means the training data is very unbalanced. The label in the
training data is directly returned when testing.

There are some other useful programs in this package.

svm-scale:

This is a tool for scaling input data file.

svm-toy:

This is a simple graphical interface which shows how SVM


separate data in a plane. You can click in the window to
draw data points. Use "change" button to choose class
1, 2 or 3 (i.e., up to three classes are supported), "load"
button to load data from a file, "save" button to save data to
a file, "run" button to obtain an SVM model, and "clear"
button to clear the window.

You can enter options in the bottom of the window, the syntax of
options is the same as `svm-train'.

Note that "load" and "save" consider dense data format both in
classification and the regression cases. For classification,
each data point has one label (the color) that must be 1, 2,
or 3 and two attributes (x-axis and y-axis values) in
[0,1). For regression, each data point has one target value
(y-axis) and one attribute (x-axis values) in [0, 1).

Type `make' in respective directories to build them.

You need Qt library to build the Qt version.


(available from http://www.trolltech.com)

You need GTK+ library to build the GTK version.


(available from http://www.gtk.org)

The pre-built Windows binaries are in the `windows'


directory. We use Visual C++ on a 32-bit machine, so the
maximal cache size is 2GB.

`svm-train' Usage
=================

Usage: svm-train [options] training_set_file [model_file]


options:
-s svm_type : set type of SVM (default 0)

0 -- C-SVC

(multi-class classification)

1 -- nu-SVC

(multi-class classification)

2 -- one-class SVM
3 -- epsilon-SVR

(regression)

4 -- nu-SVR

(regression)

-t kernel_type : set type of kernel function (default 2)


0 -- linear: u'*v
1 -- polynomial: (gamma*u'*v + coef0)^degree
2 -- radial basis function: exp(-gamma*|u-v|^2)
3 -- sigmoid: tanh(gamma*u'*v + coef0)
4 -- precomputed kernel (kernel values in training_set_file)
-d degree : set degree in kernel function (default 3)
-g gamma : set gamma in kernel function (default 1/num_features)
-r coef0 : set coef0 in kernel function (default 0)
-c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1)
-n nu : set the parameter nu of nu-SVC, one-class SVM, and nu-SVR (default 0.5)
-p epsilon : set the epsilon in loss function of epsilon-SVR (default 0.1)
-m cachesize : set cache memory size in MB (default 100)
-e epsilon : set tolerance of termination criterion (default 0.001)
-h shrinking : whether to use the shrinking heuristics, 0 or 1 (default 1)
-b probability_estimates : whether to train a SVC or SVR model for probability estimates,
0 or 1 (default 0)
-wi weight : set the parameter C of class i to weight*C, for C-SVC (default 1)
-v n: n-fold cross validation mode
-q : quiet mode (no outputs)

The k in the -g option means the number of attributes in the input data.

option -v randomly splits the data into n parts and calculates cross
validation accuracy/mean squared error on them.

See libsvm FAQ for the meaning of outputs.

`svm-predict' Usage
===================

Usage: svm-predict [options] test_file model_file output_file


options:
-b probability_estimates: whether to predict probability estimates, 0 or 1 (default 0); for
one-class SVM only 0 is supported

model_file is the model file generated by svm-train.


test_file is the test data you want to predict.
svm-predict will produce output in the output_file.

`svm-scale' Usage
=================

Usage: svm-scale [options] data_filename


options:
-l lower : x scaling lower limit (default -1)

-u upper : x scaling upper limit (default +1)


-y y_lower y_upper : y scaling limits (default: no y scaling)
-s save_filename : save scaling parameters to save_filename
-r restore_filename : restore scaling parameters from restore_filename

See 'Examples' in this file for examples.

Tips on Practical Use


=====================

* Scale your data. For example, scale each attribute to [0,1] or [-1,+1].
* For C-SVC, consider using the model selection tool in the tools directory.
* nu in nu-SVC/one-class-SVM/nu-SVR approximates the fraction of training
errors and support vectors.
* If data for classification are unbalanced (e.g. many positive and
few negative), try different penalty parameters C by -wi (see
examples below).
* Specify larger cache size (i.e., larger -m) for huge problems.

Examples
========

> svm-scale -l -1 -u 1 -s range train > train.scale


> svm-scale -r range test > test.scale

Scale each feature of the training data to be in [-1,1]. Scaling

factors are stored in the file range and then used for scaling the
test data.

> svm-train -s 0 -c 5 -t 2 -g 0.5 -e 0.1 data_file

Train a classifier with RBF kernel exp(-0.5|u-v|^2), C=10, and


stopping tolerance 0.1.

> svm-train -s 3 -p 0.1 -t 0 data_file

Solve SVM regression with linear kernel u'v and epsilon=0.1


in the loss function.

> svm-train -c 10 -w1 1 -w-2 5 -w4 2 data_file

Train a classifier with penalty 10 = 1 * 10 for class 1, penalty 50 =


5 * 10 for class -2, and penalty 20 = 2 * 10 for class 4.

> svm-train -s 0 -c 100 -g 0.1 -v 5 data_file

Do five-fold cross validation for the classifier using


the parameters C = 100 and gamma = 0.1

> svm-train -s 0 -b 1 data_file


> svm-predict -b 1 test_file data_file.model output_file

Obtain a model with probability information and predict test data with
probability estimates

Precomputed Kernels
===================

Users may precompute kernel values and input them as training and
testing files. Then libsvm does not need the original
training/testing sets.

Assume there are L training instances x1, ..., xL and.


Let K(x, y) be the kernel
value of two instances x and y. The input formats
are:

New training instance for xi:

<label> 0:i 1:K(xi,x1) ... L:K(xi,xL)

New testing instance for any x:

<label> 0:? 1:K(x,x1) ... L:K(x,xL)

That is, in the training file the first column must be the "ID" of
xi. In testing, ? can be any value.

All kernel values including ZEROs must be explicitly provided. Any


permutation or random subsets of the training/testing files are also
valid (see examples below).

Note: the format is slightly different from the precomputed kernel


package released in libsvmtools earlier.

Examples:

Assume the original training data has three four-feature


instances and testing data has one instance:

15 1:1 2:1 3:1 4:1


45

2:3

25

15 1:1

4:3

3:1

3:1

If the linear kernel is used, we have the following new


training/testing sets:

15 0:1 1:4 2:6 3:1


45 0:2 1:6 2:18 3:0
25 0:3 1:1 2:0 3:1

15 0:? 1:2 2:0 3:1

? can be any value.

Any subset of the above training file is also valid. For example,

25 0:3 1:1 2:0 3:1


45 0:2 1:6 2:18 3:0

implies that the kernel matrix is

[K(2,2) K(2,3)] = [18 0]


[K(3,2) K(3,3)] = [0 1]

Library Usage
=============

These functions and structures are declared in the header file


`svm.h'. You need to #include "svm.h" in your C/C++ source files and
link your program with `svm.cpp'. You can see `svm-train.c' and
`svm-predict.c' for examples showing how to use them. We define
LIBSVM_VERSION and declare `extern int libsvm_version; ' in svm.h, so
you can check the version number.

Before you classify test data, you need to construct an SVM model
(`svm_model') using training data. A model can also be saved in
a file for later use. Once an SVM model is available, you can use it

to classify new data.

- Function: struct svm_model *svm_train(const struct svm_problem *prob,


const struct svm_parameter *param);

This function constructs and returns an SVM model according to


the given training data and parameters.

struct svm_problem describes the problem:

struct svm_problem
{
int l;
double *y;
struct svm_node **x;
};

where `l' is the number of training data, and `y' is an array containing
their target values. (integers in classification, real numbers in
regression) `x' is an array of pointers, each of which points to a sparse
representation (array of svm_node) of one training vector.

For example, if we have the following training data:

LABEL

ATTR1 ATTR2 ATTR3 ATTR4 ATTR5

-----

----- -----

----- ----- -----

0.1

0.2

0.1

0.3

-1.2

0.4

0.1

1.4

0.5

-0.1

-0.2

0.1

1.1

0.1

then the components of svm_problem are:

l=5

y -> 1 2 1 2 3

x -> [ ] -> (2,0.1) (3,0.2) (-1,?)


[ ] -> (2,0.1) (3,0.3) (4,-1.2) (-1,?)
[ ] -> (1,0.4) (-1,?)
[ ] -> (2,0.1) (4,1.4) (5,0.5) (-1,?)
[ ] -> (1,-0.1) (2,-0.2) (3,0.1) (4,1.1) (5,0.1) (-1,?)

where (index,value) is stored in the structure `svm_node':

struct svm_node
{
int index;
double value;
};

index = -1 indicates the end of one vector. Note that indices must
be in ASCENDING order.

struct svm_parameter describes the parameters of an SVM model:

struct svm_parameter
{
int svm_type;
int kernel_type;
int degree;

/* for poly */

double gamma;

/* for poly/rbf/sigmoid */

double coef0;

/* for poly/sigmoid */

/* these are for training only */


double cache_size; /* in MB */
double eps; /* stopping criteria */
double C;

/* for C_SVC, EPSILON_SVR, and NU_SVR */

int nr_weight;

/* for C_SVC */

int *weight_label;

/* for C_SVC */

double* weight;

/* for C_SVC */

double nu;

/* for NU_SVC, ONE_CLASS, and NU_SVR */

double p;

/* for EPSILON_SVR */

int shrinking; /* use the shrinking heuristics */


int probability; /* do probability estimates */
};

svm_type can be one of C_SVC, NU_SVC, ONE_CLASS, EPSILON_SVR, NU_SVR.

C_SVC:

C-SVM classification

NU_SVC:

nu-SVM classification

ONE_CLASS:

one-class-SVM

EPSILON_SVR:

epsilon-SVM regression

NU_SVR:

nu-SVM regression

kernel_type can be one of LINEAR, POLY, RBF, SIGMOID.

LINEAR:

u'*v

POLY:

(gamma*u'*v + coef0)^degree

RBF: exp(-gamma*|u-v|^2)
SIGMOID: tanh(gamma*u'*v + coef0)
PRECOMPUTED: kernel values in training_set_file

cache_size is the size of the kernel cache, specified in megabytes.


C is the cost of constraints violation.
eps is the stopping criterion. (we usually use 0.00001 in nu-SVC,
0.001 in others). nu is the parameter in nu-SVM, nu-SVR, and
one-class-SVM. p is the epsilon in epsilon-insensitive loss function
of epsilon-SVM regression. shrinking = 1 means shrinking is conducted;
= 0 otherwise. probability = 1 means model with probability
information is obtained; = 0 otherwise.

nr_weight, weight_label, and weight are used to change the penalty

for some classes (If the weight for a class is not changed, it is
set to 1). This is useful for training classifier using unbalanced
input data or with asymmetric misclassification cost.

nr_weight is the number of elements in the array weight_label and


weight. Each weight[i] corresponds to weight_label[i], meaning that
the penalty of class weight_label[i] is scaled by a factor of weight[i].

If you do not want to change penalty for any of the classes,


just set nr_weight to 0.

*NOTE* Because svm_model contains pointers to svm_problem, you can


not free the memory used by svm_problem if you are still using the
svm_model produced by svm_train().

*NOTE* To avoid wrong parameters, svm_check_parameter() should be


called before svm_train().

struct svm_model stores the model obtained from the training procedure.
It is not recommended to directly access entries in this structure.
Programmers should use the interface functions to get the values.

struct svm_model
{
struct svm_parameter param;

/* parameter */

int nr_class;

/* number of classes, = 2 in regression/one class svm

int l;

/* total #SV */

*/

struct svm_node **SV;

/* SVs (SV[l]) */

double **sv_coef;

/* coefficients for SVs in decision functions (sv_coef[k-

double *rho;

/* constants in decision functions (rho[k*(k-1)/2]) */

1][l]) */

double *probA;

/* pairwise probability information */

double *probB;
int *sv_indices;
/* sv_indices[0,...,nSV-1] are values in
[1,...,num_traning_data] to indicate SVs in the training set */

/* for classification only */

int *label;

/* label of each class (label[k]) */

int *nSV;

/* number of SVs for each class (nSV[k]) */


/* nSV[0] + nSV[1] + ... + nSV[k-1] = l */

/* XXX */
int free_sv;

/* 1 if svm_model is created by svm_load_model*/


/* 0 if svm_model is created by svm_train */

};

param describes the parameters used to obtain the model.

nr_class is the number of classes. It is 2 for regression and one-class SVM.

l is the number of support vectors. SV and sv_coef are support

vectors and the corresponding coefficients, respectively. Assume there are


k classes. For data in class j, the corresponding sv_coef includes (k-1) y*alpha vectors,
where alpha's are solutions of the following two class problems:
1 vs j, 2 vs j, ..., j-1 vs j, j vs j+1, j vs j+2, ..., j vs k
and y=1 for the first j-1 vectors, while y=-1 for the remaining k-j
vectors. For example, if there are 4 classes, sv_coef and SV are like:

+-+-+-+--------------------+
|1|1|1|

|v|v|v| SVs from class 1 |


|2|3|4|

+-+-+-+--------------------+
|1|2|2|

|v|v|v| SVs from class 2 |


|2|3|4|

+-+-+-+--------------------+
|1|2|3|

|v|v|v| SVs from class 3 |


|3|3|4|

+-+-+-+--------------------+
|1|2|3|

|v|v|v| SVs from class 4 |


|4|4|4|

+-+-+-+--------------------+

See svm_train() for an example of assigning values to sv_coef.

rho is the bias term (-b). probA and probB are parameters used in
probability outputs. If there are k classes, there are k*(k-1)/2
binary problems as well as rho, probA, and probB values. They are
aligned in the order of binary problems:
1 vs 2, 1 vs 3, ..., 1 vs k, 2 vs 3, ..., 2 vs k, ..., k-1 vs k.

sv_indices[0,...,nSV-1] are values in [1,...,num_traning_data] to


indicate support vectors in the training set.

label contains labels in the training data.

nSV is the number of support vectors in each class.

free_sv is a flag used to determine whether the space of SV should


be released in free_model_content(struct svm_model*) and
free_and_destroy_model(struct svm_model**). If the model is
generated by svm_train(), then SV points to data in svm_problem
and should not be removed. For example, free_sv is 0 if svm_model
is created by svm_train, but is 1 if created by svm_load_model.

- Function: double svm_predict(const struct svm_model *model,


const struct svm_node *x);

This function does classification or regression on a test vector x


given a model.

For a classification model, the predicted class for x is returned.


For a regression model, the function value of x calculated using
the model is returned. For an one-class model, +1 or -1 is
returned.

- Function: void svm_cross_validation(const struct svm_problem *prob,


const struct svm_parameter *param, int nr_fold, double *target);

This function conducts cross validation. Data are separated to


nr_fold folds. Under given parameters, sequentially each fold is
validated using the model from training the remaining. Predicted
labels (of all prob's instances) in the validation process are
stored in the array called target.

The format of svm_prob is same as that for svm_train().

- Function: int svm_get_svm_type(const struct svm_model *model);

This function gives svm_type of the model. Possible values of


svm_type are defined in svm.h.

- Function: int svm_get_nr_class(const svm_model *model);

For a classification model, this function gives the number of


classes. For a regression or an one-class model, 2 is returned.

- Function: void svm_get_labels(const svm_model *model, int* label)

For a classification model, this function outputs the name of


labels into an array called label. For regression and one-class
models, label is unchanged.

- Function: void svm_get_sv_indices(const struct svm_model *model, int *sv_indices)

This function outputs indices of support vectors into an array called sv_indices.
The size of sv_indices is the number of support vectors and can be obtained by calling
svm_get_nr_sv.
Each sv_indices[i] is in the range of [1, ..., num_traning_data].

- Function: int svm_get_nr_sv(const struct svm_model *model)

This function gives the number of total support vector.

- Function: double svm_get_svr_probability(const struct svm_model *model);

For a regression model with probability information, this function


outputs a value sigma > 0. For test data, we consider the
probability model: target value = predicted value + z, z: Laplace
distribution e^(-|z|/sigma)/(2sigma)

If the model is not for svr or does not contain required

information, 0 is returned.

- Function: double svm_predict_values(const svm_model *model,


const svm_node *x, double* dec_values)

This function gives decision values on a test vector x given a


model, and return the predicted label (classification) or
the function value (regression).

For a classification model with nr_class classes, this function


gives nr_class*(nr_class-1)/2 decision values in the array
dec_values, where nr_class can be obtained from the function
svm_get_nr_class. The order is label[0] vs. label[1], ...,
label[0] vs. label[nr_class-1], label[1] vs. label[2], ...,
label[nr_class-2] vs. label[nr_class-1], where label can be
obtained from the function svm_get_labels. The returned value is
the predicted class for x. Note that when nr_class = 1, this
function does not give any decision value.

For a regression model, dec_values[0] and the returned value are


both the function value of x calculated using the model. For a
one-class model, dec_values[0] is the decision value of x, while
the returned value is +1/-1.

- Function: double svm_predict_probability(const struct svm_model *model,


const struct svm_node *x, double* prob_estimates);

This function does classification or regression on a test vector x


given a model with probability information.

For a classification model with probability information, this


function gives nr_class probability estimates in the array
prob_estimates. nr_class can be obtained from the function
svm_get_nr_class. The class with the highest probability is
returned. For regression/one-class SVM, the array prob_estimates
is unchanged and the returned value is the same as that of
svm_predict.

- Function: const char *svm_check_parameter(const struct svm_problem *prob,


const struct svm_parameter *param);

This function checks whether the parameters are within the feasible
range of the problem. This function should be called before calling
svm_train() and svm_cross_validation(). It returns NULL if the
parameters are feasible, otherwise an error message is returned.

- Function: int svm_check_probability_model(const struct svm_model *model);

This function checks whether the model contains required


information to do probability estimates. If so, it returns
+1. Otherwise, 0 is returned. This function should be called
before calling svm_get_svr_probability and

svm_predict_probability.

- Function: int svm_save_model(const char *model_file_name,


const struct svm_model *model);

This function saves a model to a file; returns 0 on success, or -1


if an error occurs.

- Function: struct svm_model *svm_load_model(const char *model_file_name);

This function returns a pointer to the model read from the file,
or a null pointer if the model could not be loaded.

- Function: void svm_free_model_content(struct svm_model *model_ptr);

This function frees the memory used by the entries in a model structure.

- Function: void svm_free_and_destroy_model(struct svm_model **model_ptr_ptr);

This function frees the memory used by a model and destroys the model
structure. It is equivalent to svm_destroy_model, which
is deprecated after version 3.0.

- Function: void svm_destroy_param(struct svm_parameter *param);

This function frees the memory used by a parameter set.

- Function: void svm_set_print_string_function(void (*print_func)(const char *));

Users can specify their output format by a function. Use


svm_set_print_string_function(NULL);
for default printing to stdout.

Java Version
============

The pre-compiled java class archive `libsvm.jar' and its source files are
in the java directory. To run the programs, use

java -classpath libsvm.jar svm_train <arguments>


java -classpath libsvm.jar svm_predict <arguments>
java -classpath libsvm.jar svm_toy
java -classpath libsvm.jar svm_scale <arguments>

Note that you need Java 1.5 (5.0) or above to run it.

You may need to add Java runtime library (like classes.zip) to the classpath.
You may need to increase maximum Java heap size.

Library usages are similar to the C version. These functions are available:

public class svm {

public static final int LIBSVM_VERSION=318;


public static svm_model svm_train(svm_problem prob, svm_parameter param);
public static void svm_cross_validation(svm_problem prob, svm_parameter param,
int nr_fold, double[] target);
public static int svm_get_svm_type(svm_model model);
public static int svm_get_nr_class(svm_model model);
public static void svm_get_labels(svm_model model, int[] label);
public static void svm_get_sv_indices(svm_model model, int[] indices);
public static int svm_get_nr_sv(svm_model model);
public static double svm_get_svr_probability(svm_model model);
public static double svm_predict_values(svm_model model, svm_node[] x,
double[] dec_values);
public static double svm_predict(svm_model model, svm_node[] x);
public static double svm_predict_probability(svm_model model, svm_node[] x,
double[] prob_estimates);
public static void svm_save_model(String model_file_name, svm_model model)
throws IOException
public static svm_model svm_load_model(String model_file_name) throws
IOException
public static String svm_check_parameter(svm_problem prob, svm_parameter
param);
public static int svm_check_probability_model(svm_model model);
public static void svm_set_print_string_function(svm_print_interface print_func);
}

The library is in the "libsvm" package.


Note that in Java version, svm_node[] is not ended with a node whose index = -1.

Users can specify their output format by

your_print_func = new svm_print_interface()


{
public void print(String s)
{
// your own format
}
};
svm.svm_set_print_string_function(your_print_func);

Building Windows Binaries


=========================

Windows binaries are in the directory `windows'. To build them via


Visual C++, use the following steps:

1. Open a DOS command box (or Visual Studio Command Prompt) and change
to libsvm directory. If environment variables of VC++ have not been
set, type

"C:\Program Files\Microsoft Visual Studio 10.0\VC\bin\vcvars32.bat"

You may have to modify the above command according which version of
VC++ or where it is installed.

2. Type

nmake -f Makefile.win clean all

3. (optional) To build shared library libsvm.dll, type

nmake -f Makefile.win lib

4. (optional) To build 64-bit windows binaries, you must


(1) Setup vcvars64.bat instead of vcvars32.bat
(2) Change CFLAGS in Makefile.win: /D _WIN32 to /D _WIN64

Another way is to build them from Visual C++ environment. See details
in libsvm FAQ.

- Additional Tools: Sub-sampling, Parameter Selection, Format checking, etc.


=========================================================
===================

See the README file in the tools directory.

MATLAB/OCTAVE Interface
=======================

Please check the file README in the directory `matlab'.

Python Interface
================

See the README file in python directory.

Additional Information
======================

If you find LIBSVM helpful, please cite it as

Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support


vector machines. ACM Transactions on Intelligent Systems and
Technology, 2:27:1--27:27, 2011. Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm

LIBSVM implementation document is available at


http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf

For any questions and comments, please email cjlin@csie.ntu.edu.tw

Acknowledgments:
This work was supported in part by the National Science
Council of Taiwan via the grant NSC 89-2213-E-002-013.
The authors thank their group members and users
for many helpful discussions and comments. They are listed in

http://www.csie.ntu.edu.tw/~cjlin/libsvm/acknowledgements

This directory includes some useful codes:

1. subset selection tools.


2. parameter selection tools.
3. LIBSVM format checking tools

Part I: Subset selection tools

Introduction
============

Training large data is time consuming. Sometimes one should work on a


smaller subset first. The python script subset.py randomly selects a
specified number of samples. For classification data, we provide a
stratified selection to ensure the same class distribution in the
subset.

Usage: subset.py [options] dataset number [output1] [output2]

This script selects a subset of the given data set.

options:
-s method : method of selection (default 0)
0 -- stratified selection (classification only)
1 -- random selection

output1 : the subset (optional)


output2 : the rest of data (optional)

If output1 is omitted, the subset will be printed on the screen.

Example
=======

> python subset.py heart_scale 100 file1 file2

From heart_scale 100 samples are randomly selected and stored in


file1. All remaining instances are stored in file2.

Part II: Parameter Selection Tools

Introduction
============

grid.py is a parameter selection tool for C-SVM classification using


the RBF (radial basis function) kernel. It uses cross validation (CV)
technique to estimate the accuracy of each parameter combination in
the specified range and helps you to decide the best parameters for
your problem.

grid.py directly executes libsvm binaries (so no python binding is needed)

for cross validation and then draw contour of CV accuracy using gnuplot.
You must have libsvm and gnuplot installed before using it. The package
gnuplot is available at http://www.gnuplot.info/

On Mac OSX, the precompiled gnuplot file needs the library Aquarterm,
which thus must be installed as well. In addition, this version of
gnuplot does not support png, so you need to change "set term png
transparent small" and use other image formats. For example, you may
have "set term pbm small color".

Usage: grid.py [grid_options] [svm_options] dataset

grid_options :
-log2c {begin,end,step | "null"} : set the range of c (default -5,15,2)
begin,end,step -- c_range = 2^{begin,...,begin+k*step,...,end}
"null"

-- do not grid with c

-log2g {begin,end,step | "null"} : set the range of g (default 3,-15,-2)


begin,end,step -- g_range = 2^{begin,...,begin+k*step,...,end}
"null"

-- do not grid with g

-v n : n-fold cross validation (default 5)


-svmtrain pathname : set svm executable path and name
-gnuplot {pathname | "null"} :
pathname -- set gnuplot executable path and name
"null" -- do not plot
-out {pathname | "null"} : (default dataset.out)
pathname -- set output file path and name

"null" -- do not output file


-png pathname : set graphic output file path and name (default dataset.png)
-resume [pathname] : resume the grid task using an existing output file (default
pathname is dataset.out)
Use this option only if some parameters have been checked for the SAME data.

svm_options : additional options for svm-train

The program conducts v-fold cross validation using parameter C (and gamma)
= 2^begin, 2^(begin+step), ..., 2^end.

You can specify where the libsvm executable and gnuplot are using the
-svmtrain and -gnuplot parameters.

For windows users, please use pgnuplot.exe. If you are using gnuplot
3.7.1, please upgrade to version 3.7.3 or higher. The version 3.7.1
has a bug. If you use cygwin on windows, please use gunplot-x11.

If the task is terminated accidentally or you would like to change the


range of parameters, you can apply '-resume' to save time by re-using
previous results. You may specify the output file of a previous run
or use the default (i.e., dataset.out) without giving a name. Please
note that the same condition must be used in two runs. For example,
you cannot use '-v 10' earlier and resume the task with '-v 5'.

The value of some options can be "null." For example, `-log2c -1,0,1

-log2 "null"' means that C=2^-1,2^0,2^1 and g=LIBSVM's default gamma


value. That is, you do not conduct parameter selection on gamma.

Example
=======

> python grid.py -log2c -5,5,1 -log2g -4,0,1 -v 5 -m 300 heart_scale

Users (in particular MS Windows users) may need to specify the path of
executable files. You can either change paths in the beginning of
grid.py or specify them in the command line. For example,

> grid.py -log2c -5,5,1 -svmtrain "c:\Program Files\libsvm\windows\svm-train.exe" gnuplot c:\tmp\gnuplot\binary\pgnuplot.exe -v 10 heart_scale

Output: two files


dataset.png: the CV accuracy contour plot generated by gnuplot
dataset.out: the CV accuracy at each (log2(C),log2(gamma))

The following example saves running time by loading the output file of a previous run.

> python grid.py -log2c -7,7,1 -log2g -5,2,1 -v 5 -resume heart_scale.out heart_scale

Parallel grid search


====================

You can conduct a parallel grid search by dispatching jobs to a


cluster of computers which share the same file system. First, you add
machine names in grid.py:

ssh_workers = ["linux1", "linux5", "linux5"]

and then setup your ssh so that the authentication works without
asking a password.

The same machine (e.g., linux5 here) can be listed more than once if
it has multiple CPUs or has more RAM. If the local machine is the
best, you can also enlarge the nr_local_worker. For example:

nr_local_worker = 2

Example:

> python grid.py heart_scale


[local] -1 -1 78.8889 (best c=0.5, g=0.5, rate=78.8889)
[linux5] -1 -7 83.3333 (best c=0.5, g=0.0078125, rate=83.3333)
[linux5] 5 -1 77.037 (best c=0.5, g=0.0078125, rate=83.3333)
[linux1] 5 -7 83.3333 (best c=0.5, g=0.0078125, rate=83.3333)
.
.
.

If -log2c, -log2g, or -v is not specified, default values are used.

If your system uses telnet instead of ssh, you list the computer names
in telnet_workers.

Calling grid in Python


======================

In addition to using grid.py as a command-line tool, you can use it as a


Python module.

>>> rate, param = find_parameters(dataset, options)

You need to specify `dataset' and `options' (default ''). See the following example.

> python

>>> from grid import *


>>> rate, param = find_parameters('../heart_scale', '-log2c -1,1,1 -log2g -1,1,1')
[local] 0.0 0.0 rate=74.8148 (best c=1.0, g=1.0, rate=74.8148)
[local] 0.0 -1.0 rate=77.037 (best c=1.0, g=0.5, rate=77.037)
.
.
[local] -1.0 -1.0 rate=78.8889 (best c=0.5, g=0.5, rate=78.8889)
.
.

>>> rate
78.8889
>>> param
{'c': 0.5, 'g': 0.5}

Part III: LIBSVM format checking tools

Introduction
============

`svm-train' conducts only a simple check of the input data. To do a


detailed check, we provide a python script `checkdata.py.'

Usage: checkdata.py dataset

Exit status (returned value): 1 if there are errors, 0 otherwise.

This tool is written by Rong-En Fan at National Taiwan University.

Example
=======

> cat bad_data


1 3:1 2:4
> python checkdata.py bad_data

line 1: feature indices must be in an ascending order, previous/current features 3:1 2:4
Found 1 lines with error.

A Practical Guide to Support Vector Classification


Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin
Department of Computer Science
National Taiwan University, Taipei 106, Taiwan
http://www.csie.ntu.edu.tw/~cjlin
Initial version: 2003 Last updated: April 15, 2010
Abstract
The support vector machine (SVM) is a popular classification technique.
However, beginners who are not familiar with SVM often get unsatisfactory
results since they miss some easy but significant steps. In this guide, we propose
a simple procedure which usually gives reasonable results.

Introduction

SVMs (Support Vector Machines) are a useful technique for data classification. Although SVM is considered easier to use than Neural Networks, users not familiar with
it often get unsatisfactory results at first. Here we outline a cookbook approach
which usually gives reasonable results.
Note that this guide is not for SVM researchers nor do we guarantee you will
achieve the highest accuracy. Also, we do not intend to solve challenging or difficult problems. Our purpose is to give SVM novices a recipe for rapidly obtaining
acceptable results.
Although users do not need to understand the underlying theory behind SVM, we
briefly introduce the basics necessary for explaining our procedure. A classification
task usually involves separating data into training and testing sets. Each instance
in the training set contains one target value (i.e. the class labels) and several
attributes (i.e. the features or observed variables). The goal of SVM is to produce
a model (based on the training data) which predicts the target values of the test data
given only the test data attributes.
Given a training set of instance-label pairs (xi , yi ), i = 1, . . . , l where xi Rn and
y {1, 1}l , the support vector machines (SVM) (Boser et al., 1992; Cortes and
Vapnik, 1995) require the solution of the following optimization problem:
l

min

w,b,

subject to

X
1 T
w w+C
i
2
i=1
yi (wT (xi ) + b) 1 i ,
i 0.
1

(1)

Table 1: Problem characteristics and performance comparisons.


Applications

Astroparticle1
Bioinformatics2
Vehicle3

#training #testing #features #classes Accuracy


data
data
by users
3,089
391
1,243

4,000
04
41

4
20
21

2
3
2

Accuracy
by our
procedure
75.2%
96.9%
36%
85.2%
4.88%
87.8%

Here training vectors xi are mapped into a higher (maybe infinite) dimensional space
by the function . SVM finds a linear separating hyperplane with the maximal margin
in this higher dimensional space. C > 0 is the penalty parameter of the error term.
Furthermore, K(xi , xj ) (xi )T (xj ) is called the kernel function. Though new
kernels are being proposed by researchers, beginners may find in SVM books the
following four basic kernels:
linear: K(xi , xj ) = xTi xj .
polynomial: K(xi , xj ) = (xi T xj + r)d , > 0.
radial basis function (RBF): K(xi , xj ) = exp(kxi xj k2 ), > 0.
sigmoid: K(xi , xj ) = tanh(xi T xj + r).
Here, , r, and d are kernel parameters.

1.1

Real-World Examples

Table 1 presents some real-world examples. These data sets are supplied by our users
who could not obtain reasonable accuracy in the beginning. Using the procedure
illustrated in this guide, we help them to achieve better performance. Details are in
Appendix A.
These data sets are at http://www.csie.ntu.edu.tw/~cjlin/papers/guide/
data/
1

Courtesy of Jan Conrad from Uppsala University, Sweden.


Courtesy of Cory Spencer from Simon Fraser University, Canada (Gardy et al., 2003).
3
Courtesy of a user from Germany.
4
As there are no testing data, cross-validation instead of testing accuracy is presented here.
Details of cross-validation are in Section 3.2.
2

1.2

Proposed Procedure

Many beginners use the following procedure now:


Transform data to the format of an SVM package
Randomly try a few kernels and parameters
Test
We propose that beginners try the following procedure first:
Transform data to the format of an SVM package
Conduct simple scaling on the data
2

Consider the RBF kernel K(x, y) = ekxyk

Use cross-validation to find the best parameter C and


Use the best parameter C and to train the whole training set5
Test
We discuss this procedure in detail in the following sections.

Data Preprocessing

2.1

Categorical Feature

SVM requires that each data instance is represented as a vector of real numbers.
Hence, if there are categorical attributes, we first have to convert them into numeric
data. We recommend using m numbers to represent an m-category attribute. Only
one of the m numbers is one, and others are zero. For example, a three-category
attribute such as {red, green, blue} can be represented as (0,0,1), (0,1,0), and (1,0,0).
Our experience indicates that if the number of values in an attribute is not too large,
this coding might be more stable than using a single number.
5

The best parameter might be affected by the size of data set but in practice the one obtained
from cross-validation is already suitable for the whole training set.

2.2

Scaling

Scaling before applying SVM is very important. Part 2 of Sarles Neural Networks
FAQ Sarle (1997) explains the importance of this and most of considerations also apply to SVM. The main advantage of scaling is to avoid attributes in greater numeric
ranges dominating those in smaller numeric ranges. Another advantage is to avoid
numerical difficulties during the calculation. Because kernel values usually depend on
the inner products of feature vectors, e.g. the linear kernel and the polynomial kernel, large attribute values might cause numerical problems. We recommend linearly
scaling each attribute to the range [1, +1] or [0, 1].
Of course we have to use the same method to scale both training and testing
data. For example, suppose that we scaled the first attribute of training data from
[10, +10] to [1, +1]. If the first attribute of testing data lies in the range [11, +8],
we must scale the testing data to [1.1, +0.8]. See Appendix B for some real examples.

Model Selection

Though there are only four common kernels mentioned in Section 1, we must decide
which one to try first. Then the penalty parameter C and kernel parameters are
chosen.

3.1

RBF Kernel

In general, the RBF kernel is a reasonable first choice. This kernel nonlinearly maps
samples into a higher dimensional space so it, unlike the linear kernel, can handle the
case when the relation between class labels and attributes is nonlinear. Furthermore,
the linear kernel is a special case of RBF Keerthi and Lin (2003) since the linear
kernel with a penalty parameter C has the same performance as the RBF kernel with
some parameters (C, ). In addition, the sigmoid kernel behaves like RBF for certain
parameters (Lin and Lin, 2003).
The second reason is the number of hyperparameters which influences the complexity of model selection. The polynomial kernel has more hyperparameters than
the RBF kernel.
Finally, the RBF kernel has fewer numerical difficulties. One key point is 0 <
Kij 1 in contrast to polynomial kernels of which kernel values may go to infinity
(xi T xj + r > 1) or zero (xi T xj + r < 1) while the degree is large. Moreover, we
must note that the sigmoid kernel is not valid (i.e. not the inner product of two

vectors) under some parameters (Vapnik, 1995).


There are some situations where the RBF kernel is not suitable. In particular,
when the number of features is very large, one may just use the linear kernel. We
discuss details in Appendix C.

3.2

Cross-validation and Grid-search

There are two parameters for an RBF kernel: C and . It is not known beforehand
which C and are best for a given problem; consequently some kind of model selection
(parameter search) must be done. The goal is to identify good (C, ) so that the
classifier can accurately predict unknown data (i.e. testing data). Note that it may
not be useful to achieve high training accuracy (i.e. a classifier which accurately
predicts training data whose class labels are indeed known). As discussed above, a
common strategy is to separate the data set into two parts, of which one is considered
unknown. The prediction accuracy obtained from the unknown set more precisely
reflects the performance on classifying an independent data set. An improved version
of this procedure is known as cross-validation.
In v-fold cross-validation, we first divide the training set into v subsets of equal
size. Sequentially one subset is tested using the classifier trained on the remaining
v 1 subsets. Thus, each instance of the whole training set is predicted once so the
cross-validation accuracy is the percentage of data which are correctly classified.
The cross-validation procedure can prevent the overfitting problem. Figure 1
represents a binary classification problem to illustrate this issue. Filled circles and
triangles are the training data while hollow circles and triangles are the testing data.
The testing accuracy of the classifier in Figures 1a and 1b is not good since it overfits
the training data. If we think of the training and testing data in Figure 1a and 1b
as the training and validation sets in cross-validation, the accuracy is not good. On
the other hand, the classifier in 1c and 1d does not overfit the training data and gives
better cross-validation as well as testing accuracy.
We recommend a grid-search on C and using cross-validation. Various pairs
of (C, ) values are tried and the one with the best cross-validation accuracy is
picked. We found that trying exponentially growing sequences of C and is a
practical method to identify good parameters (for example, C = 25 , 23 , . . . , 215 ,
= 215 , 213 , . . . , 23 ).
The grid-search is straightforward but seems naive. In fact, there are several
advanced methods which can save computational cost by, for example, approximating
the cross-validation rate. However, there are two motivations why we prefer the simple
5

(a) Training data and an overfitting classifier

(b) Applying an overfitting classifier on testing


data

(c) Training data and a better classifier

(d) Applying a better classifier on testing data

Figure 1: An overfitting classifier and a better classifier (l and s: training data;


and 4: testing data).
grid-search approach.
One is that, psychologically, we may not feel safe to use methods which avoid
doing an exhaustive parameter search by approximations or heuristics. The other
reason is that the computational time required to find good parameters by gridsearch is not much more than that by advanced methods since there are only two
parameters. Furthermore, the grid-search can be easily parallelized because each
(C, ) is independent. Many of advanced methods are iterative processes, e.g. walking
along a path, which can be hard to parallelize.
Since doing a complete grid-search may still be time-consuming, we recommend
6

Figure 2: Loose grid search on C = 25 , 23 , . . . , 215 and = 215 , 213 , . . . , 23 .

Figure 3: Fine grid-search on C = 21 , 21.25 , . . . , 25 and = 27 , 26.75 , . . . , 23 .

using a coarse grid first. After identifying a better region on the grid, a finer grid
search on that region can be conducted. To illustrate this, we do an experiment on
the problem german from the Statlog collection (Michie et al., 1994). After scaling
this set, we first use a coarse grid (Figure 2) and find that the best (C, ) is (23 , 25 )
with the cross-validation rate 77.5%. Next we conduct a finer grid search on the
neighborhood of (23 , 25 ) (Figure 3) and obtain a better cross-validation rate 77.6%
at (23.25 , 25.25 ). After the best (C, ) is found, the whole training set is trained again
to generate the final classifier.
The above approach works well for problems with thousands or more data points.
For very large data sets a feasible approach is to randomly choose a subset of the
data set, conduct grid-search on them, and then do a better-region-only grid-search
on the complete data set.

Discussion

In some situations the above proposed procedure is not good enough, so other techniques such as feature selection may be needed. These issues are beyond the scope of
this guide. Our experience indicates that the procedure works well for data which do
not have many features. If there are thousands of attributes, there may be a need to
choose a subset of them before giving the data to SVM.

Acknowledgments
We thank all users of our SVM software LIBSVM and BSVM, who helped us to
identify possible difficulties encountered by beginners. We also thank some users (in
particular, Robert Campbell) for proofreading the paper.

Examples of the Proposed Procedure

In this appendix we compare accuracy by the proposed procedure with that often
used by general beginners. Experiments are on the three problems mentioned in
Table 1 by using the software LIBSVM (Chang and Lin, 2001). For each problem, we
first list the accuracy by direct training and testing. Secondly, we show the difference
in accuracy with and without scaling. From what has been discussed in Section 2.2,
the range of training set attributes must be saved so that we are able to restore
them while scaling the testing set. Thirdly, the accuracy by the proposed procedure

(scaling and then model selection) is presented. Finally, we demonstrate the use
of a tool in LIBSVM which does the whole procedure automatically. Note that a
similar parameter selection tool like the grid.py presented below is available in the
R-LIBSVM interface (see the function tune).

A.1

Astroparticle Physics

Original sets with default parameters


$ ./svm-train svmguide1
$ ./svm-predict svmguide1.t svmguide1.model svmguide1.t.predict
Accuracy = 66.925%
Scaled sets with default parameters
$ ./svm-scale -l -1 -u 1 -s range1 svmguide1 > svmguide1.scale
$ ./svm-scale -r range1 svmguide1.t > svmguide1.t.scale
$ ./svm-train svmguide1.scale
$ ./svm-predict svmguide1.t.scale svmguide1.scale.model svmguide1.t.predict
Accuracy = 96.15%
Scaled sets with parameter selection (change to the directory tools, which
contains grid.py)
$ python grid.py svmguide1.scale

2.0 2.0 96.8922


(Best C=2.0, =2.0 with five-fold cross-validation rate=96.8922%)
$ ./svm-train -c 2 -g 2 svmguide1.scale
$ ./svm-predict svmguide1.t.scale svmguide1.scale.model svmguide1.t.predict
Accuracy = 96.875%
Using an automatic script
$ python easy.py svmguide1 svmguide1.t
Scaling training data...
Cross validation...
9

Best c=2.0, g=2.0


Training...
Scaling testing data...
Testing...
Accuracy = 96.875% (3875/4000) (classification)

A.2

Bioinformatics

Original sets with default parameters


$ ./svm-train -v 5 svmguide2
Cross Validation Accuracy = 56.5217%
Scaled sets with default parameters
$ ./svm-scale -l -1 -u 1 svmguide2 > svmguide2.scale
$ ./svm-train -v 5 svmguide2.scale
Cross Validation Accuracy = 78.5166%
Scaled sets with parameter selection
$ python grid.py svmguide2.scale

2.0 0.5 85.1662


Cross Validation Accuracy = 85.1662%
(Best C=2.0, =0.5 with five fold cross-validation rate=85.1662%)
Using an automatic script
$ python easy.py svmguide2
Scaling training data...
Cross validation...
Best c=2.0, g=0.5
Training...

10

A.3

Vehicle

Original sets with default parameters


$ ./svm-train svmguide3
$ ./svm-predict svmguide3.t svmguide3.model svmguide3.t.predict
Accuracy = 2.43902%
Scaled sets with default parameters
$ ./svm-scale -l -1 -u 1 -s range3 svmguide3 > svmguide3.scale
$ ./svm-scale -r range3 svmguide3.t > svmguide3.t.scale
$ ./svm-train svmguide3.scale
$ ./svm-predict svmguide3.t.scale svmguide3.scale.model svmguide3.t.predict
Accuracy = 12.1951%
Scaled sets with parameter selection
$ python grid.py svmguide3.scale

128.0 0.125 84.8753


(Best C=128.0, =0.125 with five-fold cross-validation rate=84.8753%)
$ ./svm-train -c 128 -g 0.125 svmguide3.scale
$ ./svm-predict svmguide3.t.scale svmguide3.scale.model svmguide3.t.predict
Accuracy = 87.8049%
Using an automatic script
$ python easy.py svmguide3 svmguide3.t
Scaling training data...
Cross validation...
Best c=128.0, g=0.125
Training...
Scaling testing data...
Testing...
Accuracy = 87.8049% (36/41) (classification)
11

Common Mistakes in Scaling Training and Testing Data

Section 2.2 stresses the importance of using the same scaling factors for training and
testing sets. We give a real example on classifying traffic light signals (courtesy of an
anonymous user) It is available at LIBSVM Data Sets.
If training and testing sets are separately scaled to [0, 1], the resulting accuracy is
lower than 70%.
$ ../svm-scale -l 0 svmguide4 > svmguide4.scale
$ ../svm-scale -l 0 svmguide4.t > svmguide4.t.scale
$ python easy.py svmguide4.scale svmguide4.t.scale
Accuracy = 69.2308% (216/312) (classification)
Using the same scaling factors for training and testing sets, we obtain much better
accuracy.
$ ../svm-scale -l 0 -s range4 svmguide4 > svmguide4.scale
$ ../svm-scale -r range4 svmguide4.t > svmguide4.t.scale
$ python easy.py svmguide4.scale svmguide4.t.scale
Accuracy = 89.4231% (279/312) (classification)
With the correct setting, the 10 features in svmguide4.t.scale have the following
maximal values:
0.7402, 0.4421, 0.6291, 0.8583, 0.5385, 0.7407, 0.3982, 1.0000, 0.8218, 0.9874
Clearly, the earlier way to scale the testing set to [0, 1] generates an erroneous set.

When to Use Linear but not RBF Kernel

If the number of features is large, one may not need to map data to a higher dimensional space. That is, the nonlinear mapping does not improve the performance.
Using the linear kernel is good enough, and one only searches for the parameter C.
While Section 3.1 describes that RBF is at least as good as linear, the statement is
true only after searching the (C, ) space.
Next, we split our discussion to three parts:

12

Number of instances  number of features

C.1

Many microarray data in bioinformatics are of this type. We consider the Leukemia
data from the LIBSVM data sets (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/
datasets). The training and testing sets have 38 and 34 instances, respectively. The
number of features is 7,129, much larger than the number of instances. We merge the
two files and compare the cross validation accuracy of using the RBF and the linear
kernels:
RBF kernel with parameter selection
$ cat leu leu.t > leu.combined
$ python grid.py leu.combined

8.0 3.0517578125e-05 97.2222


(Best C=8.0, = 0.000030518 with five-fold cross-validation rate=97.2222%)
Linear kernel with parameter selection
$ python grid.py -log2c -1,2,1 -log2g 1,1,1 -t 0 leu.combined

0.5 2.0 98.6111


(Best C=0.5 with five-fold cross-validation rate=98.61111%)
Though grid.py was designed for the RBF kernel, the above way checks various
C using the linear kernel (-log2g 1,1,1 sets a dummy ).
The cross-validation accuracy of using the linear kernel is comparable to that of using
the RBF kernel. Apparently, when the number of features is very large, one may not
need to map the data.
In addition to LIBSVM, the LIBLINEAR software mentioned below is also effective
for data in this case.

C.2

Both numbers of instances and features are large

Such data often occur in document classification. LIBSVM is not particularly good for
this type of problems. Fortunately, we have another software LIBLINEAR (Fan et al.,
2008), which is very suitable for such data. We illustrate the difference between
13

LIBSVM and LIBLINEAR using a document problem rcv1 train.binary from the
LIBSVM data sets. The numbers of instances and features are 20,242 and 47,236,
respectively.
$ time libsvm-2.85/svm-train -c 4 -t 0 -e 0.1 -m 800 -v 5 rcv1_train.binary
Cross Validation Accuracy = 96.8136%
345.569s
$ time liblinear-1.21/train -c 4 -e 0.1 -v 5 rcv1_train.binary
Cross Validation Accuracy = 97.0161%
2.944s
For five-fold cross validation, LIBSVM takes around 350 seconds, but LIBLINEAR uses
only 3. Moreover, LIBSVM consumes more memory as we allocate some spaces to
store recently used kernel elements (see -m 800). Clearly, LIBLINEAR is much faster
than LIBSVM to obtain a model with comparable accuracy.
LIBLINEAR is efficient for large-scale document classification. Let us consider a
large set rcv1 test.binary with 677,399 instances.
$ time liblinear-1.21/train -c 0.25 -v 5 rcv1_test.binary
Cross Validation Accuracy = 97.8538%
68.84s
Note that reading the data takes most of the time. The training of each training/validation split is less than four seconds.

C.3

Number of instances  number of features

As the number of features is small, one often maps data to higher dimensional spaces
(i.e., using nonlinear kernels). However, if you really would like to use the linear
kernel, you may use LIBLINEAR with the option -s 2. When the number of features
is small, it is often faster than the default -s 1. Consider the data http://www.csie.
ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/covtype.libsvm.binary.scale.
bz2. The number of instances 581,012 is much larger than the number of features 54.
We run LIBLINEAR with -s 1 (default) and -s 2.
$ time liblinear-1.21/train -c 4 -v 5 -s 2 covtype.libsvm.binary.scale
Cross Validation Accuracy = 75.67%
67.224s
$ time liblinear-1.21/train -c 4 -v 5 -s 1 covtype.libsvm.binary.scale
14

Cross Validation Accuracy = 75.6711%


452.736s
Clearly, using -s 2 leads to shorter training time.

References
B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin
classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning
Theory, pages 144152. ACM Press, 1992.
C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001.
Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273297,
1995.
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A
library for large linear classification. Journal of Machine Learning Research, 9:1871
1874, 2008. URL http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.
pdf.
J. L. Gardy, C. Spencer, K. Wang, M. Ester, G. E. Tusnady, I. Simon, S. Hua,
K. deFays, C. Lambert, K. Nakai, and F. S. Brinkman. PSORT-B: improving
protein subcellular localization prediction for gram-negative bacteria. Nucleic Acids
Research, 31(13):36133617, 2003.
S. S. Keerthi and C.-J. Lin. Asymptotic behaviors of support vector machines with
Gaussian kernel. Neural Computation, 15(7):16671689, 2003.
H.-T. Lin and C.-J. Lin. A study on sigmoid kernels for SVM and the training of nonPSD kernels by SMO-type methods. Technical report, Department of Computer
Science, National Taiwan University, 2003. URL http://www.csie.ntu.edu.tw/
~cjlin/papers/tanh.pdf.
D. Michie, D. J. Spiegelhalter, C. C. Taylor, and J. Campbell, editors. Machine
learning, neural and statistical classification. Ellis Horwood, Upper Saddle River,
NJ, USA, 1994. ISBN 0-13-106360-X. Data available at http://archive.ics.
uci.edu/ml/machine-learning-databases/statlog/.

15

W. S. Sarle. Neural Network FAQ, 1997. URL ftp://ftp.sas.com/pub/neural/


FAQ.html. Periodic posting to the Usenet newsgroup comp.ai.neural-nets.
V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York,
NY, 1995.

16

LIBSVM: A Library for Support Vector Machines


Chih-Chung Chang and Chih-Jen Lin
Department of Computer Science
National Taiwan University, Taipei, Taiwan
Email: cjlin@csie.ntu.edu.tw
Initial version: 2001

Last updated: March 4, 2013


Abstract

LIBSVM is a library for Support Vector Machines (SVMs). We have been


actively developing this package since the year 2000. The goal is to help users
to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present
all implementation details of LIBSVM. Issues such as solving SVM optimization problems, theoretical convergence, multi-class classification, probability
estimates, and parameter selection are discussed in detail.
Keywords: Classification, LIBSVM, optimization, regression, support vector machines, SVM

Introduction

Support Vector Machines (SVMs) are a popular machine learning method for classification, regression, and other learning tasks. Since the year 2000, we have been developing the package LIBSVM as a library for support vector machines. The Web address
of the package is at http://www.csie.ntu.edu.tw/~cjlin/libsvm. LIBSVM is currently one of the most widely used SVM software. In this article,1 we present all
implementation details of LIBSVM. However, this article does not intend to teach
the practical use of LIBSVM. For instructions of using LIBSVM, see the README file
included in the package, the LIBSVM FAQ,2 and the practical guide by Hsu et al.
(2003). An earlier version of this article was published in Chang and Lin (2011).
LIBSVM supports the following learning tasks.
1

This LIBSVM implementation document was created in 2001 and has been maintained at http:
//www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf.
2
LIBSVM FAQ: http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html.

Table 1: Representative works in some domains that have successfully used LIBSVM.
Representative works
Domain
Computer vision
LIBPMK (Grauman and Darrell, 2005)
Natural language processing Maltparser (Nivre et al., 2007)
Neuroimaging
PyMVPA (Hanke et al., 2009)
Bioinformatics
BDVal (Dorff et al., 2010)
1. SVC: support vector classification (two-class and multi-class).
2. SVR: support vector regression.
3. One-class SVM.
A typical use of LIBSVM involves two steps: first, training a data set to obtain a
model and second, using the model to predict information of a testing data set. For
SVC and SVR, LIBSVM can also output probability estimates. Many extensions of
LIBSVM are available at libsvmtools.3
The LIBSVM package is structured as follows.
1. Main directory: core C/C++ programs and sample data. In particular, the file
svm.cpp implements training and testing algorithms, where details are described
in this article.
2. The tool sub-directory: this sub-directory includes tools for checking data
format and for selecting SVM parameters.
3. Other sub-directories contain pre-built binary files and interfaces to other languages/software.
LIBSVM has been widely used in many areas. From 2000 to 2010, there were
more than 250,000 downloads of the package. In this period, we answered more than
10,000 emails from users. Table 1 lists representative works in some domains that
have successfully used LIBSVM.
This article is organized as follows. In Section 2, we describe SVM formulations
supported in LIBSVM: C-support vector classification (C-SVC), -support vector
classification (-SVC), distribution estimation (one-class SVM), -support vector regression (-SVR), and -support vector regression (-SVR). Section 3 then discusses
performance measures, basic usage, and code organization. All SVM formulations
3

LIBSVM Tools: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools.

supported in LIBSVM are quadratic minimization problems. We discuss the optimization algorithm in Section 4. Section 5 describes two implementation techniques
to reduce the running time for minimizing SVM quadratic problems: shrinking and
caching. LIBSVM provides some special settings for unbalanced data; details are
in Section 6. Section 7 discusses our implementation for multi-class classification.
Section 8 presents how to transform SVM decision values into probability values. Parameter selection is important for obtaining good SVM models. Section 9 presents a
simple and useful parameter selection tool in LIBSVM. Finally, Section 10 concludes
this work.

SVM Formulations

LIBSVM supports various SVM formulations for classification, regression, and distribution estimation. In this section, we present these formulations and give corresponding references. We also show performance measures used in LIBSVM.

2.1

C-Support Vector Classification

Given training vectors xi Rn , i = 1, . . . , l, in two classes, and an indicator vector


y Rl such that yi {1, 1}, C-SVC (Boser et al., 1992; Cortes and Vapnik, 1995)
solves the following primal optimization problem.
l

X
1 T
w w+C
i
2
i=1

min

w,b,

subject to

(1)

yi (wT (xi ) + b) 1 i ,
i 0, i = 1, . . . , l,

where (xi ) maps xi into a higher-dimensional space and C > 0 is the regularization
parameter. Due to the possible high dimensionality of the vector variable w, usually
we solve the following dual problem.
min

subject to

1 T
Q eT
2
y T = 0,
0 i C,

(2)
i = 1, . . . , l,

where e = [1, . . . , 1]T is the vector of all ones, Q is an l by l positive semidefinite


matrix, Qij yi yj K(xi , xj ), and K(xi , xj ) (xi )T (xj ) is the kernel function.
3

After problem (2) is solved, using the primal-dual relationship, the optimal w
satisfies
w=

l
X

yi i (xi )

(3)

i=1

and the decision function is


l
X

sgn wT (x) + b = sgn




!
yi i K(xi , x) + b .

i=1

We store yi i i, b, label names,4 support vectors, and other information such as


kernel parameters in the model for prediction.

2.2

-Support Vector Classification

The -support vector classification (Scholkopf et al., 2000) introduces a new parameter
(0, 1]. It is proved that an upper bound on the fraction of training errors and
a lower bound of the fraction of support vectors.
Given training vectors xi Rn , i = 1, . . . , l, in two classes, and a vector y Rl
such that yi {1, 1}, the primal optimization problem is
l

min

w,b,,

subject to

1 T
1X
w w +
i
2
l i=1
yi (wT (xi ) + b) i ,
i 0, i = 1, . . . , l,

(4)

0.

The dual problem is


min

subject to

1 T
Q
2
0 i 1/l,
eT ,

i = 1, . . . , l,

(5)

y T = 0,

where Qij = yi yj K(xi , xj ). Chang and Lin (2001) show that problem (5) is feasible
if and only if
2 min(#yi = +1, #yi = 1)

1,
l
so the usable range of is smaller than (0, 1].
4
In LIBSVM, any integer can be a label name, so we map label names to 1 by assigning the first
training instance to have y1 = +1.

The decision function is


sgn

l
X

!
yi i K(xi , x) + b .

i=1

It is shown that eT can be replaced by eT = (Crisp and Burges, 2000;


Chang and Lin, 2001). In LIBSVM, we solve a scaled version of problem (5) because
numerically i may be too small due to the constraint i 1/l.
min

subject to

1 T
Q

2
0
i 1,
= l,
eT

i = 1, . . . , l,

(6)

= 0.
yT

If is optimal for the dual problem (5) and is optimal for the primal problem
(4), Chang and Lin (2001) show that / is an optimal solution of C-SVM with
C = 1/(l). Thus, in LIBSVM, we output (/, b/) in the model.5

2.3

Distribution Estimation (One-class SVM)

One-class SVM was proposed by Scholkopf et al. (2001) for estimating the support of
a high-dimensional distribution. Given training vectors xi Rn , i = 1, . . . , l without
any class information, the primal problem of one-class SVM is
l

1 X
1 T
w w+
i
2
l i=1

min

w,,

wT (xi ) i ,

subject to

i 0, i = 1, . . . , l.
The dual problem is
min

subject to

1 T
Q
2
0 i 1/(l), i = 1, . . . , l,

(7)

eT = 1,
where Qij = K(xi , xj ) = (xi )T (xj ). The decision function is
!
l
X
sgn
i K(xi , x) .
i=1
5

= l, we have / = /
. Hence, in
More precisely, solving (6) obtains = l. Because
.
LIBSVM, we calculate /

Similar to the case of -SVC, in LIBSVM, we solve a scaled version of (7).


min

subject to

1 T
Q
2
0 i 1,

i = 1, . . . , l,

(8)

eT = l.

2.4

-Support Vector Regression (-SVR)

Consider a set of training points, {(x1 , z1 ), . . . , (xl , zl )}, where xi Rn is a feature


vector and zi R1 is the target output. Under given parameters C > 0 and  > 0,
the standard form of support vector regression (Vapnik, 1998) is
l

min

w,b,,

subject to

X
X
1 T
w w+C
i + C
i
2
i=1
i=1
wT (xi ) + b zi  + i ,
zi wT (xi ) b  + i ,
i , i 0, i = 1, . . . , l.

The dual problem is


l

min

subject to

X
X
1
( )T Q( ) + 
(i + i ) +
zi (i i )
2
i=1
i=1
eT ( ) = 0,

(9)

0 i , i C, i = 1, . . . , l,
where Qij = K(xi , xj ) (xi )T (xj ).
After solving problem (9), the approximate function is
l
X
(i + i )K(xi , x) + b.
i=1

In LIBSVM, we output in the model.

2.5

-Support Vector Regression (-SVR)

Similar to -SVC, for regression, Scholkopf et al. (2000) use a parameter (0, 1]
to control the number of support vectors. The parameter  in -SVR becomes a

parameter here. With (C, ) as parameters, -SVR solves


l

1 T
1X
w w + C( +
(i + i ))
2
l i=1

min

w,b,, ,

(wT (xi ) + b) zi  + i ,

subject to

zi (wT (xi ) + b)  + i ,
i , i 0, i = 1, . . . , l,

 0.

The dual problem is


min

subject to

1
( )T Q( ) + z T ( )
2
eT ( ) = 0, eT ( + ) C,
0 i , i C/l,

(10)

i = 1, . . . , l.

The approximate function is


l
X
(i + i )K(xi , x) + b.
i=1

Similar to -SVC, Chang and Lin (2002) show that the inequality eT (+ ) C
can be replaced by an equality. Moreover, C/l may be too small because users often
choose C to be a small constant like one. Thus, in LIBSVM, we treat the userspecified regularization parameter as C/l. That is, C = C/l is what users specified
and LIBSVM solves the following problem.
min

subject to

1
( )T Q( ) + z T ( )
2

eT ( ) = 0, eT ( + ) = Cl,

0 i , i C,
i = 1, . . . , l.

) has the same solution


Chang and Lin (2002) prove that -SVR with parameters (C,
).
as -SVR with parameters (lC,

Performance Measures, Basic Usage, and Code


Organization

This section describes LIBSVMs evaluation measures, shows some simple examples
of running LIBSVM, and presents the code structure.
7

3.1

Performance Measures

After solving optimization problems listed in previous sections, users can apply decision functions to predict labels (target values) of testing data. Let x1 , . . . , xl be
the testing data and f (x1 ), . . . , f (xl ) be decision values (target values for regression)
predicted by LIBSVM. If the true labels (true target values) of testing data are known
and denoted as yi , . . . , yl , we evaluate the prediction results by the following measures.
3.1.1

Classification
Accuracy
# correctly predicted data
=
100%
# total testing data

3.1.2

Regression

LIBSVM outputs MSE (mean squared error) and r2 (squared correlation coefficient).

1X
MSE =
(f (xi ) yi )2 ,
l i=1
2
 P
l l f (xi )yi Pl f (xi ) Pl yi
i=1
i=1
i=1
  P
r2 =  P
2  .

Pl
P

2
l
l
2

l l f (xi )2
y
l
y

f
(x
)
i
i=1 i
i=1 i
i=1
i=1

3.2

A Simple Example of Running LIBSVM

While detailed instructions of using LIBSVM are available in the README file of the
package and the practical guide by Hsu et al. (2003), here we give a simple example.
LIBSVM includes a sample data set heart scale of 270 instances. We split the
data to a training set heart scale.tr (170 instances) and a testing set heart scale.te.
$ python tools/subset.py heart_scale 170 heart_scale.tr heart_scale.te
The command svm-train solves an SVM optimization problem to produce a model.6
$ ./svm-train heart_scale.tr
*
optimization finished, #iter = 87
nu = 0.471645
6

The default solver is C-SVC using the RBF kernel (48) with C = 1 and = 1/n.

obj = -67.299458, rho = 0.203495


nSV = 88, nBSV = 72
Total nSV = 88
Next, the command svm-predict uses the obtained model to classify the testing set.
$ ./svm-predict heart_scale.te heart_scale.tr.model output
Accuracy = 83% (83/100) (classification)
The file output contains predicted class labels.

3.3

Code Organization

All LIBSVMs training and testing algorithms are implemented in the file svm.cpp.
The two main sub-routines are svm train and svm predict. The training procedure
is more sophisticated, so we give the code organization in Figure 1.
From Figure 1, for classification, svm train decouples a multi-class problem to
two-class problems (see Section 7) and calls svm train one several times. For regression and one-class SVM, it directly calls svm train one. The probability outputs
for classification and regression are also handled in svm train. Then, according
to the SVM formulation, svm train one calls a corresponding sub-routine such as
solve c svc for C-SVC and solve nu svc for -SVC. All solve * sub-routines call
the solver Solve after preparing suitable input values. The sub-routine Solve minimizes a general form of SVM optimization problems; see (11) and (22). Details of
the sub-routine Solve are described in Sections 4-6.

Solving the Quadratic Problems

This section discusses algorithms used in LIBSVM to solve dual quadratic problems
listed in Section 2. We split the discussion to two parts. The first part considers
optimization problems with one linear constraint, while the second part checks those
with two linear constraints.

svm train

Main training sub-routine

svm train one

Two-class SVC, SVR, one-class SVM


...

solve c svc solve nu svc

Various SVM formulations

Solving problems (11) and (22)

Solve

Figure 1: LIBSVMs code organization for training. All sub-routines are in svm.cpp.

4.1

Quadratic Problems with One Linear Constraint: C-SVC,


-SVR, and One-class SVM

We consider the following general form of C-SVC, -SVR, and one-class SVM.
min

subject to

f ()
y T = ,

(11)

0 t C, t = 1, . . . , l,
where

1
f () T Q + pT
2
and yt = 1, t = 1, . . . , l. The constraint y T = 0 is called a linear constraint. It can
be clearly seen that C-SVC and one-class SVM are already in the form of problem
(11). For -SVR, we use the following reformulation of Eq. (9).
 

 
 T

1  T T  Q Q
T
T
T
( ) ,
+ e z , e + z
min

Q Q

,
2
 

subject to
yT
= 0,
0 t , t C, t = 1, . . . , l,

where
y = [1, . . . , 1, 1, . . . , 1]T .
| {z } | {z }
l

We do not assume that Q is positive semi-definite (PSD) because sometimes non-PSD


kernel matrices are used.

10

4.1.1

Decomposition Method for Dual Problems

The main difficulty for solving problem (11) is that Q is a dense matrix and may be
too large to be stored. In LIBSVM, we consider a decomposition method to conquer
this difficulty. Some earlier works on decomposition methods for SVM include, for
example, Osuna et al. (1997a); Joachims (1998); Platt (1998); Keerthi et al. (2001);
Hsu and Lin (2002b). Subsequent developments include, for example, Fan et al.
(2005); Palagi and Sciandrone (2005); Glasmachers and Igel (2006). A decomposition
method modifies only a subset of per iteration, so only some columns of Q are
needed. This subset of variables, denoted as the working set B, leads to a smaller
optimization sub-problem. An extreme case of the decomposition methods is the
Sequential Minimal Optimization (SMO) (Platt, 1998), which restricts B to have only
two elements. Then, at each iteration, we solve a simple two-variable problem without
needing any optimization software. LIBSVM considers an SMO-type decomposition
method proposed in Fan et al. (2005).
Algorithm 1 (An SMO-type decomposition method in Fan et al., 2005)
1. Find 1 as the initial feasible solution. Set k = 1.
2. If k is a stationary point of problem (2), stop. Otherwise, find a two-element
working set B = {i, j} by WSS 1 (described in Section 4.1.2). Define N
{1, . . . , l}\B. Let kB and kN be sub-vectors of k corresponding to B and N ,
respectively.
3. If aij Kii + Kjj 2Kij > 0,7
Solve the following sub-problem with the variable B = [i j ]T .

 
 
 Qii Qij i
1
k T i
i j
min
+ (pB + QBN N )
Qij Qjj j
j
i ,j
2
subject to
0 i , j C,
(12)
yi i + yj j = y TN kN ,
else
7

We abbreviate K(xi , xj ) to Kij .

11

Let be a small positive constant and solve



 
 
 Qii Qij i
1
k T i
i j
min
+ (pB + QBN N )
Qij Qjj j
j
i ,j
2
aij
+
((i ik )2 + (j jk )2 )
(13)
4
subject to
constraints of problem (12).
4. Set k+1
to be the optimal solution of sub-problem (12) or (13), and k+1
kN .
B
N
Set k k + 1 and go to Step 2.
Note that B is updated at each iteration, but for simplicity, we use B instead of
k

B . If Q is PSD, then aij > 0. Thus sub-problem (13) is used only to handle the
situation where Q is non-PSD.
4.1.2

Stopping Criteria and Working Set Selection

The Karush-Kuhn-Tucker (KKT) optimality condition of problem (11) implies that


a feasible is a stationary point of (11) if and only if there exists a number b and
two nonnegative vectors and such that
f () + by = ,
i i = 0, i (C i ) = 0, i 0, i 0, i = 1, . . . , l,

(14)

where f () Q + p is the gradient of f (). Note that if Q is PSD, from the


primal-dual relationship, , b, and w generated by Eq. (3) form an optimal solution
of the primal problem. The condition (14) can be rewritten as
(
0 if i < C,
i f () + byi
(15)
0 if i > 0.
Since yi = 1, condition (15) is equivalent to that there exists b such that
m() b M (),
where
m() max yi i f () and M () min yi i f (),
iIup ()

iIlow ()

and
Iup () {t | t < C, yt = 1 or t > 0, yt = 1}, and
Ilow () {t | t < C, yt = 1 or t > 0, yt = 1}.
12

That is, a feasible is a stationary point of problem (11) if and only if


m() M ().

(16)

From (16), a suitable stopping condition is


m(k ) M (k ) ,

(17)

where  is the tolerance.


For the selection of the working set B, we use the following procedure from Section
II of Fan et al. (2005).
WSS 1
1. For all t, s, define
bts yt t f (k ) + ys s f (k ) > 0,

ats Ktt + Kss 2Kts ,


and


a
ts

ats

(18)

if ats > 0,
otherwise.

Select
i arg max{yt t f (k ) | t Iup (k )},
t
 2

bit
k
k
k
j arg min
| t Ilow ( ), yt t f ( ) < yi i f ( ) .
t
a
it

(19)

2. Return B = {i, j}.


The procedure selects a pair {i, j} approximately minimizing the function value; see
the term b2it /
ait in Eq. (19).
4.1.3

Solving the Two-variable Sub-problem

Details of solving the two-variable sub-problem in Eqs. (12) and (13) are deferred to
Section 6, where a more general sub-problem is discussed.
4.1.4

Maintaining the Gradient

From the discussion in Sections 4.1.1 and 4.1.2, the main operations per iteration
are on finding QBN kN + pB for constructing the sub-problem (12), and calculating

13

f (k ) for the working set selection and the stopping condition. These two operations can be considered together because
QBN kN + pB = B f (k ) QBB kB

(20)

f (k+1 ) = f (k ) + Q:,B (k+1


kB ),
B

(21)

and
where |B|  |N | and Q:,B is the sub-matrix of Q including columns in B. If at the
kth iteration we already have f (k ), then Eq. (20) can be used to construct the
sub-problem. After the sub-problem is solved, Eq. (21) is employed to have the next
f (k+1 ). Therefore, LIBSVM maintains the gradient throughout the decomposition
method.
4.1.5

The Calculation of b or

After the solution of the dual optimization problem is obtained, the variables b or
must be calculated as they are used in the decision function.
Note that b of C-SVC and -SVR plays the same role as in one-class SVM, so
we define = b and discuss how to find . If there exists i such that 0 < i < C,
then from the KKT condition (16), = yi i f (). In LIBSVM, for numerical stability,
we average all these values.
P
=

yi i f ()
.
|{i | 0 < i < C}|
i:0<i <C

For the situation that no i satisfying 0 < i < C, the KKT condition (16) becomes
M () = max{yi i f () | i = 0, yi = 1 or i = C, yi = 1}

m() = min{yi i f () | i = 0, yi = 1 or i = C, yi = 1}.
We take the midpoint of the preceding range.
4.1.6

Initial Values

Algorithm 1 requires an initial feasible . For C-SVC and -SVR, because the zero
vector is feasible, we select it as the initial .
For one-class SVM, the scaled form (8) requires that
0 i 1,

and

l
X
i=1

14

i = l.

We let the first blc elements have i = 1 and the (blc + 1)st element have i =
l blc.
4.1.7

Convergence of the Decomposition Method

Fan et al. (2005, Section III) and Chen et al. (2006) discuss the convergence of Algorithm 1 in detail. For the rate of linear convergence, List and Simon (2009) prove a
result without making the assumption used in Chen et al. (2006).

4.2

Quadratic Problems with Two Linear Constraints: SVC and -SVR

From problems (6) and (10), both -SVC and -SVR can be written as the following
general form.
min

subject to

1 T
Q + pT
2
y T = 1 ,

(22)

eT = 2 ,
0 t C, t = 1, . . . , l.
The main difference between problems (11) and (22) is that (22) has two linear
constraints y T = 1 and eT = 2 . The optimization algorithm is very similar to
that for (11), so we describe only differences.
4.2.1

Stopping Criteria and Working Set Selection

Let f () be the objective function of problem (22). By the same derivation in Section 4.1.2, The KKT condition of problem (22) implies that there exist b and such
that

(
0
i f () + byi
0

if i < C,
if i > 0.

(23)

Define
r1 b and r2 + b.

(24)

If yi = 1, (23) becomes
(
0
i f () r1
0

15

if i < C,
if i > 0.

(25)

if yi = 1, (23) becomes
(
0
i f () r2
0

if i < C,
if i > 0.

(26)

Hence, given a tolerance  > 0, the stopping condition is


max (mp () Mp (), mn () Mn ()) < ,

(27)

where
mp ()
mn ()

max
iIup (),yi =1

yi i f (),

max
iIup (),yi =1

yi i f (),

Mp ()
Mn ()

min
iIlow (),yi =1

yi i f (), and

min
iIlow (),yi =1

yi i f ().

The following working set selection is extended from WSS 1.


WSS 2 (Extension of WSS 1 for -SVM)
1. Find
ip arg mp (k ),
(
jp arg min
t

b2ip t
a
ip t

| yt = 1, t Ilow (k ), yt t f (k ) < yip ip f (k ) .

2. Find
in arg mn (k ),
 2

bi n t
k
k
k
jn arg min
| yt = 1, t Ilow ( ), yt t f ( ) < yin in f ( ) .
t
a
in t
aij .
3. Return {ip , jp } or {in , jn } depending on which one gives smaller b2ij /
4.2.2

The Calculation of b and

We have shown that the KKT condition of problem (22) implies Eqs. (25) and (26)
according to yi = 1 and 1, respectively. Now we consider the case of yi = 1. If
there exists i such that 0 < i < C, then we obtain r1 = i f (). In LIBSVM, for
numerical stability, we average these values.
P
r1 =

i f ()
.
|{i | 0 < i < C, yi = 1}|
i:0<i <C,yi =1

If there is no i such that 0 < i < C, then r1 satisfies


max

i =C,yi =1

i f () r1
16

min

i =0,yi =1

i f ().

We take r1 the midpoint of the previous range.


For the case of yi = 1, we can calculate r2 in a similar way.
After r1 and r2 are obtained, from Eq. (24),
=
4.2.3

r1 + r2
r1 r2
and b =
.
2
2

Initial Values

For -SVC, the scaled form (6) requires that


X
l
0 i 1,
i = , and
2
i:y =1
i

i =

i:yi =1

l
.
2

We let the first l/2 elements of i with yi = 1 to have the value one.8 The situation
for yi = 1 is similar. The same setting is applied to -SVR.

Shrinking and Caching

This section discusses two implementation tricks (shrinking and caching) for the decomposition method and investigates the computational complexity of Algorithm 1.

5.1

Shrinking

An optimal solution of the SVM dual problem may contain some bounded elements
(i.e., i = 0 or C). These elements may have already been bounded in the middle of
the decomposition iterations. To save the training time, the shrinking technique tries
to identify and remove some bounded elements, so a smaller optimization problem is
solved (Joachims, 1998). The following theorem theoretically supports the shrinking
technique by showing that at the final iterations of Algorithm 1 in Section 4.1.2, only
a small set of variables is still changed.
Theorem 5.1 (Theorem IV in Fan et al., 2005) Consider problem (11) and assume Q is positive semi-definite.

1. The following set is independent of any optimal solution .


> M ()
or yi i f ()
< m()}.

I {i | yi i f ()
Further, for every i I, problem (11) has a unique and bounded optimal solution
at i .
8

Special care must be made as l/2 may not be an integer. See also Section 4.1.6.

17

2. Assume Algorithm 1 generates an infinite sequence {k }. There exists k such


every k , i I has reached the unique and bounded optimal
that after k k,
i

solution. That is,

k k:

ik

remains the same in all subsequent iterations. In addition,

i 6 {t | M (k ) yt t f (k ) m(k )}.
If we denote A as the set containing elements not shrunk at the kth iteration,
then instead of solving problem (11), the decomposition method works on a smaller
problem.
min
A

subject to

1 T
QAA A + (pA + QAN kN )T A
2 A
0 i C, i A,

(28)

y TA A = y TN kN ,
where N = {1, . . . , l}\A is the set of shrunk variables. Note that in LIBSVM, we
always rearrange elements of , y, and p to maintain that A = {1, . . . , |A|}. Details
of the index rearrangement are in Section 5.4.
After solving problem (28), we may find that some elements are wrongly shrunk.
When that happens, the original problem (11) is reoptimized from a starting point
A
= [
N ], where A is optimal for problem (28) and N corresponds to shrunk

bounded variables.
In LIBSVM, we start the shrinking procedure in an early stage. The procedure is
as follows.
1. After every min(l, 1000) iterations, we try to shrink some variables. Note that
throughout the iterative process, we have
m(k ) > M (k )

(29)

because the condition (17) is not satisfied yet. Following Theorem 5.1, we
conjecture that variables in the following set can be shrunk.
{t | yt t f (k ) > m(k ), t Ilow (k ), tk is bounded}
{t | yt t f (k ) < M (k ), t Iup (k ), tk is bounded}
= {t | yt t f (k ) > m(k ), tk = C, yt = 1 or tk = 0, yt = 1}

(30)

{t | yt t f (k ) < M (k ), tk = 0, yt = 1 or tk = C, yt = 1}.
Thus, the size of the set A is gradually reduced in every min(l, 1000) iterations.
The problem (28), and the way of calculating m(k ) and M (k ) are adjusted
accordingly.
18

2. The preceding shrinking strategy is sometimes too aggressive. Hence, when the
decomposition method achieves the following condition for the first time.
m(k ) M (k ) + 10,

(31)

where  is the specified stopping tolerance, we reconstruct the gradient (details


in Section 5.3). Then, the shrinking procedure can be performed based on more
accurate information.
3. Once the stopping condition
m(k ) M (k ) + 

(32)

of the smaller problem (28) is reached, we must check if the stopping condition
of the original problem (11) has been satisfied. If not, then we reactivate all
variables by setting A = {1, . . . , l} and start the same shrinking procedure on
the problem (28).
Note that in solving the shrunk problem (28), we only maintain its gradient
QAA A + QAN N + pA (see also Section 4.1.4). Hence, when we reactivate
all variables to reoptimize the problem (11), we must reconstruct the whole
gradient f (). Details are discussed in Section 5.3.
For -SVC and -SVR, because the stopping condition (27) is different from (17),
variables being shrunk are different from those in (30). For yt = 1, we shrink elements
in the following set
{t | yt t f (k ) > mp (k ), t = C, yt = 1}
{t | yt t f (k ) < Mp (k ), t = 0, yt = 1}.
For yt = 1, we consider the following set.
{t | yt t f (k ) > mn (k ), t = 0, yt = 1}
{t | yt t f (k ) < Mn (k ), t = C, yt = 1}.

5.2

Caching

Caching is an effective technique for reducing the computational time of the decomposition method. Because Q may be too large to be stored in the computer memory, Qij
elements are calculated as needed. We can use available memory (called kernel cache)
to store some recently used Qij (Joachims, 1998). Then, some kernel elements may
19

not need to be recalculated. Theorem 5.1 also supports the use of caching because
in final iterations, only certain columns of the matrix Q are still needed. If the cache
already contains these columns, we can save kernel evaluations in final iterations.
In LIBSVM, we consider a simple least-recent-use caching strategy. We use a
circular list of structures, where each structure is defined as follows.
struct head_t
{
head_t *prev, *next;

// a circular list

Qfloat *data;
int len;

// data[0,len) is cached in this entry

};
A structure stores the first len elements of a kernel column. Using pointers prev and
next, it is easy to insert or delete a column. The circular list is maintained so that
structures are ordered from the least-recent-used one to the most-recent-used one.
Because of shrinking, columns cached in the computer memory may be in different
length. Assume the ith column is needed and Q1:t,i have been cached. If t |A|, we
calculate Qt+1:|A|,i and store Q1:|A|,i in the cache. If t > |A|, the desired Q1:|A|,i are
already in the cache. In this situation, we do not change the cached contents of the
ith column.

5.3

Reconstructing the Gradient

If condition (31) or (32) is satisfied, LIBSVM reconstructs the gradient. Because


i f (), i = 1, . . . , |A| have been maintained in solving the smaller problem (28),
what we need is to calculate i f (), i = |A| + 1, . . . , l. To decrease the cost of this
Rl .
reconstruction, throughout iterations we maintain a vector G
i = C
G

Qij , i = 1, . . . , l.

(33)

j:j =C

Then, for i
/ A,
i f () =

l
X

i + pi +
Qij j + pi = G

j=1

X
j:jA
0<j <C

Note that we use the fact that if j


/ A, then j = 0 or C.
20

Qij j .

(34)

The calculation of f () via Eq. (34) involves a two-level loop over i and j.
Using i or j first may result in a very different number of Qij evaluations. We discuss
the differences next.
1. i first: for |A| + 1 i l, calculate Qi,1:|A| . Although from Eq. (34), only
{Qij | 0 < j < C, j A} are needed, our implementation obtains all Qi,1:|A|
(i.e., {Qij | j A}). Hence, this case needs at most
(l |A|) |A|

(35)

kernel evaluations. Note that LIBSVM uses a column-based caching implementation. Due to the symmetry of Q, Qi,1:|A| is part of Qs ith column and may
have been cached. Thus, Eq. (35) is only an upper bound.
2. j first: let
F {j | 1 j |A| and 0 < j < C}.
For each j F , calculate Q1:l,j . Though only Q|A|+1:l,j is needed in calculating
i f (), i = |A| + 1, . . . , l, we must get the whole column because of our cache
implementation.9 Thus, this strategy needs no more than
l |F |

(36)

kernel evaluations. This is an upper bound because certain kernel columns (e.g.,
Q1:|A|,j , j A) may be already in the cache and do not need to be recalculated.
We may choose a method by comparing (35) and (36). However, the decision depends
on whether Qs elements have been cached. If the cache is large enough, then elements
of Qs first |A| columns tend to be in the cache because they have been used recently.
In contrast, Qi,1:|A| , i
/ A needed by method 1 may be less likely in the cache because
columns not in A are not used to solve problem (28). In such a situation, method 1
may require almost (l |A|) |A| kernel valuations, while method 2 needs much fewer
evaluations than l |F |.
Because method 2 takes an advantage of the cache implementation, we slightly
lower the estimate in Eq. (36) and use the following rule to decide the method of
calculating Eq. (34):
If (l/2) |F | > (l |A|) |A|
use method 1
Else
use method 2
9

We always store the first |A| elements of a column.

21

This rule may not give the optimal choice because we do not take the cache contents
into account. However, we argue that in the worst scenario, the selected method by
the preceding rule is only slightly slower than the other method. This result can be
proved by making the following assumptions.
A LIBSVM training procedure involves only two gradient reconstructions. The
first is performed when the 10 tolerance is achieved; see Eq. (31). The second
is in the end of the training procedure.
Our rule assigns the same method to perform the two gradient reconstructions.
Moreover, these two reconstructions cost a similar amount of time.
We refer to total training time of method x as the whole LIBSVM training time
(where method x is used for reconstructing gradients), and reconstruction time of
method x as the time of one single gradient reconstruction via method x. We then
consider two situations.
1. Method 1 is chosen, but method 2 is better.
We have
Total time of method 1
(Total time of method 2) + 2 (Reconstruction time of method 1)
2 (Total time of method 2).

(37)

We explain the second inequality in detail. Method 2 for gradient reconstruction


requires l |F | kernel elements; however, the number of kernel evaluations may
be smaller because some elements have been cached. Therefore,
l |F | Total time of method 2.

(38)

Because method 1 is chosen and Eq. (35) is an upper bound,


2 (Reconstruction time of method 1) 2 (l |A|) |A| < l |F |.

(39)

Combining inequalities (38) and (39) leads to (37).


2. Method 2 is chosen, but method 1 is better.
We consider the worst situation where Qs first |A| columns are not in the cache.
As |A|+1, . . . , l are indices of shrunk variables, most likely the remaining l |A|
22

Table 2: A comparison between two gradient reconstruction methods. The decomposition method reconstructs the gradient twice after satisfying conditions (31) and
(32). We show in each row the number of kernel evaluations of a reconstruction. We
check two cache sizes to reflect the situations with/without enough cache. The last
two rows give the total training time (gradient reconstructions and other operations)
in seconds. We use the RBF kernel K(xi , xj ) = exp(kxi xj k2 ).
(a) a7a: C = 1, = 4,  = 0.001.

Cache = 1,000 MB
Cache = 10 MB
l = 16, 100
Method 2
Reconstruction |F |
|A| Method 1 Method 2 Method 1
First
10,597 12,476
0 21,470,526 45,213,024 170,574,272
Second
10,630 12,476
0
0 45,213,024 171,118,048
102s
108s
341s
422s
Training time
No shrinking: 111s
No shrinking: 381s
(b) ijcnn1: C = 16, = 4,  = 0.5.

l = 49, 900
Cache = 1,000 MB
Cache = 10 MB
Method 1 Method 2
Method 1
Method 2
Reconstruction |F |
|A|
First
1,767 43,678 274,297,840 5,403,072 275,695,536 88,332,330
Second
2,308 6,023 263,843,538 28,274,195 264,813,241 115,346,805
189s
46s
203s
116s
Training time
No shrinking: 42s
No shrinking: 87s

columns of Q are not in the cache either and (l |A|) |A| kernel evaluations
are needed for method 1. Because l |F | 2 (l |A|) |A|,
(Reconstruction time of method 2) 2 (Reconstruction time of method 1).
Therefore,
Total time of method 2
(Total time of method 1) + 2 (Reconstruction time of method 1)
2 (Total time of method 1).
Table 2 compares the number of kernel evaluations in reconstructing the gradient.
We consider problems a7a and ijcnn1.10 Clearly, the proposed rule selects the better
method for both problems. We implement this technique after version 2.88 of LIBSVM.
10

Available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets.

23

5.4

Index Rearrangement

In solving the smaller problem (28), we need only indices in A (e.g., i , yi , and xi ,
where i A). Thus, a naive implementation does not access array contents in a
continuous manner. Alternatively, we can maintain A = {1, . . . , |A|} by rearranging
array contents. This approach allows a continuous access of array contents, but
requires costs for the rearrangement. We decide to rearrange elements in arrays
because throughout the discussion in Sections 5.2-5.3, we assume that a cached ith
kernel column contains elements from the first to the tth (i.e., Q1:t,i ), where t l. If
we do not rearrange indices so that A = {1, . . . , |A|}, then the whole column Q1:l,i
must be cached because l may be an element in A.
We rearrange indices by sequentially swapping pairs of indices. If t1 is going to
be shrunk, we find an index t2 that should stay and then swap them. Swapping two
elements in a vector or y is easy, but swapping kernel elements in the cache is more
expensive. That is, we must swap (Qt1 ,i , Qt2 ,i ) for every cached kernel column i. To
make the number of swapping operations small, we use the following implementation.
Starting from the first and the last indices, we identify the smallest t1 that should
leave the largest t2 that should stay. Then, (t1 , t2 ) are swapped and we continue the
same procedure to identify the next pair.

5.5

A Summary of the Shrinking Procedure

We summarize the shrinking procedure in Algorithm 2.


Algorithm 2 (Extending Algorithm 1 to include the shrinking procedure)
Initialization
1. Let 1 be an initial feasible solution.
in Eq. (33).
2. Calculate the initial f (1 ) and G
3. Initialize a counter so shrinking is conducted every min(l, 1000) iterations
4. Let A = {1, . . . , l}
For k = 1, 2, . . .
1. Decrease the shrinking counter
2. If the counter is zero, then shrinking is conducted.
(a) If condition (31) is satisfied for the first time, reconstruct the gradient
24

(b) Shrink A by removing elements in the set (30). The implementation


described in Section 5.4 ensures that A = {1, . . . , |A|}.
(c) Reset the shrinking counter
3. If kA satisfies the stopping condition (32)
(a) Reconstruct the gradient
(b) If k satisfies the stopping condition (32)
Return k
Else
Reset A = {1, . . . , l} and set the counter to one11
4. Find a two-element working set B = {i, j} by WSS 1
5. Obtain Q1:|A|,i and Q1:|A|,j from cache or by calculation
6. Solve sub-problem (12) or (13) by procedures in Section 6. Update k to
k+1

7. Update the gradient by Eq. (21) and update the vector G

5.6

Is Shrinking Always Better?

We found that if the number of iterations is large, then shrinking can shorten the
training time. However, if we loosely solve the optimization problem (e.g., by using
a large stopping tolerance ), the code without using shrinking may be much faster.
In this situation, because of the small number of iterations, the time spent on all
decomposition iterations can be even less than one single gradient reconstruction.
Table 2 compares the total training time with/without shrinking. For a7a, we
use the default  = 0.001. Under the parameters C = 1 and = 4, the number
of iterations is more than 30,000. Then shrinking is useful. However, for ijcnn1, we
deliberately use a loose tolerance  = 0.5, so the number of iterations is only around
4,000. Because our shrinking strategy is quite aggressive, before the first gradient
reconstruction, only QA,A is in the cache. Then, we need many kernel evaluations for
reconstructing the gradient, so the implementation with shrinking is slower.
If enough iterations have been run, most elements in A correspond to free i
(0 < i < C); i.e., A F . In contrast, if the number of iterations is small (e.g.,
ijcnn1 in Table 2), many bounded elements have not been shrunk and |F |  |A|.
Therefore, we can check the relation between |F | and |A| to conjecture if shrinking
11

That is, shrinking is performed at the next iteration.

25

is useful. In LIBSVM, if shrinking is enabled and 2 |F | < |A| in reconstructing the


gradient, we issue a warning message to indicate that the code may be faster without
shrinking.

5.7

Computational Complexity

While Section 4.1.7 has discussed the asymptotic convergence and the local convergence rate of the decomposition method, in this section, we investigate the computational complexity.
From Section 4, two places consume most operations at each iteration: finding the
working set B by WSS 1 and calculating Q:,B (k+1
kB ) in Eq. (21).12 Each place
B
requires O(l) operations. However, if Q:,B is not available in the cache and assume
each kernel evaluation costs O(n), the cost becomes O(ln) for calculating a column
of kernel elements. Therefore, the complexity of Algorithm 1 is
1. #Iterations O(l) if most columns of Q are cached throughout iterations.
2. #Iterations O(nl) if columns of Q are not cached and each kernel evaluation
costs O(n).
Several works have studied the number of iterations of decomposition methods; see,
for example, List and Simon (2007). However, algorithms studied in these works
are slightly different from LIBSVM, so there is no theoretical result yet on LIBSVMs
number of iterations. Empirically, it is known that the number of iterations may
be higher than linear to the number of training data. Thus, LIBSVM may take
considerable training time for huge data sets. Many techniques, for example, Fine
and Scheinberg (2001); Lee and Mangasarian (2001); Keerthi et al. (2006); Segata and
Blanzieri (2010), have been developed to obtain an approximate model, but these are
beyond the scope of our discussion. In LIBSVM, we provide a simple sub-sampling
tool, so users can quickly train a small subset.

Unbalanced Data and Solving the Two-variable


Sub-problem

For some classification problems, numbers of data in different classes are unbalanced.
Some researchers (e.g., Osuna et al., 1997b, Section 2.5; Vapnik, 1998, Chapter 10.9)
12

Note that because |B| = 2, once the sub-problem has been constructed, solving it takes only a
constant number of operations (see details in Section 6).

26

have proposed using different penalty parameters in the SVM formulation. For example, the C-SVM problem becomes
X
X
1 T
i
i + C
w w + C+
2
y =1
y =1

min

w,b,

yi (w (xi ) + b) 1 i ,

subject to

(40)

i 0, i = 1, . . . , l,
where C + and C are regularization parameters for positive and negative classes,
respectively. LIBSVM supports this setting, so users can choose weights for classes.
The dual problem of problem (40) is
1 T
Q eT
2
0 i C + , if yi = 1,

min

subject to

0 i C , if yi = 1,
y T = 0.
A more general setting is to assign each instance xi a regularization parameter Ci .
If C is replaced by Ci , i = 1, . . . , l in problem (11), most results discussed in earlier
sections can be extended without problems.13 The major change of Algorithm 1 is
on solving the sub-problem (12), which now becomes

 
 Qii Qij i
1
i j
min
+ (Qi,N N + pi )i + (Qj,N N + pj )j
Qji Qjj j
i ,j
2
subject to

yi i + yj j = y TN kN ,

(41)

0 i Ci , 0 j Cj .
Let i = ik + di and j = jk + dj . The sub-problem (41) can be written as

 
 
 Qii Qij di

 di
1
k
k
di dj
+ i f ( ) j f ( )
min
Qij Qjj dj
dj
di ,dj
2
subject to
yi di + yj dj = 0,
ik di Ci ik , jk dj Cj jk .
Define aij and bij as in Eq. (18), and di yi di , dj yj dj . Using di = dj , the
objective function can be written as
1 2
a
ij dj + bij dj .
2
13

This feature of using Ci , i is not included in LIBSVM, but is available as an extension at


libsvmtools.

27

Minimizing the previous quadratic function leads to


inew = ik + yi bij /
aij ,
aij .
jnew = jk yj bij /

(42)

These two values may need to be modified because of bound constraints. We first
consider the case of yi 6= yj and re-write Eq. (42) as
inew = ik + (i f (k ) j f (k ))/
aij ,
aij .
jnew = jk + (i f (k ) j f (k ))/
In the following figure, a box is generated according to bound constraints. An infeasible (inew , jnew ) must be in one of the four regions outside the following box.
NA

i j = Ci Cj

region II

Cj
region IV

region I
Ci

i j = 0

region III

i
NA

Note that (inew , jnew ) does not appear in the NA regions because (ik , jk ) is in the
box and
inew jnew = ik jk .
If (inew , jnew ) is in region I, we set
ik+1 = Ci and jk+1 = Ci (ik jk ).
Of course, we must identify the region that (inew , jnew ) resides. For region I, we have
ik jk > Ci Cj and inew Ci .
Other cases are similar. We have the following pseudo code to identify which region
(inew , jnew ) is in and modify (inew , jnew ) to satisfy bound constraints.
if(y[i]!=y[j])
{
double quad_coef = Q_i[i]+Q_j[j]+2*Q_i[j];
if (quad_coef <= 0)
28

quad_coef = TAU;
double delta = (-G[i]-G[j])/quad_coef;
double diff = alpha[i] - alpha[j];
alpha[i] += delta;
alpha[j] += delta;
if(diff > 0)
{
if(alpha[j] < 0) // in region III
{
alpha[j] = 0;
alpha[i] = diff;
}
}
else
{
if(alpha[i] < 0) // in region IV
{
alpha[i] = 0;
alpha[j] = -diff;
}
}
if(diff > C_i - C_j)
{
if(alpha[i] > C_i) // in region I
{
alpha[i] = C_i;
alpha[j] = C_i - diff;
}
}
else
{
if(alpha[j] > C_j) // in region II
{
alpha[j] = C_j;
alpha[i] = C_j + diff;
}
}
}
If yi = yj , the derivation is the same.

Multi-class classification

LIBSVM implements the one-against-one approach (Knerr et al., 1990) for multiclass classification. Some early works of applying this strategy to SVM include, for
example, Kressel (1998). If k is the number of classes, then k(k 1)/2 classifiers are
constructed and each one trains data from two classes. For training data from the

29

ith and the jth classes, we solve the following two-class classification problem.
min

wij ,bij ,ij

subject to

X
1 ij T ij
(w ) w + C
( ij )t
2
t
(wij )T (xt ) + bij 1 tij , if xt in the ith class,
(wij )T (xt ) + bij 1 + tij , if xt in the jth class,
tij 0.

In classification we use a voting strategy: each binary classification is considered to


be a voting where votes can be cast for all data points x - in the end a point is
designated to be in a class with the maximum number of votes.
In case that two classes have identical votes, though it may not be a good strategy,
now we simply choose the class appearing first in the array of storing class names.
Many other methods are available for multi-class SVM classification. Hsu and
Lin (2002a) give a detailed comparison and conclude that one-against-one is a
competitive approach.

Probability Estimates

SVM predicts only class label (target value for regression) without probability information. This section discusses the LIBSVM implementation for extending SVM to
give probability estimates. More details are in Wu et al. (2004) for classification and
in Lin and Weng (2004) for regression.
Given k classes of data, for any x, the goal is to estimate
pi = P (y = i | x),

i = 1, . . . , k.

Following the setting of the one-against-one (i.e., pairwise) approach for multi-class
classification, we first estimate pairwise class probabilities
rij P (y = i | y = i or j, x)
using an improved implementation (Lin et al., 2007) of Platt (2000). If f is the
decision value at x, then we assume
rij

1
,
1 + eAf+B

(43)

where A and B are estimated by minimizing the negative log likelihood of training
data (using their labels and decision values). It has been observed that decision values
30

from training may overfit the model (43), so we conduct five-fold cross-validation to
obtain decision values before minimizing the negative log likelihood. Note that if
some classes contain five or even fewer data points, the resulting model may not be
good. You can duplicate the data set so that each fold of cross-validation gets more
data.
After collecting all rij values, Wu et al. (2004) propose several approaches to
obtain pi , i. In LIBSVM, we consider their second approach and solve the following
optimization problem.
k

1XX
(rji pi rij pj )2
2 i=1 j:j6=i

min
p

subject to

k
X

pi 0, i,

pi = 1.

(44)

i=1

The objective function in problem (44) comes from the equality


P (y = j | y = i or j, x) P (y = i | x) = P (y = i | y = i or j, x) P (y = j | x)
and can be reformulated as

1 T
p Qp,
2

min
p

where

(P
Qij =

2
s:s6=i rsi

rji rij

if i = j,
if i =
6 j.

Wu et al. (2004) prove that the non-negativity constraints pi 0, i in problem (44)


are redundant. After removing these constraints, the optimality condition implies that
P
there exists a scalar b (the Lagrange multiplier of the equality constraint ki=1 pi = 1)
such that

Q e
eT 0

   
p
0
=
,
b
1

(45)

where e is the k 1 vector of all ones and 0 is the k 1 vector of all zeros.
Instead of solving the linear system (45) by a direct method such as Gaussian
elimination, Wu et al. (2004) derive a simple iterative method. Because
pT Qp = pT Q(be) = bpT e = b,
the optimal solution p satisfies
(Qp)t pT Qp = Qtt pt +

X
j:j6=t

Using Eq. (46), we consider Algorithm 3.


31

Qtj pj pT Qp = 0,

t.

(46)

Algorithm 3
1. Start with an initial p satisfying pi 0, i and

Pk

i=1

pi = 1.

2. Repeat (t = 1, . . . , k, 1, . . .)
pt

X
1
[
Qtj pj + pT Qp]
Qtt j:j6=t

(47)

normalize p
until Eq. (45) is satisfied.
Eq. (47) can be simplified to
pt p t +

1
[(Qp)t + pT Qp].
Qtt

Algorithm 3 guarantees to converge globally to the unique optimum of problem (44).


Using some tricks, we do not need to recalculate pT Qp at each iteration. More
implementation details are in Appendix C of Wu et al. (2004). We consider a relative
stopping condition for Algorithm 3.
kQp pT Qpek = max |(Qp)t pT Qp| < 0.005/k.
t

When k (the number of classes) is large, some elements of p may be very close to
zero. Thus, we use a more strict stopping condition by decreasing the tolerance by a
factor of k.
Next, we discuss SVR probability inference. For a given set of training data
D = {(xi , yi ) | xi Rn , yi R, i = 1, . . . , l}, we assume that the data are collected
from the model
yi = f (xi ) + i ,
where f (x) is the underlying function and i s are independent and identically distributed random noises. Given a test data x, the distribution of y given x and D,
P (y | x, D), allows us to draw probabilistic inferences about y; for example, we can
estimate the probability that y is in an interval such as [f (x), f (x)+]. Denoting
f as the estimated function based on D using SVR, then = (x) y f(x) is the
out-of-sample residual (or prediction error). We propose modeling the distribution of
based on cross-validation residuals {i }li=1 . The i s are generated by first conducting a five-fold cross-validation to get fj , j = 1, . . . , 5, and then setting i yi fj (xi )
for (xi , yi ) in the jth fold. It is conceptually clear that the distribution of i s may
resemble that of the prediction error .
32

Figure 2 illustrates i s from a data set. Basically, a discretized distribution like


histogram can be used to model the data; however, it is complex because all i s must
be retained. On the contrary, distributions like Gaussian and Laplace, commonly
used as noise models, require only location and scale parameters. In Figure 2, we
plot the fitted curves using these two families and the histogram of i s. The figure
shows that the distribution of i s seems symmetric about zero and that both Gaussian
and Laplace reasonably capture the shape of i s. Thus, we propose to model i by
zero-mean Gaussian and Laplace, or equivalently, model the conditional distribution
of y given f(x) by Gaussian and Laplace with mean f(x).
Lin and Weng (2004) discuss a method to judge whether a Laplace and Gaussian
distribution should be used. Moreover, they experimentally show that in all cases
they have tried, Laplace is better. Thus, in LIBSVM, we consider the zero-mean
Laplace with a density function.
p(z) =

1 |z|
e .
2

Assuming that i s are independent, we can estimate the scale parameter by maximizing the likelihood. For Laplace, the maximum likelihood estimate is
Pl
|i |
= i=1
.
l
Lin and Weng (2004) point out that some very extreme i s may cause inaccurate
estimation of . Thus, they propose estimating the scale parameter by discarding
i s which exceed 5 (standard deviation of the Laplace distribution). For any new
data x, we consider that
y = f(x) + z,
where z is a random variable following the Laplace distribution with parameter .
In theory, the distribution of may depend on the input x, but here we assume
that it is free of x. Such an assumption works well in practice and leads to a simple
model.

Parameter Selection

To train SVM problems, users must specify some parameters. LIBSVM provides a
simple tool to check a grid of parameters. For each parameter setting, LIBSVM obtains
cross-validation (CV) accuracy. Finally, the parameters with the highest CV accuracy
33

Figure 2: Histogram of i s and the models via Laplace and Gaussian distributions.
The x-axis is i using five-fold cross-validation and the y-axis is the normalized number
of data in each bin of width 1.

are returned. The parameter selection tool assumes that the RBF (Gaussian) kernel
is used although extensions to other kernels and SVR can be easily made. The RBF
kernel takes the form
2

K(xi , xj ) = ekxi xj k ,

(48)

so (C, ) are parameters to be decided. Users can provide a possible interval of C


(or ) with the grid space. Then, all grid points of (C, ) are tried to find the one
giving the highest CV accuracy. Users then use the best parameters to train the
whole training set and generate the final model.
We do not consider more advanced parameter selection methods because for only
two parameters (C and ), the number of grid points is not too large. Further, because
SVM problems under different (C, ) parameters are independent, LIBSVM provides
a simple tool so that jobs can be run in a parallel (multi-core, shared memory, or
distributed) environment.
For multi-class classification, under a given (C, ), LIBSVM uses the one-againstone method to obtain the CV accuracy. Hence, the parameter selection tool suggests
the same (C, ) for all k(k 1)/2 decision functions. Chen et al. (2005, Section 8)
discuss issues of using the same or different parameters for the k(k 1)/2 two-class
problems.
LIBSVM outputs the contour plot of cross-validation accuracy. An example is in

34

Figure 3: Contour plot of running the parameter selection tool in LIBSVM. The data
set heart scale (included in the package) is used. The x-axis is log2 C and the y-axis
is log2 .

Figure 3.

10

Conclusions

When we released the first version of LIBSVM in 2000, only two-class C-SVC was
supported. Gradually, we added other SVM variants, and supported functions such
as multi-class classification and probability estimates. Then, LIBSVM becomes a
complete SVM package. We add a function only if it is needed by enough users. By
keeping the system simple, we strive to ensure good system reliability.
In summary, this article gives implementation details of LIBSVM. We are still
actively updating and maintaining this package. We hope the community will benefit
more from our continuing development of LIBSVM.

35

Acknowledgments
This work was supported in part by the National Science Council of Taiwan via the
grants NSC 89-2213-E-002-013 and NSC 89-2213-E-002-106. The authors thank their
group members and users for many helpful comments. A list of acknowledgments is
at http://www.csie.ntu.edu.tw/~cjlin/libsvm/acknowledgements.

References
B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin
classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning
Theory, pages 144152. ACM Press, 1992.
C.-C. Chang and C.-J. Lin. Training -support vector classifiers: Theory and algorithms. Neural Computation, 13(9):21192147, 2001.
C.-C. Chang and C.-J. Lin. Training -support vector regression: Theory and algorithms. Neural Computation, 14(8):19591977, 2002.
C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology, 2:27:127:27, 2011. Software
available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
P.-H. Chen, C.-J. Lin, and B. Scholkopf. A tutorial on -support vector machines.
Applied Stochastic Models in Business and Industry, 21:111136, 2005. URL http:
//www.csie.ntu.edu.tw/~cjlin/papers/nusvmtoturial.pdf.
P.-H. Chen, R.-E. Fan, and C.-J. Lin. A study on SMO-type decomposition methods
for support vector machines. IEEE Transactions on Neural Networks, 17:893908,
July 2006.
pdf.

URL http://www.csie.ntu.edu.tw/~cjlin/papers/generalSMO.

C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273297,


1995.
D. J. Crisp and C. J. C. Burges. A geometric interpretation of -SVM classifiers.
In S. Solla, T. Leen, and K.-R. M
uller, editors, Advances in Neural Information
Processing Systems, volume 12, Cambridge, MA, 2000. MIT Press.

36

K. C. Dorff, N. Chambwe, M. Srdanovic, and F. Campagne.

BDVal: repro-

ducible large-scale predictive model development and validation in high-throughput


datasets. Bioinformatics, 26(19):24722473, 2010.
R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using second order
information for training SVM. Journal of Machine Learning Research, 6:18891918,
2005. URL http://www.csie.ntu.edu.tw/~cjlin/papers/quadworkset.pdf.
S. Fine and K. Scheinberg. Efficient svm training using low-rank kernel representations. Journal of Machine Learning Research, 2:243264, 2001.
T. Glasmachers and C. Igel. Maximum-gain working set selection for support vector
machines. Journal of Machine Learning Research, 7:14371466, 2006.
K. Grauman and T. Darrell. The pyramid match kernel: Discriminative classification
with sets of image features. In Proceedings of IEEE International Conference on
Computer Vision, 2005.
M. Hanke, Y. O. Halchenko, P. B. Sederberg, S. J. Hanson, J. V. Haxby, and S. Pollmann. PyMVPA: A Python toolbox for multivariate pattern analysis of fMRI data.
Neuroinformatics, 7(1):3753, 2009. ISSN 1539-2791.
C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector
machines. IEEE Transactions on Neural Networks, 13(2):415425, 2002a.
C.-W. Hsu and C.-J. Lin. A simple decomposition method for support vector machines. Machine Learning, 46:291314, 2002b.
C.-W. Hsu, C.-C. Chang, and C.-J. Lin. A practical guide to support vector classification. Technical report, Department of Computer Science, National Taiwan University, 2003. URL http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.
pdf.
T. Joachims. Making large-scale SVM learning practical. In B. Scholkopf, C. J. C.
Burges, and A. J. Smola, editors, Advances in Kernel Methods Support Vector
Learning, pages 169184, Cambridge, MA, 1998. MIT Press.
S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy. Improvements
to Platts SMO algorithm for SVM classifier design. Neural Computation, 13:637
649, 2001.
37

S. S. Keerthi, O. Chapelle, and D. DeCoste. Building support vector machines with


reduced classifier complexity. Journal of Machine Learning Research, 7:14931515,
2006.
S. Knerr, L. Personnaz, and G. Dreyfus. Single-layer learning revisited: a stepwise
procedure for building and training a neural network. In J. Fogelman, editor, Neurocomputing: Algorithms, Architectures and Applications. Springer-Verlag, 1990.
U. H.-G. Kressel. Pairwise classification and support vector machines. In B. Scholkopf,
C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods Support
Vector Learning, pages 255268, Cambridge, MA, 1998. MIT Press.
Y.-J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines. In
Proceedings of the First SIAM International Conference on Data Mining, 2001.
C.-J. Lin and R. C. Weng. Simple probabilistic predictions for support vector regression. Technical report, Department of Computer Science, National Taiwan University, 2004. URL http://www.csie.ntu.edu.tw/~cjlin/papers/svrprob.pdf.
H.-T. Lin, C.-J. Lin, and R. C. Weng. A note on Platts probabilistic outputs for
support vector machines. Machine Learning, 68:267276, 2007. URL http://www.
csie.ntu.edu.tw/~cjlin/papers/plattprob.pdf.
N. List and H. U. Simon. General polynomial time decomposition algorithms. Journal
of Machine Learning Research, 8:303321, 2007.
N. List and H. U. Simon. SVM-optimization and steepest-descent line search. In
Proceedings of the 22nd Annual Conference on Computational Learning Theory,
2009.
J. Nivre, J. Hall, J. Nilsson, A. Chanev, G. Eryigit, S. Kubler, S. Marinov, and
E. Marsi. MaltParser: A language-independent system for data-driven dependency
parsing. Natural Language Engineering, 13(2):95135, 2007.
E. Osuna, R. Freund, and F. Girosi. Training support vector machines: An application to face detection. In Proceedings of IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR), pages 130136, 1997a.
E. Osuna, R. Freund, and F. Girosi. Support vector machines: Training and applications. AI Memo 1602, Massachusetts Institute of Technology, 1997b.
38

L. Palagi and M. Sciandrone. On the convergence of a modified version of SVMlight


algorithm. Optimization Methods and Software, 20(23):315332, 2005.
J. C. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in
Kernel Methods - Support Vector Learning, Cambridge, MA, 1998. MIT Press.
J. C. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In A. Smola, P. Bartlett, B. Scholkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, Cambridge, MA, 2000. MIT
Press.
B. Scholkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector
algorithms. Neural Computation, 12:12071245, 2000.
B. Scholkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson.
Estimating the support of a high-dimensional distribution. Neural Computation,
13(7):14431471, 2001.
N. Segata and E. Blanzieri. Fast and scalable local kernel machines. Journal of
Machine Learning Research, 11:18831926, 2010.
V. Vapnik. Statistical Learning Theory. Wiley, New York, NY, 1998.
T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research, 5:9751005, 2004.
URL http://www.csie.ntu.edu.tw/~cjlin/papers/svmprob/svmprob.pdf.

39

You might also like