You are on page 1of 6

Estimation of the Regularization Parameter

for Support Vector Regression


E.M. Jordaan *, G.F. Smits t
*Department of Mathematics and Computer Science, Eindhoven University of Technology, The Netherlands
t Materials Sciences and Information Research, Dow Benelux B.V., Terneuzen, The Netherlands.

Abstract Support Vector Machines use a


Since the optimization problem in (1) is a quadratic
regularization parameter C to regulate the programming problem, it has the dual formulation:
trade-off between the complexity of the model Maximise
and the empirical risk of the model. Most of
e
1
the techniques available for determining the
e 1
-(aa - &i)(CYj - &j) K(Xi,Xj) --ba,j
2c
optimal value of C are very time consumi,j=1
ing. For industrial applications of the SVM
e
e
method, there is a need for a fast and robust
-5(ai- q y i - E -5- (a1 Sa)
(2)
method to estimate C. In this paper a method
i=l
i=l
based on the characteristics of the kernel, the
range of output values and the size of the e- subject to
(ai- Si) = 0,ai 2 0,&i 2 0, for
insensitive zone, is proposed.
2 = 1, ... ,e.
The SVM model, in terms of the Lagrange Multi-

-r

INTRODUCTION

pliers (a,
&), is defined as

The Support Vector Machine as a learning machine


was first suggested by Vapnik in the early 1990s [7].
Originally it was derived for classification applications, but since the mid 1990s it has been applied to
regression and feature selection problems as well. The
SVM formulation in case of regression, for a given
Rn,yi E R
learning or training data set xi E X
with e, is to minimize

e
f(Xnew) =

-5- (ai- Si)K(xi,xnew)+ b,

where the bias b is determined by using the constraints in (1). The input data vectors that correspond with positive Lagrange Multipliers, are referred to as support vectors. Note that the loss term
in (1) is quadratic, but (1) can also be expressed
in terms of linear loss. For linear loss, the second
term in (1) becomes l/eC:=l (& <i) and the Lagrange Multipliers in (2) are bounded from above by
C. More information on Support Vector Machines
can be found in [l]and [7].
The parameter C in (1) controls the trade-off between the complexity of the model (f 11 w 1 2) and the
(& ti)) [7]. C is also called
training error (1/C
the regularization parameter since it corresponds to

subject to

+ b)

5 E +ti,
yi - ( ( w .Xi) + b) 5 + 52,
((W.

&,ti

xi)

20,

yi

= 1...

,e.

0-7803-7278-6/02/$10.00 02002 DeEE

i = 1 , ... , e
2 =

1 , . . . ,e,

(3)

i=l

2192

the parameter y of the regularization method for solving ill-posed problems as C = 1.


Finding the optimal value ?or C still remains a
problem. Many researchers suggested that C should
be varied through a wide range of values and the optimal performance is then measured by using a separate validation set or other techniques such as crossvalidation or boot-strapping [l],[ 5 ] . Vapnik mentioned in [7] three methods for choosing the optimal regularization parameter, namely, the L-Curve
method [2], the method of effective number of parameters [4] and the effective VC-dimension method
[8]. Each of these methods uses a different approach
for measuring the performance and complexity of the
model and originates from different theories. One
common problem with many of the suggested approaches is that they are not suitable for large-scale
problems. The computational effort to determine the
eigenvalues of large matrices or using resampling limits their use in online applications.
In particular, if one needs to make a quick assessment whether a given data set can be solved with the
SVM method or if a given kernel function is an appropriate choice, a fast estimation method is extremely
useful. Furthermore, since the C parameter is known
to be a rather robust parameter, determining the true
optimal value is often not worth the effort. In SVM
literature it is often suggested that C should be chosen s u f i c i e n t l y large. But what value is large enough?
If an estimation method can give a good indication
of the magnitude of C, one can at least start from an
informed guess.
It is known that the scale of the regularization parameter is affected by several factors. It has been
shown by Smola [6] that the optimal regularization
parameter depends on the value E . Since E is used
to control the complexity of the model and depends
on the noise level in the data, the choice of the optimal value of C assumes some knowledge about the
underlying noise distribution as well as the inherent
complexity of the model. Often, this knowledge is not
the available. In [l]the authors indicate that the regularization parameter C is also affected by the choice
of feature space. The consequence of this is very significant, since the feature space is determined by the
specified kernel, which is in fact an operator associ-

0-7803-7278-6/02/~10.0002002 IEEE

ated with smoothness. Therefore, the choice of regularization parameter can not be based on one factor
aIone, but on the combined influence. None of the
heuristics of estimation methods in literature does
that. The research was therefore aimed at deriving
an estimating rule that combines the characteristics
of the feature space, the expected noise level, and
some other contributing factors.
The rest of the paper consists of four sections. In
section two, useful results from the L-Curve method
are discussed. In the third section a method is derived
that estimates the value of C from a priori parameters. The performance of this method is shown in
section four.

I1

RESULTS FROM THE L-CURVE


METHOD

The L-Curve method is derived from the theory of


solving ill-posed problems [2]. It is well established
method and one of the few approaches in regularization theory that takes into account both the norm of
the solution and the norm of the error [3]. Vapnik
has shown in [7] that the L-Curve Method can be applied for Support Vector Machines for regression with
a quadratic loss function. The resulting terms for the
norm of the solution and the norm of the error, are
then given by the following two functions,

and

i=l

k=l

where N is the index set of the support vectors.


The L-curve is the log-log plot of ~ ( yagainst
)
p(y).
The distinct L-shape of the curve is shown in Figure
1. The L-Curve method is a very useful graphical

2193

(b) M&l

(a) -L

with C=5

of the curvature expression, an important relation between the derivative of p ( y ) and ~(y)
emerged. And
it is this relation we are interested in.
Consider the following minimization problem

0.8
1

:
w
.

1 4
4 2

0.6
0.4

c=254

$ 0

0.2

C=5

-2

0
-5
In(Emx mm1

-10

-0.2

0.2

0.4

0.6

0.8

27

(d) ModelA h C=12208W

0.6

0.6

0.4

0.4

0.2

0.2

0
0.2

0.4

0.6

(7)

where is a symmetric positive (semi)definite coefficient matrix and b the given output data. Using the
SVD decomposition of A , the norms of the solution
and error can be written as

0.6

0.8

{ llA2 - blli + Y Il4l;)

l.:m
A

l.:m

-0.2

= argmin

0.8

-0.2

0.2

0.4

0.6

0.6

(8)

i=l

Figure 1: The form of the L-Curve is shown in graph


(a). Graphs (b),(c) and (d) show models for various
values of C.
tool which is used to display the trade-off between
the complexity and the error. If too little complexity
is used, the right leg of the L-Curve is dominant and
the model typically underfits (see Figure l(b)). When
the left leg of the L-Curve is dominant, the model
uses too much complexity and starts to overfit as seen
in Figure l(d). The corner point of the GCurve corresponds to the optimal value of the regularization
parameter for which the model has the right balance
between complexity and the error term.
Finding the corner point of the L-Curve involves
finding the minimum of the functional

where u i are the singular vectors, o i the singular values, and f i the Tikhonov filter factors, that depend
on ~i and y as follows,
fi =

0;
-

U;

+y

The derivatives of ~ ( yand


) p ( 7 ) to y, are then given
by

(Note that in [3], these equations were derived using


y2 which resulted in having a factor 4 instead of 2 in
each equation.) Rewriting p(y) and using the fact
that

In regularization theory, the corner point of the LCurve is normally found by determining the curvature of the L-Curve. In [3] an expression for the cur- leads to a very important relation
vature of the GCurve is derived in terms of p ( 7 ) and
~ ( 7and
) their derivatives. As part of the derivation

0-7803-7278-6/02/$10.00 02002 lEEE

2194

I11 ESTIMATE FOR C

computed as R

The relation (13) also applies to the Support Vector


Machine formulation with quadratic loss when the
implicit feature space, defined by the kernel, is considered. In this section, the relation (13) combined
with (6), will be used to derive a estimate.
First,
consider the functional (6). In order to find the optimal regularization parameter y, (6) has to be minimized, that is to set H(y) = 0. The derivative of
H(y) is given by

I maxllile(K(xi, xi)). Therefore,

Now, consider the term for the norm of the error,


p. Let yic be the predicted output value of yi of the
SVM model. Since the SVM for Regression uses an
+insensitive loss function,

l e (

v
Yi 7
i=l
- e
-e v (Yi -

\2

Pk(?)K(zk,xi)

k=l

YiJ2

i=l

Rewriting the relation (13), given in the previous section, such that y stands alone and using (14), leads
to
It is clear from (17) that a lower bound in terms of a
priori information should involve the number of data
points, the range of the output data and the value
of E . Since no such bound exists in literature, one
was
derived from a number of assumptions about the
Now using the fact that C = 117, we arrive at
error and experimental results.
Let us assume that the resulting model will be a
C = -V(Y)
.
(15) relatively good model such that the -insensitive zone
P(Y)
is smaller than half the range of the output values
and that there is an equal number of support vectors
This equation forms the basis of the estimate .
Since the true solution and therefore, true error, above and below the +insensitive zone. Then a very
is unknown, we will use upper and lower bounds in loose lower bound on (17) can be given by,
terms of the a priori parameters. From the Support
p > iRange(y) - 26
Vector theory, is known that the norm of the solution
e 2
llw112 < R2,where R is the radius of the ball centred
at the origin in the feature space and which can be
From experimental observations, it was found that a
lIt is also interesting to note the close resemblance between power of four gives a more accurate estimation. This
the derivation of the expression for the curvature of the L- leads to the proposed estimate given by

Curve,which uses the SVD decomposition, and the use of the


eigenvalues and eigenvectors in the method of the Effective
Number of Parameters that was suggested in statistics for estimating parameters for ridge regression [4].
*Vapnik derived in Chapter 7 of [7] a similar relation for -y as
in (15) as part of the proof of a theorem. Vapnik used, however,
an entirely different approach. The relation p z ( A f 8 , Af) 5
2 d f i can be rewritten t o -yt 2 p z ( A f e , A f ) / 4 d 2 . A is an
operator in a Hilbert space and the function p2 is metric meai
suring the distance between the true output A f = F and the
predicted output A f t of the optimal solution fe. Finally, d is
such that llfll 5 d .

0-7803-7278-6/02/$10.00 02002 IEEE

IV

EXPERIMENTAL RESULTS

In this section the estimated value of C, using (19)


is compared to the value of C determined by using
the L-Curve. Several data sets with varying sizes,

2195

noise levels and dimensions were used. The results for


f(z1,zz) = zlzz 1 with (z1,zz) E [-I, 11 (equivalent to a continuous version of the 2D XOR problem)
is presented in this section. The learning data consisted of a random sampling of this function after a
noise level of N(O,0.05) was added.

-10

2(a) shows various error statistics of models for increasing values C. The resulting GCurve is plotted
in Figure 2(b). The area between the vertical dashed
lines in Figure 2(a) corresponds to the area in the
corner of the L-Curve, as shown in Figure 2(b). The
area around the corner point in the GCurve is shown
more clearly in Figure 2(c). In Figure 2(a) and Figure
2(c) the location of the optimal C-value is indicated
by the asterisk and the circle shows the location of
the estimated C-value. Finally, Figure 2(d) and Figure 2(e) show the performance of the models built
using the (near) optimal C and the estimated C, respectively. At first glance, one might think that an
estimated value of C = 340 is far from the (near)
optimal value of C = 1151 from the GCurve. However, from Figure 2(c) it is clear that C is a rather
robust parameter. Therefore, the estimation needs
only to predict a value of C close to the corner of the
L-Curve.

-5
0
(d) Model wh
i L-CimeC

Figure 2: Results for a RBF kernel of width 0.2. The


(near) optimal value of C is indicated by an asterisk and the estimated value of C by a circle. (a)
Error statistic for each iteration step in the GCurve
method. The GCurve is shown in (b) and the corner
point of the L-Curve in (c). In (d) and (e) the performances of the model using the optimal C and the
model using the estimated C are shown respectively. Figure 3: Results for a polynomial kernel of degree 2.
(a) The GCurve-determined C and the estimated C
In Figure 2 the results from the L-Curve Approach are plotted against the percentage support vectors of
are compared to the estimated value of C using a each model. (b) The Rz statistic of predictions made
RBF kernel with a width of 0.2 and an E of 0.05. by the models. (c) The Root Mean Square Error
The L-Curve Approach requires the building of sev- Prediction of the models.
eral models for increasing values of C. The range
Figure 3 shows the results of SVM with a polyof values for C considered needs to be large enough,
otherwise the corner point of the L-Curve can not be nomial kernel for various values of E, ranging from
seen. Therefore, the resolution of the C-values being 0 to 0.125. For each value , . t h e C value generated
used, were chosen on a logarithmic scale. The Figure by the L-Curve method and also estimated by (19).

0-7803-7278-6/02/$10.00 02002 IEEE

2196

Figure 3(a) shows the determined value of C from


the L-Curve Method and the estimated value plotted
against the resulting percentage of support vectors of
each model. The performance of models using the
optimal C and the estimated C are compared by using the Rs-Statistic and the Root Mean Square Error Prediction (RMSEP) In Figure 3(d) and Figure
3(e) it is clear that the estimated value of C produces
models with error statistics that compare well with
the error statistics of a model using the optimal value
of c.
The CPU time of determining the (near) optimal
value for C through the L-Curve method was on average around 90 seconds. For the estimation method,
the CPU time was less than 1 second. The computational advantage speaks for itself. The estimated
value of C can also help to speed up the L-Curve
method, since one can get a good initial guess for a
starting point of the algorithm.

CONCLUSIONS

A method for estimating the regularization parameter C for Support Vector Regression problems is presented. The estimation is based on results from the
analysis of the L-Curve method. It was mentioned in
the introduction that choosing a value for C should
involve taking into account several factors, including
the kernel function and the noise level. These factors
are all present in the heuristic proposed.
Comparing the values of C obtained from the LCurve method with the values determined by the estimate, using several data sets, showed that the estimates of C-values are in close proximity to the optimal C. Furthermore, the difference in performance
between a model using the C-value determined by
the L-Curve and a model using the C estimated by
the method, is very small and often negligible.
The computation time needed to determine a good
estimate of the optimal C is a fraction of the time
needed to determine the (near) optimal value of C
by means of the L-Curve method. Therefore, the
proposed estimation method can be used for online

applications in industry. In particular, if one needs


to make a quick assessment whether a given data set
can be solved with the SVM method or if a given
kernel function is an appropriate choice, the fast and
robust estimation method is extremely useful.
In this paper, only the -Support Vector Machine
was considered with quadratic loss, assuming that the
E is known a priori. Future work includes deriving
similar estimates for the eSVM with linear loss as
well as the v-Support Vector Machine [l],where the
expected ratio of support vectors is used instead of E .

References
[l]N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, and other
kernel-based learning methods, Cambridge University Press, 2000.

[2] H. W. Engl, M. Hanke, and A. Neubauer, Regularization of Inverse Problems, Kluwer Academic
Publishers, Hingham, MA, 1996.
[3] P. C. Hansen, The L-Curve and its use in the
numerical treatment of inverse problems, invited
paper for P. Johnston (Ed.), Computational Inverse Problems in Electrocardiology, pp. 119-142,
WIT Press, Southampton, 2001.
[4] T . J. Hastie and R. J. Tibshirani, Generalized
Linear Models, Chapman and Hall, London, UK,
1990.
[5] B. Scholkopf, C. J. Burges, and A. J. Smola, Advances in Kernel Methods: Support Vector Learning, MIT Press, London, 1998.

[6] A. J. Smola, Regression Estimation with Support


Vector Learning machines, Masters Thesis, TU
Berlin, 1996.
[7] V. N. Vapnik, Statistical Learning Theory, John
Wiley & Sons, 1998.

181

3The RMSEP is the relative error multiplied by the standard deviation of the predicted test data.

0-7803-7278-6/02/$10.00 02002 IEEE

2197

.N. Vapnik, E. Levin, E., and y.


Measuring the vc Dimension Of a Learning
Machine , Neural Computation, Vol. 10:5, 1994.

You might also like