JORDAAN2002 - Estimation of The Regularization Parameter For SVR

Estimation of the Regularization Parameter
for Support Vector Regression

E.M. Jordaan *, G.F. Smits t
*Department of Mathematics and Computer Science, Eindhoven University of Technology, The Netherlands
t Materials Sciences and Information Research, Dow Benelux B.V., Terneuzen, The Netherlands.
Abstract Support Vector Machines use a

Since the optimization problem in (1) is a quadratic
regularization parameter C to regulate the programming problem, it has the dual formulation:
trade-off between the complexity of the model Maximise
and the empirical risk of the model. Most of
e
1
the techniques available for determining the
e 1
-(aa - &i)(CYj - &j) K(Xi,Xj) --ba,j
2c
optimal value of C are very time consumi,j=1
ing. For industrial applications of the SVM
e
e
method, there is a need for a fast and robust
-5(ai- q y i - E -5- (a1 Sa)
(2)
method to estimate C. In this paper a method
i=l
i=l
based on the characteristics of the kernel, the
range of output values and the size of the e- subject to
(ai- Si) = 0,ai 2 0,&i 2 0, for
insensitive zone, is proposed.
2 = 1, ... ,e.
The SVM model, in terms of the Lagrange Multi-
-r
INTRODUCTION
pliers (a,
&), is defined as
The Support Vector Machine as a learning machine

was first suggested by Vapnik in the early 1990s [7].
Originally it was derived for classification applications, but since the mid 1990s it has been applied to
regression and feature selection problems as well. The
SVM formulation in case of regression, for a given
Rn,yi E R
learning or training data set xi E X
with e, is to minimize
e
f(Xnew) =
-5- (ai- Si)K(xi,xnew)+ b,
where the bias b is determined by using the constraints in (1). The input data vectors that correspond with positive Lagrange Multipliers, are referred to as support vectors. Note that the loss term
in (1) is quadratic, but (1) can also be expressed
in terms of linear loss. For linear loss, the second
term in (1) becomes l/eC:=l (& <i) and the Lagrange Multipliers in (2) are bounded from above by
C. More information on Support Vector Machines
can be found in [l]and [7].
The parameter C in (1) controls the trade-off between the complexity of the model (f 11 w 1 2) and the
(& ti)) [7]. C is also called
training error (1/C
the regularization parameter since it corresponds to
subject to
+ b)
5 E +ti,
yi - ( ( w .Xi) + b) 5 + 52,
((W.
&,ti
xi)
20,
yi
= 1...
,e.
0-7803-7278-6/02/$10.00 02002 DeEE
i = 1 , ... , e
2 =
1 , . . . ,e,
(3)
i=l
2192
the parameter y of the regularization method for solving ill-posed problems as C = 1.

Finding the optimal value ?or C still remains a
problem. Many researchers suggested that C should
be varied through a wide range of values and the optimal performance is then measured by using a separate validation set or other techniques such as crossvalidation or boot-strapping [l],[ 5 ] . Vapnik mentioned in [7] three methods for choosing the optimal regularization parameter, namely, the L-Curve
method [2], the method of effective number of parameters [4] and the effective VC-dimension method
[8]. Each of these methods uses a different approach
for measuring the performance and complexity of the
model and originates from different theories. One
common problem with many of the suggested approaches is that they are not suitable for large-scale
problems. The computational effort to determine the
eigenvalues of large matrices or using resampling limits their use in online applications.
In particular, if one needs to make a quick assessment whether a given data set can be solved with the
SVM method or if a given kernel function is an appropriate choice, a fast estimation method is extremely
useful. Furthermore, since the C parameter is known
to be a rather robust parameter, determining the true
optimal value is often not worth the effort. In SVM
literature it is often suggested that C should be chosen s u f i c i e n t l y large. But what value is large enough?
If an estimation method can give a good indication
of the magnitude of C, one can at least start from an
informed guess.
It is known that the scale of the regularization parameter is affected by several factors. It has been
shown by Smola [6] that the optimal regularization
parameter depends on the value E . Since E is used
to control the complexity of the model and depends
on the noise level in the data, the choice of the optimal value of C assumes some knowledge about the
underlying noise distribution as well as the inherent
complexity of the model. Often, this knowledge is not
the available. In [l]the authors indicate that the regularization parameter C is also affected by the choice
of feature space. The consequence of this is very significant, since the feature space is determined by the
specified kernel, which is in fact an operator associ-
0-7803-7278-6/02/~10.0002002 IEEE
ated with smoothness. Therefore, the choice of regularization parameter can not be based on one factor
aIone, but on the combined influence. None of the
heuristics of estimation methods in literature does
that. The research was therefore aimed at deriving
an estimating rule that combines the characteristics
of the feature space, the expected noise level, and
some other contributing factors.
The rest of the paper consists of four sections. In
section two, useful results from the L-Curve method
are discussed. In the third section a method is derived
that estimates the value of C from a priori parameters. The performance of this method is shown in
section four.
I1
RESULTS FROM THE L-CURVE

METHOD
The L-Curve method is derived from the theory of

solving ill-posed problems [2]. It is well established
method and one of the few approaches in regularization theory that takes into account both the norm of
the solution and the norm of the error [3]. Vapnik
has shown in [7] that the L-Curve Method can be applied for Support Vector Machines for regression with
a quadratic loss function. The resulting terms for the
norm of the solution and the norm of the error, are
then given by the following two functions,
and
i=l
k=l
where N is the index set of the support vectors.

The L-curve is the log-log plot of ~ ( yagainst
)
p(y).
The distinct L-shape of the curve is shown in Figure
1. The L-Curve method is a very useful graphical
2193
(b) M&l
(a) -L
with C=5
of the curvature expression, an important relation between the derivative of p ( y ) and ~(y)
emerged. And
it is this relation we are interested in.
Consider the following minimization problem
0.8
1
:
w
.
1 4
4 2
0.6
0.4
c=254
$ 0
0.2
C=5
-2
0
-5
In(Emx mm1
-10
-0.2
0.2
0.4
0.6
0.8
27
(d) ModelA h C=12208W
0.6
0.6
0.4
0.4
0.2
0.2
0
0.2
0.4
0.6
(7)
where is a symmetric positive (semi)definite coefficient matrix and b the given output data. Using the
SVD decomposition of A , the norms of the solution
and error can be written as
0.6
0.8
{ llA2 - blli + Y Il4l;)
l.:m
A
l.:m
-0.2
= argmin
0.8
-0.2
0.2
0.4
0.6
0.6
(8)
i=l
Figure 1: The form of the L-Curve is shown in graph

(a). Graphs (b),(c) and (d) show models for various
values of C.
tool which is used to display the trade-off between
the complexity and the error. If too little complexity
is used, the right leg of the L-Curve is dominant and
the model typically underfits (see Figure l(b)). When
the left leg of the L-Curve is dominant, the model
uses too much complexity and starts to overfit as seen
in Figure l(d). The corner point of the GCurve corresponds to the optimal value of the regularization
parameter for which the model has the right balance
between complexity and the error term.
Finding the corner point of the L-Curve involves
finding the minimum of the functional
where u i are the singular vectors, o i the singular values, and f i the Tikhonov filter factors, that depend
on ~i and y as follows,
fi =
0;
-
U;
+y
The derivatives of ~ ( yand

) p ( 7 ) to y, are then given
by
(Note that in [3], these equations were derived using

y2 which resulted in having a factor 4 instead of 2 in
each equation.) Rewriting p(y) and using the fact
that
In regularization theory, the corner point of the LCurve is normally found by determining the curvature of the L-Curve. In [3] an expression for the cur- leads to a very important relation
vature of the GCurve is derived in terms of p ( 7 ) and
~ ( 7and
) their derivatives. As part of the derivation
0-7803-7278-6/02/$10.00 02002 lEEE
2194
I11 ESTIMATE FOR C
computed as R
The relation (13) also applies to the Support Vector

Machine formulation with quadratic loss when the
implicit feature space, defined by the kernel, is considered. In this section, the relation (13) combined
with (6), will be used to derive a estimate.
First,
consider the functional (6). In order to find the optimal regularization parameter y, (6) has to be minimized, that is to set H(y) = 0. The derivative of
H(y) is given by
I maxllile(K(xi, xi)). Therefore,
Now, consider the term for the norm of the error,

p. Let yic be the predicted output value of yi of the
SVM model. Since the SVM for Regression uses an
+insensitive loss function,
l e (
v
Yi 7
i=l
- e
-e v (Yi -
\2
Pk(?)K(zk,xi)
k=l
YiJ2
i=l
Rewriting the relation (13), given in the previous section, such that y stands alone and using (14), leads
to
It is clear from (17) that a lower bound in terms of a
priori information should involve the number of data
points, the range of the output data and the value
of E . Since no such bound exists in literature, one
was
derived from a number of assumptions about the
Now using the fact that C = 117, we arrive at
error and experimental results.
Let us assume that the resulting model will be a
C = -V(Y)
.
(15) relatively good model such that the -insensitive zone
P(Y)
is smaller than half the range of the output values
and that there is an equal number of support vectors
This equation forms the basis of the estimate .
Since the true solution and therefore, true error, above and below the +insensitive zone. Then a very
is unknown, we will use upper and lower bounds in loose lower bound on (17) can be given by,
terms of the a priori parameters. From the Support
p > iRange(y) - 26
Vector theory, is known that the norm of the solution
e 2
llw112 < R2,where R is the radius of the ball centred
at the origin in the feature space and which can be
From experimental observations, it was found that a
lIt is also interesting to note the close resemblance between power of four gives a more accurate estimation. This
the derivation of the expression for the curvature of the L- leads to the proposed estimate given by
Curve,which uses the SVD decomposition, and the use of the

eigenvalues and eigenvectors in the method of the Effective
Number of Parameters that was suggested in statistics for estimating parameters for ridge regression [4].
*Vapnik derived in Chapter 7 of [7] a similar relation for -y as
in (15) as part of the proof of a theorem. Vapnik used, however,
an entirely different approach. The relation p z ( A f 8 , Af) 5
2 d f i can be rewritten t o -yt 2 p z ( A f e , A f ) / 4 d 2 . A is an
operator in a Hilbert space and the function p2 is metric meai
suring the distance between the true output A f = F and the
predicted output A f t of the optimal solution fe. Finally, d is
such that llfll 5 d .
0-7803-7278-6/02/$10.00 02002 IEEE
IV
EXPERIMENTAL RESULTS
In this section the estimated value of C, using (19)

is compared to the value of C determined by using
the L-Curve. Several data sets with varying sizes,
2195
noise levels and dimensions were used. The results for

f(z1,zz) = zlzz 1 with (z1,zz) E [-I, 11 (equivalent to a continuous version of the 2D XOR problem)
is presented in this section. The learning data consisted of a random sampling of this function after a
noise level of N(O,0.05) was added.
-10
2(a) shows various error statistics of models for increasing values C. The resulting GCurve is plotted
in Figure 2(b). The area between the vertical dashed
lines in Figure 2(a) corresponds to the area in the
corner of the L-Curve, as shown in Figure 2(b). The
area around the corner point in the GCurve is shown
more clearly in Figure 2(c). In Figure 2(a) and Figure
2(c) the location of the optimal C-value is indicated
by the asterisk and the circle shows the location of
the estimated C-value. Finally, Figure 2(d) and Figure 2(e) show the performance of the models built
using the (near) optimal C and the estimated C, respectively. At first glance, one might think that an
estimated value of C = 340 is far from the (near)
optimal value of C = 1151 from the GCurve. However, from Figure 2(c) it is clear that C is a rather
robust parameter. Therefore, the estimation needs
only to predict a value of C close to the corner of the
L-Curve.
-5
0
(d) Model wh
i L-CimeC
Figure 2: Results for a RBF kernel of width 0.2. The

(near) optimal value of C is indicated by an asterisk and the estimated value of C by a circle. (a)
Error statistic for each iteration step in the GCurve
method. The GCurve is shown in (b) and the corner
point of the L-Curve in (c). In (d) and (e) the performances of the model using the optimal C and the
model using the estimated C are shown respectively. Figure 3: Results for a polynomial kernel of degree 2.
(a) The GCurve-determined C and the estimated C
In Figure 2 the results from the L-Curve Approach are plotted against the percentage support vectors of
are compared to the estimated value of C using a each model. (b) The Rz statistic of predictions made
RBF kernel with a width of 0.2 and an E of 0.05. by the models. (c) The Root Mean Square Error
The L-Curve Approach requires the building of sev- Prediction of the models.
eral models for increasing values of C. The range
Figure 3 shows the results of SVM with a polyof values for C considered needs to be large enough,
otherwise the corner point of the L-Curve can not be nomial kernel for various values of E, ranging from
seen. Therefore, the resolution of the C-values being 0 to 0.125. For each value , . t h e C value generated
used, were chosen on a logarithmic scale. The Figure by the L-Curve method and also estimated by (19).
0-7803-7278-6/02/$10.00 02002 IEEE
2196
Figure 3(a) shows the determined value of C from

the L-Curve Method and the estimated value plotted
against the resulting percentage of support vectors of
each model. The performance of models using the
optimal C and the estimated C are compared by using the Rs-Statistic and the Root Mean Square Error Prediction (RMSEP) In Figure 3(d) and Figure
3(e) it is clear that the estimated value of C produces
models with error statistics that compare well with
the error statistics of a model using the optimal value
of c.
The CPU time of determining the (near) optimal
value for C through the L-Curve method was on average around 90 seconds. For the estimation method,
the CPU time was less than 1 second. The computational advantage speaks for itself. The estimated
value of C can also help to speed up the L-Curve
method, since one can get a good initial guess for a
starting point of the algorithm.
CONCLUSIONS
A method for estimating the regularization parameter C for Support Vector Regression problems is presented. The estimation is based on results from the
analysis of the L-Curve method. It was mentioned in
the introduction that choosing a value for C should
involve taking into account several factors, including
the kernel function and the noise level. These factors
are all present in the heuristic proposed.
Comparing the values of C obtained from the LCurve method with the values determined by the estimate, using several data sets, showed that the estimates of C-values are in close proximity to the optimal C. Furthermore, the difference in performance
between a model using the C-value determined by
the L-Curve and a model using the C estimated by
the method, is very small and often negligible.
The computation time needed to determine a good
estimate of the optimal C is a fraction of the time
needed to determine the (near) optimal value of C
by means of the L-Curve method. Therefore, the
proposed estimation method can be used for online
applications in industry. In particular, if one needs

to make a quick assessment whether a given data set
can be solved with the SVM method or if a given
kernel function is an appropriate choice, the fast and
robust estimation method is extremely useful.
In this paper, only the -Support Vector Machine
was considered with quadratic loss, assuming that the
E is known a priori. Future work includes deriving
similar estimates for the eSVM with linear loss as
well as the v-Support Vector Machine [l],where the
expected ratio of support vectors is used instead of E .
References
[l]N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, and other
kernel-based learning methods, Cambridge University Press, 2000.
[2] H. W. Engl, M. Hanke, and A. Neubauer, Regularization of Inverse Problems, Kluwer Academic
Publishers, Hingham, MA, 1996.
[3] P. C. Hansen, The L-Curve and its use in the
numerical treatment of inverse problems, invited
paper for P. Johnston (Ed.), Computational Inverse Problems in Electrocardiology, pp. 119-142,
WIT Press, Southampton, 2001.
[4] T . J. Hastie and R. J. Tibshirani, Generalized
Linear Models, Chapman and Hall, London, UK,
1990.
[5] B. Scholkopf, C. J. Burges, and A. J. Smola, Advances in Kernel Methods: Support Vector Learning, MIT Press, London, 1998.
[6] A. J. Smola, Regression Estimation with Support

Vector Learning machines, Masters Thesis, TU
Berlin, 1996.
[7] V. N. Vapnik, Statistical Learning Theory, John
Wiley & Sons, 1998.
181
3The RMSEP is the relative error multiplied by the standard deviation of the predicted test data.
0-7803-7278-6/02/$10.00 02002 IEEE
2197
.N. Vapnik, E. Levin, E., and y.

Measuring the vc Dimension Of a Learning
Machine , Neural Computation, Vol. 10:5, 1994.

JORDAAN2002 - Estimation of The Regularization Parameter For SVR

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

JORDAAN2002 - Estimation of The Regularization Parameter For SVR

Uploaded by

Copyright:

Available Formats

Estimation of the Regularization Parameter

for Support Vector Regression

Abstract Support Vector Machines use a

The Support Vector Machine as a learning machine

-5- (ai- Si)K(xi,xnew)+ b,

0-7803-7278-6/02/$10.00 02002 DeEE

the parameter y of the regularization method for solving ill-posed problems as C = 1.

RESULTS FROM THE L-CURVE

The L-Curve method is derived from the theory of

where N is the index set of the support vectors.

(d) ModelA h C=12208W

{ llA2 - blli + Y Il4l;)

Figure 1: The form of the L-Curve is shown in graph

The derivatives of ~ ( yand

(Note that in [3], these equations were derived using

0-7803-7278-6/02/$10.00 02002 lEEE

I11 ESTIMATE FOR C

The relation (13) also applies to the Support Vector

I maxllile(K(xi, xi)). Therefore,

Now, consider the term for the norm of the error,

Curve,which uses the SVD decomposition, and the use of the

0-7803-7278-6/02/$10.00 02002 IEEE

In this section the estimated value of C, using (19)

noise levels and dimensions were used. The results for

Figure 2: Results for a RBF kernel of width 0.2. The

0-7803-7278-6/02/$10.00 02002 IEEE

Figure 3(a) shows the determined value of C from

applications in industry. In particular, if one needs

[6] A. J. Smola, Regression Estimation with Support

0-7803-7278-6/02/$10.00 02002 IEEE

.N. Vapnik, E. Levin, E., and y.

You might also like