You are on page 1of 12

Computational Statistics and Data Analysis 55 (2011) 18971908

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis


journal homepage: www.elsevier.com/locate/csda

Gene selection and prediction for cancer classification using support


vector machines with a reject option
Hosik Choi a , Donghwa Yeo b , Sunghoon Kwon c , Yongdai Kim c,
a
Department of Informational Statistics, Hoseo University, Asan, Chungnam 336-795, Republic of Korea
b
Institute of Mathematical Sciences, Ewha University, Seoul 120-750, Republic of Korea
c
Department of Statistics, Seoul National University, Seoul 151-742, Republic of Korea

article info abstract


Article history: In cancer classification based on gene expression data, it would be desirable to defer a
Received 23 August 2009 decision for observations that are difficult to classify. For instance, an observation for which
Received in revised form 16 August 2010 the conditional probability of being cancer is around 1/2 would preferably require more
Accepted 1 December 2010
advanced tests rather than an immediate decision. This motivates the use of a classifier with
Available online 9 December 2010
a reject option that reports a warning in cases of observations that are difficult to classify.
In this paper, we consider a problem of gene selection with a reject option. Typically,
Keywords:
Classification
gene expression data comprise of expression levels of several thousands of candidate
Reject option genes. In such cases, an effective gene selection procedure is necessary to provide a better
Support vector machines understanding of the underlying biological system that generates data and to improve
Lasso prediction performance. We propose a machine learning approach in which we apply
the l1 penalty to the SVM with a reject option. This method is referred to as the l1 SVM
with a reject option. We develop a novel optimization algorithm for this SVM, which is
sufficiently fast and stable to analyze gene expression data. The proposed algorithm realizes
an entire solution path with respect to the regularization parameter. Results of numerical
studies show that, in comparison with the standard l1 SVM, the proposed method efficiently
reduces prediction errors without hampering gene selectivity.
2010 Elsevier B.V. All rights reserved.

1. Introduction

Statistical and machine learning approaches are popularly used to construct a predictive model for classifying cancer
patients from normal ones based on gene expression data. A few of such approaches include the support vector machine
(SVM, Terrence et al., 2000; Guyon et al., 2002; Ambroise and McLachlan, 2002), logistic regression (Shevade and Keerthi,
2003; Liao and Chin, 2007), and boosting (Hong et al., 2005).
A standard approach is to classify all future observations. However, particularly in some cases, it would be desirable to
defer a decision, for which the observations that are difficult to classify. For instance, an observation for which the conditional
probability of being cancer is around 1/2 preferably require more advanced tests rather than an immediate decision. This
motivates the use of a classifier with a reject option that reports a warning in case of observations that are difficult to classify.
Many empirical studies in the engineering community support the conjecture that the use of a reject option effectively
reduces misclassification error rates. For further examples, refer to Chow (1970), Tortorella (2000), and Lendgrebe et al.
(2006). In addition, see the references in McLachlan (1992). Recently, Bartlett and Wegkamp (2008) proposed a learning
algorithm with a reject option based on the SVM this approach is referred to as the l2 SVM with a reject option and
studied its theoretical properties.

Corresponding author. Tel.: +82 2 880 9091; fax: +82 2 883 6144.
E-mail addresses: choi.hosik@gmail.com (H. Choi), ydkim0903@gmail.com (Y. Kim).

0167-9473/$ see front matter 2010 Elsevier B.V. All rights reserved.
doi:10.1016/j.csda.2010.12.001
1898 H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 18971908

Typically, gene expression data consist of expression levels of several thousands of candidate genes. In such cases, an
effective gene selection procedure is necessary to provide better understanding of the underlying biological system that
generates data and to improve prediction performance. For literature on gene selection, refer to Guyon et al. (2002). In this
paper, we propose a machine learning approach for gene selection with a reject option; in this approach, the l1 penalty is
applied to the SVM with a reject option. The proposed method is referred to as the l1 SVM with a reject option. The l1 penalty
is widely used for variable selection in several contexts (Tibshirani, 1996; Shevade and Keerthi, 2003; Zhu et al., 2004) since
it provides a sparse solution (i.e., some estimated coefficients are exactly zero). For the SVM with a reject option, Wegkamp
(2007) proved that the l1 penalty provides desirable large-sample properties. However, a crucial difficulty in using the l1
penalty in gene expression data is computation since the l1 penalty is not differentiable. Various efficient optimization
algorithms for computing the l1 penalty, such as the fixed-point algorithm and block coordinate descent algorithm, have
been proposed by Shevade and Keerthi (2003), Park and Hastie (2007), Wu and Lange (2008), Genkin et al. (2007), Koh et al.
(2007), Balakrishnan and Madigan (2008), Hale et al. (2008), Meier et al. (2008), Rosset and Zhu (2007) and Kim et al. (2008).
These algorithms, however, are not applicable to the l1 SVM with a reject option since the loss function as well as the penalty
are not differentiable. The optimization problem for the the l1 SVM with a reject option is similar to that of the standard l1
SVM (Zhu et al., 2004) although the former is much more difficult. This greater degree of difficulty is because there are
multiple nondifferentiable points in the loss function of the l1 SVM with a reject option. In this paper, we develop a novel
optimization algorithm for the l1 SVM with a reject option, which is sufficiently fast and stable to analyze gene expression
data. In particular, the proposed algorithm provides an entire solution path with respect to the regularization parameter.
The paper is organized as follows. In Section 2, we review the SVM and l1 SVM with a reject option. A novel optimization
algorithm for the l1 SVM with a reject option is developed in Section 3. Numerical results on simulated data, four publicly
available gene expression data sets, and a mass spectrometry protein data set are presented in Section 4. Concluding remarks
follow in Section 5.

2. Methodology

2.1. SVM with a reject option: review

Let (x1 , y1 ), . . . , (xn , yn ) be the inputoutput pairs of given data where xi Rp is a vector of gene expression levels
and yi {1, 1} denotes a class label (1 for normal and 1 for cancer). We assume that the data are n independent
copies of a random vector (X, Y ). Traditional learning algorithms try to find the optimal hyperplane that minimizes the
misclassification error rate E (I (Yf (X) < 0)) among all linear hyperplanes f (x) = n0 + x , 0 R, R . The optimal
p

hyperplane can be estimated by minimizing a regularized empirical risk given as i=1 (yi (0 + xi )) + J (), where (z )
is a convex surrogate loss function of the 0-1 loss I (z < 0) and J is a penalty function controlling the misclassification error
and classifiers complexity. Various surrogate loss functions (z ) correspond to various learning algorithms such as logistic
regression, boosting, and SVM (Hastie et al., 2001).
A learning algorithm with a reject option is used to construct a classifier t : Rp {1, 1, } r , where the interpretation
of the output r is that of being in doubt and hence of deferring the decision. Herbei and Wegkamp (2006) introduced the
misclassification error rate of a classifier with a reject option as Ld (t ) = dPr(t (X ) = ) r + Pr(t (X ) = Y , t (X ) = )r ,
where d [0, 1/2) is a cost of a reject option. For a given real valued function f (x) and > 0, Bartlett and Wegkamp (2008)
proposed a method for constructing a classifier with a reject option tf (x) by tf (x) = signf (x)I (|f (x)| > )+
r I (|f (x)| ).
Then, the misclassification error rate of tf becomes E(ld, (Yf (X))), where

1 if z < ,

ld, (z ) = d if |z | ,
0 otherwise.
Fig. 1(a) and (b) shows a comparison between the 0-1 loss and 0-1 loss with a reject option (ld, ).
Bartlett and Wegkamp (2008) proposed the SVM with a reject option by replacing ld, with a convex surrogate loss and
applying the l2 norm of the coefficients as a penalty function. In this approach, the SVM with a reject option estimates a
classifier by minimizing the following regularized empirical risk.
n

d (yi (0 + xi )) + 22 (1)
i=1
2
p
where 22 = j=1 j and the surrogated loss function d , the hinge loss with a reject option, is expressed as
2


1d
1
z if z < 0,
d (z ) = d (2)
if 0 z < 1,
1 z

0 otherwise.
Bartlett and Wegkamp (2008) showed that d is a reasonable surrogate loss for ld, by proving the Fisher consistency.
Furthermore, they noted that the optimization problem (1) can be solved by quadratic programming. Fig. 2 shows a
H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 18971908 1899

(a) 0-1 loss. (b) 0-1 loss with a reject option.

Fig. 1. The 0-1 loss and 0-1 loss with a reject option.

Fig. 2. Comparison between the hinge loss d with a reject option when d = 0.2 and the other loss functions.

comparison between ld, , and d , and the hinge loss H (z ) = (1 z )+ , which is a surrogate loss for the SVM; (z )+ =
max{z , 0}. It should be noted that d is piecewise linear as the hinge loss. However, there are two nondifferentiable points
one at z = 1 and the other at z = 0 in d while there is only one nondifferentiable point at z = 1 in the hinge loss.

2.2. l1 SVM with a reject option

In the case of l1 SVM with a reject option, the l2 norm is replaced with the l1 norm in the regularized empirical risk (1)
of the SVM with a reject option. In other words, we construct a classifier by minimizing the following regularized empirical
risk
n

d (yi (0 + xi )) + 1 , (3)
i =1
p
where 1 = j=1 |j | and > 0 is a regularization parameter controlling the complexity and sparsity of the classifier. An
important advantage of the l1 penalty is that some of the estimated coefficients are exactly zero, hence allowing automatic
gene selection. In addition, it is known that when the number of genes are much larger than that of observations, learning
procedures with the l1 penalty produce highly accurate classifiers (Greenstein, 2006).
Computation in the case of l1 SVM with a reject option is difficult. Since the penalty as well as the surrogate loss are
piecewise linear, we can use a linear program to solve the optimization problem in this case. However, a standard linear
program is very complicated in terms of computation when the dimension of the input (i.e., the number of genes) is
very large. The situation becomes worse when an optimal regularization parameter (and possibly d) optimally is to be
selected because a linear program is required for various values of . In this paper, we propose a novel efficient optimization
algorithm for the l1 SVM with a reject option to yield an entire solution path with respect to . A solution path is a function
1900 H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 18971908

( 0 (), ()
) of , where ( 0 (), ()
) is the minimizer of (3) for a given . There are several solution-path-finding
algorithms with the l1 penalty, such as LARs proposed by Efron et al. (2004), l1 SVM by Zhu et al. (2004), general differentiable
piecewise linear loss by Rosset and Zhu (2007), and a generalized linear model by Park and Hastie (2007). However, such
algorithms are not directly applicable to the l1 SVM with a reject option as there are multiple nondifferentiable points in the
surrogate loss function d .
When p is much larger than n, linear classifiers frequently perform better than nonlinear ones in many applications. Hall
et al. (2005) have provided some explanations for this phenomenon. Since gene expression data are high-dimensional and
have low sample size, we only consider linear classifiers in this paper. In addition, nonlinear models can be approximated
by employing a linear model using a basis expansion technique such as a spline (Zhang et al., 2004).

3. Algorithm

In this section, we develop an optimization algorithm for (3), which provides an entire solution path with respect to .
The entire solution path helps in saving computing time significantly when an optimal regularization parameter is being
searched.

3.1. Computation

Since d (z ) of (2) is decomposed as a sum of two functions, that is, d (z ) = (1 z )+ + a(z )+ , where a = (1 2d)/d,
the problem of minimizing (3) is equivalent to the following constraint optimization problem.
n
n

min Q (0 , ) = (1 yi f (xi ))+ + a (yi f (xi ))+ , (4)
0 ,
i =1 i =1

subject to 1 s for some s > 0, where f (xi ) = 0 + xi . Here, the complexity as well as sparsity are controlled by s.
Since there is a one-to-one relation between s and , we find an entire solution path with respect to s. That is, we obtain
{( 0 (s), (
s) ) : 0 s < }, where ( 0 , (
s) ) is the minimizer of Q (0 , ) subject to 1 s.
The main portion of this algorithm is dedicated to identifying the kinks in the path. The algorithm finds the kinks
0 = s0 < s1 < < sm < and solutions ( 0 (s0 ), ( s0 ) ) , ( 0 (s1 ), (
s1 ) ) , . . . , ( 0 (sm ), (
sm ) ) such that for
s (sk1 , sk ], the ( 0 (s), (
s) ) term is obtained by linear interpolation of ( 0 (sk1 ), (
sk1 ) ) , and ( 0 (sk ), (
sk ) ) . In
practice, we set sm as the smallest s value such that all coefficients of the corresponding solution are nonzero.
The equivalent optimization problem to (4) is
n
n

min i + a i , (5)
0 ,,,
i=1 i=1

subject to i 1 yi f (xi ), i yi f (xi ), i 0, and i 0, for i = 1, . . . , n and 1 s, where = (1 , . . . , n ) and


= (1 , . . . , n ) . The primal Lagrange function (Bertsekas, 2003) for (5) is

n
n
n
n
n
p

L= (i + ai ) + i (1 yi f (xi ) i ) + i (yi f (xi ) i ) i i i i + |j | s ,
i =1 i=1 i=1 i=1 i =1 j =1

where i 0, i 0, i 0, i 0, and 0 are Lagrange multipliers. Taking derivatives of L with respect to 0 , , ,


and , we have

L n
=0 (i + i )yi = 0,
0 i=1

L n
=0 (i + i )yi xij + sign(j ) = 0, j V (s),
j i=1
L
= 0 1 i i = 0, i = 1, . . . , n, and
i
L
= 0 a i i = 0, i = 1, . . . , n,
i
considering the KarushKuhnTucker (KKT) conditions
i (1 yi f (xi ) i ) = 0, i (0 yi f (xi ) i ) = 0,

p

i i = 0, i i = 0, and |j | s = 0,
j =1
H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 18971908 1901

Fig. 3. The locations of margins of samples in 5 subsets LL(s), LE (s), LR(s), E (s), and R(s).

where V (s) = {j : j (s) = 0}. We divide the samples into 5 subsets: leftleft set LL(s) = {i : yi f (xi ) < 0}, left elbow set
LE (s) = {i : yi f (xi ) = 0}, leftright set LR(s) = {i : 0 < yi f (xi ) < 1}, elbow set E (s) = {i : yi f (xi ) = 1}, and right set
R(s) = {i : yi f (xi ) > 1}. The margins of these subset samples are shown in Fig. 3. The KKT conditions imply that
LL(s) = {i : yi f (xi ) < 0} i = 1 and i = a,


LE (s) = {i : yi f (xi ) = 0} i = 1 and i + i = a,


LR(s) = {i : 0 < yi f (xi ) < 1} i = 1 and i = 0, (6)
E (s) = {i : yi f (xi ) = 1} i + i = 1 and i = 0, and



R(s) = {i : yi f (xi ) > 1} i = 0 and i = 0.
Hence, the minimizer of (5) can be found by solving the following system of linear equations
n

(i + i )yi xij sign(j ) = 0 for j V (s),
i =1
n

(i + i )yi = 0,
i =1


yi 0 + j xij = 0 for i LE (s), (7)
jV (s)


yi 0 + j xij = 1 for i E (s), and
jV (s)
p

1 = sign(j )j = s,
j =1

with the KKT conditions (6). Now, considering the right derivatives of (7), we have
1i 1i 1
yi xij + yi xij sign(j ) = 0 for j V (s),
1s 1s 1s



iE (s) iLE (s)



1i 1i
yi = 0,

yi +
1s 1s



iE (s) iLE (s)
10 1j (8)
+ xij = 0 for i LE (s) or E (s), and
1s 1s



jV (s)

1j



sign(j ) = 1.

1s


jV (s)

Solving the system of linear Eq. (8), we have 1j /1s for j V (s) and grow ( 0 (s), (
s) ) by 0 (s + h) = 0 (s) + h10 /1s
and j (s + h) = j (s) + h1j /1s for j V (s). It is not difficult to show that ( 0 (s + h), (
s + h) ) is the solution of (7) with
1 = s + h unless either E (s + h) or LE (s + h) is varied. In other words, we grow ( 0 (s + h), (
s + h) ) until E (s + h)
1902 H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 18971908

Table 1
The algorithm for l1 SVM with a reject option (pseudo code).

1. Find LL(0+), LE (0+), LR(0+), E (0+), R(0+), and V (0+).


2. Repeat until all coefficients corresponding to solution are nonzero.
(a) Compute the increase in s.
(b) Growtheactivecoefficientslinearlyuntilthel1 normbecomess.
(c) Update the sets LL(s), LE (s), LR(s), E (s), R(s), and V (s).
(d) Compute the new right derivatives using linear system (8).

Table 2
Computational cost with varying p (standard errors).
p L1SVM L1SVM-R
CPU time CPU time

25 0.246 (0.005) 0.381 (0.016)


50 0.334 (0.009) 0.529 (0.035)
100 0.430 (0.014) 0.742 (0.040)
200 0.648 (0.025) 1.224 (0.098)
1000 2.458 (0.077) 5.647 (0.702)
2000 4.869 (0.170) 15.802 (1.763)

or LE (s + h) is changed. Next, we update V (s + h) by either adding a new coefficient that is not in V (s) but satisfies (7)

or deleting a coefficient that is in V (s) and becomes zero; then, we resolve the (8) and further grow ( 0 , ) in a similar
manner. Starting this procedure with V (0+), which is obtained by minimizing (4) with only one coefficient via a line search,
we obtain the entire solution path. We summarize the proposed algorithm for the l1 SVM with a reject option in Table 1.

3.2. Computational complexity

The computational complexity of the proposed algorithm is equivalent to the computational complexity of solving (8)
multiplied by the number of kinks. Note that the number of equations in (8) is |E (s)| + |LE (s)| + |V (s)| + 2. The standard
l1 SVM algorithm developed by Zhu et al. (2004) is required to solve the system of |E (s)| + |V (s)| + 2 equations whenever
E (s) is changed. The proposed algorithm deals with two nondifferentiable points; that is, the number of points is exactly
twice that in the case of the standard l1 SVM algorithm developed by Zhu et al. (2004). Let us suppose that |E (s)| = |LE (s)|.
Since the complexity of solving the system of linear equations is proportional to the square of the number of equations, the
computing time for the proposed algorithm is required to be approximately 4 times that for the standard l1 SVM. Since the
number of kinks is equal to the sample size for the worst case, we can conclude that the overall computational complexity
of the l1 SVM with a reject option is approximately 4 times larger than that of the standard l1 SVM.
Now, we compare the computational cost of the l1 SVM with a reject option (L1SVM-R) with the standard l1 SVM (L1SVM)
algorithm using a small simulation. The experiment was conducted in a Windows R environment using a 2.33 GHz Intel Core2
duo processor and 2 GB RAM. Training observations of size n = 50 have been generated from the model with the same
parameters as in Section 4.1. The elapsed computing times (in seconds) were recorded using the system.time() function in
R. We replicated this process 20 times and computed the averaged elapsed times and corresponding standard errors.
Table 2 shows the results for various p. The numbers reported in Table 2 denote the average CPU times in seconds with
their standard errors in parentheses. In summary, our path algorithm, L1SVM-R, is about 2 to 4 times slower than the L1SVM,
which confirms our theoretical calculations.

4. Numerical results

In this section, we compare L1SVM-R with the L1SVM that does not have a reject option by analyzing the simulated as
well as real data sets. Particularly, we study how the reject option affects prediction accuracy and gene selectivity. Further,
the prediction accuracy of the l1 SVM with a reject option is compared with that of the l2 SVM with a reject option by
analyzing real data sets.

4.1. Simulation I

For simulation data, we generate a sample of size n as follows. Let + be a p-dimensional vector whose first q entries are
D and the other pq entries are zero, and let = + . Then, we generate x from Np (+ , ) and assign y = 1 for the first
n/2 observations and generate x from Np ( , ) and assign y = 1 for the last n/2 observations. We let the (k, l) entry of
be r |kl| , where r [0, 1). It should be noted that all input variables except the first q are noisy.
It is a standard procedure to fix the reject cost d in advance. However, the choice of d affects the estimated decision
boundary. Fig. 4 compares the prediction scores f (xi ) of the L1SVM-R according to various d values based on a simulated
data set for fixed . This figure shows that as the value of d decreases, the variation in the prediction scores increases. This
H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 18971908 1903

Fig. 4. The box-plots of prediction f (x)s according to d values.

(a) The scatter plot of margins. (b) The histogram of p-values of the permutation
test.

Fig. 5. Comparison of empirical margins of the L1SVM and L1SVM-R.

is because the classifier with a smaller reject cost tends to be far away from the boundary since d assigns more weights
to misclassified observations. When the purpose of using a reject option is to improve the prediction accuracy by not using
observations close to the decision boundary, it would be natural to choose optimal d instead of fixing it in advance. In this
section, we use the above-mentioned approach to choose optimal d using either validation samples or a model selection
criterion.
Fig. 5(a) compares the empirical margins yi f (xi ) of the L1SVM and L1SVM-R calculated on a simulated data set with
p = 100, q = 5, D = 1.0, r = 0.0, and n = 100. We can observe that the margins of L1SVM-R are larger than those
of L1SVM, particularly for large margin values. To understand the statistical significance of this difference, we applied the
permutation test. Fig. 5(b) draws the histogram of 100 p-values of the permutation test obtained on 100 simulated data sets
of size 100, which indicates that the difference of margins is statistically significant.
Table 3 compares the prediction accuracy as well as variable selectivity. The training sample size is 100 and the
regularization parameters (d, ) are selected on the basis of an independent validation sample of size 100. Next, we choose
the optimal , which minimizes the empirical risk of the 0-1 loss with a reject option. Misclassification errors are calculated
on the basis of another independent test sample of size 2000. In the table, Total MIS denotes the misclassification error
rates obtained based on all observations in a test sample; Accept MIS, based only on accepted observations by a L1SVM-
R classifier; and Reject MIS, based on rejected observations. Reject denotes the portion of rejected observations. We
repeat the simulation 20 times and report the average misclassification errors with their standard errors in parentheses.
The p-values are obtained by the paired t-test with 20 paired error rates of L1SVM-R and L1SVM. In the table, Czeros and
Cnzeros are, respectively, the average numbers of correct zero coefficients (true is zero and estimated as zero) and correct
nonzero coefficients (true is nonzero and estimated as nonzero) in the estimated models, Count represents the frequencies
of the first q coefficients being estimated as nonzero among 20 simulations, and Others represents the frequencies of
1904

Table 3
Comparison between prediction accuracy and variable selectivity values of the L1SVM and those of L1SVM-R: average misclassification errors and average numbers of coefficients (standard errors).
p r Method Total MIS Accept MIS Reject MIS Reject p-value Czeros Cnzeros Count Others

100 0 L1SVM 0.151 (0.003) 0.134 (0.003) 0.505 (0.017) 90.90 (0.976) 5 (0) 20 20 20 20 20 0.863
L1SVM-R 0.147 (0.003) 0.131 (0.003) 0.483 (0.013) 0.048 (0.007) 0.003 90.35 (1.164) 5 (0) 20 20 20 20 20 0.979

0.3 L1SVM 0.215 (0.003) 0.193 (0.003) 0.480 (0.013) 92.45 (0.634) 4.300 (0.147) 20 17 17 17 15 0.537
L1SVM-R 0.211 (0.003) 0.191 (0.004) 0.470 (0.024) 0.078 (0.010) 0.002 90.95 (0.752) 4.500 (0.154) 20 18 18 17 17 0.853

0.6 L1SVM 0.258 (0.004) 0.236 (0.004) 0.493 (0.013) 89.15 (1.186) 3.550 (0.198) 19 10 11 12 19 1.232
L1SVM-R 0.251 (0.004) 0.230 (0.004) 0.515 (0.024) 0.087 (0.013) 0.001 89.25 (1.015) 3.750 (0.176) 18 14 13 12 18 1.211
2000 0 L1SVM 0.185 (0.007) 0.160 (0.008) 0.478 (0.021) 1981.65 (7.156) 4.650 (0.263) 20 20 18 20 15 0.108
L1SVM-R 0.178 (0.006) 0.153 (0.006) 0.465 (0.018) 0.077 (0.008) 0.003 1984.25 (6.159) 4.650 (0.263) 20 20 17 20 16 0.118

0.3 L1SVM 0.231 (0.006) 0.208 (0.006) 0.492 (0.016) 1988.65 (3.966) 4.100 (0.287) 18 18 15 13 18 0.052
L1SVM-R 0.224 (0.005) 0.200 (0.005) 0.460 (0.023) 0.084 (0.012) 0.004 1989.80 (2.598) 4.050 (0.340) 17 18 14 14 18 0.052

0.6 L1SVM 0.279 (0.005) 0.254 (0.006) 0.459 (0.014) 1992.30 (1.920) 2.650 (0.567) 12 12 9 8 12 0.031
L1SVM-R 0.277 (0.005) 0.250 (0.006) 0.465 (0.013) 0.119 (0.016) 0.068 1991.90 (1.360) 2.800 (0.573) 12 12 9 10 13 0.031
H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 18971908
H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 18971908 1905

Table 4
Prediction results of the L1SVM, L1SVM-R, and L1SVM-R (remeasure): average misclassification errors (standard errors).
e r L1SVM L1SVM-R L1SVM-R(remeasure) pv1 pv2 pv3

0.5 0 0.217 (0.007) 0.212 (0.007) 0.208 (0.007) 0.001 0.000 0.000
0.3 0.259 (0.007) 0.253 (0.007) 0.249 (0.007) 0.000 0.000 0.000
0.6 0.312 (0.010) 0.311 (0.010) 0.307 (0.010) 0.447 0.016 0.000
1 0 0.309 (0.009) 0.302 (0.009) 0.292 (0.009) 0.002 0.000 0.000
0.3 0.325 (0.010) 0.319 (0.010) 0.312 (0.010) 0.017 0.000 0.000
0.6 0.365 (0.011) 0.360 (0.011) 0.355 (0.012) 0.005 0.000 0.000

Table 5
Real data sets.
Reference Data (n , p ) (+1, 1)
Bhattacharjee et al. (2001) Breast (110, 2379) (98, 12)
Iwao et al. (2002) Lung (203, 12602) (186, 17)
Yukinawa et al. (2006) Thyroid (168, 2000) (128, 40)
Petricoin et al. (2002) Ovarian (252, 15154) (162, 90)
Singh et al. (2002) Prostate (102, 12600) (52, 50)

selected noisy variables. In all simulations, we set D = 0.5 and q = 5. In the table, the smallest misclassification error rates
are highlighted in bold face.
L1SVM-R always has significantly lower misclassification errors (except in two cases: Reject MIS with p = 100, r = 0.6
and p = 2000, r = 0.6) than the L1SVM, and the improvements are statistically significant in most cases. Further, the
misclassification error rates on rejected observations are around 0.5, which indicates that the L1SVM-R successfully selects
observations near a decision boundary. One of the reasons for the superior prediction performance of L1SVM-R is the use of
only high-quality samples (samples far away from the decision boundary), which makes a corresponding classifier robust
to less informative samples that are usually located near a decision boundary, hence, yielding better prediction accuracy.
The use of only high-quality data cannot be easily realized since we do not know a decision boundary before constructing
it. L1SVM-R implemented this approach suitably. For variable selectivity, L1SVM-R and L1SVM are clearly competitive.
The results of the simulations show that L1SVM-R significantly improves prediction accuracy without hampering variable
selectivity.

4.2. Simulation II

The reject option can be used to further improve accuracy when measurement errors exist in x. For given y {1, 1}, let
z follow a multivariate normal distribution with mean y and variance . Here, D in y and the structure of are the same
as those in simulation I. However, instead of observing z, suppose we observe x = z + , where follows a multivariate
normal distribution with mean 0 and variance e2 I. We consider the following classification strategy. Let R be the index set
of rejected observations in the test samples. Then, for i R, we measure xi one more time to denote it by x i . Next, we replace
the original xi by (xi + x i )/2, and hence make a prediction based on (xi + x i )/2.
We perform the simulation using three methods including the new strategy: (1) no rejection for training and test data set
(L1SVM), (2) reject option for training and test data set (L1SVM-R), and (3) reject option + prediction with remeasurement
of an observation in test data set when it is rejected (L1SVM-R (remeasure)). Table 4 shows the results for two measurement
noise levels e = 0.5 and 1 when p = 2000. In the table, pv1, pv2, and pv3 are p-values obtained via paired t-test
between L1SVM and L1SVM-R, L1SVM and L1SVM-R (remeasure), and L1SVM-R and L1SVM-R (remeasure) respectively. We
can see that the remeasurement with a reject option improves the prediction accuracy significantly in almost all cases.

4.3. Analysis of gene expression data

In this section, we analyze four gene gene expression data sets and a mass spectrometry data set; all these data sets are
publicly available. The basic information of the data sets is given in Table 5, and detailed descriptions are given below.
The Breast cancer data set contains 98 female breast cancer samples and 12 normal samples. This data was determined
using adapter-tagged competitive PCR instead of conventional cDNA microarrays.
The Lung cancer data set consists of a total of 203 snap-frozen lung tumor samples and normal lung specimens. Of these,
125 adenocarcinoma samples were associated with clinical data and with histological slides from adjacent sections.
The 203 specimens include histologically defined lung adenocarcinomas, squamous cell lung carcinomas, pulmonary
carcinoids, SCLC cases, and normal lung specimens. Other adenocarcinomas were suspected to be extrapulmonary
metastases based on clinical history.
The third data set is pertaining to Thyroid cancer, which is a relatively common cancer, accounting for roughly 1% of the
total cancer incidence. There are two main types of thyroid cancer: papillary carcinoma (PC) and follicular carcinoma
(FC). In addition to these malignant types, a benign tumor, follicular adenoma (FA), is also prevalent.
1906 H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 18971908

Table 6
Test error rates of L2SVM and L2SVM-R.
Method Breast Lung Thyroid Ovarian Prostate

L2SVM 0.090 (0.003) 0.033 (0.008) 0.256 (0.014) 0.011 (0.002) 0.154 (0.015)
L2SVM-R 0.070 (0.002) 0.028 (0.003) 0.230 (0.008) 0.010 (0.003) 0.137 (0.011)

Table 7
Prediction accuracies, reject portions, p-values, and mean model sizes (Vsize, |V |), with the corresponding standard errors in parentheses, of the L1SVM
and L1SVM-R on 5 real data sets.
Data Method Total MIS Accept MIS Reject MIS Reject p-value Vsize

Breast L1SVM 0.017 (0.007) 0.017 (0.007) 0 (0) 1.400 (0.197)


L1SVM-R 0.003 (0.001) 0.002 (0.001) 0.050 (0.050) 0.001 (0.001) 0.041 1.150 (0.082)
Lung L1SVM 0.011 (0.001) 0.010 (0.001) 0.150 (0.180) 3.400 (0.440)
L1SVM-R 0.011 (0.001) 0.009 (0.001) 0.150 (0.180) 0.001 (0.000) 0.165 2.600 (0.180)
Thyroid L1SVM 0.254 (0.018) 0.235 (0.018) 0.512 (0.071) 3.450 (0.872)
L1SVM-R 0.235 (0.009) 0.218 (0.011) 0.481 (0.068) 0.059 (0.011) 0.097 2.850 (0.386)
Ovarian L1SVM 0.011 (0.004) 0.009 (0.003) 0.389 (0.200) 7.550 (0.682)
L1SVM-R 0.007 (0.001) 0.005 (0.001) 0.661 (0.200) 0.003 (0.001) 0.140 5.950 (0.540)
Prostate L1SVM 0.153 (0.019) 0.150 (0.018) 0.615 (0.104) 6.500 (0.605)
L1SVM-R 0.139 (0.013) 0.126 (0.013) 0.339 (0.098) 0.026 (0.008) 0.158 6.900 (0.692)

The fourth data set corresponds to mass spectrometry data and consists of 90 normal samples and 162 Ovarian cancer
samples. A normal sample refers to a blood (more precisely, serum) sample collected from a healthy patient, while a
cancer sample refers to a blood sample collected from an ovarian cancer patient. Each sample is described by 15154
mass/charge identity of protein in the blood sample.
Finally, the Prostate cancer data set has 102 samples (52 prostate cancer samples and 50 normal samples) and 12 600
genes.

To calculate misclassification errors, we divide each data set into two parts, training and test data sets, by randomly
selecting 2/3 observations and 1/3 observations, respectively. We construct a classifier on training data and select the
regularization parameters (d, ) by minimizing the BIC-type criterion (Schwarz, 1978) calculated on the training data.
n
1 log n
d (yi f (xi )) + |V |,
n i =1 2n

where |V | is the number of nonzero coefficients in . Refer to Zou et al. (2007) for justification for using the BIC-type criterion.
The optimal value is selected by minimizing the empirical average of the 0-1 loss with a reject option. Next, we measure
a misclassification error on the test data. We repeat this procedure 20 times and summarize the results in Tables 6 and 7.
To illustrate the importance of gene selection in real data analysis, we compare l2 SVM (Hastie et al., 2004, denoted by
L2SVM) and l2 SVM with a reject option (Bartlett and Wegkamp, 2008, denoted by L2SVM-R) with the L1SVM and L1SVM-R.
L2SVM-R is implemented using quadratic programming. The results are presented in Table 6. L2SVM-R is always superior
to L2SVM; this fact, when considered with the results of Table 7 (Total MIS), indicates that the methods with the l1 penalty
show better performance than those with the l2 penalty. Therefore, gene selection is indispensable for constructing accurate
predictive models.
We compare L1SVM-R with L1SVM in detail. The results in Table 7 show that L1SVM-R performs consistently better than
L1SVM in terms of prediction. In terms of Total MIS and Accept MIS L1SVM-R performs better than L1SVM in 4 out of 5 data
sets (the Breast, Thyroid, Ovarian, and Prostate data sets), while their performances are similar for the Lung data
set. Further, in most cases L1SVM-R has lower error rates even for rejected samples. The number of selected genes (Vsize in
the table) are similar.
The improvements of L1SVM-R over L1SVM are not statistical significant in many cases. However, it should be noted
that L1SVM is already a highly accurate classifier and it is very difficult to significantly improve highly accurate classifiers in
real data sets. Many recent studies on the development of new learning algorithms have investigated the consistency, that
is, whether a proposed algorithm consistently improves an old but highly accurate method on various data sets. Further
examples, refer to Zhang et al. (2006) and Wang et al. (2008). In this context, we can conclude that the proposed l1 SVM
with a reject option is a useful new tool for gene expression data.
To assess the stability of gene selection, we investigated the most frequently selected genes in the Ovarian cancer data
set by L1SVM-R and L1SVM among 20 random partitions and found that the top 5 genes matched. Table 8 presents the
frequencies of the selection of the 5 most important genes by L1SVM-R and L1SVM. Note that L1SVM-R never misses the
first two of the most important genes while L1SVM misses some. This result suggests that L1SVM-R is more stable in gene
selection than L1SVM.
H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 18971908 1907

Table 8
Frequencies of the selection of the 5 most important genes identified by L1SVM and L1SVM-R in the Ovarian cancer data.
X508022 X702461 X8832 X180015 X657213

L1SVM 18 16 16 12 11
L1SVM-R 20 20 16 10 8

Fig. 6. Various surrogate losses.

5. Conclusion

In this paper, we proposed a learning method that can simultaneously select signal genes and produce a highly accurate
predictive model by incorporating a reject option. In addition, we developed an efficient computational algorithm that
realizes an entire solution path and is able to work with high-dimensional gene expression data without much difficulty.
Analysis of simulated and real data sets suggested the strong potential for the application of the proposed method for gene
expression data analysis. A disadvantage of the proposed method is that there are multiple regularization parameters (, d);
this issue was resolved by employing the BIC-criterion.
There are various applications of the reject option. For example, we can remeasure gene expression levels of rejected
samples. Refer to Section 4.2 for an example. Remeasurement further improves prediction accuracy, particularly in cases of
measurement errors in gene expression levels. We examined this possibility only by means of simulation since we do not
have real data for remeasurement. We aim to further pursue this issue in the near future.
An opposite approach to using the reject option is to downgrade the misclassified observations located far away from
the decision boundary. With this approach, we can construct a robust classifier to outliers. An example is the -learning
proposed by Shen et al. (2003). The surrogate loss of the -learning is the same as the hinge loss on the positive side but
is smaller than that on the negative side. This contrasts with the surrogate loss for the l1 SVM with a reject option, which
is larger than the hinge loss on the negative side. We can consider of a modification of the - learning method, where
the surrogate loss has a negative slope on the negative side. See Fig. 6. Interestingly, this surrogate loss corresponds to the
surrogate loss of the l1 SVM with a reject option with d 1/2. This suggests that two seemingly different approaches
the reject option and -learning may have some interesting relationship. We will attempt to tackle this issue in a future
research.

Acknowledgements

We are grateful to the anonymous referees and the co-editor for their helpful comments. Chois work was supported by
the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of
Education, Science and Technology (2010-0003377). Yeos work was supported by the Priority Research Centers Program
through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology
(2009-0093827). Kims work was supported by the Korea Research Foundation Grant funded by the Korean Government
(KRF-2008-314-C00046).
1908 H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 18971908

References

Ambroise, C., McLachlan, G.J., 2002. Selection bias in gene extraction on the basis of microarray gene expression data. Proceedings of the National Academy
of Sciences USA 99, 65626566.
Balakrishnan, S., Madigan, D., 2008. Algorithms for sparse linear classifiers in the massive data. Journal of Machine Learning Research 9, 313337.
Bartlett, P., Wegkamp, M.H., 2008. Classification with a reject option using a hinge loss. Journal of Machine Learning Research 9, 18231840.
Bertsekas, D.P., 2003. Nonlinear Programming, second ed. Athena Scientific.
Bhattacharjee, A., Richards, W.G., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E.J., Lander,
E.S., Wong, W., Johnson, B.E., Golub, T.R., Sugarbaker, D.J., Meyerson, M., 2001. Classification of human lung carcinomas by MRNA expression profiling
reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences USA 98, 1379013795.
Chow, C.K., 1970. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory 16, 4146.
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., 2004. Least angle regression. The Annals of Statistics 32, 407499.
Genkin, A., Lewis, D.D., Madigan, D., 2007. Large-scale Bayesian logisitic regression for text categorization. Technometrics 49, 291304.
Greenstein, E., 2006. Best subset selection, persistence in high-dimensional statistical learning and optimization under L1 constraint. The Annals of Statistics
34, 23672386.
Guyon, I., Weston, J., Barnhill, S., Vapnik, V., 2002. Gene selection for cancer classification using support vector machines. Machine Learning 46, 389422.
Hale, E., Yin, W., Zhang, Y., 2008. Fixed-point continuation for L1-minimization: methodology and convergence. SIAM Journal on Optimization 19,
11071130.
Hall, P., Marron, J.S., Neeman, A., 2005. Geometric representation of high dimension low sample size data. Journal of the Royal Statistical Society. Series B
67, 427444.
Hastie, T., Rosset, S., Tibshirani, R., Zhu, J., 2004. The entire regularization path for the support vector machine. Journal of Machine Learning Research 5,
13911415.
Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning, first ed. Springer-Verlag, New York.
Herbei, R., Wegkamp, M.H., 2006. Classification with reject option. The Canadian Journal of Statistics 34, 709721.
Hong, P., Liu, S., Zhou, Q., Lu, X., Liu, S., Wong, H., 2005. A boosting approach for motif modeling using CHIPchip data. Bioinformatics 21, 26362643.
Iwao, K., Matoba, R., Ueno, N., Ando, A., Miyoshi, Y., Matsubara, K., Noguchi, S., Kato, K., 2002. Molecular classification of primary breast tumors possessing
distinct prognostic properties. Human Molecular Genetics 11, 199206.
Kim, J., Kim, Y., Kim, Y., 2008. A gradient-based optimization algorithm for lasso. Journal of Computational and Graphical Statistics 17, 9941009.
Koh, K., Kim, S.J., Boyd, S., 2007. An interior-point method for large-scale L1-regularized logistic regression. Journal of Machine Learning Research 8,
15191555.
Lendgrebe, C.W., Tax, M.J., Paclik, P., Duin, P.W., 2006. The interaction between classification and reject performance for distance-based reject-option
classifiers. Pattern Recognition Letters 27, 908917.
Liao, J.G., Chin, K.V., 2007. Logistic regression for disease classification using microarray data: model selection in a large p and small n case. Bioinformatics
23, 19451951.
McLachlan, G.J., 1992. Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York.
Meier, L., van de Geer, S., Buhlmann, P., 2008. The group lasso for logistic regression. Journal of the Royal Statistical Society. Series B 70, 5371.
Park, M.Y., Hastie, T., 2007. L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society. Series B 69, 659677.
Petricoin, E.F., Ardekani, A.M., Hitt, B.A., Levine, P.J., Fusaro, V.A., Steinberg, S.A., Mills, G.B., Simone, C., Fishman, D.A., Kohn, E.C., Liotta, L.A., 2002. Use of
proteomic patterns in serum to identify ovarian cancer. Lancet 359, 572577.
Rosset, S., Zhu, J., 2007. Piecewise linear regularized solution paths. The Annals of Statistics 35, 10121030.
Schwarz, G., 1978. Estimating the dimension of a model. The Annals of Statistics 6, 461464.
Shen, X., Tseng, G.C., Zhang, X., Wong, W.H., 2003. On -learning. Journal of the American Statistical Association 98, 724734.
Shevade, S.K., Keerthi, S.S., 2003. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19, 22462253.
Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A.A., DAmico, A.V., Richie, J.P., Lander, E.S., Loda, M., Kantoff, P.W.,
Golub, T.R., Sellers, W.R., 2002. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1, 203209.
Terrence, S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M., Haussler, D., 2000. Support vector machine classification and validation of cancer
tissue samples using microarray expression data. Bioinformatics 16, 906914.
Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B 58, 267288.
Tortorella, F., 2000. An optimal reject rule for binary classifiers. Lecture Notes in Computer Science 876, 611620.
Wang, L., Zhu, J., Zou, H., 2008. Hybrid huberized support vector machines for microarray classification and gene selection. Bioinformatics 24, 412419.
Wegkamp, M.H., 2007. Lasso type classifiers with a reject option. Electronic Journal of Statistics 1, 155168.
Wu, T.T., Lange, K., 2008. Coordinate descent algorithm for lasso penalized regression. The Annals of Applied Statistics 2, 224244.
Yukinawa, N., Oba, S., Kato, K., Taniguchi, K., Iwao-Koizumi, K., Tamaki, Y., Noguchi, S., Ishii, S., 2006. A multi-class predictor based on a probabilistic model:
application to gene expression profiling-based diagnosis of thyroid tumors. BMC Bioinformatics 7, 14712164.
Zhang, H.H., Ahn, J., Lin, X., Park, C., 2006. Gene selection using support vector machines with nonconvex penalty. Bioinformatics 22, 8895.
Zhang, H.H., Wahba, G., Lin, Y., Voelker, M., Ferris, M., Klein, R., Klein, B., 2004. Variable selection and model building via likelihood basis pursuit. Journal of
the American Statistical Association 99, 659672.
Zhu, J., Rosset, S., Hastie, T., Tibshirani, R., 2004. 1-norm support vector machines. In: Thrun, S., et al. (Eds.), Advances in Neural Information Processing
Systems, vol. 16. MIT Press, Cambridge, MA.
Zou, H., Hastie, T., Tibshirani, R., 2007. On the degrees of freedom of the lasso. The Annals of Statistics 35, 21732192.

You might also like