Professional Documents
Culture Documents
Part I
Introduction
Necessary mathematical concepts
Support vector machines for binary
classification: classical formulation
Basic principles of statistical machine learning
Introduction
Classifier
model
Patient
Biopsy
Metastatic Cancer
Gene expression
profile
5
Patient
Biomarkers,
clinical and
demographics data
Regression
model
Optimal
dosage is 5
IU/Kg/week
Cluster #1
Cluster #2
Cluster #3
Cluster #4
All objects before the coast line are boats and all objects after the
coast line are houses.
Coast line serves as a decision surface that separates two classes.
11
12
Boat
Latitude
House
13
Boat
Latitude
House
14
?
?
Latitude
15
16
Normal patients
Cancer patients
Gene X
17
Normal patients
Cancer patients
Gene X
Cancer
Decision surface
kernel
Normal
Normal
Gene X
If such linear decision surface does not exist, the data is mapped
into a much higher dimensional space (feature space) where the
separating decision surface is found;
The feature space is constructed via very clever mathematical
projection (kernel trick).
19
General sciences
Biomedicine
8000
7000
30000
8,860
General sciences
Biomedicine
25000
6,660
19,200
20000
24,100
22,200
18,700
20,100
20,000
19,500
19,100
17,700
6000
4,950
5000
4000
3,530
10000
3000
14,900
15,500
14,900
16,000
13,500
9,770
10,800
12,000
2,330
2000
1000
15000
19,600
18,300
17,700
1,430
359
906
621
4
12
46
99
1998
1999
2000
2001
201
351
521
726
917
1,190
5000
2002
2003
2004
2005
2006
2007
1998
1999
2000
2001
2002
2003
2004
2005
2006
20
2007
21
(0, 0)
Systolic BP
22
Patient 4
Age (years)
60
Patient 1
40
Patient 2
20
0
200
150
300
100
200
50
100
0
Patient
id
1
2
3
4
Cholesterol
(mg/dl)
150
250
140
300
Systolic BP
(mmHg)
110
120
160
180
Age
(years)
35
30
65
45
Tail of the
vector
(0,0,0)
(0,0,0)
(0,0,0)
(0,0,0)
Arrow-head of
the vector
(150, 110, 35)
(250, 120, 30)
(140, 160, 65)
(300, 180, 45)
23
Patient 4
Age (years)
60
Patient 1
40
Patient 2
20
0
200
150
300
100
200
50
100
0
A decision surface in R3
R2
7
6
4
4
3
3
0
10
0
0
5
0
a = (1,2)
c=2
c a = (2,4)
ca
4
3
2
1
a = (1,2)
c = 1
c a = (1,2)
3
2
1
-2
-1
a
1
-1
0
ca
-2
26
a = (1,2)
b = (3,0)
a + b = (4,2)
4
3
2
1
a
1
a +b
2
27
a = (1,2)
b = (3,0)
a b = (2,2)
4
3
2
-3
a b
-2
-1
a
1
28
2
2
2
...
a
=
a
+
a
+
+
a
Define the L2-norm:
1
2
n
2
a = (1,2)
a 2 = 5 2.24
3
2
1
a
1
Length of this
vector is 2.24
2
29
The law of cosines says that a b =|| a ||2 || b ||2 cos where
are perpendicular a b = 0.
a = (1,2)
b = (3,0)
a b = 3
a = (0,2)
b = (3,0)
a b = 0
4
3
2
1
4
3
2
30
2
Property: a a = a1a1 + a2 a2 + ... + an an = a 2
In the classical regression equation y = w x + b
the response variable y is just a dot product of the
A hyperplane in R3 is a plane
7
6
6
5
5
4
4
2
1
0
10
5
0
0
32
Equation of a hyperplane
First we show with show the definition of
hyperplane by an interactive demonstration.
or go to http://www.dsl-lab.org/svm_tutorial/planedemo.html
Source: http://www.math.umn.edu/~nykamp/
33
Equation of a hyperplane
Consider the case of R3:
x x0
P0
x0
O
w
x0
w x w x0 = 0
w x + b = 0
The above equations also hold for Rn when n>3.
34
Equation of a hyperplane
Example
w = (4,1,6)
P0 = (0,1,7)
b = w x0 = (0 1 42) = 43
w x + 43 = 0
(4,1,6) x + 43 = 0
(4,1,6) ( x(1) , x( 2 ) , x(3) ) + 43 = 0
4 x(1) x( 2 ) + 6 x(3) + 43 = 0
+ direction
w
P0
w x + 50 = 0
w x + 43 = 0
w x + 10 = 0
- direction
Distance between two parallel
hyperplanes w x + b1 = 0 and w x + b2 = 0
is equal to D = b1 b2 / w .
35
tw
x2
w x + b2 = 0
w x + b1 = 0
x1
x2 = x1 + tw
D = tw = t w
w x2 + b2 = 0
w ( x1 + tw) + b2 = 0
2
w x1 + t w + b2 = 0
2
( w x1 + b1 ) b1 + t w + b2 = 0
2
b1 + t w + b2 = 0
2
t = (b1 b2 ) / w
D = t w = b1 b2 / w
36
Recap
We know
How to represent patients (as vectors)
How to define a linear decision surface (hyperplane)
We need to know
How to efficiently compute the hyperplane that separates
two classes with the largest gap?
Gene Y
Cancer patients
Gene X
37
Basics of optimization:
Convex functions
A function is called convex if the function lies below the
straight line segment connecting two points, for any two
points in the interval.
Property: Any local minimum is a global minimum!
Local minimum
Global minimum
Convex function
Global minimum
Non-convex function
38
Basics of optimization:
Quadratic programming (QP)
Quadratic programming (QP) is a special
optimization problem: the function to optimize
(objective) is quadratic, subject to linear
constraints.
Convex QP problems have convex objective
functions.
These problems can be solved easily and efficiently
by greedy algorithms (because every local
minimum is a global minimum).
39
Basics of optimization:
Example QP problem
Consider x = ( x1 , x 2 )
1 2
|| x ||2 subject to x1 +x 2 1 0
Minimize
2
quadratic
objective
linear
constraints
1 2
( x1 + x22 ) subject to x1 +x 2 1 0
2
quadratic
objective
linear
constraints
40
Basics of optimization:
Example QP problem
f(x1,x2)
x1 +x 2 1 0
x1 +x 2 1
1 2
( x1 + x22 )
2
x1 +x 2 1 0
x2
x1
The solution is x1=1/2 and x2=1/2.
41
Quiz
1) Consider a hyperplane shown
with white. It is defined by
equation: w x + 10 = 0
Which of the three other
hyperplanes can be defined by
equation: w x + 3 = 0 ?
w
P0
- Orange
- Green
- Yellow
4
2
1
-2
-1
0
-1
-2
43
Quiz
4
2
1
-2
-1
-1
-2
4
3
4) What
is the length of a vector
-2
-1
-1
-2
44
Quiz
5) Which of the four functions is/are convex?
4
45
46
x1 , x2 ,..., x N R n
y1 , y2 ,..., y N {1,+1}
w x + b = 1
w x + b = +1
w x + b = 1 and
w x + b = +1
Or equivalently:
w x + (b + 1) = 0
w x + (b 1) = 0
We know that
Negative instances (y=-1)
D = b1 b2 / w
Therefore:
D = 2/ w
we need to minimize w
or equivalently minimize
1
2
2
w
1
2
w x + b = 0
w x + b +1
In addition we need to
impose constraints that all
instances are correctly
classified. In our case:
w xi + b 1 if yi = 1
w xi + b +1 if yi = +1
Equivalently:
yi ( w xi + b) 1
In summary:
2
1
w
Want to minimize 2
subject to yi ( w xi + b) 1 for i = 1,,N
Then given a new instance x, the classifier is f ( x ) = sign( w x + b)
49
Minimize
1
2
2
w
i
i =1
subject to
Objective function
yi ( w xi + b) 1 0
for i = 1,,N
Constraints
50
i =1
1
2
i , j =1
i =1
Objective function
Constraints
And the solution becomes: f ( x ) = sign( i yi xi x + b)
N
i =1
51
i =1
1
2
i , j =1
Solution: f ( x ) = sign( i yi xi x + b)
N
i =1
Minimize
1
2
2
w
subject
to
y
(
w
xi + b) 1 0 for i = 1,,N
i
i
i =1
Constraints
Objective function
w
Define Lagrangian P , b, ) =
(
w
y
(
w
i i xi + b) 1)
n
1
2
i =1
2
i
i =1
require that the derivative with respect to vanishes , all subject to the
constraints that i 0.
53
P (w, b, )
0
w
y
x
=
i i i
w
i =1
We substitute the above into the equation for P (w,b, ) and obtain dual
formulation of linear SVMs:
D ( ) = i i j yi y j xi x j
N
i =1
1
2
i , j =1
54
0
0
0
0
0
0
0
0
0
0 0
0
2
1
Want to minimize 2 w + C i subject to yi ( w xi + b) 1 i for i = 1,,N
i =1
f
(
x
)
=
sign
(
w
x + b)
Then given a new instance x, the classifier is
55
Minimize
1
2
w
+
C
y
w
(
xi + b) 1 i for i = 1,,N
subject
to
i
i
i =1
2
i
i =1
Objective function
Constraints
Dual formulation:
N
i =1
i , j =1
for i = 1,,N.
Objective function
i =1
Constraints
56
1
2
N
2
w + C i subject to yi ( w
xi + b) 1 i for i = 1,,N
i =1
C=100
C=1
C=0.15
C=0.1
Tumor
Tumor
kernel
Normal
Normal
Gene 1
: RN H
58
Kernel trick
f ( x) = sign( w ( x ) + b)
w = i yi xi
w = i yi ( xi )
i =1
i =1
f ( x) = sign( i yi ( xi ) ( x ) + b)
N
i =1
f ( x) = sign( i yi K ( xi ,x ) + b)
N
i =1
Popular kernels
A kernel is a dot product in some feature space:
K ( xi , x j ) = ( xi ) ( x j )
Examples:
K ( xi , x j ) = xi x j
2
K ( xi , x j ) = exp( xi x j )
K ( xi , x j ) = exp( xi x j )
q
K ( xi , x j ) = ( p + xi x j )
2
K ( xi , x j ) = ( p + xi x j ) q exp( xi x j )
K ( xi , x j ) = tanh(kxi x j )
Linear kernel
Gaussian kernel
Exponential kernel
Polynomial kernel
Hybrid kernel
Sigmoidal
60
The resulting
mapping function
is a combination
of bumps and
cavities.
61
62
63
x(1)
x( 2 )
kernel
10-dimensional space
1 x(1)
x( 2 )
x(21)
x(22 )
x(1) x( 2 )
x(31)
x(32 )
x(1) x(22 )
x(21) x( 2 )
64
x1
x3
x4
x2
2
2 x(1) z(1)
= x(1) z(1) + x( 2 ) z( 2 ) =
K ( x, z ) = ( x z ) =
x z
( 2 ) ( 2 )
x(21) z(21)
2
2
2
2
= x(1) z(1) + 2 x(1) z(1) x( 2 ) z( 2 ) + x( 2 ) z( 2 ) = 2 x(1) x( 2 ) 2 z(1) z( 2 ) = ( x ) ( z )
2
2
x
z
( 2)
( 2)
65
Therefore, the explicit mapping is ( x ) = 2 x(1) x( 2 )
2
x
( 2)
x( 2 )
x(22 )
x1
x3
x4
x2
x1 , x2
kernel
x(1)
x3 , x4
x(21)
2 x(1) x( 2 )
66
Polynomial
degree
Number of
parameters
Required
sample
10
50
10
286
1,430
10
3,003
15,015
100
176,851
884,255
100
96,560,646
482,803,230
68
69
Outcome of
Interest Y
Algorithm 1
Training Data
Test Data
Predictor X
Predictor X
71
SVMs build the following classifiers: f ( x ) = sign( w x + b)
Minimize 2
i subject to yi (w xi + b) 1 i for i = 1,,N
i =1
2
Minimize [1 yi f ( xi )]+ + w 2
i =1
Loss
(hinge loss)
Penalty
[
1
(
y
f
x
i =1
f ( xi ) < 1 for yi = +1
73
w x + b = 1
w x + b = 0
w x + b = +1
1 2
3 4
74
Penalty function
[
1
y
f
(
x
i i )]+
N
Hinge loss:
i =1
(y
i =1
(y
i =1
( yi f ( xi )) 2
N
SVMs
Ridge regression
w1
f ( xi )) 2
w2
f ( xi )) 2
w2
Resulting algorithm
Lasso
1 w 1 + 2 w 2
2
Elastic net
i =1
Hinge loss:
[
1
y
f
(
x
i i )]+
N
i =1
w1
1-norm SVM
75
Part 2
Model selection for SVMs
Extensions to the basic SVM model:
1.
2.
3.
4.
5.
6.
77
Gene 2
Tumor
Tumor
Normal
Gene 1
Normal
Gene 1
Parameter
C
(0.1, 1)
(1, 1)
(10, 1)
(100, 1)
(1000, 1)
(0.1, 2)
(1, 2)
(10, 2)
(100, 2)
(1000, 2)
(0.1, 3)
(1, 3)
(10, 3)
(100, 3)
(1000, 3)
(0.1, 4)
(1, 4)
(10, 4)
(100, 4)
(1000, 4)
(0.1, 5)
(1, 5)
(10, 5)
(100, 5)
(1000, 5)
Nested cross-validation
Recall the main idea of cross-validation:
test
data
train
train
test
train
test
valid
train
train
valid
train
data
Outer Loop
P1
Training Testing
Average
C Accuracy
set
set
Accuracy
P1, P2
P3
1
89%
83%
P1,P3
P2
2
84%
P2, P3
P1
1
76%
P2
P3
Inner Loop
Training Validation
set
set
P1
P2
P1
P2
P2
P1
P2
P1
C
1
2
Accuracy
86%
84%
70%
90%
Average
Accuracy
85%
80%
choose
C=1
81
On use of cross-validation
Empirically we found that cross-validation works well
for model selection for SVMs in many problem
domains;
Many other approaches that can be used for model
selection for SVMs exist, e.g.:
Generalized cross-validation
Bayesian information criterion (BIC)
Minimum description length (MDL)
Vapnik-Chernovenkis (VC) dimension
Bootstrap
82
83
One-versus-rest multicategory
SVM method
Gene 2
?
Tumor I
****
*
* * *
* *
*
**
*
* *
*
Tumor II
Tumor III
Gene 1
84
One-versus-one multicategory
SVM method
Gene 2
Tumor I
*
* *
* * Tumor II
* * *
* *
*
* *
*
*
*
*
?
?
Tumor III
Gene 1
85
DAGSVM multicategory
SVM method
AML
vs.vs.
ALL
T-cell
AML
ALL
T-cell
Not AML
ALL
ALLB-cell
B-cellvs.
vs.ALL
ALLT-cell
T-cell
Not ALL B-cell
ALL T-cell
ALLALL
B-cell
B-cell
AML
86
?
?
* * * *
*
* * *
* *
*
* *
*
* *
*
Gene 1
87
88
**
** *
*
*
* ***
* ** *
*
x1 , x2 ,..., x N R n
y1 , y2 ,..., y N R
+
-
Main idea:
Find a function f ( x ) = w x + b
that approximates y1,,yN :
it has at most derivation from
the true values yi
it is as flat as possible (to
avoid overfitting)
89
* *
*
* *
* * *
*
** *
*
* *
+ w x + b = 0
-
f
(
x
) = w x + b
Find
2
w subject
by minimizing
to constraints:
yi ( w x + b )
yi ( w x + b)
1
2
for i = 1,,N.
x
* *
* * *
*
** *
*
*
*
* *
*
*
+
-
91
* *
* * *
*
** *
*
i*
*
*
* *
*
*
+
-
f
(
x
) = w x + b
Find
N
2
by minimizing 12 w + C ( i + i* )
i =1
subject to constraints:
yi ( w x + b ) + i
yi ( w x + b) i*
i , i* 0
for i = 1,,N.
x
Nonlinear -SVR
y
* *
***
*
*
** **
*
+
* * * -
**
*
**
** *
*
*
* ***
*
x
+()
-()
(x)
kernel
* *
***
*
*
** **
*
* * * +
-
x
93
Build decision function of the form: f ( x ) = w x + b
2
Minimize max(0, | yi f ( xi ) | ) + w 2
N
i =1
Loss
(linear -insensitive loss)
Penalty
Error in approximation
94
f
x
max(
0
,
|
(
i
i ) | )
N
Penalty function
w2
Resulting algorithm
-SVR
i =1
f
x
i
i )) )
N
w2
i =1
y
f
x
(
(
i
i ))
N
w2
Ridge regression
i =1
Mean linear
error:
N
|
y
f
(
x
i
i)|
i =1
w2
Error in
approximation
Error in
approximation
Error in
approximation
Error in
approximation
96
98
What is it about?
Decision function = +1
Predictor Y
******* ********
***** ******** **
*
*
* **************** * *
**************************
* *
Decision function = -1
* *
Predictor X
99
Key assumptions
100
Sample applications
Normal
Novel
Novel
Modified from: www.cs.huji.ac.il/course/2004/learns/NoveltyDetection.ppt
Novel
101
Sample applications
Discover deviations in sample handling protocol when
doing quality control of assays.
Protein Y
*
*
*
*
*
Samples with low quality of
processing from ICU patients
*********
** ************* *
***********
* *** * *
**
**
*
*
**
Protein X
102
Sample applications
Identify websites that discuss benefits of proven cancer
treatments.
**
**
*
Weighted
frequency of
word Y
Websites that discuss
cancer prevention
methods
**
*
*
*
** *******
*
*********** *
*
*
*
*
* * ******** *****
* ** ********** * *
*
*
* ** *** *
* ** *
***
*
Blogs of cancer patients
**
*
Weighted
frequency of
word X
One-class SVM
Main idea: Find the maximal gap hyperplane that separates data from
the origin (i.e., the only member of the second class is the origin).
w x + b = 0
i
Origin is the only
member of the
second class
j
Use slack variables as in soft-margin SVMs
to penalize these instances
104
Find f ( x ) = sign( w x + b)
by minimizing
w x + b = 0
1
2
1
+
N
i =1
+b
subject to constraints:
w x + b i
i 0
upper bound on
for i = 1,,N.
the fraction of
outliers (i.e., points
outside decision
surface) allowed in
the data
105
Non-linear case
(use kernel trick)
Find f ( x ) = sign( w x + b)
by minimizing
1
2
2 1
w +
N
subject to constraints:
w x + b i
i 0
for i = 1,,N.
Find f ( x ) = sign( w ( x ) + b)
i =1
+b
by minimizing
1
2
2 1
w +
N
i =1
subject to constraints:
w ( x ) + b i
i 0
for i = 1,,N.
106
+b
107
108
Clustering process
109
kernel
R
R
110
subject to
R2
|| ( xi ) a ||2 R 2
for i = 1,,N
Constraints
a'
R'
111
112
113
C
A
B
D
A
B
C
D
E
C
A
1
1
B
D
114
Typical data
Outliers
Overlap
115
Noise
Typical data
kernel
Outliers
Outliers
Overlap
Overlap
116
R 2 subject to
Introduction of slack
variables i mitigates
the influence of noise
and overlap on the
clustering process
R + i
Overlap
117
i, j
119
w x + b = 0
for i = 1,,N.
Use classifier: f ( x ) = sign( w x + b)
120
w = (1,1)
1x1 + 1x2 + b = 0
w = (1,0)
1x1 + 0 x2 + b = 0
x2
x1
x1
X1 is important, X2 is not
w = (0,1)
0 x1 + 1x2 + b = 0
x3
w = (1,1,0)
1x1 + 1x2 + 0 x3 + b = 0
x2
x1
X2 is important, X1 is not
x1
X1 and X2 are
equally important,
X3 is not
121
Melanoma
Nevi
SVM
decision surface
w = (1,1)
1x1 + 1x2 + b = 0
Decision surface of
another classifier
w = (1,0)
1x1 + 0 x2 + b = 0
True model
X1
Gene X1
estimate vector w
2. Rank each variable based on the magnitude of the
123
Subset of variables
X6
X5
X3
X2
X7
X1
X6
X5
X3
X2
X7
X1
X6
X5
X3
X2
X7
X6
X5
X3
X2
X6
X5
X3
X6
X5
X6
X4
0.920
0.920
0.919
0.852
0.843
0.832
0.821
124
element in vector w
6. Until there are no variables in V
7. Select the smallest subset of variables with the best
prediction accuracy
126
SVM
model
Prediction
accuracy
5,000
genes
SVM
model
Prediction
accuracy
Important for
classification
5,000
genes
Discarded
Not important
for classification
2,500
genes
Important for
classification
2,500
genes
Discarded
Not important
for classification
Tumor
Tumor
kernel
Normal
Normal
Gene 1
input space
feature space
128
131
w x + b = 0
Distance
-1
...
98 99
100
-2
0.8
0.3
Training set
Validation set
Testing set
Number of samples
in validation set
25
20
15
10
5
0
-15
-10
-5
0
Distance
10
15
133
Training set
Number of samples
in validation set
25
20
10
Testing set
5
0
-15
5.
Validation set
15
-10
-5
0
Distance
10
15
134
Platts method
Convert distances output by SVM to probabilities by passing them
through the sigmoid filter:
1
P ( positive class | sample) =
1 + exp( Ad + B)
where d is the distance from hyperplane and A and B are parameters.
1
0.9
P(positive class|sample)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-10
-8
-6
-4
-2
0
Distance
10
135
Platts method
1. Train SVM classifier in the Training set.
2. Apply it to the Validation set and compute distances
from the hyperplane to each sample.
Sample # 1
Distance
-1
3.
...
98 99
100
-2
0.8
0.3
Training set
Validation set
Testing set
136
Part 3
Case studies (taken from our research)
1.
2.
3.
4.
5.
6.
Software
Conclusions
Bibliography
137
138
139
Classifiers
instance-based
neural
networks
kernel-based
voting
decision trees
140
Ensemble classifiers
dataset
Classifier 1
Classifier 2
Classifier N
Prediction 1
Prediction 2
Prediction N
dataset
Ensemble
Classifier
Final
Prediction
141
1.
2.
3.
Kruskal-Wallis nonparametric
one-way ANOVA (KW);
4.
genes
142
Microarray datasets
Number of
Dataset name Sam- Variables Cateples (genes) gories
Reference
11_Tumors
174
12533
11
Su, 2001
14_Tumors
308
15009
26
Ramaswamy, 2001
9_Tumors
60
5726
Staunton, 2001
Brain_Tumor1
90
5920
Pomeroy, 2002
Brain_Tumor2
50
10367
Nutt, 2003
Leukemia1
72
5327
Golub, 1999
Leukemia2
72
11225
Armstrong, 2002
Lung_Cancer
203
12600
Bhattacherjee, 2001
SRBCT
83
2308
Khan, 2001
10509
Singh, 2002
5469
Shipp, 2002
Prostate_Tumor 102
DLBCL
77
Total:
~1300 samples
74 diagnostic categories
41 cancer types and
12 normal tissue types
144
Gene Selection
Methods (4)
Performance
Metrics (2)
Statistical
Comparison
10-Fold CV
S2N One-Versus-Rest
Accuracy
LOOCV
S2N One-Versus-One
RCI
Randomized
permutation testing
Non-param. ANOVA
One-Versus-One
14_Tumors
KNN
Backprop. NN
Prob. NN
Decision Trees
One-Versus-Rest
One-Versus-One
Method by WW
MC-SVM OVR
MC-SVM OVO
MC-SVM DAGSVM
Decision Trees
Brain Tumor1
Brain_Tumor2
Leukemia1
Lung_Cancer
SRBCT
Majority Voting
Decision Trees
9_Tumors
Leukemia2
Binary Dx
DAGSVM
Multicategory Dx
One-Versus-Rest
Method by CS
WV
BW ratio
Based on outputs
of all classifiers
MC-SVM
Classifiers (11)
Prostate_Tumors
DLBCL
145
em
ia1
ia2
Lu
ng
_C
an
ce
r
SR
Pr
BC
os
T
tat
e_
Tu
mo
DL r
BC
L
Le
uk
or
s
or
1
or
2
or
s
em
_T
um
_T
um
_T
um
_T
um
Tu
mo
rs
Le
uk
11
Br
ain
Br
ain
14
9_
Accuracy, %
60
40
OVR
OVO
DAGSVM
WW
CS
KNN
NN
PNN
20
146
MC-SVM
100
80
Diagnostic performance
before and after gene selection
100
70
9_Tumors
14_Tumors
60
60
40
50
Accuracy, %
Improvement in accuracy, %
80
40
30
20
100
Brain_Tumor1
Brain_Tumor2
80
20
60
10
40
20
OVR
OVO DAGSVM
SVM
WW
CS
KNN
NN
non-SVM
PNN
SVM
non-SVM
SVM
non-SVM
100
Accuracy, %
80
60
Multiple specialized
classification methods
(original primary studies)
40
20
ia2
Lu
ng
_C
an
ce
r
SR
Pr
BC
os
T
tat
e_
Tu
mo
DL r
BC
L
em
ia1
Le
uk
em
or
s
Le
uk
_T
um
or
1
11
_T
um
or
2
Br
ain
_T
um
or
s
Br
ain
_T
um
14
9_
Tu
mo
rs
148
Summary of results
Multi-class SVMs are the best family among the
tested algorithms outperforming KNN, NN, PNN, DT,
and WV.
Gene selection in some cases improves classification
performance of all classifiers, especially of non-SVM
algorithms;
Ensemble classification does not improve
performance of SVM and other classifiers;
Results obtained by SVMs favorably compare with the
literature.
149
Testing
data
Training
data
1) Generate
bootstrap
samples
2) Random gene
selection
3) Fit unpruned
decision trees
151
154
3
4
2. Simple document
representations (i.e., typically
stemmed and weighted
words in title and abstract,
Mesh terms if available;
occasionally addition of
Metamap CUIs, author info) as
bag-of-words
155
Labeled
Examples
Labeled
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Txmt
Diag
Estimated Performance
Prog
Etio
2005 Performance
156
Average AUC
Range
over n folds
Treatment
0.97*
0.96 - 0.98
Etiology
0.94*
0.89 0.95
Prognosis
0.95*
0.92 0.97
Diagnosis
0.95*
0.93 - 0.98
Method
Treatment AUC
Etiology AUC
Prognosis AUC
Diagnosis AUC
Google Pagerank
0.54
0.54
0.43
0.46
Yahoo Webranks
0.56
0.49
0.52
0.52
Impact Factor
2005
0.67
0.62
0.51
0.52
0.63
0.63
0.58
0.57
Bibliometric
Citation Count
0.76
0.69
0.67
0.60
Machine Learning
Models
0.96
0.95
0.95
0.95
SSOAB-specific filters
0.893
Citation Count
0.791
0.549
0.558
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.98
0.98
0.89
0.71
Fixed Sens
Query Filters
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Spec
Sens
Learning Models
0.98
0.75
0.44
Query Filters
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.96
Query Filters
0.91
0.91
Sens
Query Filters
Fixed Spec
Learning Models
0.88
0.68
Fixed Sens
0.94
Spec
0.96
Learning Models
0.68
Learning Models
0.91
0.98
Fixed Sens
0.91
Fixed Spec
Query Filters
0.95
0.8
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Spec
0.97
Sens
Learning Models
Fixed Spec
Query Filters
0.97
0.82
0.65
Learning Models
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.8
0.8
Fixed Sens
Query Filters
0.87
0.71
Spec
Learning Models
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.8
Sens
Query Filters
0.77
0.77
Fixed Spec
Learning Models
0.67
0.63
160
01/1998-05/2001
Training
06/2001-10/2002
Validation
(25% of Training)
Testing
161
Data model
database
Testing
sample
Testing data
Optimal
data model
Prediction
162
Classification results
Excluding cases with K=0 (i.e. samples
with no prior lab measurements)
BUN
>1
75.9%
Ca
67.5%
80.4%
55.0%
77.4%
CaIo
63.5%
52.9%
58.8%
CO2
77.3%
88.0%
Creat
62.2%
Mg
[2.5, 97.5]
66.9%
BUN
>1
80.4%
70.8%
60.0%
Ca
72.8%
93.4%
55.6%
81.4%
81.4%
63.4%
46.4%
66.3%
58.7%
CaIo
74.1%
60.0%
50.1%
64.7%
72.3%
57.7%
53.4%
77.5%
90.5%
58.1%
CO2
82.0%
93.6%
59.8%
84.4%
94.5%
56.3%
88.4%
83.5%
88.4%
94.9%
83.8%
Creat
62.8%
97.7%
89.1%
91.5%
98.1%
87.7%
58.4%
71.8%
64.2%
67.0%
72.5%
62.1%
Mg
56.9%
70.0%
49.1%
58.6%
76.9%
59.1%
Osmol
77.9%
64.8%
65.2%
79.2%
82.4%
71.5%
Osmol
50.9%
60.8%
60.8%
91.0%
90.5%
68.0%
PCV
62.3%
91.6%
69.7%
76.5%
84.6%
70.2%
PCV
74.9%
99.2%
66.3%
80.9%
80.6%
67.1%
Phos
70.8%
75.4%
60.4%
68.0%
81.8%
65.9%
Phos
74.5%
93.6%
64.4%
71.7%
92.2%
69.7%
Laboratory test
Laboratory test
[2.5, 97.5]
70.7%
A total of 84,240 SVM classifiers were built for 16,848 possible data models.
163
Dataset description
N samples N abnormal
N
(total)
samples
variables
Training set
3749
78
Validation set
1251
27
Testing set
836
16
Frequency (N measurements)
Test name
BUN
Range of normal values
< 99 perc.
Data modeling
SRT
Number of previous
5
measurements
Use variables corresponding to
Yes
hospitalization units?
Number of prior
2
hospitalizations used
2.5
2
1.5
1
105
normal
values
0.5
3442
x 10
50
abnormal
values
100
150
Test value
200
250
All
95.29%
94.72%
3442
RFE_Linear
98.78%
99.66%
26
RFE_Poly
98.76%
99.63%
3
HITON_PC
99.12%
99.16%
11
HITON_MB
98.90%
99.05%
17
164
All
95.29%
94.72%
3442
RFE_Linear
98.78%
99.66%
26
RFE_Poly
98.76%
99.63%
3
HITON_PC
99.12%
99.16%
11
HITON_MB
98.90%
99.05%
17
Features
1
2
LAB: PM_1(BUN)
LAB: PM_1(BUN)
LAB: PM_1(BUN)
LAB: PM_1(BUN)
LAB: PM_2(Cl)
LAB: Indicator(PM_1(Mg))
LAB: PM_5(Creat)
LAB: PM_5(Creat)
LAB: DT(PM_3(K))
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
LAB: PM_3(PCV)
LAB: DT(PM_3(Creat))
LAB: Indicator(PM_1(BUN))
LAB: PM_1(Mg)
LAB: Indicator(PM_5(Creat))
LAB: PM_1(Phos)
LAB: DT(PM_4(Cl))
LAB: Indicator(PM_1(Mg))
LAB: Indicator(PM_4(Creat))
LAB: DT(PM_3(Mg))
LAB: DT(PM_4(Creat))
LAB: Indicator(PM_5(Creat))
LAB: PM_1(Cl)
LAB: PM_3(Gluc)
LAB: DT(PM_1(CO2))
LAB: DT(PM_4(Gluc))
DEMO: Gender
LAB: PM_3(Mg)
LAB: DT(PM_5(Mg))
LAB: PM_1(PCV)
LAB: PM_2(BUN)
DEMO: Gender
DEMO: Age
LAB: DT(PM_2(Phos))
LAB: DT(PM_3(CO2))
LAB: DT(PM_2(Gluc))
LAB: DT(PM_5(CaIo))
DEMO: Hospitalization Unit TVOS
LAB: PM_1(Phos)
LAB: PM_2(Phos)
LAB: Test Unit 11NM (Test K, PM 5)
LAB: Test Unit VHR (Test CaIo, PM 1)
165
166
Patients
Physicians
Diagnosis
cd1
hd1
cdN
hdN
f1fm
cd1
hd1
cdN
hdN
Physician 1
Physician6
Clinical
Gold
Standard
Patient
Feature
f1fm
Identify predictors
ignored by
physicians
Data collection:
Patients seen prospectively,
from 1999 to 2002 at
Department of Dermatology,
S.Chiara Hospital, Trento, Italy
inclusion criteria: histological
diagnosis and >1 digital image
available
Diagnoses made in 2004
Features
Lesion
location
Family history of
melanoma
Irregular Border
Streaks (radial
streaming, pseudopods)
Max-diameter
Fitzpatricks
Photo-type
Number of colors
Slate-blue veil
Min-diameter
Sunburn
Atypical pigmented
network
Whitish veil
Evolution
Ephelis
Globular elements
Age
Lentigos
Regression-Erythema
Comedo-like openings,
milia-like cysts
Gender
Asymmetry
Hypo-pigmentation
Telangiectasia
FS
Build
DT
Build
SVM
SVM
Black Box
Regular Learning
Apply SVM
Meta-Learning
All
RFE
(features)
HITON_PC
(features)
HITON_MB
(features)
(features)
Expert 1
0.94 (24)
0.92 (4)
0.92 (5)
0.95 (14)
Expert 2
0.92 (24)
0.89 (7)
0.90 (7)
0.90 (12)
Expert 3
0.98 (24)
0.95 (4)
0.95 (4)
0.97 (19)
NonExpert 1
0.92 (24)
0.89 (5)
0.89 (6)
0.90 (22)
NonExpert 2
1.00 (24)
0.99 (6)
0.99 (6)
0.98 (11)
NonExpert 3
0.89 (24)
0.89 (4)
0.89 (6)
0.87 (10)
Physicians
Expert 1
AUC=0.92
R2=99%
Blue
veil
irregular border
streaks
yes
no
yes
Expert 3
AUC=0.95
R2=99%
Expert 1
AUC=0.92
R2=99%
Blue
veil
irregular
border
streaks
number
of colors
evolution
no
no
yes
no
Expert 3
AUC=0.95
R2=99%
Reported
guidelines
Compliance
Experts1,2,3,
non-expert 1
Pattern analysis
Non expert 2
ABCDE rule
Non expert 3
Non-standard.
Reports using 7
features
176
177
Domain
Number of
variables
Number of
samples
Infant_Mortality
Clinical
86
5,337
discrete
Ohsumed
Text
14,373
5,000
continuous
ACPJ_Etiology
Text
28,228
15,779
Relevant to eitology
continuous
Lymphoma
Gene
expression
7,399
227
continuous
Gisette
Digit
recognition
5,000
7,000
Separate 4 from 9
continuous
Dexter
Text
19,999
600
continuous
Sylva
Ecology
216
14,394
Proteomics
2,190
216
continuous
Thrombin
Drug
discovery
139,351
2,543
Binding to thromin
discrete (binary)
Breast_Cancer
Gene
expression
17,816
286
continuous
Hiva
Drug
discovery
1,617
4,229
discrete (binary)
Nova
Text
16,969
1,929
discrete (binary)
Financial
147
7,063
Personal bankruptcy
Ovarian_Cancer
Bankruptcy
Target
Data type
178
Original
1
0.9
Classification performance (AUC)
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
2
0.55
0.5
0.5
Proportion of selected features
0.89
0.88
0.87
0.86
0.85
SVM-RFE
(4 variants)
Reduction
P-value
Nominal winner
P-value
Nominal winner
0.9754
SVM-RFE
0.0046
HITON-PC
0.8030
SVM-RFE
0.0042
HITON-PC
0.1312
HITON-PC
0.3634
HITON-PC
0.1008
HITON-PC
0.6816
SVM-RFE
181
182
HITON-PC
based
causal
methods
183
Summary results
HITON-PC
based
causal
methods
HITON-PCFDR
methods
SVM-RFE
184
Comparison
average HITON-PC-FDR
with G2 test vs. average
SVM-RFE
Sample size =
200
Nominal
P-value
winner
<0.0001
Sample size =
500
Nominal
P-value
winner
HITON-PCHITON-PC0.0028
FDR
FDR
Sample size =
5000
Nominal
P-value
winner
<0.0001
HITON-PCFDR
185
186
Cancer
Normal
Other
Cancer
Normal
Other
4000
8000
Clock Tick
12000
The gross break at the benign disease juncture in dataset 1 and the similarity of the
profiles to those in dataset 2 suggest change of protocol in the middle of the first
experiment.
Cancer
Normal
Other
Cancer
Normal
Other
4000
8000
Clock Tick
12000
188
Software
189
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml
http://www.smartlab.dibe.unige.it/Files/sw/Applet%20SVM/svmapplet.html
http://www.eee.metu.edu.tr/~alatan/Courses/Demo/AppletSVM.html
http://www.dsl-lab.org/svm_tutorial/demo.html (requires Java 3D)
Animations
Support Vector Machines:
http://www.cs.ust.hk/irproj/Regularization%20Path/svmKernelpath/2moons.avi
http://www.cs.ust.hk/irproj/Regularization%20Path/svmKernelpath/2Gauss.avi
http://www.youtube.com/watch?v=3liCbRZPrZA
Support Vector Regression:
http://www.cs.ust.hk/irproj/Regularization%20Path/movie/ga0.5lam1.avi
190
191
SVMLight: http://svmlight.joachims.org/
General purpose (designed for text categorization)
Implements binary SVM, multiclass SVM, SVR
Command-line interface
Code/interface for C/C++, Java, Matlab, Python, Pearl
Conclusions
193
195
Bibliography
196
197
198
199
200
201
202
204
205
206
207