Professional Documents
Culture Documents
I. I NTRODUCTION
Decision tree induction algorithms are highly used
in a variety of domains for knowledge discovery
and pattern recognition. The induced knowledge in
the form of hierarchical trees can be regarded as
a disjunction of conjunctions of constraints on the
attribute values [1]. Each path from the root to a leaf
is actually a conjunction of attribute tests, and the
tree itself allows the choice of different paths, i.e., a
disjunction of these conjunctions. Such a representation
is intuitive and easy to assimilate by humans, which
partially explains the large number of studies that
make use of these techniques. Another reason for their
popularity is their good predictive accuracy in several
application domains, such as medical diagnosis and
credit risk assessment [2].
A major issue in decision tree induction is which
attribute(s) to choose for splitting an internal node.
x2
1
1
0
1
0
1
x3
1
1
1
0
1
0
0
1
1
1
0 0 1
0 1 1
2
3
4
5
6
7
8
9
10
11
12
13
1
2
3
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
T ree merging(Leaves)
return T ree
procedure merging(L)
input : A set L of nodes
output: The oblique decision tree root node
begin
if |L| = 1 then
return unique node in L
else
s1 1
s2 1
smallestDistance
distance 0
foreach i, j L with i 6= j do
if i.class 6= j.class then
distance distance between i and j
if distance < smallestDistance then
smallestDistance distance
s1 i
s2 j
create new internal node t
t.lef tChild s1
t.rightChild s2
t.instances s1.instances s2.instances
t.class new meta-class
t.centroid mean vector of t.instances
t.hyperplane SVM hyperplane for t.instances
L L {t}
L L \ {s1}
L L \ {s2}
return merging(L)
x6
1
0
1
0
1
0
x7
0
1
0
1
0
1
xn
1
1
0
0
1
1
...
A
A1 ... AkA
1 1
0 1
...
x5
1
1
1
0
0
0
Class
A
B
A
C
B
C
...
x4
0
0
0
1
1
1
Training data
x1
0
1
1
0
0
0
...
B
B1 ... BkB
1 A
0 C
2
C1 ... Ck C
4
N1
Figure 1.
5
Diagram of BUTIAs execution steps.
(a)
Table I
S UMMARY OF THE GENE EXPRESSION DATASETS .
(b)
(c)
(d)
(e)
(f)
Id
Dataset
Type
# Instances
# Classes
# Genes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
alizadeh-v1
alizadeh-v2
alizadeh-v3
bittner
bredel
chen
garber
khan
lapointe-v1
lapointe-v2
liang
risinger
tomlins-v1
tomlins-v2
cDNA
cDNA
cDNA
cDNA
cDNA
cDNA
cDNA
cDNA
cDNA
cDNA
cDNA
cDNA
cDNA
cDNA
42
62
62
38
50
180
66
83
69
110
37
42
104
92
2
3
4
2
3
2
4
4
3
4
3
4
5
4
1095
2093
2093
2201
1739
85
4553
1069
1625
2496
1411
1771
2315
1288
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
armstrong-v1
armstrong-v2
bhattacharjee
chowdary
dyrskjot
golub-v1
golub-v2
gordon
laiho
nutt-v1
nutt-v2
nutt-v3
pomeroy-v1
pomeroy-v2
ramaswamy
shipp
singh
su
west
yeoh-v1
yeoh-v2
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
72
72
203
104
40
72
72
181
37
50
28
22
34
42
190
77
102
174
49
248
248
2
3
5
2
3
2
3
2
2
4
2
2
2
5
14
2
2
10
2
2
6
1081
2194
1543
182
1203
1877
1877
1626
2202
1377
1070
1152
857
1379
1363
798
339
1571
1198
2526
2526
BUTIA
OC1
FT
CART
J48
SMO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
0.940.14
1.000.00
0.940.10
0.920.14
0.860.13
0.840.08
0.850.12
0.980.05
0.880.12
0.860.09
1.000.00
0.740.15
0.900.11
0.900.08
0.990.05
0.970.06
0.970.03
0.920.08
0.850.24
0.930.07
0.940.07
0.990.02
0.880.16
0.560.18
0.570.21
0.730.34
0.890.14
0.770.15
0.620.08
0.890.10
0.770.07
0.930.05
0.840.18
0.980.04
0.870.04
0.720.19
0.870.12
0.660.20
0.560.23
0.760.08
0.820.07
0.740.13
0.790.14
0.800.07
0.660.19
0.680.28
0.570.18
0.610.11
0.530.22
0.710.38
0.740.11
0.920.06
0.390.51
0.680.17
0.830.13
0.810.09
0.960.02
0.830.22
0.240.31
0.800.22
0.730.24
0.570.50
0.630.18
0.360.27
0.710.10
0.760.11
0.650.15
0.680.36
0.990.03
0.750.10
0.890.15
0.990.05
0.820.14
0.810.18
0.780.11
0.940.04
0.820.13
1.000.00
0.760.12
0.840.09
0.930.12
0.780.18
0.870.12
0.820.14
0.960.07
0.900.10
0.970.03
0.950.05
0.650.24
0.890.11
0.940.07
0.980.03
0.800.21
0.660.21
0.370.07
0.680.23
0.790.15
0.790.18
0.650.11
0.800.14
0.890.10
0.890.09
0.900.14
0.990.03
0.850.06
0.690.26
0.900.13
0.710.10
0.580.14
0.760.13
0.850.06
0.760.12
0.830.11
0.770.07
0.650.08
0.760.25
0.530.25
0.540.15
0.590.13
0.910.09
0.750.15
0.890.09
0.970.05
0.750.17
0.850.14
0.920.07
0.950.02
0.760.18
0.580.18
0.850.20
0.630.20
0.730.19
0.530.17
0.570.08
0.780.11
0.740.10
0.760.12
0.780.22
0.990.03
0.760.10
0.690.20
0.890.14
0.700.16
0.560.23
0.840.13
0.840.06
0.800.10
0.870.09
0.720.16
0.630.18
0.790.20
0.450.24
0.550.10
0.560.17
0.900.07
0.760.09
0.910.08
0.930.07
0.730.25
0.860.13
0.960.07
0.950.04
0.890.14
0.560.21
0.820.20
0.600.33
0.730.29
0.590.17
0.620.12
0.810.13
0.820.10
0.810.10
0.860.10
0.990.02
0.700.10
0.940.14
1.000.00
0.940.08
0.880.13
0.860.10
0.940.07
0.800.14
0.990.04
0.850.15
0.850.08
0.980.08
0.810.16
0.930.11
0.910.07
0.990.05
0.960.07
0.960.05
0.960.05
0.900.17
0.970.06
0.940.07
0.990.02
0.970.11
0.720.25
0.930.14
1.000.00
0.920.14
0.790.21
0.720.08
0.930.09
0.920.06
0.900.05
0.860.16
0.980.03
0.840.08
#1
#2
#3
13
9
6
0
2
4
5
12
11
1
2
6
2
2
5
19
12
3
BUTIA
C4.5
OC1
SMO
FT
CART
BUTIA
C4.5
N
N
OC1
SMO
FT
N
N
N
N
N
N
CART
McGraw-Hill, 1997.
San