Professional Documents
Culture Documents
by
Hugh Chipman, Edward I. George and Robert E. McCulloch 1
Abstract
In this paper we put forward a Bayesian approach for nding CART (classication and
regression tree) models. The two basic components of this approach consist of prior
specication and stochastic search. The basic idea is to have the prior induce a poste-
rior distribution which will guide the stochastic search towards more promising CART
models. As the search proceeds, such models can then be selected with a variety of
criteria such as posterior probability, marginal likelihood, residual sum of squares or
misclassication rates. Examples are used to illustrate the potential superiority of this
approach over alternative methods.
Keywords: binary trees, Markov chain Monte Carlo, model selection, model uncertainty,
stochastic search, mixture models.
1 Introduction and the partition is determined by splitting rules associated
with each of the internal nodes. By moving from the root
CART models are a
exible method for specifying the condi- node through to the terminal node of the tree, each obser-
tional distribution of a variable y, given a vector of predictor vation is then assigned to a unique terminal node where the
values x. Such models use a binary tree to recursively par- conditional distribution of y is determined. CART models
tition the predictor space into subsets where the distribution were popularized in the statistical community by the semi-
of y is successively more homogeneous. The terminal nodes nal book of Breiman, Friedman, Olshen and Stone (1984). A
of the tree correspond to the distinct regions of the partition, concise description of CART modeling and its S-PLUS imple-
1 Hugh Chipman is Assistant Professor of Statistics, Department of mentation appears in Clark and Pregibon (1992).
Statistics and Actuarial Science, University of Waterloo, Waterloo, ON Given a data set, a common strategy for nding a good tree
N2L 3G1, hachipman@uwaterloo.ca. Edward I. George is the Ed and is to use a greedy algorithm to grow a tree and then to prune
Molly Smith Chair in Business Administration and Professor of Statis- it back to avoid overtting. Such greedy algorithms typically
tics, Department of MSIS, University of Texas, Austin, TX 78712-1175,
egeorge@mail.utexas.edu. Robert E. McCulloch is Professor of Statis- grow a tree by sequentially choosing splitting rules for nodes
tics, Graduate School of Business, University of Chicago, IL 60637, on the basis of maximizing some tting criterion. This gen-
rem@gsbrem.uchicago.edu. For helpful discussions and suggestions, the erates a sequence of trees each of which is an extension of the
authors would like to thank Wray Buntine, Jerry Friedman, Mark Glick- previous tree. A single tree is then selected by pruning the
man, Augustine Kong, Rob Tibshirani, Alan Zaslavsky, an associate
editor, two anonymous referees and the participants of the 1995 Interna- largest tree according to a model choice criterion such as cost-
tional Workshop on Model Uncertainty and Model Robustness held in complexity pruning, cross-validation, or even multiple tests of
Bath, England This work was supported by NSF grant DMS 94.04408, whether two adjoining nodes should be collapsed into a single
Texas ARP grant 003658130, and research funding from the Graduate node.
Schools of Business at the University of Chicago and the University of
Texas at Austin. In this paper we put forward a Bayesian approach for nd-
1
ing CART models. The two basic components of this ap- B
proach consist of prior specication and stochastic search. 239/683
The basic idea is to have the prior induce a posterior dis- bare<2.5
tribution which will guide the stochastic search towards more bare>2.5
promising CART models. As the search proceeds, such mod-
els can then be selected with a variety of criteria such as pos-
terior probability, marginal likelihood, residual sum of squares B M
each tree. Combining this prior with the tree model likelihood
yields a posterior distribution on the set of tree models. A B M
feature of this approach is that the prior specication can be 3/28 10/91
2
simulated and real data. Section 8 concludes with a discus-
sion.
3
use linear relationships such as E (yij j xij ; i ) = xij i at the and pRULE ( j ; T ). For an intermediate tree T in the process,
ith node. This would allow for modeling the mean of Y by pSPLIT (; T ) is the probability that terminal node is split,
piecewise linear or quadratic functions rather than by con- and pRULE ( j ; T ) is the probability of assigning splitting
stant functions as is implied by the iid assumption. rule to if it is split. The stochastic process for drawing a
For regression trees, we consider two models where f (yij ji ) tree from this prior can be described in the following recursive
is normal, namely the mean shift model manner:
yi1 ; : : :; yini j i iid N (i ; 2); i = 1; : : :; b; (3) 1. Begin by setting T to be the trivial tree consisting of a
single root (and terminal) node denoted .
where i = (i ; ), and the mean-variance shift model
2. Split the terminal node with probability pSPLIT (; T ).
yi1 ; : : :; yini j i iid N (i ; i2); i = 1; : : :; b; (4)
3. If the node splits, assign it a splitting rule according to
where i = (i ; i). For classication trees where yij belongs the distribution pRULE ( j ; T ), and create the left and
to one of K categories, say C1; : : :; CK , we consider f (yij j i ) right children nodes. Let T denote the newly created
to be generalized Bernoulli (i.e. simple multinomial) namely tree, and apply steps 2 and 3 with equal to the new
Yni Y
K left and the right children (if nontrivial splitting rules
f (yi1 ; : : :; yini j i ) = pIik(yij 2Ck ) i = 1; : : :; b; (5) are available).
j =1 k=1 In what follows, we restrict pSPLIT (; T )
P
where i = pi (pi1; : : :; piK ), pik 0 and k pik = 1. Note and pRULE ( j ; T ) to be functions of the part of T above
that P (yij 2 Ck j pi ) = pik . (i.e. ancestral to) . This insures that the process does not
Since a CART model is identied by (; T ), a Bayesian depend on the order in which terminal nodes are considered
analysis of the problem proceeds by specifying a prior proba- in steps 2 and 3. As a consequence, for any internal node,
bility distribution p(; T ). Because indexes the parametric the subtrees stemming from the right and left children are
model for each T , it will usually be convenient to use the independent. While in some cases this may be restrictive, we
relationship nd it to be a useful simplifying assumption.
p(; T ) = p( j T )p(T ); (6) We now proceed to discuss a variety of specications for
pSPLIT and pRULE . As will be seen, these specications can
and specify p(T ) and p( j T ) separately. This strategy, which be based on simple defaults, or can be made elaborate to allow
is commonly used for Bayesian model selection (George 1998), for special structure.
oers the advantage that the choice of prior for T does not de-
pend on the form of the parametric familyindexed by . Thus,
the same approach for prior specication of T can be used 3.1 Determination of Tree Size and Shape
for regression trees and classication trees. Another feature by pSPLIT
is that conditional specication of the prior on more eas- As we have dened it, a CART tree T is composed of a binary
ily allows for the choice of convenient analytical forms which tree and an assignment of splitting rules. To understand the
facilitate posterior computation. We proceed to discuss spec- role of pSPLIT , it is useful to ignore the splitting rule assign-
ication of P (T ) and P ( j T ) in sections 3 and 4 below. ment and focus on the distribution of binary trees obtained by
successively splitting terminal nodes with probability pSPLIT
3 Specication of the Tree Prior p(T ) as in step 2 above. This would be the distribution of binary
tree components of T obtained by steps 1-3 above if there
Instead of specifying a closed form expression for the tree were an innite supply of splitting rules.
prior p(T ), we specify p(T ) implicitly by a tree-generating Let us rst consider the simple specication,
stochastic process. Each realization of such a process can pSPLIT (; T ) < 1. Under this specication, the prob-
simply be considered as a random draw from this prior. Fur- ability of any particular binary tree with b terminal nodes is
thermore, many specications allow for straightforward eval- just b;1(1 ; )b . This generalization of the geometric prob-
uation of p(T ) for any T , and can be eectively coupled with ability distribution is obtained by noting that a binary tree
ecient Metropolis-Hastings search algorithms, as is shown with b terminal nodes must have (b ; 1) internal nodes. Note
in Section 5. that setting small will tend to yield smaller trees and is a
To draw from our prior we start with the tree consisting of simple convenient way to control the size of trees generated
a single root node. The tree then grows by randomly splitting by growing process.
terminal nodes, which entails assigning them splitting rules, The choice pSPLIT (; T ) is somewhat limited because
and left and right children nodes. The growing process is it assigns equal probability to all binary trees with b terminal
determined by the specication of two functions pSPLIT (; T ) nodes regardless of their shape. A more general form which
4
α=0.5, β=0.5 α=0.95, β=0.5 on the available set of split values or category subsets. By
available, we mean those predictors, split values and category
subsets which would not lead to empty terminal nodes. For
0.5
0.08
would no longer be available for splitting rules at nodes below
Probability
Probability
0.3
it.
0.2
0.04
As a practical matter, we only consider prior specications
0.1
for which the overall set of possible split values is nite. Thus,
0.0
0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30 each pRULE will always be a discrete distribution. This is
Number of Terminal Nodes Number of Terminal Nodes
hardly a restriction since every data set is necessarily nite,
α=0.95, β=1 α=0.95, β=1.5 and so can only be partitioned in a nite number of ways. As a
consequence, the support of p(T ) will always be a nite set of
trees. Note also that because the assignment of splitting rules
0.4
0.3
Probability
0.0
5
on subsets thought to be more important. as guides, and would then be chosen so that the prior for
assigns substantial probability to the interval (s ; s ). Once
and have been chosen, and a would be selected so that
4 Specication of the Parameter the prior for is spread out over the range of Y values.
Prior: p( T ) j For the more
exible mean-variance shift model (4) where
i can also vary across the terminal nodes, the conjugate form
In choosing a prior for j T , it is important to be aware is easily extended to
that using priors which allow for analytical simplication can i j i N (; i2=a) (12)
substantially reduce the computational burden of posterior
calculation and exploration. As will be seen in Section 5, this and
is especially true for prior forms p(jT ) for which it is possible i2 IG(=2; =2); (13)
to analytically margin out to obtain: with the pairs (1 ; 1); : : :; (b ; b) independently distributed.
Z Under this prior, analytical simplication is still straightfor-
p(Y j X; T ) = p(Y j X; ; T )p( j T )d: (8) ward, and yields
Y b pa
In the following two subsections, we describe a variety of prior p(Y j X; T ) = ; ni = 2
( ) pn + a
= 2
(14)
forms for which this marginalization is possible. These priors i=1 i
are obtained by putting conjugate priors on terminal node ;(( n i + )=2) (s + t + );(ni+ )=2
parameters and then assuming (conditional) independence of ;(=2) i i
parameters across terminal nodes. Such priors are simple to
describe and implement. Although we do not describe them where si and ti are as above.
here, it is also possible to construct analytically tractable hi- As before, the observed Y can be used to guide the choice
erarchical priors which use the tree structure to model pa- of the prior parameters values for (; ; ; a). The same ideas
rameter dependence across terminal nodes. Such priors are may be used with an additional consideration. In some cases,
described and investigated in Chipman, George & McCulloch the mean-variance shift model may explain variance shifts
(1997). much more so than mean shifts. To handle this possibility,
it may be desirable to choose and so that s is more to-
ward the center rather than the right tail of the prior for .
4.1 Parameter Priors for Regression Trees We might also tighten up our prior for about the average
For regression trees with the mean-shift model normal model y value. In any case, it may be appropriate to explore the
(3), perhaps the simplest prior specication for j T is the consequences of several dierent prior choices.
standard conjugate form Finally, note that it may be desirable to further extend the
above priors to let the values of and depend on features
1 ; : : :; b j ; T iid N (; 2=a) (9) of T . For example, because more complex trees (i.e. ner
partitions of the predictor space) are likely to explain more
and variation, it might be reasonable to consider and as de-
2 j T IG(=2; =2) (, =2 2 ): (10) creasing functions of tree complexity.
Under this prior, standard analytical simplication yields
!;(n+ )=2 4.2 Parameter Priors for Classication
c ab=2 Xb Trees
p(Y j X; T ) = Qb (si + ti ) +
i=1 (ni + a)
1=2
i=1 For classication trees where yij belongs to one of K cat-
(11) egories C1; : : :; CK under the generalized Bernoulli model
where c is a constant which does not depend on T , si is (ni ;1) (5), perhaps the simplest conjugate prior specication for
times the sample variance of the Yi values, ti = nni +i aa (yi ; )2 , = (p1 ; : : :; pb) is the standard Dirichlet distribution of di-
and yi is the average value in Yi . mension K ; 1 with parameter = (1; : : :; K ), k > 0
In practice, the observed Y can be used to guide the choice namely
of the prior parameters values for (; ; ; a). To begin with, p1; : : :; pb j T iid Dirichlet(pi j ) / pi11 ;1 piKK ;1 : (15)
because the mean-shift model attempts to explain the varia-
tion of Y , it is reasonable to expect that will be smaller than When K = 2 this reduces to the familiar Beta prior.
the sample standard deviation of Y , say s . Similarly, it is Under this prior, standard analytical simplication yields
reasonable to expect that will be larger than a pooled stan- P b Y b Q
dard deviation estimate obtained from a deliberate overtting P (Y jX; T ) = ;Q( ;(k k)) k ;(nikP+ k ) (16)
of the data by a greedy algorithm, say s . Using these values k k i=1 ni + k k )
; (
6
P P
where nik = j I (yij 2 Ck ), ni = k nik and k = 1. Generate a candidate value T with probability distribu-
1; :::; K over the sums and products above. For a given tree, tion q(T i ; T ).
P (Y j X; T ) will be larger when nodes are assigned more ho-
mogeneous values of y. To see this, note that assignments for 2. Set T i+1 = T with probability
which the yij 's at the same node are similar will lead to more q(T ; T i) p(Y j X; T )p(T )
disparate values of ni1; : : :; niK , which in turn leads to larger i
(T ; T ) = min q(T i ; T ) p(Y j X; T i )p(T i ) ; 1 :
values of P (Y j X; T ). (19)
The natural default choice for is the vector (1; : : :; 1) for Otherwise, set T i+1 = T i .
which the Dirichlet prior (15) is the uniform. However, by set-
ting certain k to be larger for certain categories, P (Y j X; T ) Under weak conditions (see Tierney 1994), the sequence (18)
will become more sensitive to misclassication at those cate- obtained by this algorithm will be a Markov chain with limit-
gories. This might be desirable when classication into those ing distribution p(T j Y; X ). Note that the norming constant
categories is most important. for p(T j Y; X ) is not needed to compute (19).
To implement the algorithm, we need to specify the tran-
sition kernel q. We consider kernels q(T; T ) which generate
5 Stochastic Search of the Posterior T from T by randomly choosing among four steps:
For each of the parameter prior forms p( j T ) proposed in GROW: Randomly pick a terminal node. Split it into
Section
R 4, it is possible to analytically obtain p(Y j X; T ) = two new ones by randomly assigning it a splitting rule
p(Y j X; ; T )p( j T )d in (8), resulting in one of the closed according to pRULE used in the prior.
forms (11), (14) or (16). Combining these with one of the
CART tree priors P (T ) proposed in Section 3, allows us to PRUNE: Randomly pick a parent of two terminal nodes
quickly calculate the posterior of T and turn it into a terminal node by collapsing the nodes
below it.
p(T j X; Y ) / p(Y j X; T )p(T ) (17) CHANGE: Randomly pick an internal node, and ran-
up to a norming constant. domly reassign it a splitting rule according to pRULE
Exhaustive evaluation of (17) over all T will not be feasi- used in the prior.
ble, except in trivially small problems, because of the sheer SWAP: Randomly pick a parent-child pair which are both
number of possible trees. (Recall that the number of possi- internal nodes. Swap their splitting rules unless the other
ble trees is nite because we have restricted attention to a child has the identical rule. In that case, swap the split-
nite number of possible splitting functions). This not only ting rule of the parent with that of both children.
prevents exact calculation of the norming constant, but also
makes it nearly impossible to determine exactly which trees In executing the GROW, CHANGE and SWAP steps, we
have largest posterior probability. restrict attention to available splitting rule assignments. By
In spite of these limitations, Metropolis-Hastings algo- available, we mean splitting rule assignments which do not
rithms can still be used to explore the posterior. Such al- force the tree have an empty terminal node. We have also
gorithms simulate a Markov chain sequence of trees found it useful to further restrict attention to splitting rule
assignments which yield trees with at least a small number
T 0 ; T 1; T 2; : : : (18) (such as ve) observations at every terminal node.
The proposal above has some appealing features. To be-
which are converging in distribution to the posterior gin with, it yields a reversible Markov chain because every
p(T j Y; X ) in (17). Because such a simulated sequence will step from T to T must have a counterpart that can move
tend to gravitate towards regions of higher posterior proba- from T to T . Indeed, the GROW and PRUNE steps are
bility, the simulation can be used to stochastically search for counterparts of one another, and the CHANGE and SWAP
high posterior probability trees. We now proceed describe the steps are their own counterparts. Note that although other
details of such algorithms and their eective implementation. reversible moves can be considered, we have ruled them out
because their counterparts are impractical to construct. For
5.1 Specication of the Metropolis-Hastings example, we have ruled out pruning o more than a pair of
terminal nodes because the reverse step would be complicated
Search Algorithm and time consuming to generate.
The Metropolis-Hastings (MH) algorithm for simulating the Another appealing feature is that our MH algorithm is sim-
Markov chain T 0 ; T 1; T 2; : : : in (18) is dened as follows. ple to compute. By using pRULE to assign splitting rules
Starting with an initial tree T 0, iteratively simulate the tran- in the GROW step, there is substantial cancelation between
sitions from T i to T i+1 by the two steps: p(T ) and q(T; T ) in the calculation of (T i ; T ) in (19). In
7
the PRUNE step similar cancelation occurs between p(T ) and their relative probabilities. One could then choose the trees
q(T ; T ). In the CHANGE and SWAP steps, calculation of with largest posterior probability. Alternatively, in the spirit
the q values in (19) is avoided because their ratio is always 1. of model averaging (see Breiman (1996) and Oliver and Hand
(1995)), one could approximate the overall posterior mean by
5.2 Running the MH Algorithm for Stochas- the average of the visited trees using weights proportional to
p(Y j X; T )p(T ).
tic Search However, there is a subtle problem with using posterior
The MH algorithm described in the previous section can be probabilities to pick single trees in this context. Consider
used to search for desirable trees. To perform an eective the following simple example. Suppose we were considering
search it is necessary to understand its behavior as it moves all possible trees with two terminal nodes and a single rule.
through the space of trees. By virtue of the fact that its Suppose further that we had only two possible predictors, one
limiting distribution is P (T j Y; X ), it will spend more time qualitative with two categories, the other quantitative with
visiting tree regions where P (T j Y; X ) is large. However, our 100 possible split points. If the marginal likelihood p(Y j X; T )
experience in assorted problems (see Sections 6 and 7) has was the same for all 101 rules, then the posterior would have a
been that the algorithm quickly gravitates towards such re- sharp mode on the qualitative rule. This is because the prior
gions and then stabilizes, moving locally in that region for assigns small probability to each individual split value for the
a long time. Evidently, this is a consequence of a proposal quantitative x , and much larger probability to the single rule
distribution which makes local moves over a sharply peaked on the qualitative x . The problem is that the relative sizes of
multimodal posterior. Once a tree has reasonable t, the the posterior modes does not capture the fact that the total
chain is unlikely to move away from a sharp local mode by posterior probability allocated to the quantitative predictor
small steps. Because the algorithm is convergent, we know it trees is the same as that allocated to the single qualitative
will eventually move from mode to mode and traverse the en- tree.
tire space of trees. However, the long waiting times between It should be emphasized that the problem above is not a
such moves and the large size of the space of trees make it failure of the Bayesian prior. By using it, the posterior prop-
impractical to search eectively with long runs of the algo- erly allocates high probability to tree neighborhoods which
rithm. Although dierent move types might be implemented, are collectively supported by the data. This serves to guide
we believe that any MH algorithm for CART models will have the algorithm towards such regions. The problem is that rel-
diculty moving between local modes. ative sizes of posterior modes do not capture the relative al-
To avoid wasting time waiting for mode to mode moves, location of probability to such regions, and so can lead to
our search strategy has been to repeatedly restart the algo- misleading comparisons of single trees. Note that this would
rithm. At each restart, the algorithm tends to move quickly not be a limitation for model averaging.
in a direction of higher posterior probability and eventually A natural criterion for tree selection, which avoids the dif-
stabilize around a local mode. At that point the algorithm culties described above, is to use the marginal likelihood
ceases to provide new information, and so we intervene in or- p(Y j X; T ). As illustrated in Section 7, a useful tool in this
der to nd another local mode more quickly. Although the regard is a plot of the largest observed values of p(Y j X; T )
algorithm can be restarted from any particular tree, we have against the number of terminal nodes of T , an analogue of
found it very productive to repeatedly restart at the trivial the Cp plot (Mallows 1973). This allows the user to directly
single node tree. Such restarts have led to a wide variety gauge the value of adding additional nodes while removing
of dierent trees, apparently due to large initial variation of the in
uence of p(T ). In the same spirit, we have also found
the algorithm. However, we have also found it productive to it useful to consider other commonly used tree selection cri-
restart the algorithm at other trees such as previously visited teria such as residual sums of squares for regression trees and
intermediate trees or trees found by other heuristic methods. misclassication rates for classication trees. (The misclas-
For example, in Section 7 we show that restarting our algo- sication rate of a tree is the total number of observations
rithm at trees found by bootstrap bumping (Tibshirani and dierent from the majority at each terminal node. When all
Knight 1996) leads to further improvements over the start misclassication costs are equal, it is an appropriate measure
points. of model quality).
5.3 Identifying the Good trees
As many trees are visited by each run of the algorithm, a 6 A Simulated Example
method is needed to identify those trees which are of most
interest. Because a long enough run of the chain is not feasi- In this section, we illustrate the features of running our search
ble, frequencies of visits will not be useful. A natural choice algorithm by applying it to data simulated from the regression
is to evaluate p(Y j X; T )p(T ) for all visited trees and obtain tree used for illustration in Figure 2. The model represented
8
X2={A,B} X2={C,D}
• • • • • • •• •
• • • • • • • • •
• • ••
• • •••
••• •• • •
•••
•
•• ••• ••• • •• • ••
• •• •
••• ••• • ••• •• ••• ••• •• •• ••• •••• ••
10
10
• •• •
10
10
•• •••• •• •• • •• ••• •• •• ••
•••
•• •• •• • • • • •
•• ••••• ••• • ••
•• •• ••••
•••
•• •••
•• • ••• •• ••• • • •• •• ••• • •••
••• • •••••
•• •• • • •• ••
•• •• •• • • •
•••
• • ••• •
•••• •••
••• •••
•• •• ••• • •• •• • ••• ••••
•• ••• ••••
•• • • ••• ••• ••
•• •••• • • •• •• ••
•••
• •• ••• •••
•
• • •
•• •• •• • •• ••• •
••• •• • •••• •••• ••• ••• •••••
••
• •••
•••
• •• • ••• ••• •• •• •• ••• •• • •• •
•••
•• ••••
•• •• ••• • •• • •• •• •• •• ••• •
••• •• ••• •• •• •• • ••• ••• ••••
•• ••
•• • • •• • •• •• • ••• ••• ••• •• •• ••••• •••
• •• •• •• • ••••
5
• ••• ••• ••• ••• • • • • • • •
•• • • • ••• • •• •• ••• • •• ••• •••
Y
Y
• •
5
5
• •• •• •• •• ••• ••• ••
•• • •
••• •••
•• •• • • • • • • • • ••• •• • ••• • •• ••• •••• •• •••
• • • ••
• ••• ••• • •• • ••• • • • ••• •• •••• ••
••• •• •••• •
•• • •• •• ••• • • •• •• •••• ••• ••• ••• •• •• •• •• •• ••• ••
•• • •
•• ••
•• • • • • • •• •••• ••• • •• •••
••
• •• •
• •• •• ••••• ••• •• •
•• •• ••
••
•
•• •• • •• ••• •• •
•• • •• • • • • ••• • •• •• •• ••• ••• •• •• •••
• • •• •• ••• ••• •• •• • • •• ••• •• • •
•••
••
•• • ••• •••
0
•• •• •
• •••
• •
•• •• •• ••• • • • • •• •
• •• •• • •
0
•• • • • • ••
• • • • •• • ••• ••• •• ••
• ••• • ••• ••• • ••••
• • • ••• •• •• • •
• • ••• • • • • • •
• • • •
• • •
• • • • • •
0 2 4 6 8 10 0 2 4 6 8 10 2 4 6 8 A B C D
X1 X1 X1 (ordinal) X2 (unordered)
Figure 4: Simulated example, with 800 observations and true Figure 5: The response against the two predictors. The mean
model overlaid. response for the best rst split is given in X1 plot, and mean
levels for each X2 in the other plot.
by that tree is:
Y = f (X1 ; X2) + 2; N (0; 1) (20) informative but spread over the range of Y . The prior on i2
has the unconditional variance in the tail and the minimum
where variance near the mode.
8
>
>
8: 0 if X 1 5 : 0 and X 2 2 f A; B g To illustrate the behavior of the MH algorithm for this
< 2 : 0 if X 1 > 5 : 0 and X 2 2 f A; B g setup, we report characteristics of trees we found using ten
f (X1 ; X2) = > 1:0 if X1 3:0 and X2 2 fC; Dg : runs of 1000 iterations, each time restarting from the single
> 3:0 < X1 7:0 and X2 2 fC; Dg node tree. The top panel of Figure 6 displays the log poste-
: 85::00 ifif X
> rior probabilities of the sequence of trees with a line drawn
1 > 7:0 and X2 2 fC; D g
at the log posterior probability of the true tree. The gure
Figure 4 displays 800 iid observations drawn from this model. clearly illustrates the typical behavior described in the previ-
One of the reasons we chose this particular model is that it ous section; the algorithm quickly gravitates towards regions
tends to elude identication by the greedy algorithm which of high posterior probability and then stabilizes, moving lo-
chooses splits to minimize residual sums of squares. To see cally within that region. The middle panel of Figure 6 displays
why, consider Figure 5, which plots Y marginally against each the (marginal) log likelihoods with a line is drawn at the log
X . The horizontal lines on the plots are the mean levels for likelihood of the true tree. Note the variety of tree likelihoods
the split chosen by the greedy algorithm. A rst split on X1 found by the dierent restarts. One run, the third, seems to
at 5 gives greater separation of means than any split on X2 . be trapped in a region of trees having relatively low poste-
The greedy algorithm is unable to capture the correct model rior probabilities. The last two runs have led to trees with
structure, instead making a rst split on X1 . the \optimal" likelihood. Note that single long run would
We begin by applying our approach to a sample of 200 be much more likely to get stay trapped near a suboptimal
observations generated from the model (20). We used the tree. Finally, the lower panel of the Figure 6 gives the corre-
tree prior P (T ) with = 0:95 and = 1 for pSPLIT in sponding number of terminal nodes, with a line drawn at the
(7) (see Figure 3) and the uniform specication for pRULE . size of the true tree. The likelihood makes little distinction
Using the mean-variance shift model (4) for the terminal node between trees found in the last three runs, but the posterior
distributions, we used the conjugate independence priors (12) downweights the eight run as it is too large. Interestingly, the
and (13) with hyperparameter values based on the sample. dead end runs tend to nd trees that are too large.
The unconditional mean and variance of Y are ^Y = 4:85 As Figure 6 shows, multiple restarts are a crucial part of
and ^Y2 = 12 respectively. The residual variance from an tree searches based on our algorithm. To further illustrate the
overt greedy tree is roughly ^min 2
= 4. We consequently need for restarts and other features of our approach, we per-
choose i j i N (^Y ; i (^Y =^min)) = N (4:85; 3i2) and form a Monte Carlo comparison of several strategies, based on
2 2 2
9
-180 -170 -160 -150 -140 -130
Log Posterior
8
6
4
2
Figure 6: Log marginal likelihood and number of terminal nodes for trees visited by the Metropolis algorithm.
compare our recommended strategy of restarting versus the Each strategy is based on identical priors for and de-
strategy of using one long run. We also show what happens scribed earlier in this section. Thus each strategy is using the
to our strategy when the SWAP step is deleted, and when same marginal likelihood p(Y j X; T ) as input.
the tree prior is relaxed. More precisely, on each of the 200 The four approaches are compared in terms of the number
samples, 40,000 iterations of the following four strategies are of iterations required to correctly identify the initial split-
compared. Note that the last three strategies are identical to ting rule of the tree, namely X2 2 fA; B g vs. X2 2 fC; Dg.
the \base", except as noted: Finding this split is crucial in this problem because the true
structure of the tree cannot be parsimoniously captured us-
Base: All four move types (GROW, PRUNE, CHANGE, ing a dierent initial split. Figure 7 displays boxplots of the
SWAP) with equal probabilities, 20 restarts, 2000 it- number of iterations to correct identication on each of the
erations after each restart, a tree prior pSPLIT with 200 samples for the four strategies. In cases where the method
= :95; = :5 (giving 6.5 expected terminal nodes). failed to make a correct identication in 40,000 iterations, the
value was recorded as 44,000.
No Restart: One long chain of 40,000 iterations. Figure 7 shows the Base procedure performed best. The
strategy of using one long run was clearly worst, over 75% of
No SWAP: The SWAP move is not used. The remaining the chains failed to identify the correct split in 40,000 steps.
moves have equal probabilities. Omission of the SWAP step inhibits mixing, and slows the
chain considerably. The SWAP step is useful here because
Loose prior: The tree prior pSPLIT is changed to = when correct rule is rst found at a lower level, the swap
:95; = :25, giving 11.8 expected terminal nodes. allows that rule to be moved up. Lastly, in this particular
10
Variable Code
Clump Thickness clump
• Uniformity of Cell Size size
Uniformity of Cell Shape shape
40000
•
Bland Chromatin bland
Normal Nucleoli normal
• •
•
• • • Mitoses mitoses
• •
Class: class (2=benign, 4=malignant)
20000
• •
•
• Table 1: Variable names, breast cancer data
10000
11
trees that beat greedy trees of the same size. As mentioned provement over the best of the 10 bumped trees, which had a
in Section 5.3, we used both marginal likelihood and misclas- log likelihood of -67.3. Even shorter runs than those reported
sication rates to identify promising trees. The marginal log here produced improvements on the bumped trees. Restarting
likelihood of the data for identied trees is given in Figure 8, at bumped trees also yielded misclassication rate improve-
broken down by number of terminal nodes on the horizontal ments. The best nine node trees identied by bumping had
axis. Figure 9 plots misclassication rates in a similar fashion. misclassication rates of 15. From the bumped start points,
Although similar trees are identied by all three runs, the misclassication rates as low as 13 were attained with eight
most likely trees are visited during the run with the medium node trees. Although these improvements are not dramatic
prior. Misclassication rates of 11 observations out of 683 (probably because the rates are so low to begin with), it is
are attained with nine or more terminal nodes. This suggests interesting to note that even lower misclassication rates were
that a nine node tree may be sucient. The likelihood prefers found by restarting at the single node tree.
a 10 node tree, and may provide more discrimination in this Finally, we also investigated the potential for model aver-
example (since misclassications are of the order of 10 in 683). aging in this example. Specically, we construct estimates
The trees found by our procedure had lower misclassication of malignancy probability for each individual by averaging
rates than greedy trees constructed using deviance pruning across trees using posterior model probabilities as weights.
with Splus. The greedy trees were substantially worse, yield- All models with log posterior probability within 5.43 of the
ing misclassication rates of 30, 29, and 21 for trees with 5, most probable tree were used. This set consisted of 1152 dis-
6, and 7-10 terminal nodes. tinct models. Weights ranged from 10;4 to 0.0225. Trees
For each of the three tree priors, we proceeded to consider averaged over had from four to eight terminal nodes, with the
the best tree with 9, 10, and 11 terminal nodes, where best majority having six or seven. Misclassication rates of these
means the tree with the largest likelihood and the lowest mis- trees ranged from 13 to 33. The posterior averages agree well
classication rate. (Such best trees may not always exist). with the data, having a misclassication rate of 15. In this
This group included a variety of tree structures, both in terms example, there is little room for improvement, since the data
of topology and splitting rules. Several of these trees were is t well by many trees. Even so, it is likely that the averaged
\bushy" like the tree displayed in Figure 1, and similar to tree would provide better out of sample prediction, being a
the trees selected by the greedy algorithm. In contrast, oth- more stable average of many trees. This is a promising area
ers, like the one given in Figure 10, had a very unbalanced for future work.
appearance. It is interesting to note that this tree has a sur-
prisingly simple interpretation which can be summarized as:
> >
if (clump 8.5) or (normal 8.5) or (size 4.5) or >
8 Discussion
>
(bare 6.5) then malignant In this paper we put forward a Bayesian approach for nd-
<
else benign, unless (2.5 bare 4.5) and < ing CART models. Identifying each CART model by the tree
>
(clump 4.5) T and the terminal node parameters , prior specication is
facilitated by considering P (T ) and P ( j T ) separately. We
This last condition seems at odds with the direction of all the have proposed specifying P (T ) by a tree generating process
other rules; its removal would further simplify the classica- which randomly grows trees and assigns splitting rules. This
tion rule. process can be easily specied using simple defaults, or can be
Although restarting from the single node tree produced made elaborate to allow for special structure. For the speci-
very satisfactory results, we also considered restarting from cation of P ( j T ), we have proposed a variety of conjugate
trees found by bootstrap bumping. To construct these forms which allow for analytical marginalization of from
bumped trees, 500 bootstrap samples of the original data were the posterior. This marginalization substantially reduces the
generated. For each bootstrap sample, a greedy tree with 9 computational burden of our overall approach.
nodes (and at least 5 observations in each terminal node) was For stochastic search of the posterior, we have proposed
grown. The 10 best trees (in terms of log likelihood) were used a Metropolis-Hastings algorithm which moves around the
as starting points for our algorithm. Starting at each of these posterior by gravitating towards regions of high probability.
10 trees, 25 chains of length 1000 were run, amounting to one The key to our algorithm is a proposal distribution which
tenth of the computational eort (10 25 1000 = 250; 000 randomly selects one of four moves in tree space: GROW,
steps) spent on starting from null trees. Shorter runs were PRUNE, CHANGE or SWAP. With these moves our algo-
used since the bumped trees already t the data well. rithm is easy to compute and moves quickly to posterior
Roughly 80% of the 250 chains visited trees with better modes. Although the vast size of the CART model space
log likelihoods than the bootstrap trees from which they were and the multimodal nature of the posterior prevent the simu-
started. Trees with log likelihoods as high as -62.2 were iden- lated Markov chain from converging, our experience indicates
tied by this procedure. This represents a substantial im- that the algorithm does succeed in nding at least some of
12
small tree prior medium tree prior large tree prior
-60
-60
-60
•
•
••
•
••• • ••
• •• •
• •• •
•• •
• •• • •• ••
••• •• • • ••• •
••
••
••• •
••• •••• ••• •
••• ••••• •• ••• •• ••• • ••
••• ••••• ••• • •••• ••• •••• •••• ••
-65
-65
-65
••• ••• ••• •• ••• •••
•••• •• ••••• ••• •• •• •
Log Integrated Likelihood
-70
••• •••• ••• ••• •••
-70
••• ••• •
•••• ••• ••• ••• •••• ••• •••• ••• ••• ••• •••• •• • • •••• ••• ••• ••• ••••• ••
•••• •••• •••
••• •••• ••• ••• •••
••• ••• ••• ••• ••• •••• ••• •••• ••• •••• ••• •••
••• ••• •• •
•• •• •• •• •• •••
•••• ••• ••• ••• ••
•••• ••• ••• •• ••• •••• •••• ••• •• •••• •••• ••• ••• •••
••• •••• ••• ••• ••• •••• ••• ••• ••••• •••• ••• ••• •••• ••• ••
••• • ••••
•••
••• ••• ••• ••• ••• • •
••• ••• ••• •••• ••• ••• ••• ••••• ••• ••• •••• •••• ••• ••• ••• ••• ••• ••• ••• ••
••• •••• ••• ••• ••• •••• ••• ••• ••• ••• •••• ••• ••• •••• ••• •••• ••• •••
••• ••• ••• ••• ••• ••• •••• •••• •••
••• •••• ••• ••• •••• ••• ••• • •••
•••• ••• ••• ••• ••• ••• • •••• ••••• •••
••• •••• ••• •••
••
••• ••• •••• ••• ••• •••• ••• •• •••
••• ••• •••• •••• ••• • • •••• •••• ••• ••• •
•
••• ••• ••• ••• ••• ••• •••• ••• ••••• ••• •••• •••• ••• •••• ••• ••• ••• •••
••• ••• ••
••• ••• ••
••• •• •• ••
•••
•
•••• •• •• •
••• ••• ••••
••
••• •••• •• •• •• •
••• •••• •••• •• •••
••
••• •••
•• ••• ••• •••• ••••• ••• ••• ••••• •••• •• ••• ••• ••• ••• ••• ••• •• •• ••• ••• ••••• ••• ••• ••
• ••••• ••• •••
••• •••• ••• •••• ••• ••• •• ••• •••• •••• ••• •••
••• •••• •• •• •••• ••• ••• ••• •••• ••• •
•• •••• •••• ••• •••• •••• ••• ••• •••• •• •••• ••• ••• •••• ••• •••• ••• •• •••• •••• ••• ••• ••• ••• ••
•• ••• • ••• •••
-75
•• •• •• •• ••
-75
•• •• •• •• ••
-75
• • • •• •• ••
6 8 10 12 14 6 8 10 12 14 6 8 10 12 14
Number of Bottom Nodes Number of Bottom Nodes Number of Bottom Nodes
the high posterior regions. To take advantage of this behav- because even with our additional SWAP step, the algorithm
ior, we have found it fruitful to repeatedly restart the search quickly gets trapped in a local posterior mode after which it
after a local mode has been found. The examples in Sections only moves locally, (see Section 6). Our recommended strat-
6 and 7 illustrate that our approach is capable of nding a set egy of continual restarts is deliberately designed to avoid this
of (possibly quite dierent) trees that t the data well, and problem.
can outperform those trees of similar size found by alternative
methods. References
Finally, it may be of interest to compare our approach with Breiman, L., Friedman, J. Olshen, R. and Stone, C. (1984),
the independent development of Denison et.al. (1997). Al- Classication and Regression Trees, Wadsworth.
though the basic thrusts are similar, there are key dierences.
To begin with, we use a tree generating process prior which Breiman, L (1996), \Bagging Predictors", Machine Learn-
can depend on both tree size and shape, whereas they use a ing, 24, 123{140.
simple truncated Poisson prior which puts equal weight on
equal sized trees. For splitting value assignment for quan- Buntine, W. (1992), \Learning Classication Trees", Statis-
titative predictors, we recommend a discrete uniform prior tics and Computing, 2, 63{73.
on the observed predictor values, an invariant prior, whereas Clark, L., and Pregibon, D. (1992), \Tree-Based Models" in
they recommend a continuous uniform on the predictor range, Statistical models in S, J. Chambers and T. Hastie, Eds.,
which is not invariant. For regression tree parameters, we Wadsworth.
use proper conjugate priors and then margin out analyti-
cally, whereas they use improper priors, plug-in parameter Chipman, H., George, E.I. & McCulloch, R.E. (1997) \Hi-
estimates, and do not completely margin. erarchical Bayesian CART", Technical Report, Depart-
For stochastic search, we both use Metropolis-Hastings ment of MSIS, University of Texas at Austin.
algorithms with the same GROW, PRUNE and CHANGE Denison, D., Mallick, B. and Smith, A.F.M. (1997) \A
steps. However, we include an additional SWAP step. (In Bayesian CART Algorithm", Technical Report, Depart-
an early version of this paper, our CHANGE step was more ment of Mathematics, Imperial College, London.
limited. However, we enhanced it after seeing an early ver-
sion of their paper). To cope with their continuous splitting George, E.I. (1998), \Bayesian Model Selection", Encyclope-
values priors, they recommend an elaborate reversible jump dia of Statistical Sciences Update, Volume 3, Wiley, New
algorithm (Green 1995) which requires an additional rejection York.
step. Finally, they recommend stochastic search using a sin-
gle long run preceded by a burn-in period, \after which poste- Green, P. (1995), \Reversible Jump Markov chain Monte
rior probabilities have been settled for some time". In sharp Carlo Computation and Bayesian Model Determina-
contrast, we have strongly cautioned against such a strategy, tion",Biometrika, 82, 711-32.
13
small tree prior medium tree prior large tree prior
40
40
40
• • • • • •
• • • • • • •
• • • • •
• • •
• • • • • • • •
• • • • • • • • • • •
• • • • • • • • • • • • • •
• • • • • • • • • • •
• • • • • • • • • • • • • • •
30
30
30
• • • • • • • • • • • • • •
• • • • • • • • • • • •
• • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • •
• • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • •
Misclassification rate
Misclassification rate
Misclassification rate
• • • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • • • •
20
20
20
• • • • • • • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • •
• • • • • • • •
10
10
10
0
0
6 8 10 12 14 6 8 10 12 14 6 8 10 12 14
Number of bottom nodes Number of bottom nodes Number of bottom nodes
Jordan, M.I. and Jacobs, R.A. (1994), \ Mixtures of Experts Wolberg, W. H. and Mangasarian, O. L. (1990), \Multisur-
and the EM Algorithm", Neural Computation, 6, 181- face method of pattern separation for medical diagnosis
214. applied to breast cytology", Proceedings of the National
Academy of Sciences, 87, 9193-9196.
Mallows, C. L. (1973), \Some Comments on Cp ." Techno-
metrics, 15, 661{676.
Oliver, J.J. and Hand, D.J. (1995). \On Pruning and Aver-
aging Decision Trees", Proceedings of the International
Machine Learning Conference, 430-437.
Paass, G. and Kindermann, J. (1997). \Describing the Un-
certainty of Bayesian Predictions by Using Ensembles
of Models and its Application", 1997 Real World Com-
puting Symposium, Real World Computing Partnership,
Tsukuba Research Center, Tsukuba, Japan, p. 118{125.
Quinlan, J.R. and Rivest, R.L. (1989). \Inferring decision
trees using the minimum description length principle",
Information and Computation, 80:227-248.
Sutton, C. (1991), \Improving Classication Trees with Sim-
ulated Annealing", Proceedings of the 23rd Symposium
on the Interface, E. Keramidas, Ed., Interface Founda-
tion of North America.
Tibshirani, R., and Knight, K. (1995), \Model Search
and Inference by Bootstrap `Bumping' ", University of
Toronto technical report.
Tierney, L. (1994), \Markov Chains for Exploring Posterior
Distributions," Annals of Statistics, 22, 1701{1762.
Wallace, C.C. and Patrick, J.D. (1993). \Coding decision
trees", Machine Learning, 11, 7-22.
14
B
239 / 683
clump<8.5
clump>8.5
B M
156 / 600 0 / 83
normal<8.5
normal>8.5
B M
105 / 549 0 / 51
size<4.5
size>4.5
B M
34 / 475 3 / 74
bare<6.5
bare>6.5
B M
12 / 449 4 / 26
bare<2.5
bare>2.5
B B
3 / 411 9 / 38
secs<4.5 bare<4.5
secs>4.5 bare>4.5
B B B B
0 / 402 3/9 7 / 26 2 / 12
clump<4.5
clump>4.5
B M
0 / 18 1/8
Figure 10: A nine node tree found by by stochastic search. The overall misclassication rate is 13, and log marginal likelihood
-63.4.
15