Professional Documents
Culture Documents
User Guide
Copyright 2001, Salford Systems; all rights reserved worldwide. No part of this publica-
tion may be reproduced, transmitted, transcribed, stored in a retrieval system, or
translated into any language or computer language, in any form or by any means,
electronic, mechanical, magnetic, optical, chemical, manual or otherwise without the
express written permission of Salford Systems.
Limited Warranty
Salford Systems warrants for a period of ninety (90) days from the date of delivery that,
under normal use, and without unauthorized modification, the program substantially
conforms to the accompanying specifications and any Salford Systems authorized
advertising material; that, under normal use, the magnetic media upon which this
program is recorded will not be defective; and that the user documentation is substan-
tially complete and contains the information Salford Systems deems necessary to use
the program.
If, during the ninety (90) day period, a demonstrable defect in the programs magnetic
media or documentation should appear, you may return the software to Salford Sys-
tems for repair or replacement, at Salford Systems option. If Salford Systems cannot
repair the defect or replace the software with functionally equivalent software within
sixty (60) days of Salford Systems receipt of the defective software, then you shall be
entitled to a full refund of the license fee.
Salford Systems cannot and does not warrant that the functions contained in the
program will meet your requirements or that the operation of the program will be
uninterrupted or error free.
Salford Systems disclaims any and all liability for special, incidental, or consequential
damages, including loss of profit, arising out of or with respect to the use, operation, or
support of this program, even if Salford Systems has been apprised of the possibility of
such damages.
Trademarks
Table of Contents
CHAPTER 1: INTRODUCTION TO MARS .......................................................... 1
ABOUT THIS USER GUIDE ...................................................................................... 3
W elcome to MARS, the worlds first truly successful automated regression modeling
tool. Multivariate Adaptive Regression Splines was developed in the early 1990s
by world-renowned Stanford physicist and statistician Jerome Friedman, but has become
widely known in the data mining and business intelligence worlds only recently through
our seminars and the enthusiastic endorsement of leading data mining specialists. We
expect that you will find MARS an essential component in your data analysis tool kit.
MARS is an innovative and flexible modeling tool that automates the building of accurate
predictive models for continuous and binary dependent variables. It excels at finding
optimal variable transformations and interactions, the complex data structure that often
hides in high-dimensional data. In doing so, this new approach to regression modeling
effectively uncovers important data patterns and relationships that are difficult, if not
impossible, for other methods to reveal.
Although regression is one of the most widely used tools in statistical analysis, it is
almost never used in data mining. Nevertheless, a well-crafted regression model can be
ideal for predictive modeling and data mining because of the following important
characteristics:
n A regression model predicts the outcome variable by forming a weighted sum of
the predictor variables; thus, the predicted outcome changes in a smooth and
regular fashion as the inputs change.
Developing a good regression model is usually an extremely time intensive activity requiring
considerable modeling expertise, even for small databases. For the large databases
common in data mining projects, model-building challenges have deterred data miners
from using this otherwise very effective tool. However, with the advent of MARS, regression
models can now be routinely and automatically developed for the most complex data
structures.
2 Chapter 1: Introduction to MARS
Given a target variable and a set of candidate predictor variables, MARS automates all
aspects of model development and model deployment, including:
MARS enables you to rapidly search through all possible models and to quickly identify
the optimal solution. Because the software can be exploited via intelligent default settings,
for the first time analysts at all technical levels can easily access MARS innovations.
MARS essentially builds flexible models by fitting piecewise linear regressions; that is,
the nonlinearity of a model is approximated through the use of separate regression slopes
in distinct intervals of the predictor variable space. An example of a piecewise linear
regression is shown below.
The slope of the regression line is allowed to change from one interval to the other as the
two knot points are crossed. The variables to use and the end points of the intervals for
each variable are found via a fast but very intensive search procedure. In addition to
searching variables one by one, MARS also searches for interactions between variables,
allowing any degree of interaction to be considered.
The optimal MARS model is selected in a two-stage process. In the first stage, MARS
constructs an overly large model by adding basis functions the formal mechanism by
which variable intervals are defined. Basis functions represent either single variable
transformations or multivariable interaction terms. As basis functions are added, the
model becomes more flexible and more complex, and the process continues until a user-
specified maximum number of basis functions is reached.
3
In the second stage, basis functions are deleted in order of least contribution to the model
until an optimal model is found. By allowing for any arbitrary shape for the function as
well as for interactions, and by using this two-stage model selection method, MARS is
capable of reliably tracking the very complex data structures that often hide in high-
dimensional data.
The MARS output contains an easy-to-deploy regression model that can be simply applied
to new data from within MARS or exported as C-, SAS- or XML/PMML-compatible code.
To facilitate interpretation of the model, the output includes interpretive summary reports
as well as exportable two- and three-dimensional curve and surface plots.
The best way to learn about MARS is, to use it so lets start now!
Because MARS is such a new tool, this User Guide assumes no prior knowledge of the
methodology underlying MARS or familiarity with the output. The main body of this User
Guide, Chapters 3, 4 and 5, contains an extensive discussion of how to use the technique
and how to interpret the results.
If you have already installed MARS and would like to begin immediately, see
Chapter 6 for hands-on advice. Later you can check for the details provided
in the rest of the manual.
MARS for Windows incorporates alternative control modes that extend the programs
features and capabilities. In addition to controlling MARS with the graphical user interface
(GUI), you can also issue commands at the command prompt or submit a command file.
Chapter 6 provides a hands-on tour to introduce you to MARS for Windows GUI, menus
and dialogs. Chapter 7 describes the situations in which you may want to take advantage
of the two alternative modes of control and provides a guide to using the command-line
and batch file features.
4 Chapter 1: Introduction to MARS
The current release of MARS for UNIX platforms is entirely command-line driven but all the
graphical results can be displayed on a PC using the MARS model viewer. Because the
Windows version has command analogs that are automatically displayed in the Command
Log window, users running MARS on UNIX platforms may find it instructive to start with a
Windows version. This will enable you to use the GUI to set up your MARS models, view
and learn the commands in the MARS Command Log, and save the command file (which
you can subsequently submit on the UNIX platform). In this manner, the command log
can be used as a tutor: What would the commands have been to accomplish what I just
did in the graphical user interface? Windows MARS can also be used to display all UNIX
results. See the sections on UNIX MARS for details.
MARS also offers an integrated BASIC programming language that allows you to define
new variables, modify existing variables, access mathematical, statistical and probability
distribution functions and define flexible criteria to control case deletion. BASIC commands
are implemented through the command interface, either interactively or via batch command
files. Appendix I provides a detailed guide to using BASIC.
If you would like to read more about MARS and its history, Appendix II contains suggestions
for further readings. In particular, for a detailed technical discussion of the MARS
methodology, see Friedman, 1991a (also available on the MARS distribution CD as a .pdf
file).
5
T his chapter provides system requirements and instructions for installing and starting
MARS for Windows 95/98/NT/2000. For guidance on installing MARS software on a
UNIX platform, see the documentation accompanying the software.
System Requirements
To install and run MARS for Windows, the minimum hardware you need includes:
80486 processor or higher,
64 MB of random-access memory (RAM),
hard disk with 15 MB of free space for program files,
additional hard disk space for scratch files (with the required space contingent
on the size of the input data set),
CD-ROM drive, and
Windows 95/98, Windows 2000, Windows ME, or Windows NT 4.+.
For optimal performance, we strongly recommend that MARS run on a Pentium machine
with 64 megabytes of memory or more. Because MARS is CPU intensive, the faster your
CPU, the faster MARS will run.
Installation Procedure
To install MARS:
1. Launch Windows.
2. Insert the CD into the CD drive.
3. From the Start menu, select Run.
4. In the Run dialog box, type the letter of your CD drive followed by :\SETUP and
press Enter.
5. Respond to the questions about where you want to install MARS.
6. Click Finish to exit the Installer.
READ ME if reading or writing .SYS Files By default, Windows hides file types with
.sys extensions. To allow Windows to let you see SYSTAT 1.0-7.0 files (which also have
a .sys extension), select Folder Options from the View menu in Windows Explorer. Click
the View tab and check Show All Files.
READ ME if running Windows NT 4.+ If you are running Windows NT, permissions
may require modification so that MARS can write temporary files to the hard drive.
6 Chapter 2: Installing and Starting MARS
Making sure that you are using the most up-to-date file access drivers
If you have previously installed DBMS/COPY you may have an older version that does not
support recently introduced file formats. If MARS detects an older version during the
install procedure it will place a newer version in a DBMSCOPY directory beneath your
MARS directory. We recommend that you use the DBMS/COPY live update over the
web to keep your file access software current (start up DBMS/COPY, Choose "Skip This"
on the introductory menu and select "DBMSCOPY on the Web" from the help menu).
Using ODBC
To access data stored in a SQL-based system, use the ODBC Query Builder in DBMS-
COPY to extract the data you need and save it as a flat file file in your preferred format.
See the accompanying DBMS-COPY Users Guide for further instructions.
7
Variable Names
Variable names should not begin with an underscore, a numeric value or any other non-
alphabetical character (e.g., $, &, ^) and should not contain blank spaces. When you try
to compute a model using an illegal variable, MARS will issue the following error message:
For MARS 1.x and 2.0 variable names also should not exceed eight characters. MARS
will truncate anything that exceeds this limit. If this results in duplicate variable names,
the names will automatically be modified so that each variable has a unique name. The
last character in each duplicate name is replaced with a sequential number beginning
with one. Later versions of MARS will support long variable names (up to 36 characters).
MARS 2.0 and earlier versions of MARS will not permit text variables to be used as either
target or predictor variables. You will need to convert text variables to numeric equivalents
before they can appear in models. A later release of MARS will permit use of text
variables directly; please check with Salford Systems for availability.
8 Chapter 2: Installing and Starting MARS
9
T his chapter describes whats under the hood, beginning with why MARS engine is
both unique and innovative. Because MARS is such a new tool, we assume no prior
knowledge of the adaptive modeling methodology underlying MARS. To put this
methodology into context, the first section discusses the modelers challenge and
addresses how MARS meets this challenge. The remaining sections provide detailed
explanations of how the MARS model is generated, how MARS handles categorical
variables and missing values, how the optimal model is selected and, finally, how testing
regimens are used to protect against overfitting.
Because estimating the model is typically the last step in the analysis process, time
often does not permit adequate attention to be paid to each of these steps. A model is
declared final because improvements in the model become negligible or, as is usually
the case with large databases, time has simply run out. The quality of the final results is
usually highly dependent on the skill of the modeler; however, even expert modelers can
overlook important effects and be fooled by data anomalies (e.g., multivariate outliers,
data coding errors, etc.).
10 Chapter 3: MARS Basics
The modern approach to modeling is different from this conventional approach in that it is
much more data rather than user driven. Modern analysis tools, which began to emerge
in the 1980s, are very computer intensive, usually combining intelligent algorithms and
brute force searches. Examples of such tools include neural nets, genetic algorithms,
rule induction and decision trees.
Some of the modern methods, given a target variable and a set of candidate predictor
variables, let the data dictate the functional form. For example, Generalized Additive
Models (GAM) determine the functional form for a variable. Other methods are even more
fully automated. CART and MARS, for example, automatically determine both variable
selection and functional form.
Of course, it is important not to be overly data driven. A priori knowledge is very valuable
and can help shape a model when several alternatives are all consistent with the data.
Domain expertise can help detect errors (e.g., price increases should reduce quantity
demanded) and user-imposed constraints can yield better models.
Ultimately, the intended use of the final model will influence how it should be developed.
For example, if predictive accuracy is the sole criterion by which a model is to be assessed,
its complexity and comprehensibility are irrelevant. Therefore, some modern methods,
such as boosted decision trees or neural nets, that do not yield easy to understand
models, can be utilized.
On the other hand, if understanding the data generation process is important, a model
that can be easily understood is desirable. In this situation, the modeler wants to be able
to tell a story and use the insights gained to make decisions; thus, a single decision tree
or regression model can be used to yield results that represent the data and to assist in
understanding the underlying data patterns and relationships.
For smaller data sets, of course, the modeler is forced to use parametric modeling
techniques. When data points are scarce, all points will influence almost every aspect of
the model. The best example is the simple linear regression where all data points are
used to help locate the regression line.
In summary, the key strengths of global parametric models are that they can be very
accurate when developed by expert modelers and they are efficient vis--vis data use.
Their weaknesses include vulnerability to outliers and subtleties missed by the modeler.
11
Nonparametric models, in contrast, are developed locally rather than globally. The extreme,
of course, is to simply reproduce the data exactly; however, this is not a useful model! A
simplification or summary of the data is needed smoothing is one example of such
simplification. For example, if the objective is to summarize how a target variable y
behaves in a small region of data containing low values of X1, a single value for the entire
region (e.g., median or mean) can be used or a curve, surface or regression can be fitted
to the region. Then, developing a separate summary in the remaining regions of the X1
predictor space paints a picture illustrating how y behaves in the entire range of X1
values. Painting the complete picture may require some cross-region smoothing to join
the functions in the neighboring regions.
Smoothing Techniques
Common smoothing techniques, available in several statistical packages, include:
n running mean
n running median
n Distance Weighted Least Squares (DWLS, fitting a new regression at each value
of X and then down-weighting points by their distance from the current value of X)
n LOWESS (locally-weighted regression smooths)
n LOESS (same as LOWESS but down-weights outliers from local regressions)
Almost all smoothes require the choice of a tuning parameter, typically a window size
indicating how large a fraction of the data to use when evaluating a smooth for any value
of X. The larger the window size, the less local and the more smooth the result. For
example, a median smooth using a 100% window and a very flexible smooth using 5% of
the data are illustrated below. The 5% window appears to impose too little structure on
the data points while the 100% window appears to impose too much structure.
50 50
40 40
MV
MV
30 30
20 20
10 10
0 0
0 10 20 30 40 0 10 20 30 40
LSTAT LSTAT
12 Chapter 3: MARS Basics
The size of the neighborhood can be selected by the user or can be determined via
experimentation or cross validation. A large number of weighting functions, or kernels,
are available for down-weighting far-away observations. The specific kernel used, however,
is less important than the neighborhood or bandwidth size.
K( ) is the weighting function, also known as the kernel function, that integrates to 1. The
function is typically a bell-shaped curve like the normal, so the weight declines with the
distance from the center of the interval.
The bandwidth, or size of the window about X, is represented by b. For some kernels,
data outside the window have a weight equal to zero. Within the data window, K( ) could
be a constant. The smaller the window, the more local the estimator.
As evident by the formula, kernels are used to weight all observations in the neighborhood.
The fitted value of y corresponding to X is a weighted average of y-values. Note that X
need not be an observed data value.
The more local a portion of a model, the higher the variance is likely to be because the
amount of relevant data is small; however, the localization is faithful to the data and thus
minimizes bias. Given that simple global models tend to be stable but biased, and more
complex local models tend to have the reverse properties, the challenge to is find the
optimal balance between bias and variance (i.e., to minimize mean squared error [MSE]).
A classic example from insurance risk assessment illustrates this tradeoff. The assessor
is estimating the risk that a restaurant located in a small town will burn down. Because
it is located in a small town and, thus, few observations are available, the assessor
borrows data from neighboring towns. These data are perhaps less relevant, but are
used in the absence of any other information.
Quantifying the Bias-Variance Tradeoff A squared error loss function typical for any
approximation to the non-linear function f(x) can be defined as:
Variance + Bias2
The variance portion measures how different the model predictions would be from training
sample to training sample; in other words, it answers the question, how stable are the
results? The bias portion measures the tendency of the model to systematically mistrack.
Note that MSE is sensitive to outliers, so alternative criteria may be more robust.
The bias-variance tradeoff is real; thus, the modeler will often want to permit some bias in
the model. The availability of repeated observations for every possible value of the predictor
vector x is the only way to completely avoid bias.
For example, suppose we decide to look at only two regions for each variable in a database,
values below average and values above average. Given two predictors, four regions will
need to be investigated: low/low, low/high, high/low, and high/high. Similarly, with three
variables, eight regions will need to be investigated, with 4 variables, 16 regions, etc.
Now consider 35 predictor variableseven with only two intervals per variable, 235 (or 34
billion) regions, most of which will be empty, will need to be examined!
14 Chapter 3: MARS Basics
Given the number of records in most data sets, it is infeasible to approximate the function
y=f(x) by summarizing y in each distinct region of x. For some variables, two regions may
not be enough to track the specifics of the function. If the relationship of y to some xs is
different in 3 or 4 regions, for example, the number of regions requiring examination is
even larger than 34 billion with only 35 variables. Given that the number of regions cannot
be specified a priori, specifying too few regions in advance can have serious implications
for the final model.
Given these two criteria, a successful method will essentially need to be ADAPTIVE to
the characteristics of the data. Such a solution will probably ignore quite a few variables
(affecting variable selection) and will take into account only a few variables at a time (also
reducing the number of regions). Even if the method selects 30 variables for the model, it
will not look at all 30 simultaneously. Such simplification is accomplished by a decision
tree at a single node, only ancestor splits are being considered; thus, at a depth of six
levels in the tree, only six variables are being used to define the node.
1) interpolating splines a spline passes through every data point (curve drawing),
and
2) smoothing splines the curve needs to be close to the data points.
To estimate the most common form, the cubic spline, a uniform grid is placed on the
predictors and a reasonable number of knots are selected. A cubic regression is then fit
within each region. This approach, popular with physicists and engineers who want
continuous second derivatives, requires many coefficients (four per region) to be estimated.
Normally, three constraints, which dramatically reduce the number of free parameters,
can be placed on cubic splines:
Piece-wise linear regression splines, the simplest version of splines, have been well
known for some time. Instead of fitting a single straight line to the data, the regression is
allowed to bend. For example, a MARS spline with three knots is illustrated on the left
with the actual data shown on the right:
50 60
50
40
ESTIMATE
40
MV
30 30
20
20
10
10 0
0 10 20 30 40 0 10 20 30 40
LSTAT LSTAT
A key concept underlying the spline is the knot. A knot marks the end of one region of
data and the beginning of another. Thus, the knot is where the behavior of the function
changes. Between knots, the model could be global (e.g., linear regression).
In a classical spline, the knots are predetermined and evenly spaced, whereas in MARS,
the knots are determined by a search procedure. Only as many knots as needed are
included in a MARS model. If a straight line is a good fit, there will be no interior knots. In
MARS, however, there is always at least one pseudo knot that corresponds to the
smallest observed value of the predictor; this topic is revisited below.
With only one predictor and one knot to select, placement is straightforward: test every
possible knot location and select the model with the best fit (i.e., the smallest SSE). An
additional constraint requiring a minimum amount of data in each interval can also be
imposed to prevent one knot from being placed too close to another.
In determining the exact knot location, all possible values on the real line cannot be
considered without considerable computer resources, and, in reality, only actual data
values are examined. It is also advantageous to allow points between actual data values
(e.g., mid-point) to be examined. For example, a better fit might be obtained if a change
in slope is allowed at a mid-point rather than at an actual data value.
Finding the one best knot in a simple regression is a straightforward search problem:
simply examine a large number of potential knots and choose the one with the best R-
squared. However, finding the best pair of knots requires far more computation, and
finding the best set of knots when the actual number needed is unknown is an even more
challenging task.
MARS finds the location and number of needed knots in a forward/backward stepwise
fashion. A model which is clearly overfit with too many knots is generated first, then
16 Chapter 3: MARS Basics
those knots that contribute least to the overall fit are removed. Thus, the forward knot
selection will include many incorrect knot locations, but these erroneous knots will even-
tually (although this is not guaranteed), be deleted from the model in the backwards
pruning step.
Strictly speaking, there may be no true set of knot locations as the true function may in
fact be smooth. For example, the true flat top function illustrated below on the left has
two knots at X=30 and X=60. The observed data, displayed on the right, contain random
error. The best single knot is at X=45 and this is the knot MARS finds first.
0 10
0
-10
-10
-20
-20
YACT
Y
-30 -30
-40
-40
-50
-50
-60
-60 -70
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
X X
As the number of knots allowed in a forward search is increased, MARS finds the following
approximations to the flat-top function:
50 10 10
0 0
-10 -10
0
-20 -20
M0Y
M1Y
M2Y
-30 -30
-40 -40
-50
-50 -50
-60 -60
-100 -70 -70
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100
X X X
10 10 10
0 0 0
-10 -10 -10
-20
-20 -20
-30
M3Y
M4Y
M5Y
-30 -30
-40
-40 -40
-50
-50 -60 -50
-60 -70 -60
-70 -80 -70
-80 -90 -80
0 10 20 30 40 50 60 70 80 90100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90100
X X X
17
Thinking in terms of knot selection works very well to illustrate splines in one dimension;
however, this context is unwieldy for working with a large number of variables simultaneously.
Both concise notation and easy to manipulate programming expressions are required. It
is also not clear how to construct or represent interactions using knot locations.
In MARS, basis functions are the machinery used for generalizing the search for knots.
Basis functions are a set of functions used to represent the information contained in one
or more variables. Much like principal components, basis functions essentially re-express
the relationship of the predictor variables with the target variable.
The hockey stick basis function is the core building block of the MARS model and is often
applied to a single variable multiple times. The hockey stick function maps variable X to
new variable X*:
where X* is set to 0 for all values of X up to some threshold value c and X* is equal to X for
all values of X greater than c. (Actually X* is equal to the amount by which X exceeds
threshold c.) For example, consider a predictor variable X, ranging from 0 to 100. Eight
basis functions, all graphed with the same dimensions, are displayed below for c=10, 20,
30, 40, 50 , 60, 70 and 80.
100
90
80
70
60
Value
50 BF10
40 BF20
BF30
30 BF40
20 BF50
BF60
10
BF70
0 BF80
0 20 40 60 80 100 120
X
BF10 is offset from the original value by 10 whereas BF80 is zero for most of its range.
Such basis functions can be constructed for any value of c. MARS in fact considers
constructing one for every possible data value.
18 Chapter 3: MARS Basics
To illustrate how MARS uses hockey stick functions to represent splines, lets look at the
Boston Housing dataset analyzed in the pricing study by Harrison and Rubinfeld (1978).
(The data set, Boston.syd, is included with your MARS software and can be found in the
Sample Datasets folder in your MARS directory.)
Harrison and Rubinfeld studied the relationship between quality of life variables and 1970
property values in Boston. The variables examined for 506 census tracts included:
Summary statistics for the Boston Housing data and a frequency bar chart for the target
variable, MV, are provided below:
MV
120
100 0.2
Count
60
0.1
40
20
0 0.0
0 10 20 30 40 50 60
MV
Pairwise scatter plots with smooths of the core variables, displayed below, clearly suggest
some non-normal distributions as well as non-linear relationships.
MV RM LSTAT AGE
MV
MV
RM
RM
LSTAT
LSTAT
AGE
AGE
MV RM LSTAT AGE
20 Chapter 3: MARS Basics
Some additional pairwise scatter plots also clearly suggest the presence of nonlinear
relationships and indicate that interaction terms are probably warranted.
60 60 60 60
50 50 50 50
40 40 40 40
MV
MV
MV
MV
30 30 30 30
20 20 20 20
10 10 10 10
0 0 0 0
0 10 20 30 40 50 60 70 80 90 0 20 40 60 80 100 120 0 10 20 30 0.3 0.4 0.5 0.6 0.7 0.8 0.9
CRIM ZN INDUS NOX
60 60 60 60
50 50 50 50
40 40 40 40
MV
MV
MV
MV
30 30 30 30
20 20 20 20
10 10 10 10
0 0 0 0
0 10 20 30 100 200 300 400 500 600 700 800 10 15 20 25 0 5 10 15
RAD TAX PT DIS
21
To illustrate how hockey stick functions represent splines, lets first define a hypothetical
basis function BF1 on the variable INDUS:
The effect of INDUS on the dependent variable is 0 for all values below 4 and b1 for values
above 4. Now consider a second basis function, BF2 = max (0, INDUS-8). The regression
function is:
Summary statistics for the original INDUS variable and the two basis functions, BF1 and
BF2, are displayed below:
BF1 BF2 INDUS
N of cases 506 506 506
Minimum 0.000 0.000 0.460
Maximum 23.740 19.740 27.740
Mean 7.374 4.620 11.137
Standard Dev 6.569 5.378 6.860
Number=0 93 217 0
Note that the maximum values of the BFs are just shifted maximums of the original
INDUS maximum value; however, the mean is not simply shifted as the max( ) function is
a non-linear transform of INDUS.
An alternative notion for the basis function is (X - knot)+, which has the exact
same meaning as MAX(0, X - knot).
22 Chapter 3: MARS Basics
The spline for a MARS regression of MV on INDUS with one basis function and the spline
with two basis functions are displayed below. In the one basis function model, the slope
starts at zero and then becomes -0.659 after INDUS=4. In the two basis function model,
the slope starts at 0, becomes 2.439 after INDUS=4, and then 0.224 (- 2.439 + 2.215)
after INDUS=8.
25
Predicted MV
20
15
10
0 10 20 30
INDUS
35
30
Predicted MV
25
20
15
0 10 20 30
INDUS
23
A standard basis function, (X - knot)+, does not provide for a non-zero slope for values
below the knot. To handle this, MARS uses a mirror image basis function, as shown
below on the left. The standard basis function is displayed on the right.
20 90
80
70
15
60
BF20R
BF20
50
10
40
30
5 20
10
0 0
0 20 40 60 80 100 120 0 20 40 60 80 100 120
X X
The mirror-image hockey stick function looks at the interval of a variable X, which lies
below the threshold c. Consider, for example, BF = min (0, 20 - X) displayed below. The
left panel is for the mirror image BF and the right panel displays the basis function BF*-1:
20 20
10 10
BF20RR
BF20R
0 0
-10 -10
-20 -20
0 20 40 60 80 100 120 0 20 40 60 80 100 120
X X
The basis function is downward sloping at 45 degrees, taking on the value 20 when X=0.
It declines until hitting 0 at X=20 and remains 0 for all other X. The mirror-image function
is just a mathematical convenience: with a negative coefficient it yields any needed
slope for the X interval, 0 to 20.
As displayed below, all three line segments have a negative slope even though two of the
coefficients are greater than zero:
ESTIMATE 40
30
20
10
0 10 20 30
INDUS
By their very nature, any hockey stick function defines a knot where a regression can
change slope. Running a regression on hockey stick functions is equivalent to specifying
a piecewise linear regression. Thus, the problem of locating knots is now translated into
the problem of defining basis functions.
As noted above, basis functions are much more convenient to work with mathematically.
For example, you can interact a basis function transforming one variable with a basis
function representing another variable. In addition, the programming code to define a
basis function is straightforward.
The functions are not all linearly independent but do increase the flexibility of the model.
For a given set of knots, only one mirror image basis function will be linearly independent
of the standard basis functions. Further, it does not matter which mirror image basis
function is added as they will all yield the same model. However, using the mirror image
instead of the standard basis function at any knot will change the model.
25
To illustrate that the addition of a particular mirror-image basis function does not affect the
final model, lets force MARS to keep extra basis functions when regressing INDUS on
MV; the set of basis functions and the resulting plot are now:
ESTIMATE
BF11 = max (0, INDUS - 5.190)
25
20
15
0 10 20 30
INDUS
The following results summarize a regression adding BF1R, the mirror image of BF1, and
a second regression adding BF3R, the mirror image of BF3. The two models are identical
with the exception of a shift in the estimated beta coefficients on the added mirror image
functions.
Although the coefficients for BF1 and BF3 have been shifted, the two regressions give
exactly the same predictions.
26 Chapter 3: MARS Basics
MARS searches for a pair of hockey stick basis functions, the primary and mirror image,
even though only one might be linearly independent of the other terms. This search is
then repeated, with MARS searching for the best variable to add given the basis functions
already in the model. The brute search process theoretically continues until every possible
basis function has been added to the model.
The MARS technology is similar to the CART methodology in that a model is deliberately
overfit and then pruned back. The core notion is that a good model cannot be built from a
forward stepping plus stopping rule; rather, the model must be generously overfit and then
the unneeded basis functions removed. However, the model still needs to be limited due
to the intensity of the search that may be required with larger datasets. For example, with
400 variables and 10,000 records, there are potentially 400*10,000 or 4 million knots to
examine just for the main effects. Even if most variables have a limited number of distinct
values (e.g., dummies only allow one knot, age may only have 50 distinct values), the
total number of possible knots will be very large.
In practice, the user specifies an upper limit for the number of knots to be generated in the
forward stage. The limit should be large enough to ensure that the true model can be
captured. A good rule of thumb for determining the minimum number is three to four times
the number of basis functions in the optimal model. This limit may have to be set by trial
and error. (See also Chapter 4 for advice on this topic.)
27
REGION 1010
REGION 1001
REGION 0110
The first basis function represents levels 1 and 3 (North and South), the second repre-
sents levels 1 and 4 (North and West), and the third represents 2 and 3 (South and
East). In each case, MARS has found some reason to group the four levels in these
patterns. MARS can easily create conventional dummies such as 1000 (North) or 0100
(East); whether it does so depends on which dummies improve the model the most.
Theoretically, one dummy could be created for every level of the categorical predictor. In
practice, however, levels are almost always grouped together, with MARS combining
levels that are similar in context. For each dummy created, a complementary basis
function is also implicitly created (e.g., 1010 is the complement to 0101). For example,
returning to the Boston Housing data, when RAD (accessibility to radial highways) is
declared categorical in the model, MV = constant + INDUS + RAD, MARS reports the
following in the text output:
The basis functions, two of which are categorical basis functions for RAD, are displayed
below. Note that the basis functions representing the categorical predictors are not
graphed.
The constant is always entered into the model first as BF0. The next two basis functions,
BF1 and BF2, are mirror image functions for INDUS with the knot located at 8.140. The
basis function corresponding to the upper portion of the variable is numbered first. Thus,
BF1 is (INDUS-8.140)+ and BF2 is (8.140-INDUS)+.
Next, two dummies for RAD (RAD=1,4,6,24 and RAD=4,6,8) and their complements are
entered as BF3, BF4, BF5 and BF6. MARS continues to add basis functions, adding
BF7-BF15, until the maximum allowed number of basis functions is reached (in this case,
15).
29
Using conventional logistic regression, the original LOGIT model estimating the response
rate (models dependent variable) included two price variables as well as dummy variables
for levels of all categorical predictors. The final log-likelihood for this model was -133864
(df=31).
The MARS model also contains main effects but with an optimal transformation on prices,
as illustrated below. The end result is the addition of one MARS basis function to the
original model to capture the price spline. The log-likelihood for this model is -133801
(df=32; x2=126 on 1 df).
0.36 0.4
0.34
0.3
E S TIM ATE
E S T IM A T E
0.32
0.30
0.28 0.2
0.26
0.24 0.1
0 50 100 150 200 250 0 50 100 150 200 250
EQPRICE EQPRICE
30 Chapter 3: MARS Basics
As illustrated below, the simple relationship between the price predictor variable, EQPRICE,
and the dependent variable, looks reasonably linear but in fact is not linear when the other
variables are controlled for.
0.4
0.3
ME1DPV2
0.2
0.1
0 50 100 150 200 250
EQPRICE
The final set of basis functions for the MARS cell choice model, which we now review one
by one, is reproduced below:
The first pair of basis functions, BF1 and BF2, is a mirror image pair for AVGBILL, the land
line phone bill:
As no other AVGBILL basis functions appear in the model, MARS located a single knot at
4.0 for this variable.
31
Next is an upper basis function, BF3, for USE_FEE (or monthly cost):
The absence of a mirror image function for USE_FEE implies that there is a zero slope
(read from the reported coefficient) until the knot at 5.0 and then a download slope.
The knot for the next basis function, BF4, is placed at the minimum observed data value
for the variable CURROWN; thus, MARS wants to keep the variable linear. The minimum
observed data value is technically a knot but, practically, it is not.
The next set of three basis functions, BF6, BF7, and BF9, is similar to BF4. The next two
functions, BF10 and BF11, are missing value indicator basis functions for INCOME, which
brings us to the next topic, how MARS handles missing values.
Missing value indicators are essential to MARS modeling for two reasons. First, if a
variable X has missing values, it can only be used in the regression model when X is not
missing. MARS imposes this restriction by using the variable X only when interacted with
the X_mis=0 basis function. The basis function
X * (X_mis=0)
generates the variable X if X is not missing and zero otherwise. Thus, MARS effectively
substitutes 0 for all missing values, a method commonly used in conventional modeling.
Second, the missing value indicators are used to develop surrogate sub-models that
apply only when some needed data are missing. For example, the basis function
Z * (X_mis=1)
32 Chapter 3: MARS Basics
generates Z if X is missing and zero otherwise, allowing MARS to use Z in place of X for
cases with missing X values. Note that the coefficient for Z could be quite different from
the coefficients generated for basis functions constructed from X.
In general, if you direct MARS to generate an additive model, no interactions are allowed
between basis functions created from primary variables. MARS does not, however, con-
sider interactions with missing value indicators to be genuine interactions. Thus, an
additive model might contain high-level interactions involving missing-value basis func-
tions such as:
BF = (AGE >.) * (INCOME >.) * max (0, EDUC -12) * (EDUC >.)
This basis function creates an effect for individuals with at least some college who also
have non-missing age and income data. There is no limit on the degree of interaction that
MARS will consider when examining missing value indicators. Also, as shown in the
basis function above, the missing value indicators in interaction basis functions could
indicate data present ( > . ) or data absent ( = . ). Neither is favored by MARS; rather, the
best is entered.
Returning to the cell phone example, the B10 and B14 basis functions in the choice
model presented above are reproduced below:
BF10 is the data present indicator coded 1 when non-missing and 0 when missing, whereas
BF11 is the data absent indicator coded 1 when missing and 0 when non-missing. When
income is available, BF12 is positive and BF13 and BF14 are zero. When income is
missing, BF12 is zero and one of the two basis functions for REGION (BF13 or its mirror
image, B14) is positive while the other is also zero. BF13 and BF14 are thus acting as
surrogates for INCOME when it is missing.
There is no guarantee that a surrogate for all variables with missing values will be found
and kept in the final MARS model; however, MARS will search all possible surrogates in
the basis function generation stage.
If the candidate pair of basis functions contribute more when interacted with ONE basis
function already in the model, then an interaction is added to the model instead of a main
effect. Lets return to the Boston Housing data set to examine interaction terms in more
detail.
First, lets look at the main effects model using the following set of variables as candidate
predictors: CRIM, INDUS, RM, AGE, DIS, TAX, PT, and LSTAT. The forward basis function
generation begins with:
From a total of 24 generated basis functions, seven make it into the final model. RM, DIS,
PT, TAX, and CRIM all have just one basis function. Thus, each of these variables has a
region with a slope of 0. Two basis functions are included in the final model for LSTAT, a
standard and a mirror image. The main effects model has a regression R2 equal to 0.841.
Lets now rerun the same model but allow MARS to search for interactions. The forward
basis function generation begins with:
The first two pairs of basis functions for LSTAT and RM are identical to those in the main
effects progression. The third pair, however, differ: (PT - 18.6)+ and (18.6 - PT)+ are
interacted with (RM - 6.431)+.
34 Chapter 3: MARS Basics
The Variable column displays the variable name, the Parent column displays the
previously-entered variable participating in the interaction, and the BsF column
displays the number of the basis function involved in the interaction.
As weve seen, MARS builds up its interactions by combining a SINGLE previously entered
basis function with a PAIR of new basis functions. The new pair of basis functions (a
standard and a mirror image) could be a previously entered pair, a new pair for an existing
model variable, or a pair for a new variable. Interactions are thus built by accretionone
of the members of the interaction must first appear as a main effect basis function, then
an interaction can be created involving this term. The second member of the interaction
does NOT need to appear as a main effect; in fact, an analyst might wish to require
otherwise via ex post modification of the model.
Only one basis function number is listed because 12.6 is the smallest value of PT in the
data. You will see this pattern for any variable you require MARS to enter linearly, a user
option to prevent MARS from transforming selected variables.
The variable TAX is entered without transformation and interacted with the upper half of the
initial LSTAT spline, BF1. TAX is then entered again as a pair of basis functions interacted
with the LOWER half of the initial LSTAT spline, BF2.
As noted above, by default, MARS fits an additive model. Variable transformations of any
complexity are allowed but interactions are not allowed (with the exception of missing
value indicators) unless you specify otherwise.
The user can specify an upper limit to the degree of interactions to be considered by
MARS. We recommend that the following series of models should be examined:
Based on performance and judgment, the best model should then be selected from this
series. We have also experimented with combining the best basis functions from several
MARS runs, for example, combining the best set from a no-interactions model with the
best set from a model allowing two-way interactions. In some situations, selecting the
best subset of regressors from the pooled set of candidates can yield better models. See
Chapter 4, Setting Control Parameters & Refining Models, for further guidance on model
building.
36 Chapter 3: MARS Basics
Once the maximum number of basis functions has been added to the overfit model,
MARS begins pruning using the following deletion procedure:
1) starting with the largest model, MARS determines the ONE basis function which,
using a residual sum of squares criteria, hurts the model the least if dropped;
2) after refitting the now-pruned model, MARS again identifies a basis function to
drop;
3) this process is repeated until all basis functions have been eliminated.
The end result of this deletion procedure is a unique sequence of candidate models. If
there are 25 basis functions, then there are at most 25 candidate models. An alternative
to this process would be to consider all possible subset deletions, but this would be
computationally burdensome and carries a high risk of overfitting.
On a nave R2 criterion the largest model will always be best. To protect against overfitting,
MARS uses a penalty to adjust the R2. The penalty is similar in spirit to AIC (Akaike
Information Criterion), but is determined dynamically from the data.
MARS automatically determines the order in which the terms are dropped. In classical
modeling, of course, there are no restrictions on the order of deletion and the analyst uses
the t-test and F-test to make such judgments.
Based on the results of his experiments, Friedman suggests that the degrees of freedom
charged per knot should be between 2 and 5. Our research, however, suggests that this
factor needs to be much higher for data mining applications and moderately higher for
market research problems. A reasonable range for the degrees of freedom is between 10
and 20 for data sets of modest size (e.g., 1,000 records with 30 variables) and between 20
and 200 for data sets typically encountered in data mining (e.g., 20,000 records and 300
variables). See Chapter 4 for further discussion of this issue.
37
= - -
=
C(M) is the cost-complexity measure of a model containing M basis functions. Note that
C(M)=M is the usual measure used in linear regression. The MSE is calculated by
dividing the sum of squared errors by N-M instead of by N.
The GCV formula enables C(M) > M; in other words, enables charging each basis function
with more than one degree of freedom.
The impact of the DF setting is on the final model selected and in performance measures
such as GCV. The higher the DF setting, the smaller the final model and, conversely, the
smaller the DF setting, the larger the model.
When the DF is set to 1, some basis functions will be dropped. In this case, basis
functions numbered 4, 15, 17, 19, and 23 are deleted. BF4 and BF15 are dropped because
the slope is truly 0 for RM<=6.431. BF17 is dropped because a mirror image TAX basis
function, BF11, is included in the model. Finally, BF19 and BF23 are dropped because
they are redundant--the CRIM mirror image is already included.
The table below reports the size of the final model selected by MARS for the Boston
Housing data set when all variables are allowed into a main effects model:
The df charged per knot is clearly vital, so how should you decide on the optimal setting?
You can specify the setting manually or use one of the two automated testing methods:
T his chapter provides practical advice on setting MARS control parameters and guidance
on how to refine your MARS models.
MARS models can be shaped and refined using the following techniques, each of which
can influence the final model:
n changing the number of basis functions generated in the forward stage
n forcing variables into the model
n forbidding transformation of selected variables
n placing a penalty on the number of distinct variables (in addition to the number of
basis functions)
n specifying a minimum distance (or minimum span) between knots
n allowing select interactions only
n modifying MARS search intensity
n manually selecting a model other than the optimal model from the selector
A rule of thumb is that this maximum should be at least two to four times the size of the
truth; thus, if previous experience suggests that a robust model has approximately 35
predictors, the maximum number of basis functions should be set to at least 70 and more
likely 100.
The larger this maximum is set, the longer a MARS run will take. MARS will attempt to
create as many basis functions as are allowed even if there is no sensible way to create
that many. You should specify the maximum number judiciously, as MARS will take your
limit literally.
The limit should be reassessed when you increase the maximum number of allowable
interactions. A main effects model can only search one variable at a time so the number
of possible basis functions is limited by the number of distinct data values; however, a
two-way interaction model has many more possible functions. To ensure that both
interactions and main effects are properly searched, the maximum number might need to
be increased.
40 Chapter 4: Setting Control Parameters and Refining Models
y = constant + bZ
and save the residuals, e. Then use e as the target variable in your MARS model and
specify all other variables, including Z, as candidate predictors. Note that Z needs to be
included as a legal 2nd stage regressor to capture non-linearity. Subsequent releases of
MARS will allow direct forcing of user-specified variables.
The penalty was originally introduced to deal with multicollinearity. Suppose, for example,
that X1, X2, and X3 are all highly correlated. If X1 is entered into the model first and there
is a penalty on added variables, MARS will lean towards using X1 exclusively instead of
some combination of X1, X2, and X3. If the correlation between the variables is quite high,
there will be little lost in fit as a result. The penalty can also be used to encourage
parsimonious models containing few variables, though they might contain many basis
functions.
Consider, for example, a simple model with one predictor variable LSTAT. MARS might
generate a function like that shown below.
60
50
40
MV
30
20
10
0
0 10 20 30 40
LSTAT
The function traces out a very jagged relationship between LSTAT and MV with many sign
changes. If MARS selects such a model, it will have strong support in the data.
Nevertheless, this degree of flexibility may be undesirable in many applications and a
smoother albeit less locally-accurate model might be preferred.
42 Chapter 4: Setting Control Parameters and Refining Models
An effective way to restrain knot placement, i.e., to make MARS less locally adaptive, is
to specify a moderately large minimum span or, equivalently, a minimum number of
observations between knots. If, for example, the minimum number is set to 100, there
must be at least 100 observations (as opposed to data values) between knots.
For data mining applications, settings as high as several hundred or more may be appropriate
to restrain the adaptiveness of MARS. Even if true wobbles exist in the data, a high
setting may also be useful as a simplifying constraint.
If you simply permit MARS to introduce interactions into a model, MARS will consider all
interactions between basis functions constructed from any variables. Depending on your
type of problem, you may wish to forbid some specific interactions. MARS allows you to
either exclude a variable from any interaction or to specify in detail precisely which
interactions are allowed and which are forbidden.
The Interactions dialog (in the Model Setup command center) provides a matrix with all
candidate predictor variables appearing in both row and column headers. Any cell in this
matrix can be set to disallow a local interaction; for example, an interaction between
INDUS and RAD may be disallowed in any context (2-way, 3-way, etc.).
Specific variables can be easily excluded from all interactions by checking the variable
row in the Non-interacting column of the Variables dialog. Thus, for example, INDUS
could be prohibited from interacting with any other variable but interactions involving other
variables in the model be allowed.
Developing a robust MARS model is a sequential process not unlike developing a parametric
regression model. The difference is that MARS does much of the work for you. We
recommend you start by not allowing any interactions; this default will develop a main
effects model that will be the most easily understood. For such a model, MARS will
search for an optimal transform for each variable via its decomposition into basis functions,
and the final model will be a sum of coefficients multiplying single basis functions.
Examine the main-effects modelwhat story does it tell? Is it plausible? How does the
model fit the data? As there is no guarantee that a main-effects model will be adequate for
predictive accuracy or a faithful representation of the true data generation process, it is
necessary to then experiment with higher-order interactions. Allow two- and three-way
interactions to see if the model fit can be substantially improved.
43
1) Allow MARS to grow a model with the interaction level set to the depth of a
satisfactory CART tree; that is, if the CART tree (after reasonable pruning)
reaches a depth of six for a substantial number of cases, then let MARS
search for up to 6-way interactions
2) Develop a hybrid CART-MARS model in which the CART terminal nodes (which
capture the complex interactions) are entered as eligible predictor variables in a
main-effects MARS model.
The CART-MARS hybrid model is discussed in detail in our white paper on the topic
available on our web site as a set of power point slides.
pN2M4
where p is equal to the number of variables, N is equal to the sample size, and M is equal
to the maximum number of allowed basis functions. Intelligent programming reduces the
M4 to M3, but this is still a very heavy computational burden.
To reduce compute times further, MARS allows intelligent search strategies that reduce
the running time to a multiple of M2. Once the model has grown to a reasonable size,
speed is gained by not testing every possible knot in every variable. Potential knots that
yielded very low improvements on the last iteration are not reevaluated for several cycles.
(Model performance is not likely to change quickly, especially when the model is already
large.)
The results CAN DIFFER if the speed setting is decreased but should be relatively similar.
Given a choice between using a smaller data set and lower speed setting (higher search
intensity) or larger data set and higher speed setting (lower search intensity), we recommend
the latter. The gain from using more training data typically outweighs the potential loss
from a less-thorough search.
Our experience to date also suggests caution when using the highest speed setting. It is
definitely worthwhile to check near final models with lower speed settings to ensure that
nothing of importance has been overlooked.
A speed setting of 1 forces MARS to test every possible knot at every forward step. While
this ensures that MARS will miss nothing in its model building phase it is also extremely
slow on large databases. For real world problems we recommend the speed setting of 4;
it offers judicious search, high speed, and additional protection from overfitting. However,
for small, and especially for artificial data sets, if you wish to ensure that MARS does find
the literally true model a speed setting of 1 is called for.
45
and can be found in this form in the Basis Function tab on the model summary. The
programming code for creating the basis functions and generating the fitted values can be
cut and pasted into commonly-used statistical packages and database management
tools.
Note that the final model equation, illustrated below for the Boston Housing model reviewed
earlier, specifies which basis functions are used directly in the model. Some basis functions
are used only to create other functions but do not enter the model directly. For example,
BF2 enters only indirectly in the construction of BF7 and BF8.
The challenging part of MARS work is determining what needs to go into the model
components. Once defined, the formula can simply be applied to new data. The final
model equation, however, may .not lend itself to easy interpretation. An examination of
the ANOVA and variable importance tables as well as the curve and surface plots will
assist you in understanding the final MARS model.
46 Chapter 5: Interpreting Model Results
After the MARS model is estimated, a Text Report appears in the MARS Report window
and the Results dialog opens. The Results dialog displays a Model Summary Table,
Anova Decomposition Table, Variable Importance Table, Final Model Table, Basis Functions,
Gains and Lift charts, and the 2-D and 3-D curve and surface plots.
While the Text Report will primarily be of use to experts, the history of the foward stepping
may be of use to anyone curious about the model generation process. To access the Text
Report, close or minimize the Results dialog box. A hyper-linked outline of the report
contents is displayed in the left panel. Click on Learn Sample Stats, Forward Knot
Placement, Final Model, ANOVA Decomposition, Variable Importance, Basis Functions
or OLS Predicted Results to view that section of the output in the right panel. Alternatively,
use the vertical scroll bars to browse the output.
You may also want to save a copy of the output as a permanent record of
your analysis. To save the report, select Save Mars Report from the File
menu. In the Save Report to File dialog specify a file name and directory.
By default, the text file will be saved with a .dat extension. You can also
copy and paste sections of the report into another application or to the
clipboard.
Model Summary
The GUI output contains the core summary information you will want to examine. As
shown below for the Boston Housing example, the Model Summary Table gives an overview
of the variables, terms and parameters in the final MARS model and reports both R-
square and mean-square measures.
47
Target variable Name, min, max, mean and variance of the target
variable
Direct variables Number of variables used to construct basis functions
not counting missing value indicators
Total variables used Total number of variables used to construct basis
functions
Terms in the model Number of coefficients in the MARS model excluding
intercept
Effective parameters Number of effective parameters (based on number of
terms in the model, number of knots, and degrees of
freedom charged per knot)
Nave R2 value for regression equation using final MARS model
Nave-Adjusted Adjusted R2 for naive regression model
GCV R-Square 1 Final-GCV / Initial-GCV
Nave MSE Mean-squared error from regression equation above
MARS GCV Penalized mean-squared error
The Text Report also includes learning sample summary statisticsmean, standard
deviation, number of observations with valid values, and sum for continuous variables and
min, max, 25th, 50th and 75th quartiles for all variables. For a detailed explanation of the
Forward Stepwise Knot Placement Report, see Chapter 3.
ANOVA Table
Because MARS models are made up of pieces of variables or basis functions, the
conventional regression output can be challenging to interpret. Each subinterval of the
piecewise model has its own coefficient, and the larger picture may not be evident from
such a report. To provide another perspective on the model, MARS produces the ANOVA
Table in both the Text Report and in one of the Model Results dialogs.
The ANOVA Table summarizes the model and streamlines the output by aggregating the
basis functions into groups involving the same raw variables. For example, if three basis
functions were generated for AGE, they would be combined together into a single main
AGE effect and the contribution of the group would be reported. Similarly, if two basis
functions involving the interaction of AGE and EDUC were also present in the model,
another entry in the ANOVA table would report the combined contribution of these two
basis functions.
MARS constructs the ANOVA table by first rearranging the final MARS model so that all
basis functions involving any ONE variable are grouped together. In some cases, only one
basis function is used. This means that the variable enters with a possible upper or lower
threshold.
48 Chapter 5: Interpreting Model Results
When more than one basis function is involved, a nonlinear transformation involving a
change in slope has been found. This group of functions represents a non-parametric
approximation to the best transformation MARS can find for that variable. Although there
is no theoretical upper-limit on the number of basis functions that might be needed to
properly represent the transformation of a variable, most applications (with the exception
of those involving time series data) will require relatively few basis functions.
The ANOVA table then continues with entries for all basis functions involving the same
PAIR of variables. These represent the optimal transform corresponding to the interaction
of those two variables. A larger number of basis functions is usually required for such
aggregates. Again, there will be one such collection for every distinct pair of variables
appearing in two-way interactions. Thus, if AGE and EDUC interact and AGE and INCOME
interact, each pair will appear in the ANOVA table as a separate entry (or row).
As illustrated below for the Boston Housing example, the first two columns in the ANOVA
table list the basis function collection number and the standard deviation of the collection.
The larger the standard deviation, the greater the contribution to the overall explanatory
power of the model.
The third column (labeled GCV in the text report and Cost of Omission in the GUI table)
displays the contribution of the collection of basis functions as measured by the resulting
loss of fit if that entire collection were to be deleted from the model. The next two columns
list the number of basis functions aggregated into the collection and the number of effective
parameters (or, similarly, the total degrees of freedom charged for the collection). The
final column lists the variable names involved in that entry, one name for main effects, two
for bivariate interactions, three for three-way interactions, etc.
49
Basis Functions
These are the transformed predictors used by MARS. Most basis functions represent a
region of one of the raw variables and there will be as many basis functions for a variable
as there are distinct regions needed in the model. Some basis functions will be missing
value indicators and others will involve interactions between two or more variable regions.
You can easily export the basis functions to a text file by selecting File|Export Rules...
menu, or you can click your right mouse button and select Export.
The exported basis function code is directly compatible with many statistical package
programming languages, as well as with database management tools for generating fitted
values and for forecasting. Contact Salford Systems for products to export to XML/
PMML.
51
Gains Chart
Once you have generated or selected a MARS model the summary reports include Gains,
Lift, and Cumulative Lift Graphs and tables. The training data are scored using the current
MARS model and provide a predicted value for every record. The data are then sorted in
descending order by this score and divided into 10 equal deciles. The Gains table reports
the results, displaying the average predicted and actual target values within each decile.
If you use the selector to choose a different model from the backwards deletion sequence,
the summary reports will all update automatically. You can display summary reports for
any number of sub-models at the same time for easy comparisons.
52 Chapter 5: Interpreting Model Results
The three buttons , [Gains], [Lift], and [Cum. Lift], toggle the graph between each of
the three views.
The two buttons , [Continuous], and [Binary], toggle the graph between each of the
two views.
[Continuous] [Binary]
Remember that the binary option is a reporting feature; it does not change
the way in which MARS generates its models. Theoretically, it should be
possible to get more accurate models by running a logistic regression
version of MARS. Such a version is planned for future release.
ACTUAL CLASS Class level. Using a default threshold of .5, MARS will con-
sider all target values (actual or predicted) less than .5 to
correspond to non-response (Class 0) and values greater than
.5 to correspond to the response (Class 1).
PREDICTED 0 Number of cases classified as non-response (Class 0) by
Actual Class.
PREDICTED 1 Number of cases classified as response (Class 1) by Actual
Class.
TOTAL CASES Total number of cases in the class.
PERCENT CORRECT Percent of cases for the class that were classified correctly.
Like any regression model MARS predicts a continuous score. For the binary target
variable the score will be like a probability, but some scores (for the least likely outcomes)
may be negative, and others (for the most likely outcomes) may be greater than 1. To
convert a MARS model score to a class assignment we have to decide how high the
MARS score needs to be to count as a 1 (ie. yes or response). We accomplish this
conversion by selecting a threshold. For example, we might specify that a score greater
than 0.5 counts as a response and anything else as a non-response; this is the default in
MARS. Frequently however, a different threshold would be beneficial. For example, a low
threshold such as .3 would end up classifying many more records as a response than
would the 0.5 threshold. For some low-cost marketing campaigns, approaching pros-
pects with a modest chance of response might still be quite profitable. A low threshold
might also be used if the costs of missing a potential response far outweigh the costs of
misclassifying a non-responder as a responder. While changing the threshold from .5
might increase the absolute number of mistakes made, great benefits might be obtained
from using a different threshold.
Even if the costs of misclassification are not very different between responders and non-
responders, you may still want to tune the balance between the errors in each class. The
threshold table is designed to help you decide on a threshold. For each of the 100
thresholds between 0.00 and 1.00 the table displays the breakdown of the training data
55
among true negatives, false negatives, false positives, and true positives. Studying the table
can help you identify the best cutoff for overall classification accuracy and help you trade off
one type of error for the other.
The portion of the Threshold Table shown above displays the threshold values of .40 to
.50. You can see that as the threshold is decreased from .50 down to .40, the % Correct
Overall increases. If you look more closly, the % Correct 1 is increasing faster than the
%Correct 0 is decreasing.
Three-dimensional, rotatable surface (or contour) plots depict the relationship between a
pair of predictor variables and the target variable. In addition, you can enable Mesh,
Shaded, Contour and/or Zones by clicking [Mesh], [Shaded], [Contours] and [Zones].
To rotate the 3-D plots, click the rotation buttons in the lower-right corner.
An example 2-D and 3-D plot from the Boston Housing example discussed earlier are
displayed below.
56 Chapter 5: Interpreting Model Results
If you instruct MARS not to use any interactions in the model, the graphs will be 2-D
graphs. If two-way but not higher-order interactions are permitted, the graphical output is
likely to be a mixture of 2-D and 3-D plots. In those rare cases in which MARS cannot find
any valid predictive model, no graphical output is produced. Similarly, if the final model
contains three-way and higher-order interactions exclusively, no graphs are produced.
(Future releases will allow you to display 2-D and 3-D slices from the higher-dimensional
interactions.)
In most business applications, such as predicting response rates to a direct mail offer,
transforms are typically depicted as one of the following six types:
More complex functions are of course possible, but the simpler transforms are both more
likely and more plausible in business applications. In contrast, in physical science
applications, the true relationship between the target variable and its predictors can be
extraordinarily complex; MARS is capable of tracking such functions regardless of the
number of twists and turns involved.
57
To export a 2- or 3-D graph, click on the graph (which will activate a black box around the
graph indicating that it has been selected) and select Export Graph... from the File
menu. As illustrated below, in the Export Graph dialog, select one of the five graphical file
format options from the drop-down menu, enter a file name and directory location, and
click [Save].
A key parameter governing this process is the Degrees of Freedom penalty applied to knots.
Typically, the higher the penalty the smaller the final model will be. MARS allows you to
manually set the DF penalty, or to estimate it via a testing procedure (cross validation or
random setaside of records for test). Regardless of how the DF penalty is set, there will be
room for uncertainty regarding the best MARS model, as we cannot know the optimal DF
penalty with certainty.
58 Chapter 5: Interpreting Model Results
In MARS 1.0, if you leaned towards a smaller model than the one automatically selected
by MARS for any reason, your only option was to increase the DF penalty and rerun the
analysis. This was a trial and error process because there was no way of knowing how
much smaller a model would become once the DF penalty was increased. MARS 2.0
makes selection of a model from the backwards deletion sequence a simple matter of
clicking on a row in the model selector.
To illustrate this, open the BOSTON.SYD data and set up a model with MV as the target
variable and all other variables as predictors. Note that in the bottom right hand corner of
the model dialog you see an [All Models] and a [Best Model] button. Click on [All
Models] and you will see the selector displayed below after MARS completes its analy-
sis.
The selector lists every model MARS identified in the backwards-stepping procedure (15
models in this example). The first column in the selector lists the number of basis func-
tions in the model, which decline from 15 to 1 by one basis function per row. The second
column lists the number of variables (not basis functions) being used in the model. A
model with many basis functions could easily be built up out of just a few variables if each
variable contained many knots. By checking this column you can see where the model
gets smaller by dropping a variable rather than one of several knots in a variable. The
GCV, the mean-square-error as adjusted by the DF penalty, usually is large for the most
overfit model,declines to a minimum at the MARS optimal model, and then rises again as
the model is trimmed back too far. As noted above, if we knew the DF penalty with
certainty the MARS optimal model would be best, but because we dont there is room for
analyst judgment in model selection.
To select a model with seven basis functions just double click on the 7-basis function row in
the selector, highlight it and then click the [Select] button in the lower right hand corner.
You will now obtain a complete set of MARS reports for the 7-basis function model.
59
Suppose you place great importance on small models with few predictors. The previous
table and graph show that you could prune the model back to four basis functions involving
only four distinct variables to achieve a GCV R-squared of 0.811. If you think the reduction
in model size compensates for the loss of accuracy, or if you simply believe that the
smaller model is likely to be better in a changing environment you can override MARS and
select this model. By double clicking on the row with four basis functions you make this
the new model; all reports will reflect this. Visit any summary tab and you will see that
the results are for the smaller model.
To score new data with a selected model other than the MARS optimal model you will need
to save the basis function code and run it through a data-processing step. See the readme
files that came with your software for further details and breaking news on improvements to
the MARS scoring capabilities.
To open a Selector file you have previously saved, select Open|Selector... from the File
menu. In the Open MARS Selector dialog box, specify the name and directory location of
the .SLC file and click on [Open]. Windows-compatible selector files are produced by all
versions of MARS, including UNIX versions. To view models generated by a UNIX MARS
just download the selector file to a PC, and load using your PC MARS.
Opening a Selector file in subsequent sessions allows you to continue your exploration of
detailed and summary reports for each of the models in the sequence; however, reopening
the file does not reload the model setup specifications in the GUI dialogs. To save your
model setup specifications, save the settings in a command file prior to exiting MARS.
The commands, by default stored in MARSs command log, can be accessed by select-
ing Open Command Log from the View menu (or by clicking the Open Command
Log toolbar icon). To save the command log, select Save from the File menu. To then
reload your setting in the Model Dialog, simply submit the command log. The last set of
model setup commands in the command file appear in the tabbed Model Setup dialogs.
For more on using the Command Log, see the section titled Command Log Control in
Chapter 6.
61
T his chapter provides a hands-on tour of the graphical user interfacemenus, commands,
and dialogs. You will learn how to set up a MARS analysis, set optional control
parameters, save your MARS models, and apply them to new data.
Getting Started
Most predictive modeling techniques give more satisfactory results if the input data are
properly cleaned. MARS is no exception to this general rule. Data cleaning involves
correcting data entry errors, resolving inconsistencies, and, in particular, capping, removing,
or replacing outliers. Univariate outliers (i.e., outliers relative to a single variable) can be
detected via straightforward descriptive statistics produced optionally in every MARS run.
Multivariate outliers are of less concern and can be detected in CART runs (as outliers are
typically isolated in small terminal nodes). Once the data are clean and have passed
elementary tests of soundness, MARS modeling can begin. (Note: CART is more resistant
to data problems than MARS. If you have serious concerns regarding your data quality
and reliability, be sure to conduct some CART analyses as well.)
1. Select Open from the File menu (or click on the Open File icon in the toolbar).
(To change or set default input and output directories, select Options... from the
Edit menu and select the Directories tab in the Options dialog.)
2. In the Open Data File dialog, select the data sets file format and then browse for
the file. MARS for Windows reads over 80 different file formats.
After you open a file, the Model Setup dialog automatically opens. The MARS Report
window appears in the background with Report Contents in the left panel and text output
in the right panel. The initial text output contains the variable names, the size of the file,
and the number of records read.
Variables Dialog
The Variables dialog, shown below, allows you to specify target and predictor variables,
specify the weight variable, indicate categorical variables, and identify predictor variables
that should be considered non-transforming and/or globally non-interacting.
62 Chapter 6: Hands-On Tour of Graphical User Interface
n To specify the target variable, highlight the name from the variables list and
click on [Select] in the Target Variable box.
See the section below titled Binary Target Variables for additional information
on this subject. The resulting tables, dialogs, and graphs generated by these
two options are discussed in further detail in Chapter 5.
n To specify a weighting variable, highlight the variable name and click on [Select]
in the Weighting Variable box.
n To select predictor variables, highlight the variables from the variables list and
click on [Select Predictors]. (Standard Windows conventions can be used to
select more than one predictor at a time.)
n To restrict the list of candidate predictor variables, once specified, double click on
the check box in the Exclude column for the variables you wish to exclude (or
alternatively highlight the variable in the predictor variable list and click on [Remove
Predictor]).
n To specify variables which, if they are part of the model, should be entered into
the model as linear functions only (i.e., no transformation), double click the check
box in the Non-transforming column for each variable. This option can be applied
to both continuous and categorical variables.
n To specify variables which, if they are part of the model, should not be allowed to
interact with any other variables, double click the check box in the Non-interaction
column for each variable.
n Note that the Variables tab (and also the other four model setup dialogs) contains
a [Auto-Save Model] button in the lower-left corner. At any point during the
model setup, you can create a special model file in which the MARS model will
be saved for later application to new data. This file will always have a .MDL
extension.
When you first open a database the Variables tab is colored red, meaning that
you are not ready to start an analysis until information on this tab is provided.
Once you have selected your target variable and the candidate predictors you are
ready to begin analysis and the Variables tab turns black; you do not need to
touch any other MARS setting.
We recommend that at the very least you consider how many forward steps
MARS should be allowed to take. Otherwise, you do not need to fuss with any
other settings to get useful results. But please keep reading!
MARS now contains some new features to assist binary response modeling. To invoke
them, be sure to check the binary box in the model setup dialog (see Variable setup
dialog screen shot above).
If you select the binary target variable option, MARS will generate reports on the assumption
that the target is either 0 or 1 or a probability between 0 and 1. Using a default threshold
of .5, MARS will consider all target values (actual or predicted) less than .5 to correspond
to non-response and values greater than .5 to correspond to response. After classifying
all predicted and actual values of either 0 or 1, a prediction success (confusion matrix)
table is produced cross classifying actual vs. predicted class assignments.
64 Chapter 6: Hands-On Tour of Graphical User Interface
It will often be necessary to adjust the threshold dividing 0 from 1 to obtain sensible
results. Use the slider found in the post-processing dialogs to change the threshold as
needed. A table showing all four cells of the prediction success matrix (true positive, false
positive, true negative, false negative) for each threshold from .01 to.99 can help you
decide quickly on the best threshold setting. Details of the binary results are discussed
in Chapter 5.
Remember that the binary option is a reporting feature; it does not change the
way in which MARS generates its models because the underlying algorithm is
not specifically adapted to the problem. Theoretically, it would be possible to
get more accurate models by running a logistic regression version of MARS.
Such a version is planned for future release.
Interactions Dialog
By default, MARS will consider all possible interactions between selected predictor variables.
The Interactions dialog allows you to limit or restrict pair-wise interactions between any
variable and any other variable in the model. This dialog, accessible only after predictor
variables are selected in the Variables dialog, displays the selected predictor variables in
a matrix, with each predictor appearing as both a row and a column header. Note that if
you checked the Non-interaction box in the Variables dialog for a particular predictor
variable, that particular variable will not appear in the Interactions dialog.
To disallow a pairwise interaction, double-click (or highlight and click on [Uncheck]) the
cell that corresponds to the variable pair you would like to restrict. In the Interactions
dialog displayed above, for example, unchecking the cell corresponding to the variable ZN
in the 2nd row and INDUS in the 4th column would disallow any interaction between these
two variables.
65
Select Dialog
The Select dialog allows you to build a model on a subset of your data. Selection criteria,
which can be specified using any variable in the database, are constructed as follows:
1. Select a variable from the variable list and double-click to add that variable to the
Select text box.
2. Select one of the logical relations by clicking on its radio button.
3. Enter a numerical constant in the Value text box.
4. Click on [Add to List] to accept the constructed critera (similarly, use [Delete
from List] to remove).
5. Repeat the process to create compound selection criteria involving several
variables.
For an example, see the Select dialog above. To analyze records with AGE less than or
equal to 50, double click on AGE, click on =<, enter 50 in the Value box and click on [Add
to List]. For complex subset selection it is easier to use the built-in BASIC programming
language to delete unwanted records. (See Appendix I for details.)
Speed Setting. As discussed in Chapter 4, the speed acceleration setting controls the
trade-off between speed and accuracy in MARS search for the best model. At each step
in model construction, MARS assesses a potentially very large number of basis functions
for entry into the model. With a high-speed setting (e.g., set to 4 or 5), MARS only
searches the best candidates (the top-ranked candidates from previous steps). The
exact number of candidates that MARS considers is progressively reduced with each
higher-speed setting. Every so often, MARS will search all candidate basis functions and
update its list of top contenders, giving every basis function a chance to be selected.
What should you do if you get different models using different speed settings? The final
choice of the model is up to you. Because the speed setting simply influences how
thorough MARS is in its search for the best basis functions, it is possible that a less-
thorough search would yield a better model.
The count of variables in the model is entirely distinct from the count of basis functions. A
variable could be represented by just one basis function or by dozens. The added variable
penalty induces MARS to make do with fewer variables but the final model could have
quite a few basis functions.
67
The default setting is no penalty. Other choices, which can be set by clicking on the
appropriate radio button, are moderate (0.05), heavy (0.10), and a user-specified penalty,
which cannot exceed 1.0. The best value depends on the specific situation. You will
probably have to experiment to find the best setting for your problem.
Maximum Basis Functions. The maximum basis functions setting specifies how many
forward steps MARS takes to generate its maximal model. Usually each step adds two
basis functions, so a limit of 40 basis functions would be reached with 20 forward steps.
The default maximum basis functions is 15 and can be modified with the up/down arrows
or by entering a new value. See Chapter 4 for guidance on setting this parameter.
Maximum Interactions. The maximum interactions control limits the highest degree of
interaction MARS can consider. The default setting is 1, which disallows true interactions;
a setting of 2 would allow 2-way interactions, 3 would permit triple products of basis
functions, and so on. The interactions setting can be increased or decreased with the
up/down arrows or by entering a new value. See Chapter 4 for guidance on setting this
parameter.
Number of Records to Process. The number of records to process setting enables you
to analyze a subset of records rather than the entire data set initially read in by MARS
when the file was opened. The default setting of 0 indicates no limit. To impose a limit,
use the up/down arrows or simply enter the record limit in the box.
Minimum Observations Between Knots. The last control in the options and limits tab is
the number of observations required between knots, or window/bandwidth size. By default,
MARS allows a knot to be generated at every observed data value; this default allows the
MARS regression to change slope or direction anywhere and as often as the data dictate.
You can make MARS less locally adaptive by increasing the number of data points required
between knots. Setting this parameter to a value like 20, 50, or 100 can be very useful in
data mining applications. Use the up/down arrow keys to change the setting or to directly
enter a new value.
Testing Dialog
The Testing dialog, displayed below, allows you to control the number of degrees of freedom
charged for knot optimization. The penalty is used to reflect the fact that MARS conducts
an extraordinarily intensive search to find the knots, creating a risk of overfitting. Penalizing
each knot helps MARS select an honest rather than overfit model. The degrees of freedom
can be fixed by you, or automatically computed using either cross validation or an
independent test sample.
The default degrees of freedom penalty setting is fixed at 3. This setting can be changed
to any non-negative number by entering a new value. Setting the penalty higher will tend
to favor a smaller optimal model.
68 Chapter 6: Hands-On Tour of Graphical User Interface
You can ask MARS to use a test method to estimate a degrees of freedom penalty for
your data. If you select cross validation, the default v-fold factor is 10, which means that
MARS will conduct ten extra modeling runs just to select a defensible degrees of freedom
penalty. Cross validation can be quite time consuming and run times increase steadily
with the number of folds. 20-fold cross validation will take at least twice as long as 10-fold,
and 10-fold cross validation will take ten times as long as fixing the degrees of freedom
penalty manually.
If you opt to estimate the degrees of freedom using a test sample, MARS will randomly
set aside approximately every Nth observation. By default, MARS will use about half the
sample for testing, but you can choose 1/3, 1/4, 1/5, etc. The resulting test set is used in
a single test run to estimate the degrees of freedom.
As discussed earlier, the larger the degrees of freedom charged for each knot (or basis
function), the smaller the resulting MARS model. Thus, you can influence the size and
complexity of your model by setting this parameter higher or lower. If MARS is estimating
null models or models containing only a handful of basis functions, you can encourage
larger models by fixing the degrees of freedom parameter at a low number such as 1, 2 or
3. Conversely, if MARS is including too many parameters or knots for your liking, you can
force a smaller number by setting this parameter quite high. Using values as high as 10,
30 or even 200 may be necessary to estimate a model of suitable size.
If you allow MARS to determine the true degrees of freedom parameter, you will
arrive at a very defensible model. Lowering the DF penalty could yield a model
that will not stand up to new data. Increasing the DF will yield a deliberately cut-
back model that omits statistically defensible terms.
69
You always have the option of estimating All Models, which will allow you to
manually select different sized models using the backwards deletion sequence.
Regardless of which model you select manually, however, the official MARS optimal
model will have been determined by the DF penalty. See the discussion of the
MARS Model Selector for more details on manual model selection.
Edit Options
Before computing your model, you have the option of resetting default Report Preferences
settings, the random-number seed (used for cross validation and random test runs), and/
or the default directories. If youre in the Model Setup dialog and need to access this
dialog before computing a MARS model, click on [Continue] before selecting Options
from the Edit menu.
Reporting Dialog
The Reporting dialog allows you to control some classic output details. You can include
summary statistics for all model variables, summary plots, and the use of exponential
notation for values near zero. The default settings can be changed by clicking on the
radio button next to each item. You can change the settings for each MARS run as well
as save new default settings by clicking [Save As Defaults].
two identical analyses involving the random number generator could result in somewhat
different results. There is nothing wrong with such an outcomeyou just need to be
aware that it could happen.
Directories Dialog
The Directories dialog allows you to set default locations for input (data, models, commands),
output (models, predicted data files, reports) and temporary files. All input and output files
are initially set to the directory housing MARS and the temporary directory is set to your
machines temporary Windows directory. To change any of the default directories, click
on the [] button next to the appropriate directory and specify a new directory in the
Select Default Directory dialog.
71
To save the optimal model after the model is computed, select Save MARS Model
from the File menu to access the Save Model to File dialog and follow the same steps as
above. At present ONLY the optimal model can be saved to an MDL file for scoring. Basis
function code needs to be used to score other manually selected models.
Both a data file and a model file must be active and all variables needed to construct the
basis functions must appear in the data file. The predicted values can be saved in one of
over 80 different file formats. To specify the file name and format, click on [Set Output
File] and specify the file name, directory and format in the Save Results to File dialog.
After the open, model and output files are specified, click on [OK] to compute the predicted
responses. The predicted response is saved in a column named ESTIMATE on the output
file. A summary dialog like the following appears:
The summary results dialog displays the data, model and output file names; the number
of records read, the number of records with a missing target variable, and the number of
records used; and the minimum, maximum, mean and standard deviation for the predicted
outcome variable ESTIMATE.
Two additional tabs are found in this results dialog, the Gains tab and the
Prediction Sucess tab. The results information contained in these tabs is
identical to that described earlier in Chapter 5. See the sections titled
Gains Chart and Prediction Success Table for further details.
73
Interactive Example
Lets now walk through a step-by-step example using data extracted from the March
Current Population Survey of 1985. (These data are available in the Sample Data folder
installed with MARS.) The target variable is the log of wage earned at the respondents
current job. The independent variables are years of education (ED), dummy variables for
living in the South (South), race non-Caucasian and Non-Hispanic (NONWH), race Hispanic
(HISP), female (FE), married (MARR), represented by a union (UNION), industry dummies
for manufacturing and construction (MANUF, CONSTR), occupational dummies for
managerial, sales, clerical, service and professional (MANAG, SALES, CLER, SERV,
PROF), years of work experience (EX) and age in years (AGE).
Follow the five steps below to estimate a MARS model with main effects only:
1) From the File menu, select Open. In the Open dialog, select files of type Systat
(.syd) in the Files of Type drop down. Browse for CPS85B.syd and click OPEN.
2) In the Variables tab of the Setup dialog, select LNWAGE as the target variable.
3) Next select ED, SOUTH, NONWH, HISP, FE, MARR, EX, UNION, AGE, MANUF,
CONSTR, MANAG, SALES, CLER, SERV, PROF as candidate predictors (total
of 16 predictors).
4) Specify SOUTH, NONWH, HISP, FE, MARR, UNION, MANUF, CONSTR,
MANAG, SALES, CLER, SERV, and PROF as categorical variables in the
Categorical column.
5) Click Compute Model.
By default, MARS searches for a maximum of 15 basis functions in the first stage of the
model building process and uses a degrees of freedom penalty of 3 in the second stage to
prune the potentially overfit model. The other key default settings include minspan (minimum
number of observations between knots) tuned automatically to the the size of the data set
(the literal parameter setting is 0) and maximum interactions equal to 1 (i.e., no
interactions).
Lets start by examining the Forward Stepwise Knot Placement report shown below.
The constant is always entered into the model first as BF0. The next two basis functions,
BF1 and BF2, are mirror image basis functions for ED with the knot located at education
equal to 11 years. Similarly, the third and fourth basis functions are mirror image basis
functions for EX with the knot at 12 years of work experience. Next, dummies for FE,
UNION, SERV, MANAG and PROF and their complements are entered as BF5-BF14.
The final basis function entered, BF15, is the dummy for SOUTH. Note that the complement
(South=0) is not entered because MARS reached the maximum allowed number of basis
functions.
The final model selected, the one with the lowest GCV score, keeps eight of the 15 basis
functions added in the forward-stepping stage and has a GCV R-squared of 0.316.
The set of final basis functions and the regression equation for this model are shown
below:
Only the upper portion of the set of mirror image basis functions for ED (BF1) is retained
in the final model while the reverse is true for EX (BF4). The absence of mirror images
indicates that there is a zero slope for the lower portion of ED and the upper portion of EX,
as shown in the graphs below. Thus, education does not have a positive significant effect
on wages until after 11 years whereas the effect of additional years of work experience
ends at 12 years.
Basis Function 1: ED Basis Function 4: EX
The remaining six basis functions in the model are the complementary basis functions
created for dummy variables FE, UNION, SERV, MANAG, PROF and SOUTH. The
coefficients in the MARS model equation indicate whether each of these has a positive or
negative effect on wages. For example, BF5, an indicator for not female, is positively
related to wages with a coefficient of +0.197. Graphs are not created for categorical basis
functions.
Lets now see if we can improve the model by allowing MARS to include two-way interactions:
1) Open the Setup dialog again by selecting Set Up Model from the Model menu (or
clicking the Set Up Model toolbar icon).
2) In the Options and Limits dialog, increase Maximum Interactions to 2.
3) Click Compute Model.
As shown in the Forward Stepwise Placement report below, the first eight basis functions
added to the model are identical to those added in the main effects model.
Interaction terms do not enter the model until BF9 and BF10; here, MARS has determined
that adding an interaction between SERV and UNION=1 (a main effect already in the
model) results in the greatest reduction in GCV. Note that UNION is interacted with
SERV=1 (BF9) and its complement, SERV=0 (BF10).
The remaining two pairs of basis functions are also interaction terms. Interactions between
a new pair of mirror image basis functions for ED with a knot at 12 and UNION=0 are
entered as BF11 and BF12. Next, interactions between the (ED-11.000)+ spline and the
pair of MANUF dummies enter the model followed by an interaction between PROF and
the complementary dummy for FE.
In the final model, the GCV R-squared is slightly higher and the number of basis functions
remains the same as in the main effects model but now includes five interaction terms.
Look at the graphs for the ED interaction terms between UNION and MANUF. Because
the interaction terms involve a categorical basis function, the graphs are only 2-D and
display the effect on the dependent variable ONLY when the dummy condition is met. For
example, the interaction graph for ED and MANUF shows the effect of ED on LNWAGE
when MANUF=0. The small negative slope, -0.059, indicates that the positive effect of
ED on LNWAGE is not quite as strong when MANUF is equal to zero.
BFs 11 & 12. ED * UNION BF 13. ED * MANUF
How do you use the Report Writer? Its easy! One way is to copy certain reports and
diagrams to the Report window as you view the model in the results dialog or Selector
windows.
Once a model has been built, a Model results dialog appears allowing you to explore the
model and its performance with a variety of graphic reports and diagrams. Virtually any
graph, table, grid display, or set of basis functions can be copied to the Report Writer.
Simply right-click the item you wish to add to the Report Writer and it will appear at the
bottom of the Report window.
MARS also produces classic output for those users more comfortable with a text-based
summary of the MARS model and its performance. To add any (or all) of MARS classic
output to the Report Writer window, highlight text in the classic output window, copy it to
the Windows clipboard (Ctrl+C), switch to the Report Writer window and paste (Ctrl+V)
at the point you want text inserted. This way you can combine those MARS elements
you find most usefuleither graphic in nature and originating in the Model results dialog,
or textual in nature from the classic outputinto a single custom Report.
78 Chapter 6: Hands-On Tour of Graphical User Interface
Default Options
In the Set Report Options dialog, the currently-selected reporting items and the Automatic
Report checkbox can be saved as a default group of settings for future MARS sessions
by clicking the [Set Default] button. These default options will then persist from session
to session since they are saved in the MARS.INI file. You may recall these settings any
time with the Use Default button.
Pre-configured Reports
Additionally, MARS can produce a stock report with the click of a button. Select
components of MARS output that are the most useful to you on the Report|Set Report
Options... dialog. The stock report will be the same for all models in the session until you
visit the Set Report Options dialog again. (In addition, the currently-open selectors are
listed and individual ones can be excluded or added to the list that will appear in the report
when Report|Report All is selected.)
You can then generate a stock report for the currently active (i.e., foreground) Model
results window or Selector by choosing Report|Report Current. If the active window is
not a Model results window or Selector window, Report Current will be disabled.
Furthermore, if you have several Selectors and their associated Model result windows
open, you can generate a report for all the models (in the order in which they were built) by
choosing Report|Report All.
The Report|Report All and Report|Summary only work for Selectors (and
their child Model results windows). If you have grown a series of Best
Models, then you must add each ones results to the report separately by
bringing it to the foreground and selecting Report|Report Current.
79
If you want a report to be produced for every model that is built without having to explicitly
request it each time, check the Automatic Report box on the Set Report Options dialog.
From that point on, each model will have a stock report created for it as soon as it is built.
To save a report to a file, use the File|Save As... option. The contents of the Report
Window can be saved in four formats: Salford Systems Report (.ssr), rich text format (.rtf),
and text or text with line breaks (.txt). The .ssr format is the most compact but can only
be read by Salford Systems modules such as MARS. Rich text (.rtf) can be read by most
other word processors and maintains the integrity of any graphics imbedded in the report.
The text formats do not retain graph and diagram images or table formatting.
80 Chapter 6: Hands-On Tour of Graphical User Interface
It is possible to cut and paste to/from the Report Window and other Windows documents,
such as Microsoft Word, Notepad, Wordpad, etc. To select the entire report quickly and
drop it into another Windows application, use Ctrl-A (shortcut for Edit -> Select All),
Ctrl+C (copy to clipboard), move to the other application and paste.
The Data Viewer window is opened by selecting the View|View Data menu item or
clicking on the View Data toolbar icon (it looks like a little spreadsheet).
Only one data file can be displayed at a time.
81
On-line Help
The Help menu provides comprehensive on-line information concerning MARS menus,
commands, BASIC programming, and frequently-asked questions.
For an outline of the topics in the help file, select Index from the Help menu. Place the
mouse pointer over the topic of interest and press enter. A discussion of the topic is
displayed on the screen. To print the topic, click Print.
Alternatively, select Help Topics to see a detailed list of index entries. Type the first few
letters of the word youre looking for or use the scroll bar to review the list. For further
instructions on using on-line help, select Using Help from the Help menu.
The About MARS for Windows selection displays information about the version number,
preprocessor and tree-building work space, available disk space and free memory.
82 Chapter 6: Hands-On Tour of Graphical User Interface
83
T his chapter describes the situations in which a Windows user may want to take
advantage of the two alternative modes of control in MARS, command-line and batch,
and provides a guide to using these two control modes. For users running MARS on a
UNIX platform, this chapter contains a detailed guide to command syntax and options and
describes how the Windows version may assist you in learning the command-line language.
Avoiding Repetition You may need to interact with several dialogs to define your model
and set model estimation options. This is particularly true when a model has a large
number of variables or many categorical variables, or requires that more than just a few
options be set to build the desired model. Suppose that a series of runs are to be
accomplished, with little variation between each. A batch command file, containing the
commands that define the basic model and options, provides an easy way to perform
many MARS command functions in one user step. For each run in the series, the core
batch command file can be submitted to MARS, followed by the few graphical user interface
selections necessary for the particular run in question.
Creating an Audit Trail The Command Log window can help you create an audit trail when
one is needed. Imagine not being able to reproduce a particular analysis track, perhaps
because the specific set of options used to create a model (e.g., the name of the data set
itself) was never recorded. The updated command log provides you with the entire
command set necessary to exactly reproduce your analysis, provided the input data do
not change.
Small BASIC programs are defined near the beginning of your analysis session, after you
have opened your dataset but before you estimate (or apply) the model and usually before
defining the list of predictor variables. BASIC is powerful enough that in many cases
users do not need to resort to a stand-alone data manipulation program. See Appendix I
for more on BASIC.
84 Chapter 7: Command-Line Control and Batch Mode
Command-Line Mode
Choosing Command Prompt from the File menu allows you to enter commands directly
from the keyboard. Switching to the command-line mode also enables you to access the
integrated BASIC programming language. See Appendix I for a detailed description of the
BASIC programming language.
Command Log
Most GUI dialog and menu selections have command analogs that are automatically sent
to the Command Log and can be viewed, edited, resubmitted and saved via the Command
Log window. When the command log is first opened (by selecting Open Command
Log from the View menu), all the commands for the current MARS session are displayed.
Subsequently, by selecting Update Command Log from the View menu, the most
recent commands are added to the Command Log window.
After computing a MARS model, the entire set of commands can be archived by updating
the command log, highlighting and copying the commands to the clipboard (or saving
directly to a text file), then pasting into your text application. Alternatively, you can edit
the text commands, deleting or adding new commands, and then resubmit the analysis
by selecting either Submit Window or Submit Current Line to End from the File
menu.
To submit an existing batch file, choose Submit Command File from the File menu. In
the Submit a File dialog that appears, specify the ASCII text file from which command
input is to be read and then click on [Submit]. To facilitate multiple MARS runs, the
MARS results are directed only to the MARS report window in text form (i.e., the GUI
Results dialog does not appear).
85
The remainder of this chapter provides example command files for the Boston Housing
examples discussed in prior chapters and precise statements of each commands syntax
and options.
The following command file generates the main effects model discussed in the Construction
of Interactions section of Chapter 3:
INTERACTIONS = 1
to:
INTERACTIONS = 2.
87
Command Reference
ADDITIVE
Purpose
The ADDITIVE command specifies variables that can enter the model only as main effects
and not as interactions. A variable specified to enter the MARS model additively is not
allowed to enter into an interaction with any other variable. This applies to both ordinal
and categorical variables.
NOTE: Additive variables are still allowed to interact with missing value indicators. The
latter may be necessary for a variable to enter the model in any form.
Syntax
ADDITIVE
This command is mutually exclusive with LINEAR. In other words, a variable should be
listed on one or the other command.
88 Chapter 7: Command-Line Control and Batch Mode
APPLY
Purpose
The APPLY command applies the MARS model to new data. Both a USE file and MODEL
(.mdl) file must be active and all variables in the MARS model must appear in the USE file.
Syntax
BOPTIONS
Purpose
Syntax
CATEGORY
Purpose
The CATEGORY command identifies which predictors are categorical. MARS will
determine the number of unique levels for you - each unique value found in the data
generates a valid level. Just list the names of the categorical variables; for example:
Syntax
CDF
Purpose
The CDF command evaluates one or more distribution, density, or inverse distribution
functions at specified values.
Syntax
To generate density values, use the syntax above with the DENSITY option:
DESCRIPTIVE
Purpose
The Descriptive command specifies what statistics are computed and printed during the
initial pass through the input data. The statistics will not appear in the output unless the
command LOPTIONS MEANS=YES command is issued. By default, the mean, N, SD
and sum of each variable will appear when LOPTIONS MEANS=YES is used. To indicate
that only the N, MIN and MAX should appear in descriptive statistics tables, use the
commands:
Syntax
ALL will turn on all statistics and MISSING will produce the fraction of observations with
missing data.
Remarks
Also BOPTIONS MISSING will produce a special report summarizing which variables are
missing most often.
93
ECHO
Purpose
Syntax
ECHO
94 Chapter 7: Command-Line Control and Batch Mode
ESTIMATE
Purpose
Reads the data, chooses the training and test samples (if any) and computes the MARS
model.
Syntax
ESTIMATE [ / ALL]
In most circumstances, MARS will report only the best model. If the ALL option is
specified, however, MARS will generate a full model at every backstep and produce a
model selector from which you can choose any model in the sequence. The process of
estimating a model at each backstep can increase compute time considerably.
If the SEQUENCE command is issued, the ALL option is implicitly in effect even if not
specified.
95
EXCLUDE
Purpose
Syntax
EXCLUDE<varlist>
<varlist> is a list of variables prohibited from entering the model. All other numeric variables
are candidates for entry into the model..
Examples
The following example excludes ID, SSN, and ATTITUDE from the candidate list of predictor
variables.
MODEL CHOICE
EXCLUDE ID, SSN, ATTITUDE
96 Chapter 7: Command-Line Control and Batch Mode
FORMAT
Purpose
The FORMAT command sets the precision to which most numerical output is printed.
Syntax
FORMAT=<number> [/UNDERFLOW]
Number is a whole number between 0 and 9, inclusive, representing the desired number of
digits to the right of the decimal point. The UNDERFLOW option prints very small numbers
in exponential notation, rather than rounding them off to zero. FORMAT with no arguments
sets the format to its default.
Remarks
Examples
Set the precision to 6, with numbers smaller than .000001 printed in exponential notation:
FORMAT=6/UNDERFLOW
97
HELP
Purpose
For the command-line (non-GUI) version of MARS, the HELP command provides brief, on-
line command descriptions and examples.
Syntax
HELP [<command_name>]
in which:
Remarks
MARS will use the descriptions contained in the files MARS.HLP when the HELP command
is used. These ASCII text files may be edited with a text editor.
Examples
HELP BOPTIONS
The Help menu provides comprehensive on-line information concerning MARS menus,
commands, BASIC programming, and frequently-asked questions.
For an outline of the topics in the help file, select Index from the Help menu. Place the
mouse pointer over the topic of interest and press enter. A discussion of the topic is
displayed on the screen. To print the topic, click Print.
Alternatively, select Help Topics to see a detailed list of index entries. Type the first few
letters of the word youre looking for or use the scroll bar to review the list. For further
instructions on using on-line help, select Using Help from the Help menu.
98 Chapter 7: Command-Line Control and Batch Mode
HISTOGRAM
Purpose
Syntax
The plot is normally a half screen high; the FULL and BIG options will increase it
to a full screen (24 lines) or a full page (60 lines).
TICKS and GRID add two kinds of horizontal and vertical gridding.
WEIGHTED requests plots weighted by the WEIGHT command variable.
NORMALIZED scales the vertical axis to 0 to 1 (or -1 to 1).
IDVAR
Purpose
The IDVAR command lists up to 50 variables that are to be included in the next dataset to
be SAVED. This can be any numerical variable from the file being USEd, and facilitates
merging of the SAVEd data with other files.
Syntax
If every record in your dataset has a unique identifier, say SSN, you could specify:
IDVAR SSN
SAVE WATER
The file WATER.SYS will include the variable SSN in addition to its normal contents.
100 Chapter 7: Command-Line Control and Batch Mode
INTERACT
Purpose
The INTERACT command specifies which variables are or are not allowed to interact in
the model.
Syntax
KEEP
Purpose
Syntax
KEEP <indep_list>
See the MODEL and EXCLUDE commands for other ways to restrict the list of candidate
predictor variables.
102 Chapter 7: Command-Line Control and Batch Mode
KNOT
Purpose
The KNOT command controls the number of degrees of freedom for (unrestricted) knot
optimization. This can be fixed, or automatically computed via cross validation or a test
set.
Syntax
LIMIT
Purpose
Syntax
LIMIT DATASET=<n>
DATASET limits the size of the sample used to build the MARS model. MARS extracts
the test sample from the first <n> records, thereby effectively reducing the size of the
learn sample. So, if
LIMIT DATASET=10000
then records 1 -to- 10000 from the USE dataset will be read in and processed.
104 Chapter 7: Command-Line Control and Batch Mode
LINEAR
Purpose
The LINEAR command specifies variables that, if they are part of the model, will enter the
model only linearly. This applies to both ordinal and categorical variables.
Syntax
LINEAR
This command is mutually exclusive with ADDITIVE. In other words, a variable should be
listed on one or the other command. The LINEAR command only applies to ordinal
variables. Any CATEGORICAL variable included in the LINEAR command will be ignored.
The LINEAR command is used to identify those variables which will not be transformed by
MARS via knot selection. MARS implements this by selecting a single knot value of
0.000 for such variables. A variable entering the model linearly (untransformed) can
participate in interactions with other variables.
105
LOPTIONS
Purpose
Syntax
LOPTIONS MEANS, TIMING (turn MEANS printing and CPU timing on)
LOPTIONS MEANS=NO (turn MEANS printing off)
106 Chapter 7: Command-Line Control and Batch Mode
MODEL
Purpose
Syntax
in which <depvar> is the dependent variable and <indep_list> is an optional list of potential
predictor variables. If no <indep_list> is specified, all numeric variables are considered
(unless KEEP or EXCLUDE commands are used).
For example:
See the KEEP and EXCLUDE commands for another way to restrict the list of candidate
predictor variables.
The BINARY option indicates that MARS should consider the target variable binary in
reporting dialogs. Typically this option is used if the target variable takes on values 0/1 or
1/2.
In addition to what the BINARY option does, the BINARY/TABLE option instructs MARS
to generate an a 100 point table summarizing SENSITIVITY and SPECIFICITY as functions
of a moving threshold for the target.
107
NAMES
Purpose
Syntax
NAMES
108 Chapter 7: Command-Line Control and Batch Mode
NEW
Purpose
The NEW command resets all options, as if MARS had been terminated and restarted. It
also clears out any data transformation (BASIC) statements that are in effect.
Syntax
NEW
109
OPTIONS
Purpose
The OPTIONS command displays current environment settings and options, input and
output devices, etc.
Syntax
OPTIONS
Remarks
The OPTIONS command will not list the settings of the LOPTIONS or BOPTIONS
commands.
110 Chapter 7: Command-Line Control and Batch Mode
OUTPUT
Purpose
The OUTPUT command directs output from MARS to the printer, the screen, or a file.
Syntax
OUTPUT [*|<filename>]
Where <filename> is given the default extension .DAT in the OUTPUT command. This
extension default may be overridden by placing quotes around the filename and explicitly
giving a file extension.
The default is to send the output to the console, OUTPUT *. Output will still appear on the
screen as it is being sent to a file.
Examples
OUTPUT*
OUTPUT RESULTS
OUTPUT "RESULTS.PRN"
111
PAGE
Purpose
The PAGE command lets you choose wide or narrow output format.
Syntax
WIDE forces output to fit in 132 columns. NARROW forces output to fit in 80 columns.
112 Chapter 7: Command-Line Control and Batch Mode
Purpose
Syntax
PRINT=LONG|SHORT
LONG will result in a small amount of additional output to be produced if cross validation
is used. Also, when OLS coefficients are estimated (with BOPTIONS OLS=YES or
BOPTIONS OLS=ONLY) then PRINT=LONG will cause the covariance matrix of the
coefficients to be printed. Often this is not needed; therefore, the default is PRINT=SHORT.
Remarks
QUIT
Purpose
The QUIT command, when typed from MARS, terminates MARS and returns to the operating
system. When typed from ESTIMATE or APPLY, ends ESTIMATE or APPLY phase and
returns to MARS.
Syntax
QUIT
114 Chapter 7: Command-Line Control and Batch Mode
REGRESSION
Purpose
Syntax
REGRESSION = OLS
MARS carries out OLS regression, using Salford Systems 2SLS algorithms after a MARS
regression. OLS results are presented for regressing the final basis functions against the
original dependent variable along with the standard R-squared statistics.
A future release of MARS will also carry out logistic regression using code from Salford
Systems LOGIT program.
115
REM
Purpose
The REM command allows comments to be inserted in the command stream. It causes
no action by the program.
Syntax
REM <text>
RETRIEVE
Purpose
When applying data to a MARS model, the RETRIEVE command specifies which file
stores the model information.
Syntax
RETRIEVE <filename>
USE LEARN1
MODEL TARGET
STORE MYMODEL
ESTIMATE
USE VALIDAT3
SAVE PREDICT
RETRIEVE MYMODEL
APPLY
117
SEED
Purpose
The SEED command allows you to set the random number seed and to specify whether
the seed is to remain in effect after the MARS model is computed. Normally the seed is
reset to 987654321 on start up and after each MARS model is computed (ESTIMATE) or
applied to new data (APPLY).
Syntax
Legal values include integers between 1 and 2147483647. If RETAIN is not specified, the
seed will be reset to 987654321 after the current model is completed.
If RETAIN is specified, the seed will keep its latest value after the model is computed.
118 Chapter 7: Command-Line Control and Batch Mode
SELECT
Purpose
The SELECT command specifies selection criteria for computing a model based on a
subgroup of cases.
Syntax
SELECT <var1> <rel> <# ($)> [, <var2> <rel> <# ($)> <...>]
in which
<rel> is a logical relation: =, <>, <, >, <=, =<, >=, =>,
Remarks
SELECT may be based on any variable appearing in the data set, whether or not that
variable is involved in the model.
SELECT may not be based on any variables defined on the fly by internal Data
Transformation statements.
Examples
SEQUENCE
Purpose
The SEQUENCE command specifies a selector file into which the model and selector
information is to be stored. Later, perhaps in another MARS session, the selector file can
be recalled and the selector information viewed in the GUI.
Syntax
SEQUENCE <filename>
Remarks
Using the SEQUENCE command implies that the model at each backstep is to be esti-
mated, as if the ESTIMATE / ALL command had been issued. This can significantly
increase the time required for the model to run.
120 Chapter 7: Command-Line Control and Batch Mode
STORE
Purpose
The STORE command creates files in which MARS model information is saved for later
viewing, printing, or application to new data.
Syntax
STORE <filename>
SUBMIT
Purpose
The SUBMIT command specifies a file from which command input is to be read.
Syntax
SUBMIT <filename>
Remarks
Filename is given the extension .CMD in the SUBMIT command. This extension default
may be overridden by placing quotes around the filename and explicitly giving a file
extension.
USE
Purpose
The USE command opens a SYSTAT format file for analysis, and lists the variable names
in the file.
Syntax
USE <filename>
Remarks
Filename is given the extension .SYS in the USE command. This extension default may
be overridden by placing quotes around the filename and explicitly giving a file extension.
You will need to enclose lowercase file names in quotes on UNIX platforms. Within
MARS, all non-quoted names are read as uppercase.
123
WEIGHT
Purpose
Syntax
WEIGHT=<variable>
in which variable is a variable present in the USE dataset. The WEIGHT variable can
contain any non-negative real values.
124 Chapter 7: Command-Line Control and Batch Mode
XYPLOT
Purpose
The XYPLOT command produces 2-D scatter plots, plotting one or more y variables
against an x variable in separate graphs.
Syntax
The plot is normally a half screen high: the FULL and BIG options will increase it
to a full screen (24 lines) or a full page (60 lines).
TICKS and GRID add two kinds of horizontal and vertical gridding.
WEIGHTED requests plots weighted by the WEIGHT command variable.
NORMALIZED scales the vertical axis to 0 to 1 (or -1 to 1).
The BASIC transformation language allows you to modify your input files on the fly while
you are in an analysis module and to save permanent copies of your changed data in
ASCII. We expect users will find that they can accomplish almost any required data
manipulation involving a single data file.
Although this integrated version of BASIC is much more powerful than the simple variable
transformation functions sometimes found in other statistical procedures, it is not meant
to be a replacement for more comprehensive data steps found in general use statistics
packages. At present, integrated BASIC does not permit the merging or appending of
multiple files, nor does it allow processing across observations. In Salford Systems'
statistical analysis packages, the programming work space for BASIC is limited and is
intended for on-the-fly data modifications of 20 to 40 lines of code (though custom large
work space versions will accommodate larger BASIC programs). For more complex or
extensive data manipulation, we recommend you use the large work space for BASIC in
ASCII or your preferred database management software.
The remainder of this appendix describes what you can do with BASIC and provides
simple examples to get you started.
126 Appendix I: BASIC Programming Language
Getting Started
Your BASIC program will consist of a series of statements which all begin with a % sign.
These statements could comprise simple assignment statements that define new variables,
conditional statements that delete selected cases, iterative loops that repeatedly execute
a block of statements, and complex programs with the flow control provided by GOTO
statements and line numbers. Thus, somewhere before a HOT! Command such as
ESTIMATE or RUN in a Salford module, you might type:
The % symbol appears only once at the beginning of each line of BASIC code; it should
not be repeated anywhere else on the line. You can leave a space after the % symbol or
you can start typing immediately; BASIC will accept your code either way.
Our programming language uses standard statements found in many dialects of BASIC.
These include:
LET
Assigns a value to a variable. The form of the statement is:
IF...THEN
Evaluates a condition, and if it is true, executes the statement following the THEN. The
form is:
FOR...NEXT
Allows for the execution of the statements between the FOR statement and a subsequent
NEXT statement as a block. The form of the simple FOR statement is:
% FOR
% statements
% NEXT
For example, you might execute a block of statements only if a condition is true, as in
When an index variable is specified on the FOR statement, the statements between the
FOR and NEXT statements are looped through repeatedly while the index variable re-
mains between its lower and upper bounds:
DIM
Creates an array of subscripted variables. For example, a set of 5 scores could be set up
with:
% DIM SCORE(5)
The size of the array must be specified with a literal integer up to a maximum size of 99;
variable names may not be used. You can use more than one DIM statement, but be
careful not to create so many large arrays that you exceed the maximum number of
variables allowed (currently 8019).
DELETE
Deletes the current case from the data set.
OPERATORS
The table below lists the operators that can be used in BASIC statement expressions.
Operators are evaluated in the order they are listed in each row with one exception: a
minus sign before a number (making it a negative number) is evaluated after exponentiation
and before multiplication or division. The "<>" is the "not equal" operator.
Numeric operators () ^ * / + -
Built-in
Variables Definition Values
FUNCTION(variable, variable, .)
Function Definition Examples
AVG Arithmetic mean % LET XMEAN = AVG(X1,X2,X3)
MAX maximum % LET BEST=MAX(Y1,Y2,Y3,Y4,Y5)
MIN minimum Note: These statistical functions will
MIS number of missing automatically adjust for the presence of
values missing values. Thus, if X1 is
STD standard deviation missing for a case, AVG(X1,X2,X3) is
equal to (X2+X3)/2
SUM summation
130 Appendix I: BASIC Programming Language
Integrated BASIC also includes a collection of probability functions that can be used to
determine probabilities and confidence level critical values, and to generate random numbers.
The following table shows the distributions and any parameters that are needed to obtain
values for either the random draw, the cumulative distribution, the density function, or the
inverse density function.
131
These functions are invoked with either 0, 1, or 2 arguments as indicated in the table
above, and return a single number, which is either a random draw, a cumulative probabil-
ity, a probability density, or a critical value for the distribution.
We will illustrate the use of these functions with the chi-square distribution. To generate
10 random draws from a chi-square distribution with 35 degrees of freedom for each case
in your data set:
% DIM CHISQ(10)
% FOR I= 1 TO 10
% LET CHISQ(I)=XRN(35)
% NEXT
To evaluate the probability that a chi-square variable with 20 degrees of freedom exceeds
27.5:
The chi-square density for the same chi-square value is obtained with:
Finally, the 5% point of the chi-squared distribution with 20 degrees of freedom is calculated
with:
Missing Values
The system missing value is stored internally as the largest negative number allowed.
Missing values in BASIC programs and printed output are represented with a period or dot
("."), and missing values can be generated and their values tested using standard expres-
sions.
Missing values are propagated so that most expressions involving variables that have
missing values will themselves yield missing values.
One important fact to note: because the missing value is technically a very large negative
number, the expression X < 0 will evaluate as true if X is missing.
BASIC statements included in your command stream are executed when a HOT! Command
such as ESTIMATE, APPLY, or RUN is encountered; thus, they are processed before any
estimation or model building is attempted. This means that any new variables created in
BASIC are available for use in MODEL and KEEP statements, and any cases that are
DELETEd via BASIC will not be used in the analysis.
134 Appendix I: BASIC Programming Language
More Examples
It is easy to create new variables or change old variables using BASIC. The simplest
statements create a new variable from other variables already in the data set. For example:
BASIC allows for easy construction of Boolean variables, which take a value of one if true
and zero if false. In the following statement, the variable XYZ would have a value of 1 if any
condition on the right-hand side is true, and 0 otherwise.
Suppose your data set contains variables for gender and age, and you want to create a
categorical variable with levels for male-senior, female-senior, male-non-senior, female-
non-senior. You might type:
If the measurement of several variables changed in the middle of the data period, conversions
can be easily made with the following:
If you would like to create powers of a variable (square, cube, etc.,) as independent
variables in a polynomial regression, you could type something like:
% DIM AGEPWR(5)
% FOR I = 1 TO 5
% LET AGEPWR(I) = AGE^I
% NEXT
135
Because you can construct complex Boolean expressions with BASIC, using program-
ming logic combined with the DELETE statement gives you far more control than is
available with the simple SELECT statement. For example:
It is often useful to draw a random sample from a data set to fit a problem into memory or
to speed up a preliminary analysis. By using the uniform random number generator in
BASIC, this is easily accomplished with a one-line statement:
The data set can be divided into an analysis portion and a separate test portion distinguished
by the variable TEST:
This sets TEST equal to 1 in approximately 40% of all cases and 0 in all other cases. The
following draws a stratified random sample taking 10% of the first stratum and 50% of all
other strata:
IFTHEN Statement
Purpose
Evaluates a condition and if it is true executes the statement following the THEN.
The form is:
An IFTHEN may be combined with an ELSE statement in two ways. First, the ELSE
may be simply used to provide an alternative statement when the condition is not true:
Examples
LET Statement
Purpose
Syntax
Examples
ELSE Statement
Purpose
Syntax
The statement2 can be another IFTHEN condition, thus allowing IFTHEN statements
to be linked into more complicated structures. For more information, see the section for
IFTHEN.
Examples
DIM Statement
Purpose
Syntax
% DIM var(n)
where n is a literal integer. Variables of the array are then referenced by variable name and
subscript, such as var(1), var(2), etc.
In an expression, the subscript can be another variable, allowing these array variables to
be used in FORNEXT loop processing. See the section on the FORNEXT statement
for more information.
Examples
% DIM QUARTER(4)
% DIM MONTH(12)
% DIM REGION(9)
141
FOR...NEXT Statement
Purpose
Allows for the processing of steps between the FOR statement and an associated NEXT
statement as a block. When an optional index variable is specified, the statements are
looped through repetitively while the value of the index variable is in a specified range.
Syntax
The index variable and limits are optional but, if used, they are of the form
x = y TO z [STEP=s]
Remarks
Nested FORNEXT loops are not allowed and a GOTO which is external to the loop may
not refer to a line within the FORNEXT loop. However, GOTOs may be used to leave a
FOR...NEXT loop or to jump from one line in the loop to another within the same loop.
Examples
DELETE Statement
Purpose
Syntax
% DELETE
% IF condition THEN DELETE
Examples
GOTO Statement
Purpose
Syntax
% GOTO ##
Remarks
This is often used with an IFTHEN statement to allow certain statements to be ex-
ecuted only if a condition is met.
If line numbers are used in a BASIC program, all lines of the program should have a line
number. Line numbers must be positive integers less than 32000.
Examples
% 10 GOTO 20
% 20 STOP
% 10 IF X=. THEN GOTO 40
% 20 LET Z=X*2
% 30 GOTO 50
% 40 LET Z=0
% 5O STOP
144 Appendix I: BASIC Programming Language
STOP Statement
Purpose
Stops the processing of the BASIC program on the current observation. The observation is
kept but any BASIC statements following the STOP are not executed.
Syntax
% STOP
Examples
MARS is referenced in over 120 scientific publications, dating back to 1994. A list can be
downloaded from our website at http://www.salford-systems.com/MARSCITE.PDF.
Friedmans articles are challenging classics but definitely worth the effort.
Friedman, J.H. (1988), Fitting Functions to Noisy Data in High Dimensions. Proc.,
Twentieth Symposium on the Interface, Wegman, Gantz and Miller, eds., American
Statistical Association, Alexandria, VA, 3-43.
Friedman, J.H. (1991a), Multivariate Adaptive Regression Splines (with discussion), Annals
of Statistics, 19, 1-141 (March).
Friedman, J.H. and Silverman, B.W. (1989), Flexible Parsimonious Smoothing and
Additive Modeling (with discussion), Technometrics, 31, 3-39 (February).
DeVeaux et. al. provide examples in which MARS outperforms neural networks:
DeVeaux R.D., Psichogios D.C., and Ungar L.H. (1993), A Comparison of Two
Nonparametric Estimation Schemes: MARS and Neural Networks, Computers
Chemical Engineering, 17, 8.
146 Appendix II: Further Reading and References
Additional References
Belsley, D. A., E. Kuh, and R. Welsch, Regression Diagnostics, New York: Wiley,
1980.
Breiman, Leo, Jerome Friedman, Richard Olshen, and Charles Stone, Classification and
Regression Trees, Pacific Grove: Wadsworth, 1984.
Harrison, D. and D. Rubinfeld, Hedonic Housing Prices and Demand for Clean Air,
Journal of Environmental Economics and Management, 5, 81-102, 1978.
Scott, David W., Multivariate Density Estimation, New York: Wiley, 1992.
Steinberg, Dan and Phillip Colla, CART: Tree-Structured Non-parametric Data Analy-
sis, San Diego, CA: Salford Systems, 1995.
Please visit our web site for updates on new publications about MARS.
http://www.salford-systems.com
147
Index
Symbols C
.SLC 60 CART Report window 61
2-D, 3-D CATEGORY 90
Curves CDF 91
Plots 55 Chapter 1: Introduction 14
classic output 77
Command Log 60, 84
Command Log window 83
A Command-Line Syntax and Options 85
About CART for Windows 81 Command-Line Mode 84
ADDITIVE 87 Cost of Omission 48
Advanced Programming Features 136 Creating Batch Files
Allowing Specific Interactions 42 Submitting Batch Files 84
ANOVA 46, 47 custom reports 77
APPLY 88 cut 80
Applying Data to a MARS Model D
scoring 71
Audit Trail 83 data anomalies 80
Automatic Report 78, 79 Data Viewer 80
DBMS/COPY
linking DBMS/COPY 6
DBMS/Copy 80
B DBMS/Copy translators 80
default options 78
BASIC programming language 83, 125
DF
Basis Function 45
dof 67
Basis Function and Model Code 45
DELETE 128, 142
Basis Functions
DESCRIPTIVE 92
BF 50, 57
DIM 128, 141
batch command file 83
direct variables 47
binary
Directories Dialog 70
gains chart 52
Disallowing Specific Interactions 42
binary target variables
dependent variable E
target variable 63
BOPTIONS 89 ECHO 92
Effective parameters 47
ELSE 126, 139
ESTIMATE 94
examples 73
148 Appendix I: BASIC Programming Language
exporting K
rules 50
Exporting and Printing 2-D and 3-D Plots KEEP 101
57 KNOT 102
Exporting Graphs 57 L
F LET 126, 138
File menu 61 LIMIT 103
Filtering the Data Set 135 LINEAR 104
FOR...NEXT 127, 140 LOPTIONS 105
Forbidding Transformations of Selected M
Variables 40
Forcing Variables into the Model 40 MARS Search Intensity 43
frequently-asked questions 81, 97 Maximum Basis Functions 40, 67
Maximum Interactions 67
G Microsoft Word 80
gains chart 51 Minimum Number of Observations
GCV 47, 58 Between Knots 41
GCV R-Square 47 Minimum Span 41
Getting Started 61 Missing Values 133
Open 61 MODEL 106
GOTO Statement 143 Model Code 45. See also Basis
Graphical User Interface Function
GUI N
Tour 61
Nave MSE 47
H Nave-Adjusted 47
HELP 97 NAMES 107
help 81 NEW 108
Help menu 97 Notepad 80
HISTOGRAM 98 Number of Records to Process 67
hybrid CART-MARS model 43 O
I observations between knots 67
IDVAR 99 Open 61
IF...THEN 126, 137 Open File icon 61
Installation Procedure 5 Open Tree Navigator dialog box 60
Installing and Starting CART 5 OPERATORS 128
INTERACT 100 OPTIONS 109
interactions 56 options 69
Interactions Dialog 64 Options and Limits Dialog 65
OUTPUT 110
149