Professional Documents
Culture Documents
32492-001
For more information about SPSS software products, please visit our Web site at http://www.spss.com or contact SPSS Inc. 233 South Wacker Drive, 11th Floor Chicago, IL 60606-6412 Tel: (312) 651-3000 Fax: (312) 651-3668 SPSS is a registered trademark and its other product names are the trademarks of SPSS Inc. for its proprietary computer software. No material describing such software may be produced or distributed without the written permission of the owners of the trademark and license rights in the software and the copyrights in the published materials. The SOFTWARE and documentation are provided with RESTRICTED RIGHTS. Use, duplication, or disclosure by the Government is subject to restrictions as set forth in subdivision (c) (1) (ii) of The Rights in Technical Data and Computer Software clause at 52.227-7013. Contractor/manufacturer is SPSS Inc., 233 South Wacker Drive, 11th Floor, Chicago, IL 606066412. Graphs powered by SPSS Inc.s nViZn(TM) advanced visualization technology http://www.spss.com/sm/nvizn Patent No. 7,023,453 General notice: Other product names mentioned herein are used for identification purposes only and may be trademarks of their respectivecompanies. Project phases are based on the CRISP-DM process model. Copyright 19972003 by CRISPDM Consortium (http://www.crisp-dm.org). Microsoft and Windows are registered trademarks of Microsoft Corporation. IBM, DB2, and Intelligent Miner are trademarks of IBM Corporation in the U.S.A. and/or other countries. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. DataDirect and SequeLink are registered trademarks of DataDirect Technologies. Copyright 20012005 by JGoodies. Founder: Karsten Lentzsch. All rights reserved. Predictive Modeling with Clementine Copyright 2007 by SPSS Inc. All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher.
ii
Predictive Modeling with Clementine 8.8 ARIMA ............................................................................................................................ 8-13 8.9 DATA REQUIREMENTS ...................................................................................................... 8-15 8.10 AUTOMATIC FORECASTING IN A PRODUCTION SETTING................................................ 8-16 8.11 FORECASTING BROADBAND USAGE IN SEVERAL MARKETS.......................................... 8-17 8.12 APPLYING MODELS TO SEVERAL SERIES ....................................................................... 8-39 SUMMARY EXERCISES ............................................................................................................ 8-45
CHAPTER 10: FINDING THE BEST MODEL FOR BINARY OUTCOMES ............................................................................................ 10-1
10.1 INTRODUCTION ............................................................................................................... 10-1 SUMMARY EXERCISES .......................................................................................................... 10-13
iii
iv
Objectives
In this chapter, after a brief discussion regarding the cleaning of data, we will introduce several techniques that may be useful in preparing data for modeling.
Data
In this chapter we use a data set from a leading telecommunications company, churn.txt. The file contains records for 1477 of the companys customers who have at one time purchased a mobile phone. It includes such information as length of time spent on local, long distance and international calls, the type of billing scheme and a variety of basic demographics, such as age and gender. The customers fall into one of three groups: current customers, involuntary leavers and voluntary leavers. We want to use data mining to understand what factors influence whether an individual remains as a customer or leaves for an alternative company. The data are typical of what is often referred to as a churn example (hence the file name).
Predictive Modeling With Clementine Figure 1.1 Server Login Dialog in Clementine
1.1 Introduction
Preparing data for modeling can be a lengthy but essential and extremely worthwhile task. If data are not cleaned and modified/transformed as necessary, it is doubtful that the models you build will be successful. In this chapter we will introduce a number of techniques that enable such data preparation. We will begin with a brief discussion concerning the handling of blanks and cleaning of data, although this is covered in greater detail in the Introduction to Clementine and Data Mining and Preparing Data for Data Mining courses. Following this, we will introduce the concept of data balancing and how it is achieved within Clementine. A number of data transformations will also be introduced as possible solutions to skewed data. We will discuss how to create training and validation samples of the data automatically with the use of data partitioning.
Once the condition of the data has been assessed, the next step is to attempt to improve the overall quality. This can be achieved in a variety of ways: Using the Generate menu from the Data Audit nodes report, a Select node that removes records with blank fields can be automatically created (particularly relevant for a models output field). Fields with a high proportion of blank records can be filtered out using the Generate menu from the Data Audit nodes report to create a Filter node. Blanks can be replaced with appropriate values using the Filler node. Possible appropriate values within a numeric field can range from the average, mode, or median, to a value predicted using one of the available modeling techniques. In addition, missing values can be imputed by using the Data Audit node. The Type node and Types tab in source nodes provide an automatic checking process that examines values within a field to determine whether they comply with the current type and bounds settings. If they do not, fields with out-of-bound values can either be modified, or those records removed from passing downstream.
After these actions are completed, the data will have been cleaned of blanks and out-of-bounds values. It may also be necessary to use the Distinct node to remove any duplicate records. Once the data file has been cleaned, you can then begin to modify it further so that it is suitable for the modeling technique(s) you plan to use.
Predictive Modeling With Clementine Figure 1.2 Distribution of the CHURNED Field
The proportions of the three groups are rather unequal and data balancing may be useful when trying to predict this field using a neural network. This output can be used directly to create a Balance node, but first we must decide whether we wish to reduce or boost the current data. Reducing the data will drop over 73% of the records, but boosting the data will involve duplicating the involuntary leavers from 132 records to over 830. Neither of these methods is ideal but in this case we choose to reduce the data to eliminate the magnification of errors.
Click GenerateBalance Node (reduce) Close the Distribution plot window
Figure 1.3 Distribution of the CHURNED Field after Balancing the Data
Predictive Modeling With Clementine When balancing data it is advisable to enable a data cache on the balance node, to fix the selected sample. This is due to the fact that the balance node is randomly reducing or boosting data and a different sample will be selected each time the data are passed through the node. At this point the data are balanced and can be passed into a modeling node, such as the Neural Net node. Once the model has been built, it is important that the testing and assessment of the model should be done based on the unbalanced data.
Close the Distribution plot window
Predictive Modeling With Clementine This distribution has a strong positive skew. This condition may lead to poor performance of a neural network predicting LOCAL since there is less information (fewer records) on those individuals with higher local usage. What we need is a transformation that inverts the original skew, that is, skews it to the left. If we get the transformation correct, the data will become relatively balanced. When you transform data you normally try to create a normal distribution or a uniform (flat) distribution. For our problem, the distribution of LOCAL closely follows that of a negative exponential, e-x, so the inverse is a logarithmic function. We will therefore try a transformation of the form ln(x + a), where a is a constant and x is the field to be transformed. We need to add a small constant because some of the records have values of 0 for LOCAL, and the log of 0 is undefined. Typically the value of a would be the smallest actual positive value in the data.
Close the Histogram plot window Add a Derive node and connect the Type node to it Edit the Derive node and set the Derive Field name to LOGLOCAL Select Formula in the Derive As list Enter log(LOCAL + 3) in the Formula text box (or use the Expression Builder) Click on OK
Connect a Histogram node to the Derive node Edit the Histogram node and set the Field to LOGLOCAL Click the Execute button
Predictive Modeling With Clementine Figure 1.6 Histogram of the Transformed LOCAL Field Using a Logarithmic Function
Although this distribution is not perfectly normal it is a great improvement on the distribution of the original field.
Close the Histogram plot window
The above is a simple example of a transformation that can be used. Table 1.1 gives a number of other possible transformations you may wish to try when transforming data, together with their CLEM expression. Table 1.1 Possible Numerical Transformations Transformation e
x
CLEM Expression Exp(x) where x is the name of the field to be transformed Log(x + a) where a is a numerical constant Log((x - a) / (b x)) Where a and b are numerical constants Log10(x + a) Sqrt(x) 1 / exp(@GLOBAL_AVE(x) x) where @GLOBAL_AVE is the average of the field x, set using the Set Globals node in the Output palette
By default, a new field will be created from the original field name with the suffix _TILEN, where N stands for the number of bins to be created (here five). Percentiles can be based on the record count (in ascending order of the value of the bin field, which is the standard definition of percentiles), or on the sum of the field.
Predictive Modeling With Clementine Figure 1.7 Completed Binning Node to Group LOCAL by Quintiles
The Generate tab allows you to view the bins that have been created and their upper and lower limits. However, understandably, information on generated bins is not available until the node has been executed in order to allow the thresholds to be determined.
Click OK
To study the relationship between binned LOCAL (LOCAL_TILE5) and CHURNED, we could use a Matrix node, since both fields are categorical, but we can also use a Distribution node, which will be our choice here.
Add a Distribution node to the stream and attach it to the Binning node Edit the Distribution node and select LOCAL_TILE5 as the Field Select CHURNED as the Overlay field Click Normalize by color checkbox (not shown) Click Execute
Predictive Modeling With Clementine Figure 1.8 Distribution of CHURNED by Binned LOCAL
There is an interesting pattern apparent. Essentially all the involuntary churners are in the first quintile of LOCAL_TILE5 (notice how the number of cases in each category is almost exactly the same). Perhaps we got lucky when specifying quintiles as the binning technique, but we have found a clear pattern that might not have been evident if LOCAL had not been binned. We would next wish to know what the bounds are on the first quintile, and to see that we need to edit the Binning node.
Close the Distribution plot window Edit the Binning node for LOCAL Click the Bin Values tab Select 5 from the Tile: menu
Predictive Modeling With Clementine Figure 1.9 Bin Thresholds for LOCAL
We observe that the upper bound for Bin 1 is 10.38 minutes. That means that the involuntary churners essentially all made less than 10.38 minutes of local calls, since they all fall into this bin (quintile). Given this finding, we might decide to use the binned version of LOCAL in modeling, or try two models, one with the original field and then one with the binned version.
Predictive Modeling With Clementine the training records. Because of this capability, the use of the Partition node makes model assessment more efficient. To illustrate the use of data partitioning, we will create a partition field for the churn data with two values, for training and testing. Although the Partition node assists in selecting records for training and testing, its output is a new field, and so it can be found in the Field Ops Palette.
Add a Partition node to the stream and connect the Type node to it Edit the Partition node
The name of the partition field is specified in the Partition field text box. The Partitions choice allows you to create a new field with either 2 or 3 values, depending on whether you wish to create 2 or 3 data samples. The size of the files is specified in the partition size text boxes. Size is relative and given in percents (which do not have to add to 100%). If the sum of the partition sizes is less than 100%, the records not (randomly) included in a partition will be discarded. The Generate menu allows you to create Select nodes that will select records in the training, testing, and validation samples. Well change the size of the training and testing partitions, and input a random seed so our results are comparable. Figure 1.10 Partition Node Settings
The new field Partition has close to a 70/30 distribution. It can now be directly used in modeling as described above, or separate files can be created with use of the Select node. We will use the partition field in a later chapter, so well save the stream for use in later chapters.
Click on FileSave Stream As Save the stream with the name Chapter1_Partition
Predictive Modeling With Clementine The procedure helps you quickly detect unusual cases during data exploration before you begin modeling. It is important to note that the definition of an anomalous case is statistical and not particular to any specific industry or application, such as fraud in the finance or insurance industry (although it is possible that the technique might find such cases). Clustering is done using the TwoStep cluster routine (also available in the TwoStep node). In addition to clustering, the Anomaly node scores each case to identify its cluster group and creates an anomaly index, to measure how unusual it is, and identifies which variables contribute most to the anomalous nature of the case. Well use a new data file to demonstrate the Anomaly nodes operation. The file, customer_dbase.sav, is a richer data file that is also from a telecommunications company. It has an outcome field churn which measures whether a customer switched providers in the last month. There is no target field for anomaly detection, but in most instances you will want to use the same set of variables in the Anomaly node that you plan to use for modeling. There is an existing stream file we can use for this example. The Anomaly node is found in the Modeling palette since it uses the TwoStep clustering routine.
Click FileOpen Stream Double-click on Anomaly_FeatureSelect.str in the c:\Train\ClemPredModel directory Execute the Table node and view the data Close the Table window Place an Anomaly node in the stream and connect it to the Type node Edit the Anomaly node, and then click the Fields tab
You will typically specify exactly which fields should be used to search for anomalous cases. In these data, there are several fields that measure various aspects of the customers account, and we want to use all these here (there are also demographic fields, but in the interests of keeping this example relatively simple, we will restrict somewhat the number and type of fields used).
By default, the procedure will use a cutoff value that flags 1% of the records in the data. The cutoff is included as a parameter in the model being built, so this option determines how the cutoff value is set for modeling but not the actual percentage of records to be flagged during scoring. Actual scoring results may vary depending on the data. The Number of anomaly fields to report specifies the number of fields to report as an indication of why a particular record is flagged as an anomaly. The most anomalous fields are defined as those that show the greatest deviation from the field norm for the cluster to which the record is assigned. Well use the defaults for this example.
Click Execute Right-click on the Anomaly model in the Models Manager, and select Browse Click Expand All button
Predictive Modeling With Clementine Figure 1.14 Browsing Anomaly Generated Model Results
We see that three clusters (labeled Peer Groups) were created automatically (although we didnt view the Expert options, the default number of clusters to be created is set between 1 and 15). In the first cluster there are 1267 records, and 18 have been flagged as anomalies (about 1.4%, close to the 1% cutoff value). The Model browser window doesnt tell us which cases are anomalous in this cluster, but it does provide a list of fields that contributed to defining one or more cases as anomalous. Of the 18 records identified by the procedure, 16 are anomalous on the field lnwireten (the log of wireless usage over tenure in months [time as a customer]). This was a derived field created earlier in the data exploration process. The average contribution to the anomaly index from lnwireten is .275. This value should be used in a relative sense in comparison to the other fields. To see information for specific records we need to add the generated Anomaly model to the stream. We will sort the records by the $O-AnomalyIndex field, which contains the index values.
Add the Anomaly generated model node to the stream near the Type node and connect the two Add a Sort node from the Record Ops palette to the stream and connect the Anomaly generated model node to the Sort node Edit the Sort node and select the field $O-AnomalyIndex as the sort field Change the Sort Order to Descending
Predictive Modeling With Clementine Figure 1.15 Sorting Records by Anomaly Index
Click OK Connect a Table node to the Sort node Execute the Table node
Figure 1.16 Records Sorted by Anomaly Index with Fields Generated by Anomaly Model
For each record, the model creates 9 new fields. The field #O-PeerGroup contains the cluster membership. The next six fields contain the top three fields that contributed to this record being an anomaly and the contribution of that field to the anomaly index (we can request fewer or more fields on which to report in the Anomaly node Model tab). Thus we see that the three most anomalous cases, with an anomaly index of 5.0, all are in cluster 2. The first two of these are most deviant on longmon and longten.
Predictive Modeling With Clementine Knowing which variables made the greatest contribution to the anomaly index allows you to more easily review the data values for these cases. You dont need to look at all the fields, but instead can concentrate on specific fields detected by the model for that case. In the interests of time, we wont take this next step here, but you might want to try this in the exercises. What we can briefly show are the options available when an Anomaly generated model is added to the stream.
Close the Table window Edit the Anomaly generated model node in the stream Click on the Settings tab
Note in particular that in large files, there is an option available to discard non-anomalous records, which will make investigating the anomalous records much easier. Also, you can change the number of fields on which to report here.
Close the Anomaly model Browser window
Predictive Modeling With Clementine To shortcut this process and narrow the list of candidate predictors, the Feature Selection node can identify the fields that are most importantmostly highly relatedto a particular target/outcome field. Reducing the number of fields required for modeling will allow you to develop models more quickly, but also permit you to explore the data more efficiently. Feature selection has three steps: 1) Screening: In this first step, fields are removed that have too much missing data, too little variation, or too many categories, among other criteria. Also, records are removed with excessive missing data. 2) Ranking: In the second step, each predictor is paired with the target and an appropriate test of the bivariate relationship between the two is performed. This can be a crosstabulation for categorical variables or a Pearson correlation coefficient if both variables are continuous. The probability values from these bivariate analyses are turned into an importance measure by subtracting the p value of the test from 1 (thus a low p value leads to an importance near 1). The predictors are then ranked on importance. 3) Selecting: In the final step, a subset of predictors is identified to use in modeling. The number of predictors can be identified automatically by the model, or you can request a specific number. Feature selection is also located in the Modeling palette and creates a generated model node. This node, though, does not add predictions or other derived fields to the stream. Instead, it acts as a filter node, removing unnecessary fields downstream (with parameters under user control). Well try feature selection on the customer database file. Note that although we are using feature selection after demonstrating anomaly detection, you may want to use these two in combination. For example, you can first use feature selection to identify important fields. Then you can use anomaly detection to find unusual cases on only those fields.
Add a Feature Selection node to the stream and connect it to the Type node Edit the Feature Selection node and click the Fields tab Click the Use custom settings button Select churn as the Target field (not shown) Select all the fields from region to news as Inputs (be careful not to select churn again) Click the Model tab
Predictive Modeling With Clementine Figure 1.18 Model Tab for Feature Selection to Predict Churn
By default fields will initially be screened based on the various criteria listed in the Model tab. A field can have no more than 70% missing data (which is rather generous, and you may wish to modify this value). There can be no more than 90% of the records with the same value, and the minimum coefficient of variation (standard deviation/mean) is 0.1. All of these are fairly liberal standards.
Click the Options tab
Predictive Modeling With Clementine Fields after being ranked will be selected based on importance, and only those deemed Important will be selected in the model. This can be changed to select the top N fields, by ranking of importance, or by selecting all fields that meet a minimum level of importance. Four options are available for determining the importance of categorical predictors, with the default being the Pearson chi-square value. We will use all default settings for these data.
Click Execute Right-click on the churn Feature Selection generated model and select Browse
We selected 127 potential predictors. Seven were rejected in the screening stage because of too much missing data or too little variation. Then of the remaining 120 fields, the model selected 63 as being important, so it has reduced our tasks of data review and model building considerably.
Predictive Modeling With Clementine The model ranked the fields by importance (importance is rounded off to a maximum value of 1.000). If you scroll down the list of fields in the upper pane, you will eventually see fields with low values of importance that are unrelated to churn. All variables with their box checked will be passed downstream if this node is added to a data stream. The set of important variables includes a mix, with some demographic (age, employ), accountrelated (tenure, ebill), and financial status (cardtenure) types. From here, we can add the generated Feature Selection model to the stream, and it will filter out the unimportant variables. Note When using the Feature Selection node, it is important to understand its limitations. First, importance of a relationship is not the same thing as the strength of a relationship. In data mining, the large data files used allow very weak relationships to be statistically significant. So just because a variable has an importance value near 1 does not guarantee that it will be a good predictor of some target variable. Second, nonlinear relationships will not necessarily be detected by the tests used in the Feature Selection node, so a field could be rejected yet have the potential of being a good predictor (this is especially true for continuous predictors).
Summary Exercises
A Note Concerning Data Files
In this training guide files are assumed to be located in the c:\Train\ClemPredModel directory.
ID LONGDIST International LOCAL DROPPED PAY_MTHD LocalBillType LongDistanceBillType AGE SEX STATUS CHILDREN Est_Income Car_Owner CHURNED
Customer reference number Time spent on long distance calls per month Time spent on international calls per month Time spent on local calls per month Number of dropped calls Payment method of the monthly telephone bill Tariff for locally based calls Tariff for long distance calls Age Gender Marital status Number of Children Estimated income Car owner (3 categories) Current Still with company Vol Leavers who the company wants to keep Invol Leavers who the company doesnt want
In this session we will perform some exploratory analysis on the Churn.txt data file and prepare these data so that they are ready for modeling. 1. Read the file c:\Train\ClemPredModel\Churn.txtthis file is blank delimited and includes field namesusing a Var. File node. Browse the data and familiarize yourself with the data structure within each field. 2. Check to see if there are blanks (missing values) within the data; if you find any problems, decide how you wish to deal with these and take appropriate steps. 3. Look at the distribution of the CHURNED field. This field probably requires balancing. Do you think it better to balance using boosting or reducing of the data?
4. If you think that both of these methods are too harsh (either in terms of duplicating data too much or reducing data so there are too few cases), edit the balance node and see if you can find a way of reducing the impact of balancing. 5. If you are going to use this data for modeling, do you wish to cache this node? 6. Use the Data Audit node to look at the distribution of some of the fields that will be used as inputs. Does the distribution of these fields appear appropriate? If not, try and find a transformation that may help the modeling process. (Note: The instructor may have already spoken about the field LOCALyou may want to transform this field, as discussed in Chapter 1). 7. Look at the field International. Do you think this field will need transforming or binning? Can you find a transformation that helps with this field? If not, why do think this is? 8. Think about whether there are potentially any other fields that could be derived from existing data that may help out with the modeling process. If so, create those fields. 9. Try using the Anomaly node on these data to detect unusual records. Dont use the field CHURNED. Do you find any commonalities among most of the anomalous records? 10. If you have made any data transformations, balanced the data, or derived any fields, you may want to create a Supernode that reduces the size of your current stream. 11. Save your stream as Exer1.str. For those with extra time. Use the Anomaly node to detect anomalous cases in the customer_dbase.sav file, as we did in the chapter. Then add the generated Anomaly node to the stream and investigate these unusual cases in more detail. Would you retain them for modeling, or not? Why?
Objectives
In this chapter we show how to build a neural network with Clementine. The resulting model will be browsed and the output explained. In addition, we will introduce the different training methods and discuss the types of algorithms available within Clementine. We will then illustrate how and when to use the expert options within the Neural Net node. Finally, we will discuss the uses of sensitivity analysis and how to prevent over-training.
Data
In this chapter we will use the data set introduced in the previous chapter, churn.txt. The data contain information on 1477 of the companys customers who have at some time purchased a mobile phone. The customers fall into one of three groups: current customers, involuntary leavers and voluntary leavers. In this chapter we want to use data mining to understand what factors influence whether an individual remains as a customer or leaves for a competitor. The file contains information including length of time spent on local, long distance and international calls, the type of billing scheme and a variety of basic demographics, such as age and gender of the customer. Following recommended practice, we will use a Partition Node to divide the cases into two partitions (subsamples), one to build or train the model and the other to test the model (often called a holdout sample). With a holdout sample, you are able to check the resulting model performance on data not used to fit the model. The holdout data sample also has known values for the outcome field and therefore can be used to check model performance.
Neural Networks 2 - 1
Predictive Modeling With Clementine A typical neural network consists of several neurons arranged in layers to create a network. Each neuron can be thought of as a processing element that is given a simple part of a task. The connections between the neurons provide the network with the ability to learn patterns and interrelationships in data. The figure below gives a simple representation of a common neural network (a Multi-Layer Perceptron). Figure 2.1 Simple Representation of a Common Neural Network
When using neural networks to perform predictive modeling, the input layer contains all of the fields used to predict the outcome. The output layer contains an output field: the target of the prediction. The input and output fields can be numeric or symbolic (in Clementine, symbolic fields are transformed into a numeric form (dummy or binary set encoding) before processing by the network). The hidden layer contains a number of neurons at which outputs from the previous layer combine. A network can have any number of hidden layers, although these are usually kept to a minimum. All neurons in one layer of the network are connected to all neurons within the next layer While the neural network is learning the relationships between the data and results, it is said to be training. Once fully trained, the network can be given new, unseen data and can make a decision or prediction based upon its experience. When trying to understand how a neural network learns, think of how a parent teaches a child how to read. Patterns of letters are presented to the child and the child makes an attempt at the word. If the child is correct she is rewarded and the next time she sees the same combination of letters she is likely to remember the correct response. However, if she is incorrect, then she is told the correct response and tries to adjust her response based on this feedback. Neural networks work in the same way. Clementine provides two different classes of supervised neural networks, the Multi-Layer Perceptron (MLP) and the Radial Basis Function Network (RBFN). In this course we will concentrate on the MLP type network and the reader is referred to the Clementine 11.1 Node Reference for more details on the RBFN approach to neural networks. Within a Multi-Layer Perceptron (MLP), each hidden layer neuron receives an input based on a weighted combination of the outputs of the neurons in the previous layer. The neurons within the
Neural Networks 2 - 2
Predictive Modeling With Clementine final hidden layer are, in turn, combined to produce an output. This predicted value is then compared to the correct output and the difference between the two values (the error) is fed back into the network, which in turn is updated. This feeding of the error back through the network is referred to as back-propagation. To illustrate this process we will take the simple example of a child learning the difference between an apple and a pear. The child may decide in making a decision that the most useful factors are the shape, the color and the size of the fruitthese are the inputs. When shown the first example of a fruit she may look at the fruit and decide that it is round, red in color and of a particular size. Not knowing what an apple or a pear actually looks like, the child may decide to place equal importance on each of these factorsthe importance is what a network refers to as weights. At this stage the child is most likely to randomly choose either an apple or a pear for her prediction. On being told the correct response, the child will increase or decrease the relative importance of each of the factors to improve her decision (reduce the error). In a similar fashion a MLP network begins with random weights placed on each of the inputs. On being told the correct response, the network adjusts these internal weights. In time, the child and the network will hopefully make correct predictions.
Neural Networks 2 - 3
Next we will add a Table node to the stream. This not only will force Clementine to instatiate the data but also will act as a check to ensure that the data file is being correctly read.
Place a Table node from the Output palette above the Type node in the Stream Canvas Connect the Type node to the Table node Right-click the Table node Execute the Table node
Neural Networks 2 - 4
Predictive Modeling With Clementine Figure 2.2 Type Node Ready for Modeling
Notice that ID will be excluded from any modeling as the direction is automatically set to None for a Typeless field. The CHURNED field will be the output field for any predictive model and all fields but ID and Partition will be used as predictors.
Click OK Place a Neural Net node from the Modeling palette to the right of the Type node Connect the Type node to the Neural Net node
Neural Networks 2 - 5
Predictive Modeling With Clementine Notice that once the Neural Net node is added to the data stream, its name becomes CHURNED, the field we wish to predict. The name can be changed (among other things), by editing the Neural Net node.
Double-click the Neural Net node
The name of the Network, which, by default, will also be used as the name for the Neural Net and the generated model node, can be entered in the Model name Custom text box. The Use partitioned data option is checked so that the Neural Net node will only use the Training cases on the Partition field to build the model and hold out the Testing cases for testing purposes. If this option is left unchecked, this node will ignore the Partition field and use all of the cases to build the model. There are six different algorithms available within the Neural Net node. The Quick method uses a feed-forward back-propagation network whose topology (number and configuration of nodes in the hidden layer) is based on the number and types of the input and output fields. For details on the other neural network methods, the reader is referred to the Clementine 11.1 Node Reference. Over-training is one of the problems that can occur within neural networks. As the data pass repeatedly through the network, it is possible for the network to learn patterns that exist in the sample only and thus over-train. That is, it will become too specific to the training sample data and loose its ability to generalize. By selecting the Prevent overtraining option (checked by
Neural Networks 2 - 6
Predictive Modeling With Clementine default), only a randomly selected proportion of the training data is used to train the network (this is separate from a holdout sample discussed above). By default, 50% of the data is selected for training the model, and 50% for testing it. Once the training proportion of data has made a complete pass through the network, the rest is reserved as a test set to evaluate the performance of the current network. By default, this information determines when to stop training and provides feedback information. We advise you to leave this option turned on. Note that by checking both the Use partitioned data and the Prevent overtraining options, the Neural Net model will be trained on 50 percent of the training sample selected by the Partition Node, and not on half of the entire data set. You can control how Clementine decides to stop training a network. By default, Clementine stops when it appears to have reached its optimally trained state; that is, when accuracy in the test data set seems to no longer improve. Alternatively, you can set a required accuracy value, a limit to the number of cycles through the data, or a time limit in minutes. In this chapter we use the default option. Since the neural network initiates itself with random weights, the behavior of the network can be reproduced using the Set random seed option and entering the same seed value. Although we do it here to reproduce the results in the guide, setting the random seed is not a normal practice and it is advisable to run several trials on a neural network to ensure that you obtain similar results using different random starting points. The Optimize option allows you to make a tradeoff between speed and memory usage. Select Speed to never have Neural Net use disk space for memory in order to improve performance. Alternatively, select Memory to use available disk space when appropriate, at some sacrifice to speed. By default, optimize for memory is selected, and we recommend leaving it at this setting unless your computer is low in installed memory. The Options tab allows you to customize some settings:
Click the Options tab
Neural Networks 2 - 7
Predictive Modeling With Clementine Figure 2.5 Neural Net Options Tab
The Use binary set encoding option uses an alternative method of coding fields of type Set when they are used in the Neural Net node. It is more efficient and thus can have benefits when Set fields included in the model have a large number of values. By default, a feedback graph appears while the network is training and gives information on the current accuracy of the network. We will describe the feedback graph in more detail later. By default, a model will be generated from the best network found (based on the test data), but there is an option to generate a model from the final network trained. This can be used if you wish to stop the network at different points to examine intermediate results, and then pick up network training from where it left off (you will need to check the Continue training existing model as well). Sensitivity analysis provides a measure of relative importance for each of the fields used as inputs to the network and is helpful in evaluating the predictors. We will retain this option as well. The Expert tab allows you to refine the properties (for example, the network topology and training parameters) of the training method. Expert options are detailed in Clementine 11.1 Node Reference. Initially, we will keep the default settings on the above options. The Fields tab can be used to override the Type node direction settings and directly select the predictors and outcome field for the neural net.
Neural Networks 2 - 8
Predictive Modeling With Clementine To reproduce the results in this training guide:
Click the Model tab Click the Set random seed check box Type 233 into the Seed: text box Click Execute
Note that if different models are built from the same stream using different inputs, it may be advisable to change the Neural Net node names for clarity. Figure 2.6 Feedback Graph During Network Training
Clementine passes the data stream to the Neural Net node and begins to train the network. A feedback graph similar to the one shown above appears on the screen (although it may not appear for this small data file if your computer is reasonably fast). The graph contains two lines. The red, more irregular line labeled Current Predicted Accuracy, presents the accuracy of the current network in predicting the test data set. The blue, smoother line, labeled Best Predicted Accuracy, displays the best accuracy so far on the test data. Training can be paused by clicking the Stop execution button be found next to the Execute buttons). in the Toolbar (this button can
Once trained the network performs, if requested, the sensitivity analysis and a diamond-shaped node with the neural net symbol appears in the Models palette. This represents the trained network and is labeled with the output field name.
Neural Networks 2 - 9
Predictive Modeling With Clementine Figure 2.7 Generated Neural Net Node Appearing in Models Palette
Neural Networks 2 - 10
Predictive Modeling With Clementine This menu allows you to open a model in the palette, save the models palette and its contents, open a previously saved models palette, clear the contents of the palette, or to add the generated models to the Modeling section of the CRISP-DM project window. If you use SPSS Predictive Enterprise Services to manage and run your data mining projects, you can store the palette, retrieve a palette or model from the Predictive Enterprise Repository. The second menu is specific to the generated model nodes.
Right-click the generated Neural Net node named CHURNED in the Models palette
This menu allows you to rename, annotate, and browse the generated model node. A generated model node can be deleted, exported in either PMML (Predictive Model Markup Language) or stored in the Predictive Enterprise Repository, or saved in a file for future use. We will first browse the model.
Click Browse
For more information on a section simply expand the section by double-clicking the section (or click the Expand All button to expand all sections at once). To start with we will take a closer look at the Analysis section.
Expand the Relative Importance of Inputs folder
The output is shown in Figure 2.10. The Analysis section displays information about the neural network. The predicted accuracy for this neural network is 75.2%, indicating the proportion of the test set (used to prevent overtraining) correctly predicted. The input layer is made up of one neuron per numeric or flag type field. Set type fields will have one neuron per value within the set (unless binary encoding is used). In this example, there are twelve numeric or flag fields and one set field with three values, totaling fifteen neurons. In this network there is one hidden layer, containing three neurons, and the output layer contains three neurons corresponding to the three values of the output field, CHURNED. If the output field had been defined as numeric then the output layer would only contain one neuron.
Neural Networks 2 - 11
Predictive Modeling With Clementine The input fields are listed in descending order of relative importance. Importance values can range from 0.0 and 1.0, where 0.0 indicates unimportant and 1.0 indicates extremely important. In practice this figure rarely goes above 0.40 or so. Here we see that International is the most important field within this current network, followed by LONGDIST and SEX. Figure 2.10 Browsing the Generated Net Node
The sections Fields, Build Settings and Training Summary contain technical details and we will skip these sections.
Click FileClose to close the CHURNED Neural Net output window
Neural Networks 2 - 12
Predictive Modeling With Clementine will rerun the model just once and compare it with the one we just ran. Normally, you would need to rerun it several times.
Double-click the Neural Net node Change the random seed in the Seed: text box from 233 to 444 Click Execute Right-click on the generated Neural Net node named CHURNED in the Models palette. Click Browse Expand the Relative Importance of Inputs folder
Figure 2.11 Browsing the Generated Net Node after Changing the Seed
While the same three fields as in the previous model were chosen as the most important again, notice the second model ranked SEX as more important than LONGDIST. Also note that the accuracy has jumped from 75.2 to 80.7 percent. Normally we would rerun the model again to further convince ourselves that these are indeed the top three predictors of CHURNED, but we will stop here and attempt to further understand the model.
Click FileClose to close the CHURNED Neural Net output window
Predictive Modeling With Clementine that you are likely to remain a customer, or leave. In the following sections we will use some techniques available in Clementine to help you evaluate the network and discover its simple structure.
Move the Neural Net node named CHURNED higher in the Stream Canvas Place the generated Neural Net model named CHURNED from the Models palette to the right of the Type node on the Stream Canvas Connect the Type node to the generated Neural Net model named CHURNED Place a Table node below the generated Neural Net model named CHURNED Connect the generated Neural Net model named CHURNED to the Table node
Neural Networks 2 - 14
Predictive Modeling With Clementine Figure 2.13 Table Showing the Two Fields Created by the Generated Net Node
The generated Neural Net node calculates two new fields, $N-CHURNED and $NC-CHURNED, for every record in the data file. The first represents the predicted CHURNED value and the second a confidence value for the prediction. The latter is only appropriate for symbolic outputs and will be in the range of 0.0 to 1.0, with the more confident predictions having values closer to 1.0.
Close the Table window
First we will edit the Select node on the left that we will use to select the Training sample cases:
Double click on the Select node on the left to edit it
Neural Networks 2 - 15
Click OK
Now we will edit the Select node on the right to select the Testing sample cases:
Double click on the Select node on the right to edit it Click the Expression Builder button Move Partition from the Fields list box to the Expression Builder text box Click the (equal sign button) and insert the value 2_Testing
Click the Select from existing field values button Click OK and OK
Now, attach a Matrix node to each of the Select nodes. For each of the Select nodes:
Place a Matrix node from the Output palette below each Select node Connect the Matrix nodes to the Select nodes Double-click on each Matrix node to edit it Put CHURNED in the Rows: Put $N-CHURNED in the Columns: Click the Appearance tab Click the Percentage of row option
Neural Networks 2 - 16
For each actual churn category, the Percentage of row choice will display the percentage of records predicted into each of the outcome categories.
Figure 2.15 Updated Stream with new Select and Matrix Nodes
Neural Networks 2 - 17
Predictive Modeling With Clementine Figure 2.16 Matrix of Actual (Rows) and Predicted (Columns) Churned for Training and Testing Samples
For the training data, the model is predicting 78.5% of the current customers, 82.0% of the voluntary leavers and 100% of the involuntary leavers. It is of course up to the researcher to decide whether these are acceptable levels of accuracy. The results for the Testing sample are very similar which suggests that the model will work well with unseen data. When you decide whether to accept a model, and you report on its accuracy, you should use the results from the Testing (or Validation) sample. The models performance on the Training data may be too optimized for that particular sample, so its performance on the Testing sample will be the best indication of its performance in the future.
Close the Matrix windows
Neural Networks 2 - 18
Predictive Modeling With Clementine different evaluation criterion. Here we discuss Gains and Lift charts. For information about the others, which include Profit and ROI charts, see the Clementine 11.1Nodes Reference. Gains are defined as the proportion of total hits that occurs in each quantile. We will examine the gains when the data are ordered from those most likely to those least likely to be in the current category (based on the confidence of the model prediction).
Place an Evaluation node from the Graphs palette near the generated Neural Net model named CHURNED Connect the generated Neural Net model named CHURNED to the Evaluation node
Figure 2.17 Stream with Evaluation Node Connected to a Generated Model Node
Neural Networks 2 - 19
Predictive Modeling With Clementine Figure 2.18 Evaluation Node Dialog Box
The Chart type option supports five chart types with Gains chart being the default. If Profit or ROI chart type is selected, then the appropriate options (cost, revenue and record weight values) become active so information can be entered. The charts are cumulative by default (see Cumulative plot check box), which is helpful in evaluating such business questions as how will we do if we make the offer to the top X% of the prospects? The granularity of the chart (number of points plotted) is controlled by the Plot drop-down list and the Percentiles choice will calculate 100 values (one for each percentile from 1 to 100). For small data files or business situations in which you can only contact customers in large blocks (say some number of groups, each representing 5% of customers, will be contacted through direct mail), the plot granularity might be decreased (to deciles (10 equal-sized groups) or vingtiles (20 equal-sized groups)). A baseline is quite useful since it indicates what the business outcome value (here gains) would be if the model predicted at the chance level. The Include best line option will add a line corresponding to a perfect prediction model, representing the theoretically best possible result applied to the data where hits = 100% of the cases.
Click the Include best line checkbox
The Split by partition option provides an opportunity to test the model against the unseen data that was held out by the Partition node. If checked, an evaluation chart will be displayed for both the Training and Testing samples. We will accept the default option to split by partition.
Click Options tab
Neural Networks 2 - 20
Predictive Modeling With Clementine To change the definition of a hit, check the User defined hit check box and then enter the condition that defines a hit in the Condition box. For example, if we wanted the evaluation chart to be based on the Vol (voluntary leavers) category, the condition would be @TARGET = "Vol", where @TARGET represents the target fields from any models in the stream. The Expression Builder can be used to build the expression defining a hit. This tab also allows users to define how scores are calculated (User defined score), which determines how the records are ordered in Evaluation charts. Typically scores are based on functions involving the predicted value and confidence. Figure 2.19 Evaluation Node Options Tab
The Include business rule option allows the Evaluation chart to be based only on records that conform to the business rule condition. So if you wanted to see how a model(s) performs for single males, the business rule could be STATUS = "S" and SEX ="M". The model evaluation results used to produce the evaluation chart can also be exported to a file (Export results to file option).
Click Execute
Neural Networks 2 - 21
Predictive Modeling With Clementine Figure 2.20 Gains Chart (Cumulative) with Current as Target
The vertical axis of the gains chart is the cumulative percentage of the hits, while the horizontal axis represents the ordered (by model prediction and confidence) percentile groups. The diagonal line presents the base rate, that is, what we expect if the model is predicting the outcome at the chance level. The upper line (labeled Best) represents results if a perfect model were applied to the data, and the middle line (labeled $N-CHURNED) displays the model results. The three lines connect at the extreme [(0, 0) and (100, 100)] points. This is because if either no records or all records are considered, the percentage of hits for the base rate, best model, and actual model are identical. The advantage of the model is reflected in the degree to which the model-based line exceeds the base-rate line for intermediate values in the plot and the area for model improvement is the discrepancy between the model line and the Best (perfect model) line. If the model line is steep for early percentiles, relative to the base rate, then the hits tend to concentrate in those percentile groups of data. At the practical level, this would mean for our data that many of the current customers could be found within a small portion of the ordered sample. You can create bands on an Evaluation chart and generate a Select or Derive node for a band of business interest. Click on EditEnable Interaction The fact that the evaluation charts for both the Training and Testing data are so strikingly similar strongly suggests that the model will work well with unseen data. An examination of each of the evaluation charts reveals that across percentiles 1 through 40, the distance between the model and baseline lines grows (indicating a concentration of current customers). If we look at the 40th percentile value (horizontal axis) for the Training data, we see that under the base rate we expect to find 40% of the hits (current customers) in the first 40 percentiles (40%) of the sample, but the
Neural Networks 2 - 22
Predictive Modeling With Clementine model produces over 70% of the hits in the first 40 percentiles of the model-ordered sample. This percentage can be displayed by placing your cursor on the line at the 40th percentile. The steeper the early part of the plot, the more successful is the model in predicting the target outcome. Notice that the line representing a perfect model (Best) continues with a steep increase between the 10th and 40th percentiles, while the results from the actual model flatten. Figure 2.21 Hit Rate for Bad Losses at the 40th Percentile
For the remainder of the percentiles (50 through 100), the distance between the model and base rate narrows, indicating that these last model-based percentile groups contain a relatively small (lower than the base rate) proportion of current customers. The Gains chart provides a way of visually evaluating how the model will do in predicting a specified outcome. The lift chart is another way of representing this information graphically. It plots a ratio of the percentage of records in each quantile that are hits divided by the overall percentage of hits in the training data. Thus the relative advantage of the model is expressed as a ratio to the base rate.
Close the Evaluation chart window Double-click the Evaluation node named $N-CHURNED Click the Plot tab Click the Lift Chart Type option Click Execute
Neural Networks 2 - 23
Predictive Modeling With Clementine The lift chart is shown in Figure 2.22. The lift value at the 20th Percentile is 1.848 (recall this is cumulative lift), providing another measure of the relative advantage of the model over the base rate. Gains charts and lift charts are very helpful in marketing and direct mail applications, since they provide evaluations of how well the campaign would do if it were directed to the top X% of prospects, as scored by the model. Although, the performance on the Testing data is slightly less (1.647), it is still highly acceptable, adding further confidence to the model. Figure 2.22 Lift Chart (Cumulative) with Current Customers as Target
We have established where the model is making correct and incorrect predictions and evaluated the model graphically. But how is the model making its predictions? In the next section we will examine a couple of methods that may help us to begin to understand the reasoning behind the predictions.
Close the Evaluation chart window
Neural Networks 2 - 24
The Normalize by color option creates a bar chart with each bar the same length. This helps to compare the proportions in each overlay category. Figure 2.23 Distribution Plot Relating Sex and Predicted Churned ($N-CHURNED)
The chart illustrates that the model is predicting that the majority of females are voluntary leavers, while the bulk of males were predicted to remain current customers.
Close the Distribution plot window
Neural Networks 2 - 25
We need to normalize by color because there are so few people in the file who did a substantial amount of international calling. Figure 2.24 Histogram with Overlay of Predicted Churned ($N-CHURNED)
Here we see that the neural network is predicting that customers who spend a great deal of time on international calling are far more likely to voluntarily leave the company than persons who rarely do international calling. Now, lets look at how Long Distance calling affected the predictions.
Close the Histogram plot window Double-click the Histogram node Click the Plots tab Click LONGDIST in the Field: list Click Execute
Neural Networks 2 - 26
Predictive Modeling With Clementine Figure 2.25 Histogram of LONGDIST with $N-CHURNED as the Overlay
Here the only clear pattern we see is that Involuntary Leavers tend to be people who do little or no long distance calling. In contrast, it appears that amount of long distance calling was not as much an issue when it came to predicting whether or not a person would remain a current customer or voluntarily choose to leave.
Close the Histogram plot window
Note: Use of Data Audit Node We explored the relationship between just three input fields (International, LONGDIST, and SEX) and the prediction from the neural net ($N-CHURNED), and used Distribution and Histogram nodes to create the plots. If more inputs were to be viewed in this way, a better approach would be to use the Data Audit node because overlay plots could easily be produced for multiple input fields, and a more detailed plot could be created by double-clicking on it in the Data Audit output window.
Neural Networks 2 - 27
Neural Networks 2 - 28
Predictive Modeling With Clementine such a graph. With any complex problem there may be a large number of feasible solutions, thus the graph contains a number of sub-optimal solutions or local minima (the valleys in the plot). The trick to training a successful network is to locate the overall minimum or global solution (the lowest point), and not to get stuck in one of the local minima or sub-optimal solutions. Figure 2.26 Representation of the Error Domain Showing Local and Global Minima
There are many different types of supervised neural networks (that is neural networks that require both inputs and an output field). However, within the world of data mining, two are most frequently used. These are the Multi-Layer Perceptron (MLP) and the Radial Basis Function Network (RBFN). In the following paragraphs we will describe the main differences between these types of networks and describe their advantages and disadvantages.
Neural Networks 2 - 29
Predictive Modeling With Clementine Figure 2.27 Decision Surface Created Using the Multi-Layer Perceptron
The advantages of using a MLP are: It is effective on a wide range of problems It is capable of generalizing well If the data are not clustered in terms of their input fields, it will classify examples in the extreme regions It is currently the most commonly used type of network and there is much literature discussing its applications The disadvantages of using a MLP are: It can take a great deal of time to train It does not guarantee finding the best global solution Within Clementine, there are four available MLP learning algorithms: Quick, Dynamic, Multiple and Prune. In addition, an Exhaustive Prune option (Prune with a large preset topology) is available. Choosing an appropriate algorithm can involve a trade-off between computing time and accuracy (increased computing time supports a more extensive search for the global solution). We will discuss the four different algorithms and their settings in later paragraphs.
Neural Networks 2 - 30
The advantages of using a RBF network are: It is quicker to train than a MLP It can model data that are clustered within the input space. The disadvantages of using a RBF network are: It is difficult to determine the optimal position of the function centers The resulting network often has a poor ability to represent the global properties of the data. Within Clementine there is only one available RBF algorithm, which uses the k-means clustering algorithm to determine the number and location of the centers in the input space. We discuss the settings of this algorithm in greater detail in the following sections.
Alpha
Alpha refers to the momentum used in updating the weights when trying to locate the global solution. It tends to keep the weight changes moving in a consistent direction and can reduce the training time. Each update includes a factor of alpha times the previous update. Alpha ranges between 0 and 1, and its default value is 0.9.
Neural Networks 2 - 31
Eta
Eta refers to the learning rate and can be thought of as how much of an adjustment can be made at each update. Within the expert options the Initial Eta is the starting value of eta. This is then exponentially decayed to Low Eta, at which point it is set back to High Eta and decayed back to Low Eta. The decay takes Eta Decay cycles to go from High Eta to Low Eta. Figure 2.29 illustrates the exponential decay of eta. Figure 2.29 Exponential Decay of Eta
When deciding upon settings of both alpha and eta it is often useful to think of a simple analogy. Imagine a ball bearing (the current network) rolling down a hill, trying to find the lowest point (global solution). Alpha is analogous to the size of the ball bearing and eta is the gradient of the hill. When choosing a network, we want to reach the overall minimum, not a sub-optimal solution; hence, we require a ball bearing that has enough momentum to move in and out of local minima (alpha) and a space that contains local hills with small gradients. Figures 2.30 and 2.31 illustrate this analogy: Figure 2.30 Network Becomes Stuck in a Local Minimum and Provides a Sub-Optimal Solution
Neural Networks 2 - 32
Predictive Modeling With Clementine Figure 2.31 Network Can Move In and Out of Local Minima and Locate the Globally Optimal Solution
Persistence
Persistence refers to the number of cycles for which the network will train without improvement in default stopping mode. Neural networks must pass the data through their neural topology many, many times before finding a stable and viable solution. When deciding at what values to set Alpha, Eta and Persistence, the feedback graph can be used to help make your decisions. Figure 2.32 illustrates this. The first feedback graph shows an ideal feedback graph. Recommended values are typically: alpha = 0.9 eta = 0.01 to 0.3 persistence = 200 eta decay cycles = 30
If a solution is reached extremely quickly and the graph then reaches a plateau, as shown in the second graph in Figure 2.32, a reduction in the Initial Eta can help. If the accuracy increases and then suddenly drops without reaching a plateau, as illustrated in the third graph in Figure 2.32, the Eta Decay cycles should be increased. Finally, if the training terminates too quickly, as shown in the last graph in Figure 2.32, the Persistence should be increased.
Neural Networks 2 - 33
Predictive Modeling With Clementine Figure 2.32 Various Feedback Graphs and the Possible Solution
We will now go on to discuss each of the algorithms available within Clementine and their expert options.
Quick
The Quick method (the default) creates a network that contains one hidden layer. The number of neurons within the hidden layer is determined according to several factors relating to the number and type of fields used in the analysis. If it is not already open, open the NeuralNet..str.
Click FileOpen Stream and move to the c:\Train\ClemPredModel directory Double-click on NeuralNet.str
Neural Networks 2 - 34
Edit the Neural Net node (named CHURNED) in the upper stream Click Quick on the Method drop-down list (if necessary) Click the Expert tab Click the Expert Mode option button
Neural Networks 2 - 35
Predictive Modeling With Clementine Figure 2.34 Expert Options for the Neural Net Nodes Quick Training Method
You can select whether to use up to three Hidden Layers and set the number of neurons within each of these layers. The default Persistence is set to 200 and can be altered if required. The learning rates of Alpha and Eta can also be changed. Run the Neural Net model with the Quick method.
Click Execute
Dynamic
Dynamic training uses a dynamically growing network that starts with two hidden layers of two neurons each and begins to grow by adding one neuron to each layer. The network monitors the training and looks for over-training; lack of improvement triggers growth of the network. The network will continue to grow until adding a neuron gives no benefit for a number of growing attempts. This option is slow but often yields good results. The Dynamic method has no Expert options.
Multiple
The Multiple method creates a number of networks, each with a different topology. Some contain one hidden layer; others contain two, all with varying numbers of hidden neurons. The networks are trained in pseudo-parallel, which causes this method to be extremely slow to train, but can yield good results.
Neural Networks 2 - 36
Figure 2.35 Expert Options for the Neural Net Nodes Multiple Training Method
The Topologies box is a text field in which you may specify a set of topologies for the networks to be trained. The topologies refer only to the hidden layers (because the input and output layers are fixed by the fields in the model). Semi-colons separate the network definitions: Network1; Network2; Network3 Commas separate the hidden layer structure: Layer1, Layer2, Layer3 Spaces separate up to three numbers that represent the number of neurons in the layers: n m inc where: n alone represents the number of neurons in the hidden layer n m represents alternative sizes for the hidden layer, one for each integer between n and m inclusive n m inc represents alternative sizes for the hidden layer, one for each integer between n and m in jumps of inc.
Neural Networks 2 - 37
Predictive Modeling With Clementine For example, the topologies setting (2 20 3; 2 27 5, 2 22 4) represents a set of networks with one hidden layer containing 2, 5, 8, 11, 14, 17 and 20 neurons, and then a set of networks with two hidden layers using all combinations of 2, 7, 12, 17, 22, or 27 neurons in the first layer and 2, 6, 10, 14, 18 or 22 neurons in the second hidden layer. The Discard Non-Pyramids: option, when checked, ensures that no non-pyramid networks are produced. A network is a pyramid if each layer contains a smaller number of neurons than the preceding layer. Non-pyramids have been found not to train as well as pyramids. The default Persistence is set to 200 and can be altered if required. The momentum (Alpha) and learning rates (Initial Eta, High Eta, and Low Eta) can also be changed.
Prune
The Prune method begins with a large one or two hidden-layer network (large meaning all predictors and many hidden neurons). Training is initially done in the same manner as Quick. A sensitivity analysis is then performed on the hidden neurons, and the weakest (the proportion based on the Hidden Rate factordefault .15 or 15%) hidden neurons are removed from the network. This process of training and removing is repeated until there has been no improvement in the network (for a Hidden Persistence number of loops). Once this has been done, training is performed, a sensitivity analysis performed on the input neurons, and the least important input neurons (proportion based on the Input Ratedefault .15 or 15%) are removed. This loop continues until the network no longer improves (for an Input Persistence number of loops). The larger loop of pruning the hidden layer, then pruning the input layer, continues until there has been no improvement (for an Overall Persistence number of loops). The final network may actually use fewer predictor fields than originally supplied. The Prune method is generally the best of the four methods, but is very time consuming.
Click the Model tab and click Prune in the Method drop-down list Click the Expert tab
Neural Networks 2 - 38
Predictive Modeling With Clementine Figure 2.36 Expert Options for the Neural Net Nodes Prune Training Method
You may specify the size and number of Hidden Layers within the starting network (usually set slightly larger than expected). The Hidden rate is the factor by which the number of neurons within the hidden layers is reduced by each pruning. The Input rate is the factor by which the number of neurons within the input layer is reduced by each pruning. The Hidden persistence and Input persistence are the number of prunes to perform to the hidden and input layer, respectively, without improvement. The Persistence is the number of cycles for which the network will train without improvement, before it attempts to prune one of the layers. The Overall persistence is the number of times the network will pass through the prune hidden/prune input loop without improvement in Default stopping mode. As with the other algorithms, you can alter the persistence, momentum (alpha) and the learning rates (eta).
Exhaustive Prune
The Exhaustive Prune Training Method invokes Prune settings that are designed to produce a more exhaustive examination of networks than is done by the default Prune method. When Exhaustive Prune is chosen, the Hidden rate and Input rate are reduced (from .15) to .02 and .01, respectively. This means that fewer neurons and fields are removed at each stage, so a larger number of topologies can be examined. The Persistence values are also increased (Hidden persistence and Input Persistence are 40 and Persistence is 1000), so each topology can receive more training. In addition, the initial topology is a large two-hidden layer network. The learning rates are not changed.
Neural Networks 2 - 39
Predictive Modeling With Clementine Thus, the Exhaustive Prune method provides an easy way to request that a more complete examination of networks is done within the Prune method. Since the Exhaustive Prune option invokes the Prune method with specific Expert settings, no Expert options are available when Exhaustive Prune is selected as the method. If you wish to run an Exhaustive Prune model and change the learning rates, you can use the Expert options under the Prune method to manually reduce the Hidden rate and Input rate, increase the Persistence values, and then change the learning rates. The Exhaustive Prune option will be slower than the Prune method, but since it examines more networks, it may produce a better model.
RBFN
The RBFN method creates a Radial Basis Function Network and works by initially creating a Kmeans clustering model that provides the centers for the hidden layer. The output layer is then trained as a Single Layer Perceptron using the Least Mean Squares (LMS) method. When the Expert mode is selected, the options shown below become available.
Click the Model tab and click RBFN in the Method drop-down list Click the Expert tab
Figure 2.37 Expert Options for the Neural Net Nodes RBFN Training Method
You may specify the size of the hidden layer by changing the number of RBF clusters. The Persistence, Alpha and Eta can be altered if required, although the eta parameter can be calculated automatically on the basis of learning performance in the first two iterations.
Neural Networks 2 - 40
Predictive Modeling With Clementine During RBF training, a data record (or pattern) is most strongly represented in the cluster(s) nearest it. In other words, its activation value is highest for nearby clusters. When the distance between a record and a cluster center is evaluated, it is divided by a width or spread parameter (named sigma), which is equivalent to the role of the standard deviation in a normal distribution (the normal distribution is close in form to a commonly-used activation function). In Clementine, the sigma (or spread) value for a cluster is equal to the average distance from that cluster to the two nearest clusters. These sigma values are determined by Clementine, but you can effectively increase or decrease them by modifying the RBF overlapping factor. By increasing this multiplicative factor above 1, records further from a cluster center will be reflected in that cluster during training and so clusters will tend to overlap (a record can contribute to multiple clusters). As the RFB overlapping factor is decreased (from 1 toward 0), a cluster tends to be represented in training by only those records tightly grouped around it. Thus low RBF overlapping factors lead to tight, distinct clusters, while large RBF overlapping factors lead to large, dispersed, overlapping clusters being used in training the network.
Sets
The Neural Net node handles these value conversions automatically. It is therefore important to check the data in advance of running a Neural Net using the Quality and Data Audit nodes and to
Neural Networks 2 - 42
Predictive Modeling With Clementine decide which records and fields should be passed to the neural network for modeling. Otherwise you run the risk of a model being built using data values supplied by these substitution rules. Also, you can take control of the substitution by using the facilities of Clementine to change missing values to valid values that you prefer, before using the Neural Net node. Note: Accuracy for a Continuous Output Neuron For a nominal output variable, the accuracy is simply the percentage correct. It is worth noting that if the output field is continuous then accuracy within Clementine is defined as the average across all records of the following expression:
100* [1 (abs[Target Network Prediction]/ (Range of Target values))]
The rapid rise in the line seems to suggest that the model reached a solution too quickly (see Figure 2.32, second graph). Unfortunately, because the Dynamic method has no expert options, it wont be possible to reduce the Initial Eta, which is recommended in Figure 2.32. However, this alone doesnt mean that the Dynamic method wont be the best model with this data. As we will see, there are other factors to consider, such as overall accuracy, or the ability of the model to correctly predict a particular group of interest.
Neural Networks 2 - 43
Predictive Modeling With Clementine Figure 2.38 Feedback graph using the Dynamic method
Now, lets try running the Prune method. It should be noted that this method usually takes considerably longer than the other methods but is often more accurate.
Right click on the new generated model in the in the Models manager and select Rename and Annotate from the context menu Click Custom and type Dynamic Edit the Neural Net node (named CHURNED) Click the Model tab and select Prune from the Method drop-down list Click Execute
It appears that the Prune feedback graph also suggests a sub-optimal solution. While, normally you would let the model run to completion, in the interest of time we will stop the execution and adjust the initial eta value. (In case you are interested, on my computer the model took 8 minutes and 9 seconds to complete, and the Estimated Accuracy was 85.795%. Of course, these timings will differ from computer to computer). Figure 2.39 Feedback graph using the Prune method
Click the Stop Execution button on the tool bar (it turns red during stream execution) Click Yes when you are asked whether you want to stop the execution Click No when you are asked whether you want to try generating a model Close the Message screen
Neural Networks 2 - 44
Predictive Modeling With Clementine Now lets rerun the model with a lower Initial Eta value.
Edit the Neural Net node (named CHURNED) Click the Expert tab Click Expert Mode Change the Initial Eta value to 0.2
Figure 2.40 Expert Prune Options After Reducing the Initial Eta value
Click Execute
At quick glance, it appears that the pattern in the feedback graph is no different than when the Initial Eta was 0.3. We could continue decreasing the value, but we will leave that to you as an exercise. Instead, lets examine the results of the model.
Right click on the new generated model in the in the Models manager and select Rename and Annotate from the context menu Click Custom and type Prune Right Click again on the model and click Browse Expand the Relative Importance of Inputs folder
Neural Networks 2 - 45
Predictive Modeling With Clementine Figure 2.41 Relative Importance of the fields using the Prune method
In this instance, reducing the Initial Eta resulted in a slightly less accurate model than with the defaults (85.511% vs. 85.795%). One other thing to notice is that Prune method used fewer predictors than the Quick Method. However, there is agreement between both models that Longdist, International, and Sex were the best predictors. By the way, although we did not display the sensitivity table for the Dynamic method, it also ranked these three fields as the top predictors. Now that we have three separate models, we can compare them to see which one is the best for our purposes. It should be noted that overall accuracy isnt the only criteria with which to compare them. For example, if we were primarily interested in identifying Voluntary Leavers, we would want to use the model which did the best job predicting them. We will use an Analysis node to make our comparisons.
Move all the generated models to the Stream Canvas Connect them in sequence (Quick, Dynamic, Prune) to the Type node (as shown in Figure 2.42) Attach an Analysis node from the Output palette to the last model
Neural Networks 2 - 46
Predictive Modeling With Clementine Figure 2.42 Stream for Comparing the Three Models
Edit the Analysis Node Check Coincidence matrices (for symbolic targets) Click Execute
The output is shown in Figure 2.43. The results for each model are listed according to their order in the stream. Thus, N-CHURNED corresponds with the Quick method, N1-CHURNED with the Dynamic method and N2-CHURNED with the Prune method. We will focus on the Testing data, or the data that was not used to build the models. This will give us a good idea how well each model with work with new data. The model with the best overall accuracy was created by the Prune method (81.4%) while the Quick (78.4%) and Dynamic (74.4%) did slightly worse. However, as was mentioned earlier, it isnt always the overall accuracy which is of the most interest. For example, if the primary focus was on identifying Current Customers or Voluntary Leavers, it may not necessarily be the most accurate model, although based on the coincidence matrices, in this case it was.
Neural Networks 2 - 47
Predictive Modeling With Clementine Figure 2.43 Results of the Model Comparisons Using the Analysis Node
Neural Networks 2 - 48
Summary Exercises
A Note Concerning Data Files
In this training guide files are assumed to be located in the c:\Train\ClemPredModel directory.
response orispend orivisit spendb visitb promspd promvis promspdb promvisb totvisit totspend forpcode mos mosgroup title sex yob age ageband
Response to campaign Pre-campaign expenditure Pre-campaign visits Pre-campaign spend category Pre-campaign visits category Post-campaign expenditure Post-campaign visits Post-campaign spend category Post-campaign visit category Total number of visits Total spend Post Code 52 Mosaic Groups Mosaic Bands Title Gender Year of Birth Age Age Category
In this session we will attempt to predict the field Response to campaign using a neural network. 1. Begin with a clear Stream canvas. Place an SPSS source node on the Stream canvas and connect it to the file charity.sav. Tell Clementine to use use variable and value labels. 2. Attach a Type and Table node in a stream to the source node. Execute the stream and allow Clementine to automatically define the types of the fields. 3. Edit the Type node. Set all of the fields to direction NONE. 4. We will attempt to predict response to campaign using the fields listed below. Set the direction of all five of these fields to IN and the Response to campaign field to OUT.
Neural Networks 2 - 49
Predictive Modeling With Clementine Pre-campaign expenditure Pre-campaign visits Gender Age Mosaic Bands (which should be changed to type Set) 5. Attach a Neural Net node to the Type node. Execute the Neural Net node with the default settings. 6. Once the model has finished training, browse the generated Net node within the Generated Models palette in the Manager. What is the predicted accuracy of the neural network? What were the most important fields within the network? 7. Place the generated Net node on the Stream canvas and connect the Type node to it. Connect the generated Net node to a Matrix node and create a data matrix of actual response against predicted response. Which group is the model predicting well? 8. Use some of the methods introduced in the chapter, such as web plots and histograms (or use the Data Audit node with an overlay field), to try to understand the reasoning behind the networks predictions. 9. Try some of the other neural net algorithms to see if you can improve on the accuracy. Also try modifying some of the parameters on the Expert tab to see if you can get a better result for the current model. Note, in the interests of time, you may have to stop execution on some of these methods other than Quick. However, you can still generate a sensitivity table even after you stop the execution. Save a copy of the stream as Network.str.
Neural Networks 2 - 50
Objectives
We will introduce the differences between the four types of decision tree/rule induction algorithms and detail the expert options available within their respective modeling nodes and when to use them. We will demonstrate the Interactive Trees feature with CHAID. We briefly explain how CHAID and C&R Tree can also be used to model a numeric output.
Data
We will use the data set churn.txt, which we used in the Neural Nets chapters. This data file contains information on 1477 of the companys customers who have at some time purchased a mobile phone. The customers fall into one of three groups: current customers, involuntary leavers and voluntary leavers. In this chapter, we will decision tree models to understand which factors influence into which of the three groups an individual falls. The file contains information including length of time spent on local, long distance and international calls, the type of billing scheme and a variety of basic demographics, such as age and gender of the customer. A second data set Insclaim.dat, used with the C&R Tree node, contains 293 records based on patient admissions to a hospital. All patients belong to a single diagnosis related group (DRG). Four fields (grouped severity of illness, age, length of stay, and insurance claim amount) are included. The goal is to build a predictive model for the insurance claim amount and use this model to identify outliers (patients with claim values far from what the model predicts), which might be instances of errors made in the claims. Such analyses can be performed for error or fraud detection in instances where audited data (for which the outcome: error/no error or fraud/no fraud) are not available.
3.1 Introduction
Clementine contains four different algorithms for constructing a decision tree (more generally referred to as rule induction): C5.0, CHAID, QUEST, and C&R Tree (classification and regression trees). They are similar in that they can all construct a decision tree by recursively splitting data into subgroups defined by the predictor fields as they relate to the outcome. However, they differ in several important ways. We begin by reviewing a table that highlights some distinguishing features of the algorithms. Next, we will examine the various options for the algorithms in the context of predicting a symbolic output. Within each section we discuss when it is viable to use the expert options within these nodes.
Model Criterion
Split Type for Symbolic Predictors Continuous Target Continuous Predictors Criterion for Predictor Selection Can Cases Missing Predictor Values be Used? Priors Pruning Criterion Build Trees Interactively Supports Boosting
1
C5.0
Multiple
CHAID
Multiple
1
QUEST
Binary
C&R Tree
Binary
No Yes Information measure Yes, uses fractionalization No Upper limit on predicted error No Yes
Yes No2 Chi-square F test for continuous Yes, missing becomes a category No Stops rather than overfit Yes No
No Yes Statistical
Yes Yes Impurity (dispersion) measure Yes, uses surrogates Yes Cost-complexity pruning Yes No
SPSS has extended the logic of the CHAID approach to accommodate ordinal and continuous target variables. 2 Continuous predictors are binned into ordinal variables containing by default approximately equal sized categories. Note: C&R Tree and QUEST produce binary splits (two branch splits) when growing the tree, while C5.0 and CHAID can produce more than two subgroups when splitting occurs. However, if we had a predictor of type set with four categories, each of which were distinct with relation to the outcome field, C&R Tree and QUEST could perform successive binary splits on this field. This would produce a result equivalent to a multiple split at a single node, but requires additional tree levels. All methods can handle predictors and outcomes that are of type flag and set. CHAID and C&R Tree can use a continuous target or outcome field (of type range), while all but CHAID can use a continuous predictor (although see footnote 2). The trees that each method grows will not necessarily be identical because the methods use very different criteria for selecting a predictor. CHAID and QUEST use more standard statistical methods, while C5.0 and C&R Tree use non-statistical measures, as explained below.
Predictive Modeling With Clementine Missing (blank) values are handled in three different ways. C&R Tree and QUEST use the substitute (surrogate) predictor field whose split is most strongly associated with that of the original predictor to direct a case with a missing value to one of the split groups during tree building. C5.0 splits a case in proportion to the distribution of the predictor field and passes a weighted portion of the case down each tree branch. CHAID uses all the missing values as an additional category in model building. Three of the four methods prune trees after growing them quite large, while CHAID instead stops before a tree gets too large. For all these reasons, you should not expect the four algorithms to produce identical trees for the same data. You should expect that important predictors would be included in trees built by any algorithm. Those interested in more detail concerning the algorithms see the Clementine 11.1 Algorithms Guide. Also, you might consider C4.5: Programs for Machine Learning (Morgan Kauffman, 1993) by Ross Quinlan, which details the predecessor to C5.0; Classification and Regression Trees (Wadsworth, 1984) by Breiman, Friedman, Olshen and Stone, who developed CART (Classification and Regression Tree) analysis; the article by Loh and Shih (1997, Split Selection Methods for classification trees, Statistica Sinica, 7: 815-840) that details the QUEST method; and for a description of CHAID, The CHAID Approach to Segmentation Modeling: CHIsquared Automatic Interaction Detection, Chapter 4 in Richard Bagozzi, Advanced Methods of Marketing Research (Blackwell, 1994).
Predictive Modeling With Clementine The name of the C5.0 node should immediately change to CHURNED. Figure 3.1 C5.0 Modeling Node Added to Stream
The Model name option allows you to set the name for both the C5.0 and resulting C5.0 rule nodes. The form (decision tree or rule set, both will be discussed) of the resulting model is selected using the Output type: option.
Predictive Modeling With Clementine The Use partitioned data option is checked so that the C5.0 node will make use of the Partition field created by the Partition node earlier in the stream. Whenever this option is checked, only the cases the Partition node assigned to the Training sample will be used to build the model; the rest of the cases will be held out for Testing and/or Validation purposes. If unchecked, the field will be ignored and the model will be trained on all the data. The Cross-validate option provides a way of validating the accuracy of C5.0 models when there are too few records in the data to permit a separate holdout sample. It does this by partitioning the data into N equal-sized subgroups and fits N models. Each model uses (N-1) of the subgroups for training, then applies the resulting model to the remaining subgroup and records the accuracy. Accuracy figures are pooled over the N holdout subgroups and this summary statistic estimates model accuracy applied to new data. Since N models are fit, N-fold validation is more resource intensive and reports the accuracy statistic, but does not present the N decision trees or rule sets. By default N, the number of folds, is set to 10. For a predictor field that has been defined as type set, C5.0 will normally form one branch per value in the set. However, by checking the Group symbolic values check box, the algorithm can be set so that it finds sensible groupings of the values within the field, thus reducing the number of rules. This is often desirable. For example, instead of having one rule per region of the country, group symbolic values may produce a rule such as: Region [South, Midwest] Region [Northeast, West] Once trained, C5.0 builds one decision tree or rule set that can be used for predictions. However, it can also be instructed to build a number of alternative models for the same data by selecting the Boosting option. Under this option, when it makes a prediction it consults each of the alternative models before making a decision. This can often provide more accurate prediction, but takes longer to train. Also the resulting model is a set of decision tree predictions and the outcome is determined by voting, which is not simple to interpret. The algorithm can be set to favor either Accuracy on the training data (the default) or Generality to other data. In our example, we favor a model that is expected to better generalize to other data and so we select Generality.
Click Generality option button
C5.0 will automatically handle errors (noise) within the data and, if known, you can inform Clementine of the expected proportion of noisy or erroneous data. This option is rarely used. As with all of the modeling nodes, after selecting the Expert option or tab, more advanced settings are available. In this course, we will discuss the Expert options briefly. The reader is referred to the Clementine 11.1 Node Reference for more information on these settings.
Click the Expert option button
Predictive Modeling With Clementine Figure 3.3 C5.0 Node Model Tab Expert Options
By default, C5.0 will produce splits if at least two of the resulting branches have at least two data records each. For large data sets you may want to increase this value to reduce the likelihood of rules that apply to very few records. To do so, increase the value in the Minimum records per child branch box.
Click the Simple Mode option button, and then click Execute
A C5.0 Rule model, labeled with the predicted field (CHURNED), will appear in the Models palette of the Manager.
Predictive Modeling With Clementine Figure 3.4 Browsing the C5.0 Rule Node
The results are in the form of a decision tree and not all branches are visible. Only the beginning of the tree is shown. According to what we see of the tree so far, LOCAL is the first split in the tree. Further we see that if LOCAL <= 4.976 the Mode value for CHURNED is InVol and if LOCAL > 4.976 the Mode value is Current. The Mode lists the modal (most frequent) output value for the branch. The mode will be the predicted value, unless there are other fields that need to be taken into account within that branch to make a prediction. In this instance, no predictions of CHURNED are visible. To view the predictions we need to further unfold the tree. To unfold the branch LOCAL > 4.976, just click the expand button.
Click to unfold the branch LOCAL > 4.976
SEX is the next split field. Now we see that SEX is the best predictor for persons who spend more than 4.976 minutes on local calls. The Mode value for Males is Current and for Females is Vol. symbol However, at this point we still cannot make any predictions for Sex because there is a to the left of each value which means that other fields need to be taken into account before we can make a prediction. Once again we can unfold each separate branch to see the rest of the tree, but we will take a shortcut:
Click the All button in the Toolbar
We can see several nodes usually referred to as terminal nodes that cannot be refined any further. In these instances, the mode is the prediction. For example, if we are interested in the Current Customer group, one group we would predict to remain customers are persons where Local > 4.976, Sex = M and International <= 0.905. To get an idea about the number and percentage of records within such branches we ask for more details.
Click Show or hide instance and confidence figures in the toolbar
Predictive Modeling With Clementine Figure 3.7 Instance and Confidence Figures Displayed
The incidence tells us that there are 256 persons who met those criteria. The confidence figure for this set of individuals is .956, which represents the proportion of records within this set correctly classified (predicted to be Current and actually being Current). If we were to score another dataset with this model, how would persons with the same characteristics be classified? Because Clementine assigns the group the modal category of the branch, everyone in the new data set who met the criteria defined by these two rules would be predicted to remain Current Customers. If you would like to present the results to others, an alternative format is available that helps visualize the decision tree. The Viewer tab provides this alternative format.
Click the Viewer tab Click the Decrease Zoom tool (to view more of the tree). (You may also need to expand the size of the window.)
Predictive Modeling With Clementine Figure 3.8 Decision Tree in the Viewer Tab
The root of the tree shows the overall percentages and counts for the three categories of CHURNED. Furthermore, the modal category is shaded. The first split is on Local, as we have seen already in the text display of the tree. Similar to the text display, we can decide to expand or collapse branches. In the right corner of some nodes a or + is displayed, referring to an expanded or collapsed branch, respectively. For example, to collapse the tree at node 2:
Click in the lower right corner of node 2 (shown in Figure 3.9)
In the Viewer tab, toolbar buttons are available for zooming in or out; showing frequency information as graphs and/or as tables; changing the orientation of the tree; and displaying an overall map of the tree in a smaller window (tree map window) that aids navigation in the Viewer tab. When it is not possible to view the whole tree at once, such as now, one of the more useful buttons in the toolbar is the Tree map button because it shows you the size of the tree. A red rectangle indicates the portion of the tree that is being displayed. You can then navigate to any portion of the tree you want by clicking on any node you desire in the Tree map window.
Click in the lower right corner of node 2
in the tool bar Click on the Treemap button Enlarge the Treemap until you see the node numbers (shown in Figure 3.10)
Predictive Modeling With Clementine Figure 3.10 Decision Tree in the Viewer Tab with a Tree Map
Note that the default Rule set name appends the letters RS to the output field name. You may specify whether you want the C5.0 Ruleset node to appear in the Stream Canvas (Canvas), the generated Models palette (GM palette), or both. You may also change the name of the rule set and lower limits on support (percentage of records having the particular values on the input fields) and confidence (accuracy) of the produced rules (percentage of records having the particular value for the output field, given values for the input fields).
Set Create node on: to GM Palette Click OK
Click FileClose to close the C5.0 Rule browser window Right-click the C5.0 Rule Set node named CHURNEDRS in the generated Models palette in the Manager, then click Browse
Predictive Modeling With Clementine Figure 3.13 Browsing the C5.0 Generated Rule Set
Apart from some details, this window contains the same menus as the browser window for the C5.0 Rule node.
Click All button to unfold Click Show or hide instance and confidence figures
Predictive Modeling With Clementine Figure 3.14 Fully Expanded C5.0 Generated Rule Set
For example Rule #1 (Current) has this logic: If a person makes more than 4.976 minutes of local calls a month, is Male and makes less than or equal to .905 minutes of International calls then we would predict Current. This form of the rules allows you to focus on a particular conclusion rather than having to view the entire tree. If the Rule Set is added to the stream, a Settings tab will become available that allows you to export the rule set in SQL format, which permits the rules to be directly applied to a database.
Click FileClose to close the Rule set browser window
Figure 3.15 Two New Fields Generated by the C5.0 Rule Node
Two new columns appear in the data table, $C-CHURNED and $CC-CHURNED. The first represents the predicted value for each record and the second the confidence value for the prediction.
Click FileClose to close the Table output window
Predictive Modeling With Clementine First we will edit the Select node on the left that we will use to select the Training sample cases:
Double-click on the Select node on the left to edit it Click the Expression Builder button Move Partition from the Fields list box to the Expression Builder text box Click the (equal sign button)
Click the Select from existing field values button and insert the value 1_Training Click OK, and then click OK again to close the dialog
Now we will edit the Select node on the right to select the Testing sample cases:
Double-click on the Select node on the right to edit it Click the Expression Builder button Move Partition from the Fields list box to the Expression Builder text box Click the (equal sign button)
Click the Select from existing field values button and insert the value 2_Testing Click OK, and then click OK again to close the dialog
Now attach a separate Matrix node to each of the Select nodes. For each of the Select nodes:
Place a Matrix node from the Output palette below the Select node Connect the Matrix node to the Select node Double-click the Matrix node to edit it Put CHURNED in the Rows: Put $C-CHURNED in the Columns:
For each actual churned category, the Percentage of row choice will display the percentage of records predicted into each of the outcome categories.
Execute each Matrix node
Figure 3.17 Matrix Output for the Training and Testing Samples
Looking at the Training sample results, the model predicts about 79.8% of the Current category correctly, 100% of the Involuntary Leavers, and 93.6% of the Voluntary Leavers correctly. These results are far better than those found with a neural network for the Voluntary Leaver category (93.6% versus 82.0%), slightly better for Current Customers (79.8% versus 78.5%) and exactly the same for Involuntary Leavers (each model correctly predicted all of them). The results with the testing sample compare favorably which suggests that the model will perform well with new data.
Click FileClose to close the Matrix windows
Predictive Modeling With Clementine Figure 3.18 Gains Chart of the Current Customer Group
Predictive Modeling With Clementine Figure 3.19 Gains Chart for the Current Customer Group (Interaction Enabled)
The gains line ($C-CHURNED) in the Training data chart rises steeply relative to the baseline, indicating the hits for the Current outcome are concentrated in the percentiles predicted most likely to contain current customers according to the model. Just under 75% of the hits were contained within the first 40 Percentiles. The gains line in the chart using Testing data is very similar which suggests that this model can be reliably used to predict current customers with new data.
Click FileClose to close the Evaluation chart window
Predictive Modeling With Clementine Figure 3.20 Specifying the Hit Condition within the Expression Builder
The condition (Vol as the target value) defining a hit was created using the Expression Builder.
Click OK
Predictive Modeling With Clementine In the evaluation chart, a hit will now be based on the Voluntary Leaver target category.
Click Execute Click EditEnable Interaction from the result gains chart (not shown)
Figure 3.22 Gains Chart for the Voluntary Leaver Category (Interaction Enabled)
The gains chart for the Voluntary Leavers category is better (steeper in the early percentiles) than that for the Current category. For example, the top 40 model-ordered percentiles in the Training data chart contain over 87% of the Voluntary Leavers as opposed to the same chart when we looked at Current Customers (that value was 75.3%)
Click FileClose to close the Evaluation chart window
The simple options within the C5.0 node allow you to use Boosting, specify the Expected noise (%) and whether the resulting tree favors Accuracy or Generality. Noisy (contradictory) data contain records in which the same, or very similar, predictor values lead to different outcome values. While C5.0 will handle noise automatically, if you have an estimate of it, the method can take this into account (see the section on Minimum Records and Pruning for more information on the effect of specifying a noise value). The expert mode allows you to fine-tune the rule induction process.
Double-click on the C5.0 node Click the Model tab Click the Expert Mode option button
Predictive Modeling With Clementine Figure 3.23 Expert Options Available within the C5.0 Dialog (Model Tab)
When constructing a decision tree, the aim is to refine the data into subsets that are, or seem to be heading toward, single-class collections of records on the outcome field. That is, ideally the terminal nodes contain only one category of the output field. At each point of the tree, the algorithm could potentially partition the data based on any one of the input fields. To decide which is the best way to partition the datato find a compact decision tree that is consistent with the datathe algorithms construct some form of test that usually works on the basis of maximizing a local measure of progress.
Where INFO(DATA) represents the average information needed to identify the class (outcome category) of a record within the total data.
Predictive Modeling With Clementine And INFOX(DATA) represents the expected information requirement once the data has been partitioned into each outcome of the current field being tested. The information theory that underpins the criterion of gain can be given by the statement: The information conveyed by a message depends on its probability and can be measured in bits as minus the logarithm to the base 2 of that probability. So, if for example there are 8 equally probable messages, the information conveyed by any one of them is log2 (1 / 8) or 3 bits. For details on how to calculate these values the reader is referred to Chapter 2 in C4.5: Programs for Machine Learning. Although the gain criterion gives good results, it has a flaw in that it favors partitions that have a large number of outcomes. Thus a symbolic predictor with many categories has an advantage over one with few categories. The gain ratio criterion, used in C5.0, rectifies this problem. The bias in the gain criterion can be rectified by a kind of normalization in which the gain attributable to tests with many outcomes is adjusted. The gain ratio represents the proportion of information generated by dividing the data in the parent node into each of the outcomes of field X that is useful, i.e., that appears helpful for classification.
GAIN RATIO(X) = GAIN(X) / SPLIT INFOX(DATA)
Where SPLIT INFOX(DATA) represents the potential information generated by partitioning the data into n outcomes, whereas the information gain measures the information relevant to classification. The C5.0 algorithm will choose to partition the data based on the outcomes of the field that maximizes the information gain ratio. This maximization is subject to the constraint that the information gain must be large, or at least as great as the average gain over all tests examined. This constraint avoids the instability of the gain criterion, when the split is near trivial and the split information is thus small. Two other parameters the expert options allow you to control are the severity of pruning and the minimum number of records per child branch. In the following sections we will introduce each of these in turn and give advice on their settings.
Predictive Modeling With Clementine A second phase of pruning (global pruning) is then applied by default. It prunes further based on the performance of the tree as a whole, rather than at the sub-tree level considered in the first stage of pruning. This option (Use global pruning) can be turned off, which generally results in a larger tree. After initially analyzing the data, the Winnow attributes option will discard some of the inputs to the model before building the decision tree. This can produce a model that uses fewer input fields yet maintains near the same accuracy, which can be an advantage in model deployment. This option can be especially effective when there are many inputs and where inputs are statistically related.
Once a tree has been built using the simple options, the expert options may be used to refine the tree in these two common ways. If the resulting tree is large and has too many branches increase the Pruning Severity If there is an estimate for the expected proportion of noise (relatively rare in practice), set the Minimum records per branch to half of this value.
Boosting
C5.0 has a special method for improving its accuracy rate, called boosting. It works by building multiple models in a sequence. The first model is built in the usual way. Then, a second model is built in such a way that it focuses especially on the records that were misclassified by the first model. Then a third model is built to focus on the second model's errors, and so on. Finally, cases are classified by applying the whole set of models to them, using a weighted voting procedure to combine the separate predictions into one overall prediction. Boosting can significantly improve the accuracy of a C5.0 model, but it also requires longer training. The Number of trials option allows you to control how many models are used for the boosted model. While boosting might appear to offer something for nothing, there is a price. When model building is complete, more than one tree is used to make predictions. Therefore, there is no
Predictive Modeling With Clementine simple description of the resulting model, nor of how a single predictor affects the outcome field. This can be a serious deficiency, so boosting is normally used when the chief goal of an analysis is predictive accuracy, not understanding.
Misclassification Costs
The Costs tab allows you to set misclassification costs. When using a tree to predict a symbolic output, you may wish to assign costs to misclassifications (where the tree predicts incorrectly) to bias the model away from expensive mistakes. The Misclassifying controls allow you to specify the cost attached to each possible misclassification. The default costs are set at 1.0 to represent that each misclassification is equally costly. When unequal misclassification costs are specified, the resulting trees tend to make fewer expensive misclassifications, usually at the cost of an increased number of the relatively inexpensive misclassifications.
There are two available methods, CHAID and Exhaustive CHAID. The latter is a modification of CHAID designed to address some of its weaknesses. Exhaustive CHAID examines more possible splits for a predictor, thus improving the chances of finding the best predictor (at the cost of additional processing time).
Predictive Modeling With Clementine Figure 3.24 CHAID Node Dialog (Model Tab)
CHAID allows only one other change in the Model tab, the maximum tree depth (the number of levels the tree can grow). Since CHAID doesnt prune a bushy tree, the user can specify the depth with the Levels below root setting, which is 5 initially. This setting should depend on the size of the data file, the number of predictors, and the complexity of the desired tree. You can set one of two modes: Generate model builds the model, Launch Interactive session launches the Interactive Tree feature which we will discuss this feature in a later section.
Click the Expert tab, and then click the Expert Mode option button
The Expert mode options are shown in Figure 3.25. To select the predictor for a split, CHAID uses a chi-square test in the table defined at each node by a predictor and the outcome field. CHAID chooses the predictor that is the most significant (smallest p value). If that predictor has more than 2 categories, CHAID compares them and collapses together those categories that show no differences in the outcome. This category merging process stops when all remaining categories differ at the specified testing level (Alpha for Merging:). It is possible for CHAID to split merged categories, controlled by the Allow splitting of merged categories check box. (Note that a categorical predictor with more than 127 discrete categories will be ignored by CHAID.) For continuous predictors, the values are binned into a maximum of 10 groups, and then the same tabular procedure is followed as for flag and set types.
Because many chi square tests are performed, CHAID automatically adjusts its significance values when testing the predictors. These are called Bonferroni adjustments and are based on the number of tests. You should normally leave this option turned on; in small samples or with only a few predictors, you could turn it off to increase the power of your analysis.
Click the Stopping button
The Stopping options control stopping rules for growth based on node size. These can be specified either as an absolute number of records or as a percentage of the total number of records. By default, a parent branch to be split must contain at least 2% of the records; a child branch must contain at least 1%. It is often more convenient to work with the absolute number of records rather than a percent, but in either case, you will very likely modify these values to get a smaller, or larger, tree. Figure 3.26 CHAID Stopping Criteria
Predictive Modeling With Clementine Unlike other models, CHAID uses missing, or blank, values when growing a tree. All blank values are placed in a missing category that is treated like any other category for nominal predictors. For ordinal and continuous predictors, the process of handling blanks is a bit different, but the effect is the same (see the Clementine11.1 Algorithms Guide for detailed information). If you dont want to include blank data in a model, it should be removed beforehand.
The only Simple mode model option available concerns the Maximum tree depth, as with CHAID. By default this value is 7, so C&R Tree will grow a deeper tree than CHAID, all things equal (which they really arent given the different methods for predictor selection). It is also possible, as with CHAID, to grow a tree interactively. Since pruning is performed with this method, and other stopping rules may be triggered, the actual tree depth may be less than the maximum specified. Figure 3.27 Classification and Regression Trees (C&R Tree) Dialog
Click the Expert tab Click the Expert Mode option button
Predictive Modeling With Clementine Figure 3.28 Expert Options for Classification and Regression Trees
Impurity Criterion
The criterion that guides tree growth in C&R Tree with a symbolic output field is called impurity. It captures the degree to which responses within a node are concentrated into a single output category. A pure node is one in which all cases fall into a single output category, while a node with the maximum impurity value would have the same number of cases in each output category. Impurity can be defined in a number of ways and two alternatives are available within the C&R Tree procedure. The default, and more popular measure, is the Gini measure of dispersion. If P(t)i is the proportion of cases in node t that are in output category i, then the Gini measure is:
Gini = 1 P(t ) i
i
Gini = P (t )i P (t ) j
i j
If two nodes have different distributions across three response categories (for example (1,0,0) and (1/3, 1/3, 1/3)), the one with the greater concentration of responses in a single category (the first one) will have the lower impurity value (for (1,0,0) the impurity is 1 (12 + 02 + 02), or 0; for (1/3, 1/3, 1/3) the impurity is 1 ((1/3)2 + (1/3)2 + (1/3)2)or .667). The Gini measure ranges between 0 and 1, although the maximum value is a function of the number of output categories. Thus far we have defined impurity for a single node. It can be defined for a tree as the weighted average of the impurity values from the terminal nodes. When a node is split into two child nodes, the impurity for that branch is simply the weighted average of their impurities. Thus if two child nodes resulting from a split have the same number of cases and their individual impurities are .4 and .6, their combined impurity is .5*.4 + .5*.6. When growing the tree, C&R Tree splits a node on the predictor that produces the greatest reduction in impurity (comparing the impurity of the parent node to the impurity of the child nodes). This change in impurity from a parent node to its child nodes is called the improvement and under Expert options you can specify the minimum change in impurity for tree growth to continue. The default value is .0001 and if you are considering modifying this value, you might calculate the impurity at the root node (the overall output proportions) to establish a point of reference. The problems with using impurity as a criterion for tree growth are that you can almost always reduce impurity by enlarging the tree and any tree will have 0 impurity if it is grown large enough (if every node has a single case, impurity is 0). To address these difficulties, the developers of the classification and regression tree methodology (see Breiman, Friedman, Olshen, and Stone, Classification and Regression Trees, Wadsworth, 1984) developed a pruning method based on a cross-validated cost complexity measure (as discussed above). By default, the Gini measure of dispersion is used. Breiman and colleagues proposed Twoing as an alternative impurity measure. If the target has more than two output categories, twoing will create binary splits of the response categories in order to calculate impurity. Each possible combination of output categories split into two groups will be separately evaluated for impurity with each predictor variable, and the best split across predictors and target category combinations is chosen. Ordered Twoing (inactive because the output field is type set, not ordered set) applies Twoing as described above, except that the output category combinations are limited to those consistent with the rank order of the categories. For example, if there are five output categories numbered 1,2,3,4 and 5, Ordered Twoing would examine the (1,2) (3,4,5) split, but the (1,4) (2,3,5) split would not be considered, since only contiguous categories can be grouped together. Of the methods, the Gini measure is most commonly used.
Surrogates
Surrogates are used to deal with missing values on the predictors. For each split in the tree, C&R Tree identifies the input fields (the surrogates) that are most similar statistically to the selected split field. When a record to be classified has a missing value for a split field, its value on a surrogate field can be used to make the split.
Predictive Modeling With Clementine The Maximum surrogates option controls how many surrogate predictor fields will be stored at each mode. Retaining more surrogates slows processing and the default (5) is usually adequate.
Priors in C&RT
Historically, priors have been used to incorporate knowledge about the base population rates (here of the output field categories) into the analysis. Breiman et al. (1984) point out that if one target category has twice the prior probability of occurring than another, it effectively doubles the cost of misclassifying a case from the first category, since it is counted twice. Thus by specifying a larger prior probability for a response category, you can effectively increase the cost of its misclassification. Since priors are only given at the level of the base rate for the output field categories (with J categories there are J prior probabilities), use of them implies that the misclassification of a record actually in output category j has the same cost regardless of the category into which it is misclassified, (that is C(j) = C(k|j), for all k not equal to j).
Click Priors button
By default, the prior probabilities are set to match the probabilities found in the training data. The Equal for all classes option allows you to set all priors equal (might be used if you know your sample does not represent the population and you dont know the population distribution on the outcome), and you can enter prior probabilities (Custom option). The prior probabilities should sum to 1 and if you enter custom priors that reflect the desired proportions, but do not sum to 1, the Normalize button will adjust them. Finally, priors can be adjusted based on misclassification costs (see Breimans comment above) entered in the Costs tab. Figure 3.29 C&RT Expert Options: Priors
As we briefly discussed in the context of C5.0, unequal misclassification costs can be specified for the outcome categories and will be taken into account during tree creation. By default, all misclassification costs are set equal (to 1). Figure 3.30 Misclassification Costs
QUEST (Quick Unbiased Efficient Statistical Tree) is a binary classification method that was developed, in part, to reduce the processing time required for large C&R Tree analyses with many fields and/or records. It also tries to reduce the tendency in decision tree methods to favor predictors that allow more splits (see Loh and Shi, 1997).
Predictive Modeling With Clementine Figure 3.31 QUEST Model Tab Options
Like CHAID and C&R Tree, QUEST allows only the specification of maximum tree depth in Simple mode. A model can be built or an Interactive Tree session launched.
Click the Expert tab Click the Expert mode option button
QUEST separates the tasks of predictor selection and splitting at a node. Like CHAID, it uses statistical tests to pick a predictor at a node. For each continuous or ordinal predictor variable, QUEST performs an analysis of variance, and then uses the significance of the F test as a criterion. For nominal predictors (of type flag and set), chi-square tests are performed. The predictor with the smallest significance value from either the F or chi-square test is selected. Although not evident from the dialog box options, Bonferroni adjustments are made, as with CHAID (not under user control). QUEST is more efficient than C&R Tree because not all splits are examined, and category combinations are not tested when evaluating a predictor for selection.
After selecting a predictor, QUEST determines how the field should be split (into two groups) by doing a quadratic discriminant analysis, using the selected predictor on groups formed by the target categories. The details are rather complex and can be found in Loh and Shih (1997). The type (nominal, continuous) of the predictor will determine how it is treated in this method. While quadratic discriminant analysis allows for unequal variances in the groups and makes one fewer assumption than does linear discriminant analysis, it does assume that the distribution of the data are multivariate normal, which is unlikely for predictors that are flags and sets. QUEST uses an alpha (significance) value of .05 for splitting in the discriminant analysis, and you can modify this setting. For large files you may wish to reduce alpha to .01, for example.
Predictive Modeling With Clementine C5.0 support interactive trees. One caution is that all ordered sets used in the model must have numeric storage to be used in the Tree Builder. Well examine tree building by using CHAID to predict the CHURNED field in the current stream.
Click Cancel to close the QUEST node Double-click the CHAID node to edit it Click the Model tab Click the Launch interactive session option button
When Launch interactive session is selected, the Use tree directives check box becomes active. Tree directives are used to save the tree in the Model Builder. The next time the model-building node is executed, the current tree, including any custom splits you define, will automatically be regrown if you specify the saved directives file under the Directives button. Figure 3.33 CHAID Model Tab with Interactive Tree Method
The tree opens in the Viewer tab (see Figure 3.34), displaying the root node, with the distribution of the outcome field (here CHURNED). You can grow the tree level-by-level, or all at once, or variations on this. You can also request information on the tree.
Predictive Modeling With Clementine Figure 3.34 Interactive Tree Builder to Predict CHURNED Field with CHAID
Because there has been no tree growth yet, some of the options are inactive. From here, we can grow the full tree, grow the tree only one level, or grow just the selected branch (the latter two are equivalent here). We can also specify a custom split if there is interest in a particular predictor.
Predictive Modeling With Clementine Lets grow the tree one level.
Click on Grow Tree One Level
CHAID selects the field LONGDIST (long distance minutes) as the best predictor. This field is of type range, so it was binned before being used in the model. CHAID finds the statistically best split is into three groups, as shown in Figure 3.36. Notice that all the involuntary leavers are in the first node, with a value of 0 on LONGDIST. Figure 3.36 Interactive Tree Grown One Level with LONGDIST as Best Predictor
Although we are hardly finished with building the tree, we can learn how accurate the current tree is at any time.
Click the Risks tab
Predictive Modeling With Clementine The Risks tab displays the error of the current tree in predicting CHURNED. The Risk estimate for the Training data (0.370) is the amount of error, so the model is (1 - .370)*100 = 63.0% accurate in its predictions. This isnt too bad for a model with only one predictor. The error rate for the Testing data (0.381) is nearly the same, which is a good sign that the model will work well with unseen data. One concern is that the model incorrectly predicts that most Voluntary Leavers (217 out of 267) will remain Current Customers. This suggests that the model will have to be refined latter on, perhaps by balancing the three groups, so that the model will do a better job in finding the people who are likely to leave the company. Figure 3.37 Risks Estimates for Current CHAID Model
Click the Viewer tab Right-click on Node 2 and select Grow Branch with Custom Split from the Context menu
We can control how the tree will split at this node. In Figure 3.38, we can see if we grow this node automatically, SEX will be used to grow the branch.
Predictive Modeling With Clementine Figure 3.38 Define Split Dialog for Node 2
Predictive Modeling With Clementine Figure 3.39 Potential Fields For Splitting Node 2
Four fields meet the standard statistical criterion of .05 probability (adjusted by the Bonferroni correction). Since International is also highly significant, lets select that field instead.
Click International Click OK Click Grow
Figure 3.40 shows the resulting split. The split is at 0 minutes of international calls, into 2 child nodes. Customers who dont make international calls are more likely to be current customers. Those who do make international calls are more likely to be voluntary churners, so we will finally make a prediction for this group. Note, of course, that this branch only applies to customers who make between 0.00 and 26.055 minutes of long distance calls. The red symbol under Node 2 indicates that the tree was not grown automatically at that node but that the user instead defined the split.
Predictive Modeling With Clementine Figure 3.40 Tree Grown One Level From Node 2 With User-Selected Predictor
Instead of examining these results any further, lets remove the International branch and allow the model to use the best predictor, which was Sex, as we saw in Figure 3.38.
Right-click on Node 2 and select Remove One Level from the context menu Right-click on Node 2 and select Grow Tree One Level from the context menu
Predictive Modeling With Clementine Figure 3.41 Tree Grown One Level From Node 2 With the Most Significant Predictor
There are other options available. The tree display can be modified by displaying graphs instead of tabular statistics, or both, or by hiding certain branches. You can zoom in or out, or display a Tree Map window to help navigate large trees. On the Gains tab, various statistics and charts are available that can help you readily assess the effectiveness of the current tree (the graphs are those produced by the Evaluation node). One other very useful characteristic of the Gains table is to help you identify which segments contain a particular class of customers, for example Voluntary Leavers, even though the Misclassification Table may have incorrectly predicted them into the wrong category.
Click on the Gains tab Select Vol from the menu to the right of Target Category
Each row in the Gains table represents a terminal node in the tree. The Node:n column is the total number of customers in each node; the Node (%) column is the node size as a percent of the entire population; the Gain column is a count of Voluntary Leavers within each node; the Gain % column is the percent of all the Voluntary Leavers for the entire sample within each node; the Response % column is the percent of people in the node who are Voluntary Leavers; and the Index % column indicates the relative probability of choosing a Voluntary Leaver within the node versus randomly choosing cases in the entire sample. The Index % value for Node 6 is derived by dividing 63.86 (Response %) by 37.135 (Percent of Voluntary Leavers in the Entire Sample) and multiplying by 100. If the index percentage is greater than 100%, you have a better chance of finding cases with the desired target category in a node. Within the Gains Table we can now see that 68.16% of all Voluntary Leavers come from Node 6. From looking at the tree, these are the people who are both female and make between 0 and 26.055 minutes of long distance calling per month. Interactive trees are not a model, but instead are a form of output, like a table or graph. When you are satisfied with the tree you have built, you can generate a model to be used in the stream to make predictions.
Click GenerateGenerate Model Click OK in the resulting dialog box (not shown) Close the Tree Builder window
A generated CHAID model appears in the upper left corner of the Stream Canvas. It can be edited, attached to other nodes, and used like any other generated model. The only difference is in how it was created.
Figure 3.43 Clementine Stream with C&R Tree Model Node (Numeric Output Field)
The data file consists of patient admissions to a hospital. The goal is to build a model predicting insurance claim amount based on hospital length of stay, severity of illness group and patient age.
Double click on the C&R Tree node (labeled CLAIM)
The Model tab for the C&R Tree dialog with a numeric output is the same as for a symbolic output. However, if we explore the Expert options we find that some expert settingspriors and misclassification costswhich are not relevant for a continuous output field, are inactive when Expert mode is chosen. Otherwise, setting up the model-building parameters, and executing the tree, is identical to the process for a categorical outcome. The generated model will display the predicted mean for the insurance claim amount in each terminal node.
Predictive Modeling With Clementine Figure 3.44 C&R Tree Dialog for a Numeric Output Field
Click the Launch interactive session button Click the Expert Tab Click the Expert Mode option button Click Prune Tree
By selecting Prune Tree, pruning will be performed to produce a more compact tree.
Figure 3.46 Interactive Tree Builder to Predict Claim with C&R Tree
Because the target field is numeric, different statistics appear in the nodes. Now the nodes display the mean, number of cases, and percentage of the sample. Thus, the mean insurance claim for persons in this data file is slightly over $4631, which we would predict for each person if we didnt know how long they stayed in the hospital, their age, or how severely ill they were. Once the tree is grown, we should get some insight into the characteristics of patients that separate high insurance claims from low ones.
Right-Click on the Node 0 and select Grow Tree and Prune
Predictive Modeling With Clementine Figure 3.47 Interactive Tree Builder Fully Grown and Pruned Tree
The results indicate that Length of Stay is the best predictor. The average insurance claim for persons who stay more than 2.5 days in the hospital is $5742.50 while the mean claim for persons who spent less time in the hospital is only $4276.62. The predictions are further refined as we work our way down the tree. For example, the average claim for individuals who were in the hospital more than 3.5 days (Node 6) was $8032.73 and for persons who spent less than or equal to 2.5 days in the hospital and were older than 28.5, the average prediction was $4108.60. While this tree is perfectly valid, some of the results dont make intuitive sense. For example, it may not be possible to be charge for 2.5 days; most hospitals charge you for an entire day even if you stayed a fraction thereof. Similarly, the Severity of Illness field (ASG) has only 3 values and we are treating it as if it is numeric. An examination of the Type node will show how each field is currently typed.
Double-click on the Type Node in the Stream Canvas
Predictive Modeling With Clementine Figure 3.48 Type Node Original Field Types
As we can see, all the fields have been typed as numeric. In order to have the model treat each day in the hospital as an entire day, and to insure that Severity of Illness groups remain discrete, we need to reinstantiate the LOS and ASG field so they are typed as SET.
Hold down the Control key and click on the ASG and LOS fields to select them Right-click Type box for either of these two variables Click Set TypeDiscrete on the context menu Click the Read Values button Click OK
Figure 3.49 Type Node after ASG and LOS have been Reinstantiated to SET
While Length of Stay is still the best predictor, number of days is now displayed as a whole number. The average claim for persons who stayed 1 or 2 days is $4276.62 and for those who stayed 3, 4 or 6 days, the average claim is $5742.50. Also, the values for ASG are now treated as discrete.
Summary Exercises
A Note Concerning Data Files
In this training guide files are assumed to be located in the c:\Train\ClemPredModel directory.
churn.txt contains information from a telecommunications company. The data are comprised of customers who at some point have purchased a mobile phone. The primary interest of the company is to understand which customers will remain with the organization or leave for another company. The files contain the following fields:
ID LONGDIST International LOCAL DROPPED PAY_MTHD LocalBillType LongDistanceBillType AGE SEX STATUS CHILDREN Est_Income Car_Owner CHURNED
Customer reference number Time spent on long distance calls per month Time spent on international calls per month Time spent on local calls per month Number of dropped calls Payment method of the monthly telephone bill Tariff for locally based calls Tariff for long distance calls Age Gender Marital status Number of Children Estimated income Car owner (3 categories) Current Still with company Vol Leavers who the company wants to keep Invol Leavers who the company doesnt want
In this session we will explore the various training methods and options for the rule induction techniques within Clementine. 1. Begin a new stream with a Var.file node connected to the file Churn.txt. 2. Use C5.0 and at least one other decision tree method to predict CHURNED and compare the accuracy of both. What do you learn from this? Which rule method performs best?
Predictive Modeling With Clementine 3. Now browse the rules that have been generated by the methods. Which model appears to be the most manageable and/or practical? Do you think there is a trade-off between accuracy and manageability? 4. Does the Balance node have much of an effect with the rule induction techniques? 5. Try switching from Accuracy to Generality in C5.0. Does this have much effect on the size and accuracy of the tree? 6. Experiment with the expert options within the methods you selected to see how they affect tree growth. Can you increase the accuracy without making the model overly complicated? Experiment with the minimum values for parent and child nodes and see how this influences the size of the tree. 7. In C5.0, use the Winnow attributes expert option and see if it reduces the number of inputs used in the model (hint: for an easier comparison, generate a Filter node from the generated C5.0 node with Winnow attributes checked and with it unchecked). 8. Of all the models you have run, which do you think is the best? Why? 9. For those with extra time: Use C5.0 or other decision tree methods to predict Response to campaign from the charity.sav data used in the exercises for Chapter 2. How do the rule induction models compare with the neural network models built in the last chapter? Which are the most accurate? Which are the easiest to understand? You may wish to save the stream (use the name Exer3.str) that you have just created.
4.1 Introduction
Linear regression is a method familiar to just about everyone these days. It is the classic general linear model (GLM) technique, and it is used to predict an outcome variable that is interval or ratio in scale with predictors that are also interval or ratio. In addition, categorical input fields can be included by creating dummy variables (fields). A Regression model node that performs linear regression is available in Clementine. Linear regression assumes that the data can be modeled with a linear relationship. To illustrate, the figure below contains a scatterplot depicting the relationship between the length of stay for hospital patients and the dollar amount claimed for insurance. Superimposed on the plot is the best-fit regression line. The plot may look a bit unusual in that there are only a few values for length of stay, which is recorded in whole days, and few patients stayed more than three days. Figure 4.1 Scatterplot of Hospital Length of Stay and Insurance Claim Amount
Although there is a lot of spread around the regression line and a few outliers, it is clear that there is a positive trend in the data such that longer stays are associated with a greater insurance claims. Of course, linear regression is normally used with several predictors; this makes it impossible to
Linear Regression 4 - 1
Predictive Modeling With Clementine display the complete solution with all predictors in convenient graphical form. Thus, most users of linear regression use the numeric output.
Linear Regression 4 - 2
Predictive Modeling With Clementine in the prediction equation. While we are limited to the number of dimensions we can view in a single plot, the regression equation allows for many input fields. When we run multiple regression we will again be concerned with how well the equation fits the data, whether there are any significant linear relations, and estimating the coefficients for the best-fitting prediction equation. In addition, we are interested in the relative importance of the independent fields in predicting the output field.
Assumptions
Regression is usually performed on data for which the input and output fields are interval scale. In addition, when statistical significance tests are performed, it is assumed that the deviations of points around the line (residuals) follow the normal bell-shaped curve. Also, the residuals are assumed to be independent of the predicted values (values on the line), which implies that the variation of the residuals around the line is homogeneous (homogeneity of variance). Clementine can provide summaries and plots useful in evaluating these latter issues. One special case of the assumptions involves the interval scale nature of the independent variable(s). A field coded as a dichotomy (say 0 and 1) can technically be considered as an interval scale. An interval scale assumes that a one-unit change has the same meaning throughout the range of the scale. If a fields only possible codes are 0 and 1 (or 1 and 2, etc.), then a one-unit change does mean the same change throughout the scale. Thus dichotomous or flag fields (for example gender) can be used as predictor variables in regression. It also permits the use of categorical predictor fields if they are converted into a series of flag fields, each coded 0 or 1; this technique is called dummy coding. Note that the Regression node in Clementine will only accept numeric inputs or ordered sets that contain numeric values (they will be treated as numeric, then). Thus if you have symbolic inputs, you must convert them to numeric dummy fields (using the SetToFlag node to create 0,1 dummycoded fields, followed by a Type node to set the type of these fields to range), before they can be used by the Regression node.
Linear Regression 4 - 3
Predictive Modeling With Clementine hospital stay (LOS) and a severity of illness category (ASG). This last field is based on several health measures and higher scores indicate greater severity of the illness. The plan is to build a regression model that predicts the total claim amount for a patient on the basis of length of stay, severity of illness and patient age. Assuming the model fits, we are then interested in those patients that the model predicts poorly. Such cases can simply be instances of poor model fit, or the result of predictors not included in the model, but they also might be due to errors on the claims form or fraudulent entries. Thus we are approaching the problem of fraud detection by identifying exceptions to the prediction model. Such exceptions are not necessarily instances of fraud, but since they are inconsistent with the model, they may be more likely to be fraudulent or contain errors. Some organizations perform random audits on claims applications and then classify them as fraudulent or not. Under these circumstances, predictive models can be constructed that attempt to correctly classify new claims; logistic, discriminant, rule induction and neural networks have been used for this purpose. However, when such an outcome field is not available, fraud detection then involves searching for and identifying exceptional instances. Here, an exceptional instance is one that the model predicts poorly. We are using regression to build the model; if there were reason to believe the model were more complex (for example, contained nonlinear relations or complex interactions), then a neural network or rule induction model could be applied. We begin by opening the stream.
Click FileOpen Stream and move to the c:\Train\ClemPredModel directory Double-click on LinearRegress.str Double-click on the Type node
We will develop a regression equation predicting claims amount based on hospital length of stay, severity of illness group and age. Note that the severity of illness field (ASG) is of type range,
Linear Regression 4 - 4
Predictive Modeling With Clementine although it has only the three integer values 0, 1, and 2. We will leave it as range since these values fall on an ordered scale (higher values indicate greater severity). If we wished to treat it as a symbolic field, we would create dummy (0,1) fields using the SetToFlag node and use all but one (for g groups, there are g-1 non-redundant dummy fields) of the dummy fields (declared as type range) as inputs to the Regression node.
Close the Type node dialog Double-click on the Regression node (named CLAIM)
Simple options include whether a constant (intercept) will be used in the equation and the Method of input field selection. By default (Enter), all inputs will be included in the linear regression equation. With such a small number of predictor fields, we will simply add them all into the model together. However, in the common situation of many input fields (most insurance claim forms would contain far more information) a mechanism to select the most promising predictors is desirable. This could be based on the domain knowledge of the business expert (here perhaps a health administrator). In addition, an option may be chosen to select, from a larger set of independent variables, those that in some statistical sense are the best predictors (Stepwise Method). In the stepwise method, the best input field (according to a statistical criterion) is entered into the prediction equation. Then the next best input field is entered, and so on, until a point is reached when no further input fields meet the criterion. The stepwise method includes a check to insure that the fields entered into the equation before the current step still meet the statistical criterion when the additional inputs are added. Variations on the stepwise method (Forwardinputs are added one by one, as described above, but are never removed; Backward all inputs are entered, then the least significant input is removed, and this process is repeated until only statistically significant inputs remain) are available as well.
Click the Fields tab
Linear Regression 4 - 5
The weighted least squares option (Use weight field check box) supports a form of regression in which the variability of the output field is different for different values of the input fields; an adjustment can be made for this if an input field is related to this degree of variation. In practice, this option is rarely used. We see here the option to specify a partition field when there is such a field (but it doesnt have the default name of Partition).
Click the Expert tab Click the Expert Mode option button
Linear Regression 4 - 6
Predictive Modeling With Clementine Figure 4.5 Expert Options (Missing Values and Tolerance)
By default, the Regression node will only use records with valid values on the input and output fields (this is often called listwise deletion). This option can be checked off, in which case Clementine will attempt to use as much information as possible to estimate the Regression model, including records where some of the fields have missing values. It does this through a method called pairwise deletion of missing values. However, we recommend against using this option unless you are a very experienced user of regression; using incomplete records in this manner can lead to computational problems in estimating the regression equation. Instead, if there is a large amount of missing data, you may wish to substitute valid values for the missing data before using the Regression node. The Singularity tolerance will not allow an input field in the model unless at least .0001 (.01 %) of its variation is independent of the other predictors. This prevents the linear regression model estimation from failing due to multicollinearity (linear redundancy in the predictors). Most analysts would recommend increasing the default tolerance value to at least .05, though.
Click the Model tab, and then click Stepwise on the Method drop-down list Click the Expert tab, and then click the Stepping button
Linear Regression 4 - 7
Predictive Modeling With Clementine You control the criteria used for input field entry and removal from the model. By default, an input field must be statistically significant at the .05 level for entry and will be dropped from the model if its significance value increases above .1.
Click Cancel Click Output button
These options control how much supplementary information concerning the regression analysis displays. The results will appear in the Advanced tab of the generated model node in HTML format. Confidence bands (95%) for the estimated regression coefficients can be requested (Confidence interval). Summaries concerning relationships among the inputs can be obtained by requesting their Covariance matrix or Collinearity Diagnostics. The latter are especially useful when you need to identify the source and assess the level of redundancy in the predictors. Part and partial correlations measure the relationship between an input and the output field, controlling for the other inputs. Descriptive statistics (Descriptives) include means, standard deviations, and correlations; these summaries can also be obtained from the Statistics or Data Audit node. The Durbin-Watson statistic can be used when running regression on time series data and evaluates the degree to which adjacent residuals are correlated (regression assumes residuals are uncorrelated).
Click Cancel Click the Simple option button Click Model tab, then click Enter on the Method drop-down list Click Execute button Add the Regression generated model node (named CLAIM) to the Stream canvas Connect the Type node to the Regression generated model node Edit the Regression generated model node Click the Summary tab, and then expand the Analysis summary
Linear Regression 4 - 8
Predictive Modeling With Clementine Figure 4.8 Linear Regression Browser Window (Analysis Summary)
This Analysis summary contains only the equation relating the predictor fields to the output. We could interpret the coefficients here, but since we dont know whether they are statistically significant or not, we will postpone this until we examine additional information in the Advanced tab. To reach the more detailed results:
Click the Advanced tab Increase the size of the window to see more of the output
The advanced output is formatted in HTML. After listing the dependent (output) and independent (input) fields, Regression provides several measures of how well the model fits the data. First is the multiple R, which is a generalization of the correlation coefficient. If there are several input fields (our situation) then the multiple R represents the unsigned (positive) correlation between the output and the optimal linear combination of the input fields. Thus the closer the multiple R is to 1, the better the fit. As mentioned earlier, the r-square measure can be interpreted as the proportion of variance of the output that can be predicted from the input field(s). Here it is about 32% (.318), which is far from perfect prediction, but still substantial. The adjusted r-square represents a technical improvement over the r-square in that it explicitly adjusts for the number of input fields and sample size, and as such is preferred by many analysts. Generally the two r-square values are very close in value; in fact, if they differ dramatically in multiple regression, it is a sign that you have used too many inputs relative to your sample size, and the adjusted r-square value should be more trusted. In our results, they are very close.
Linear Regression 4 - 9
Predictive Modeling With Clementine Figure 4.9 Model Summary and Overall Significance Tests
While the fit measures indicate how well we can expect to predict the output or how well the line fits the data, they do not tell whether there is a statistically significant relationship between the output and input fields. The analysis of variance table presents technical summaries (sums of squares and mean square statistics), but here we refer to variation accounted for by the prediction equation. We are interested in determining whether there is a statistically significant (non-zero) linear relation between the output and the input field(s) in the population. Since our analysis contains three input fields, we test whether any linear relation differs from zero in the population from which the sample was taken. The significance value accompanying the F test gives us the probability that we could obtain one or more sample slope coefficients (which measure the straight-line relationships) as far from zero as what we obtained, if there were no linear relations in the population. The result is highly significant (significance probability less than .0005the table value is rounded to .000or 5 chances in 10,000). Now that we have established that there is a significant relationship between the claim amount and one or more input fields and obtained fit measures, we turn to interpret the regression coefficients. Here we are interested in verifying that several expected relationships hold: (1) claims will increase with length of stay, (2) claims will increase with increasing severity of illness, and (3) claims will increase with age. Strictly speaking, this step is not necessary in order to identify cases that are exceptional. However, in order to be confident in the model, it should make sense to a domain expert in hospital claims. Since interpretation of regression models can be made directly from the regression coefficients, we turn to those next.
Linear Regression 4 - 10
The second column contains a list of the input fields plus the intercept (Constant). The estimated coefficients in the B column are those we saw when we originally browsed the Linear Regression generated model node; they are now accompanied by supporting statistical summaries. Although the B coefficient estimates are important for prediction and interpretive purposes, analysts usually look first to the t test at the end of each row to determine which input fields are significantly related to the output field. Since three inputs are in the equation, we are testing if there is a linear relationship between each input field and the output field after adjusting for the effects of the two other inputs. Looking at the significance values (Sig.) we see that all three predictors are highly significant (significance values are .004 or less). If any of the fields were found to be not significant, you would typically rerun the regression after removing these input field(s). The column labeled B contains the estimated regression coefficients we would use to deploy the model via a prediction equation. The coefficient for length of stay indicates that on average, each additional day spent in the hospital was associated with a claim increase of about $1,106. The coefficient for admission severity group tells us that each one-unit increase in the severity code is associated with a claim increase of $417. Finally, the age coefficient of about $33 suggests that claims decrease, on average, by $33 as patient age increases one year. This is counterintuitive and should be examined by a domain expert (here a physician). Perhaps the youngest patients are at greater risk or perhaps the type of insurance policy, which is linked somehow to age, influences the claim amount. If there isnt a convincing reason for this negative association, the data values for age and claims should be examined more carefully (perhaps data errors or outliers are influencing the results). Such oddities may have shown up in the original data exploration. We will not pursue this issue here, but it certainly would be done in practice. The constant or intercept of $3,027 indicates that the amount of predicted claim of someone with 0 days in the hospital, in the least severe illness category (0) and with age 0. This is clearly impossible. This odd result stems in part from the fact that no one in the sample had less than 1 day in the hospital (it was an inpatient procedure) and the patients were adults (no ages of 0), so the intercept projects well beyond where there are any data. Thus the intercept cannot represent an actual patient, but still is needed to fit the data. Also, note that when using regression it can be risky to extrapolate beyond where the data are observed, since the assumption is that the same pattern continues. Here it clearly cannot!
Linear Regression 4 - 11
Predictive Modeling With Clementine The Std. Error (of B) column contains standard errors of the estimated regression coefficients. These provide a measure of the precision with which we estimate the B coefficients. The standard errors can be used to create a 95% confidence band around the B coefficients (available as a Expert Output option). In our example, the regression coefficient for length of stay is $1,106 and the standard error is about $104. Thus we would not be surprised if in the population the true regression coefficient were $1,000 or $1,200 (within two standard errors of our sample estimate), but it is very unlikely that the true population coefficient would be $300 or $2,000. Betas are standardized regression coefficients and are used to judge the relative importance of each of several input fields. They are important because the values of the regression coefficients (Bs) are influenced by the standard deviations of the input fields and their scale, and the beta coefficients adjust for this. Here, not surprisingly, length of stay is the most important predictor of claim amount, followed by severity group and age. Betas typically range from 1 to 1 and the further from 0, the more influential the predictor variable. Thus if we wish to predict claims based on length of stay, severity code and age, the formula would use the estimated B coefficients:
Predicted Claims = $3,027 + $1,106 * length of stay + $417 * severity code $33 * age
Linear Regression 4 - 12
Predictive Modeling With Clementine Figure 4.11 Computing an Error (Residual) Field
The DIFF field measures the difference between the actual claim value (CLAIM) and the claim value predicted by the model ($E-CLAIM). Since we are most interested in the large positive errors, we will sort the data by DIFF before displaying it in a table.
Click OK to complete the Derive node Place a Sort node to the right of the Derive node Connect the Derive node to the Sort node Edit the Sort node Select DIFF as the Sort by field Select Descending in the Order column Click OK to process the Sort request Place a Table node to the right of the Sort node Connect the Sort node to the Table node Execute the Table node
Linear Regression 4 - 13
Predictive Modeling With Clementine Figure 4.12 Errors Sorted in Descending Order
There are two records for which the claim values are much higher than the regression prediction. Both are about $6,000 more than expected from the model. These would be the first claims to examine more carefully. We could also examine the last few records for large over-predictions, which might be errors as well.
Linear Regression 4 - 14
Summary Exercises
A Note Concerning Data Files
In this training guide files are assumed to be located in the c:\Train\ClemPredModel directory.
Severity of illness code (higher values mean more seriously ill) Age Length of hospital stay (in days) Total charges in US dollars (total amount claimed on form)
1. Using the insurance claims data, use the Stepwise method and compare the equation to the one obtained using the Enter method. Are you surprised by the result? Why or why not? Try the Forward and Backward methods. Do you find any differences? 2. Instead of examining errors in the original scale, analysts may prefer to express the residual as a percent deviation from the prediction. Such a measure may be easier to communicate to a wider audience. Add a Derive node that calculates a percent error. Name this field PERERROR and use the following formula: 100* (CLAIM '$ECLAIM')/'$E-CLAIM'. Compare this measure of error to the original DIFF. Do the same records stand out? What conditions is this percent error most sensitive to? Use the Histogram node to produce histograms for either of the error fields, generate a Select node to select records with large errors, and then display them in a table. 3. Use the Neural Net modeling node to predict CLAIM using a neural network. How does its performance compare to linear regression? What does this suggest about the model? Fit a C&R Tree model and make the same comparison. Examine the errors from the better of these latter models (as you judge them). Do the same records consistently display large errors?
Linear Regression 4 - 15
Linear Regression 4 - 16
Data
A risk assessment study in which customers with credit cards were assigned to one of three categories: good risk, bad risk-profitable (some payments missed or other problems, but were profitable for the issuing company), and bad risk-loss. In addition to the risk classification field, a number of demographics, including age, income, number of children, number of credit cards, number of store credit cards, having a mortgage, and marital status, are available for about 2,500 records.
Logistic regression technically predicts the probability of an event (of a record being classified into a specific category of the outcome field). The logistic function is shown in Figure 5.1. Suppose that we wish to predict whether someone buys a product. The function displays the predicted probability of purchase based on an incentive.
Logistic Regression 5 - 1
Predictive Modeling With Clementine Figure 5.1 Logistic Model for Probability of Purchase
We see the probability of making the purchase increases as the incentive increases. Note that the function is not linear but rather S-shaped. The implication of this is that a slight change in the incentive could be effective or not depending on the location of the starting point. A linear model would imply that a fixed change in incentive would always have the same effect on probability of purchase. The transition from low to high probability of purchase is quite gradual. However, with a logistic model the transition can occur much more rapidly (steeper slope) near the .5 probability value. To understand how the model functions, we need to review some equations. The logistic model makes predictions based on the probability of an outcome. Binary (two outcome category) logistic regression can be formulated as:
Odds (event) =
Prob (event) Prob (event) or = e + B1 X 1+ B 2 X 2 +...+ BkXk 1 Prob (event) Prob (no event)
where the outcome is one of two categories (event, no event). If we take the natural log of the odds, we have a linear model, akin to a standard regression equation:
Logistic Regression 5 - 2
Predictive Modeling With Clementine the outcome categories for identification: (1) Bad RiskProfit, (2) Bad RiskLoss, (3) Good Risk. For the three categories we can create two probability ratios:
g(1) =
and
g(2) =
Where (j) is the probability of being in outcome category j. Each ratio is based on the probably of an output category divided by the probability of the reference or baseline outcome category. The remaining probability ratio (Bad RiskProfit / Bad RiskLoss) can be obtained by taking the ratio of the two ratios shown above. Thus the information in J outcome categories can be summarized in (J-1) probability ratios. In addition, these outcome-category probability ratios can be related to input fields in a fashion similar to what we saw in the binary logistic model. Again using the Good Risk output category as the reference or baseline category, we have the following model:
ln(
(1) Prob (Bad Risk - Profit) ) = ln( ) = 1 + B11 X 1 + B12 X 2 + ... + B1k X k (3) Prob (Good Risk)
and
ln(
(2) Prob (Bad Risk - Loss) ) = ln( ) = 2 + B21 X 1 + B22 X 2 + ... + B2 k X k (3) Prob (Good Risk)
Notice that there are two sets of coefficients for the three-category output case, each describing the ratio of an output category to the reference or baseline category. If we complete this logic and create a ratio containing the baseline category in the numerator, we would have:
ln(
(3) Prob (Good Risk) ) = ln( ) = ln(1) = 0 (3) Prob (Good Risk) = 3 + B31 X 1 + B32 X 2 + ... + B3k X k
Also, the ratio relating any two output categories, excluding the baseline, can be easily obtained by subtracting their respective natural log expressions. Thus:
ln(
Logistic Regression 5 - 3
ln(
Prob (Bad Risk - Profit) Prob (Bad Risk - Profit) Prob (Bad Risk - Loss) ) = ln( ) - ln( ) Prob (Bad Risk - Loss) Prob (Good Risk) Prob (Good Risk)
We are interested in predicting the probability of each output category for specific values of the predictor variables. This can be derived from the expressions above. The probability of being in outcome category j is:
(j) =
g (j)
g (i)
i =1
In our example with the three risk output categories, for outcome category (1):
(1) g (1) (1) (1) (3) = = = g (1) + g (2) + g (3) (1) (2) (3) (1) + (2) + (3) 1 + + (3) (3) (3)
And substituting for the g(j)s, we have an equation relating the predictor variables to the output category probabilities.
(1) =
e1 + B11 X 1 + B12 X 2 +...+ B1k X k e1 + B11 X 1 + B12 X 2 +...+ B1k X k = 1 + B11 X 1 + B12 X 2 +...+ B1k X k e + e 2 + B21 X 1 + B22 X 2 +...+ B2 k X k + 1
e1 + B11 X 1 + B12 X 2 +...+ B1k X k + e 2 + B21 X 1 + B22 X 2 +...+ B2 k X k + e 3 + B31 X 1 + B32 X 2 +...+ B3 k X k
In this way, the logic of binary logistic regression can be naturally extended to permit analysis of symbolic output fields with more than two categories.
Logistic Regression 5 - 4
Field name
AGE INCOME GENDER MARITAL
Field description
age in years income (in thousands of British pounds) f=female, m=male marital status: single, married, divsepwid (divorced, separated or widowed) number of dependent children NUMKIDS NUMCARDS number of credit cards how often paid: weekly, monthly HOWPAID MORTGAGE have a mortgage: y=yes, n=no STORECAR number of store credit cards number of other loans LOANS income (in thousands of British pounds) divided by 1,000 INCOME1K The output field is:
Field name
RISK To access the data:
Field description
credit risk: 1= bad risk-loss, 2=bad risk-profit, 3= good risk
Click FileOpen Stream and move to the c:\Train\ClemAdvMod directory Double-click on Logistic.str Execute the Table node, examine the data, and then close the Table window Double-click on the Type node
The output field is credit risk (RISK). Notice that only four input fields are used. This is done to simplify the results for this presentation. As an exercise, the other fields will be used as predictors.
Logistic Regression 5 - 5
Predictive Modeling With Clementine Figure 5.2 Type Node for Logistic Analysis
Close the Type node dialog Double-click on the Logistic Regression model node named RISK
Logistic Regression 5 - 6
In the Model tab, you can choose whether a constant (intercept) is included in the equation. The Procedure option is used to specify whether a binomial or multinomial model is created. The options that will be available to you in the dialog box will differ according to which modeling procedure you select. Binomial is used when the target field is a flag or set with two discreet values, such as good risk/bad risk, or churn/not churn. Whenever you use this option, you will in addition be asked to declare which of your flag or set fields should be treated as categorical, the type of contrast you want performed, and the reference category for each predictor. The default contrast is Indicator, which indicates the presence or absence of category membership. However, in fields with some implicit order, you may want to use another contrast such as Repeated, which compares each category with the one that precedes it. The default reference or base category is the First category. If you prefer, you can change this to the Last category. Multinomial should be used when the target field is a set field with more than two values. This is the correct choice in our example because the RISK field has three values: bad risk, bad profit, and good risk. Whenever you use this option, the Model type option will become available for you to specify whether you want a main effects model, a full factorial model, or a custom model. By default, a model including the main effects (no interactions) of factors (symbolic inputs) and covariates (numeric inputs) will be run. This is similar to what the Regression model node will do (unless interaction terms are formally added). The Full factorial option would fit a model
Logistic Regression 5 - 7
Predictive Modeling With Clementine including all factor interactions (in our example, with two symbolic predictors, the two-way interaction of MARITAL and MORTGAGE would be added). Notice that there are Method options (as there were for linear regression), so stepwise methods can be used when the Main Effects model type is selected. When a number of input fields are available, the stepwise methods provide a method of input field selection based on statistical criteria. The Base Category for target option is used to specify the reference category. The default is the First category in the list, which in this case is bad loss. Note: This field is unavailable if the contrast setting is Difference, Helmert, Repeated, or Polynomial.
Select the Multinomial Procedure option (if necessary) Click on the Specify button to the right of Base category for target. This will open the Insert Value dialog box Click on good risk
This will change the base target category. The result is shown in Figure 5.5.
Logistic Regression 5 - 8
Predictive Modeling With Clementine Figure 5.5 Logistic Regression Dialog with Good Risk as the Base Target Category
Click on the Expert tab Click the Expert Mode option button
Logistic Regression 5 - 9
Predictive Modeling With Clementine Figure 5.6 Logistic Expert Mode Options
The Scale option allows adjustment to the estimated parameter variance-covariance matrix based on over-dispersion (variation in the outcome greater than expected by theory, which might be due to clustering in the data). The details of such adjustment are beyond the scope of this course, but you can find some discussion in McCullagh and Nelder (1989). If the Append all probabilities checkbox is selected, predicted probabilities for every category of the output field would be added to each record passed through the generated model node. If not selected, a predicted probability field is added only for the predicted category.
Click the Output button Click the Likelihood ratio tests check box Click the Classification table check box
By default, summary statistics and (partial) likelihood ratio tests for each effect in the model appear in the output. Also, 95% confidence bands will be calculated for the parameter estimates. We have requested a classification table so we can assess how well the model predicts the three risk categories.
Logistic Regression 5 - 10
Predictive Modeling With Clementine Figure 5.7 Logistic Regression Advanced Output Options
In addition, a table of observed and expected cell probabilities can be requested (Goodness of fit chi-square statistics). Note that, by default, cells are defined by each unique combination of a covariate (range input) and factor (symbolic input) pattern, and a response category. Since a continuous predictor (INCOME1K) is used in our analysis, the number of cell patterns is very large and each might have but a single observation. These small counts could possibly yield unstable results, and so we will forego goodness of fit statistics. The asymptotic correlation of parameter estimates can provide a warning for multicollinearity problems (when high correlations are found among parameter estimates). Iteration history information is requested to help debug problems if the algorithm fails to converge, and the number of iteration steps to display can be specified. Monotonicity measures can be used to find the number of concordant pairs, discordant pairs, and tied pairs in the data, as well as the percentage of the total number of pairs that each represents. The Somers' D, Goodman and Kruskal's Gamma, Kendall's tau-a, and Concordance Index C are also displayed in this table. Information criteria shows Akaikes information criterion (AIC) and Schwarzs Bayesian information criterion (BIC).
Click OK Click Convergence button
Logistic Regression 5 - 11
Predictive Modeling With Clementine The Logistic Regression Convergence Criteria options control technical convergence criteria. Analysts familiar with logistic regression algorithms might use these if the initial analysis fails to converge to a solution.
Click Cancel Click Execute Browse the Logistic Regression generated model node named RISK in the Models Manager window Click the Advanced tab, and then expand the browsing window
The advanced output is displayed in HTML format. Figure 5.9 Record Processing Summary
The marginal frequencies of the symbolic inputs and the output are reported, along with a summary of the number of valid and missing records. A record must have valid values on all inputs and the output in order to be included in the analysis. We have nearly 2,500 records for the analysis.
Logistic Regression 5 - 12
Predictive Modeling With Clementine Figure 5.10 Model Fit and Pseudo R-Square Summaries
The Final model chi-square statistic tests the null hypothesis that all model coefficients are zero in the population, equivalent to the overall F test in regression. It has ten degrees of freedom that correspond to the parameters in the model (seen below), is based on the change in 2LL (2 log likelihood) from the initial model (with just the intercept) to the final model, and is highly significant. Thus at least some effect in the model is significant. The AIC and BIC fit measures are also displayed. The model fit improves as these two values approach zero. Because each of them decreased, we can conclude that the model fit improved with the addition of the predictors. Pseudo r-square measures try to measure the amount of variation (as functions of the chi-square lack of fit) accounted for by the model. The model explains only a modest amount of the variation (the maximum is 1, and some measures cannot reach this value). Figure 5.11 Likelihood Ratio Tests
The Model Fitting Criteria table provided an omnibus test of effects in the model. Here we have a test of significance for each effect (in this case the main effect of an input field) after adjusting for the other effects in the model. The caption explains how it is calculated. All effects are highly significant. Notice that the intercepts are not tested in this way, but tests of the individual intercepts can be found in the Parameter Estimates table. In addition, we can use this table to rank order the importance of the predictors. For instance, if we focus on the -2 LL value, if
Logistic Regression 5 - 13
Predictive Modeling With Clementine INCOME1K was removed as a predictor, the -2 LL value would increase by a magnitude 302.422. Clearly, the removal of this predictor would have far more impact on the overall fit than if we were to eliminate any of the other predictors. The further -2LL gets from zero, the worse the fit. Thus, we can conclude that INCOME1K is the most important predictor, followed by MARITAL, NUMKIDS, and MORTGAGE. For those familiar with binary (two output category) logistic regression, note that the values in the df (degrees of freedom) column are double what you would expect for a binary logistic regression model. For example, the covariate income (INCOME1K), which is continuous, has two degrees of freedom. This is because with three outcome categories, there are two probability ratios to be fit, doubling the number of parameters. Income has by far the largest chi-square value compared to the other predictors with two (or even four) degrees of freedom.
For each of the two outcome probability ratios, each predictor is listed, plus an intercept, with the B coefficients and their standard errors, a test of significance based on the Wald statistic, and the Logistic Regression 5 - 14
Predictive Modeling With Clementine Exp(B) column, which is the exponentiated value of the B coefficient, along with its 95% confidence interval. As with ordinary linear regression, these coefficients are interpreted as estimates for the effect of a particular input, controlling for the other inputs in the equation. Recall that the original (linear) model is in terms of the natural log of a probability ratio. The intercept represents the log of the expected probability ratio of two outcome categories when all numeric inputs are zero and all symbolic fields are set to their reference category (last group) values. For covariates, the B coefficient is the effect of a one-unit change in the input on the log of the probability ratio. Examining income (INCOME1K) in the bad loss section, an increase of 1 unit (equivalent to 1,000 British pounds) decreases the log of the probability ratio between bad loss and good risk by .056. But what does this mean in terms of probabilities? Moving to the Exp(B) column, we see the value is .945 for INCOME1K (in the bad loss section of the table). Thus increasing income by 1 unit (or 1,000 British pounds) decreased the expected ratio of the probability of being a bad loss to the probability of being a good risk by a factor of .945. In other words, increasing income reduces the expected probability of being a bad loss relative to being a good risk, and this reduction is .945 per 1,000 British pounds. This finding makes common sense. If we examine the income coefficient in the bad profit section of the table, we see that in a similar way (Exp(B) = .878) the expected probability of being a bad profit relative to being a good risk decreases as income increases. Thus increasing income, after controlling for the other variables in the equation, is associated with decreasing the probability of having a bad loss or bad profit outcome relative to being a good risk. This relationship is quantified by the values in the Exp(B) column and the Sig column indicates that both coefficients are statistically significant. Turning to the number of children (NUMKIDS), we see that its coefficient is significant for the bad loss ratio, but not the bad profit ratio. Examining the Exp(B) column for NUMKIDS in the bad loss section, the coefficient estimate is 2.267. For each additional child (one unit increase in NUMKIDS), the expected ratio of the probability of being a bad loss to being a good risk more than doubles. Thus, controlling for other predictors, adding a child (one unit increase) doubles the expected probability of being a bad loss relative to a good risk. However, controlling for the other predictors, the number of children has no significant effect on the probability ratio of being a bad profit relative to a good risk. The Logistic node uses a General Linear Model coding scheme. Thus for each symbolic input (here MARITAL and MORTGAGE), the last category value is made the reference category and the other coefficients for that input are interpreted as offsets from the reference category. In examining the table we see that the last categories for MARITAL (single) and MORTGAGE (y) have B coefficients fixed at 0. Because of this the coefficient of any other category can be interpreted as the change associated with shifting from the reference category to the category of interest, controlling for the other input fields. Since the reference category coefficients are fixed at 0, they have no associated statistical tests or confidence bands. Looking at the MARITAL input field, its two coefficients (for divsepwid and married categories) are significant for both the bad loss and bad profit summaries. In the bad loss section, we see the estimated Exp(B) coefficient for the MARITAL=divsepwid category is .284, while that for MARITAL=married is 2.891. Thus we could say that, after controlling for other inputs, compared to those who are single, those who are divorced, separated or widowed have a large reduction (.284) in the expected ratio of the probability of being a bad loss relative to a good risk. Put another way, the divorced, separated or widowed group is expected to have fewer bad losses relative to good risks than is the single group. On the other hand, the married group is
Logistic Regression 5 - 15
Predictive Modeling With Clementine expected to have a much higher (by a factor of almost 3) proportion of bad losses relative to good risks than the single group. The explanation of why being married versus single should be associated with an increase of bad losses relative to good risks should be worked out by the analyst, perhaps in conjunction with someone familiar with the credit industry (domain expert). If we examine the MARITAL Exp(B) coefficients for the bad profit ratios, we find a very similar result. Finally, MORTGAGE is significant for both the bad loss and bad profit ratios. Since having a mortgage (coded y) is the reference category, examining the Exp(B) coefficients shows that compared to the group with a mortgage, those without a mortgage have a greater expected probability of being bad losses (1.828) or bad profits (2.526) relative to good risks. In short, those without mortgages are less likely to be good risks, controlling for the other predictors. In this way, the statistical significance of inputs can be determined and the coefficients interpreted. Note that if a predictor were not significant in the Likelihood Ratio Tests table, then the model should be rerun after dropping the variable. Although NUMKIDS is not significant for both sets of category ratios, the joint test (Likelihood Ratio Test) indicates it is significant and so we would retain it.
Classification Table
The classification table, sometimes called a misclassification table or confusion matrix, provides a measure of how well the model performs. With three output categories we are interested in the overall accuracy of model classification, the accuracy for each of the individual output categories, and patterns in the errors. Figure 5.13 Classification Table
The rows of the table represent the actual output categories while the columns are the predicted output categories. We see that overall, the predictive accuracy of the model is 62.4%. Although marginal counts do not appear in the table, by adding the counts within each row we find that the most common output category is bad profit (1,475). This constitutes 60.1% percent of all cases (2,455). Thus the overall predictive accuracy of our model is not much of an improvement over the simple rule of always predicting bad riskprofit. However, we should recall that this simple rule would never make a prediction of bad riskloss or good risk. In examining the individual output categories, the bad riskprofit group is predicted most accurately (87.3%), while the other categories, bad riskloss (15.9%) and good risk (36.8%) are predicted with much less accuracy. Not surprisingly, most errors in prediction for these latter two output categories are predicted to be bad riskprofit.
Logistic Regression 5 - 16
Predictive Modeling With Clementine The classification table allows us to evaluate a model from the perspective of predictive accuracy. Whether this model would be adequate depends in part on the value of correct predictions and the cost of errors. Given the modest improvement of this model over simply classifying all cases as bad riskprofit, in practice an analyst would see if the model could be improved by adding additional predictors and perhaps some interaction terms. Finally, it is important to note that the predictions were evaluated on the same data used to fit the model and for this reason may be optimistic. A better procedure is to keep a separate validation sample on which to evaluate the predictive accuracy of the model.
Making Predictions
We now have the estimated model coefficients. How does the Logistic generated model node make predictions from the model? First, lets see the actual predictions by adding the generated model to the stream.
Close the Model browsing window Add the Logistic generated model to the stream Connect the generated model to the Type node Add a Table node to the stream and connect the Logistic generated model to the Table node Execute the Table node
Logistic Regression 5 - 17
Predictive Modeling With Clementine The field $L-RISK contains the most likely prediction from the model (here good risk). The probabilities for all three outcomes must sum to 1; the model prediction is the outcome category with the highest probability. That probability is contained in the field $LP-RISK. So for the first case, the prediction is good risk and the predicted probability of this occurring is .692 for this combination of input values. You prefer that the probability be as close to 1 as possible (the lowest possible value for the predicted category is .333; Why?). To illustrate how the actual calculation is done, lets take an individual who is single, has a mortgage, no children, and has an income of 35,000 British pounds (INCOME1K = 35.00). What is the predicted probability of her (although gender was not included in the model) being in each of the three risk categories? Into which risk category would the model place her? Earlier in this chapter we showed the following (where (j) is the probability of being in outcome category j):
(j) =
g (j)
g (i)
i =1
If we substitute the parameter estimates in order to obtain the estimated probability ratios, we have:
(3) = 1 g
Where because of the coding scheme for the symbolic inputs (Factors): Marital1 = 1 if Marital=divsepwid; 0 otherwise Marital2 = 1 if Marital=married; 0 otherwise Mortgage1 = 1 if Mortgage=n; 0 otherwise Thus for our hypothetical individual, the estimated probability ratios are:
Logistic Regression 5 - 18
(1) =
(2) =
(3) =
Since the third group (good risk) has the greatest expected probability (.504), the model predicts that the individual belongs to that group. The next most likely group to which the individual would be assigned would be group 2 (bad riskprofit) because its expected probability is the next largest (.386).
Additional Readings
Those interested in learning more about logistic regression might consider David W Hosmer and Stanley Lemeshows Applied Logistic Regression, 2nd Edition, New York, Wiley, 2000.
Logistic Regression 5 - 19
Summary Exercises
A Note Concerning Data Files
In this training guide files are assumed to be located in the c:\Train\ClemPredModel directory.
ID AGE INCOME GENDER MARITAL NUMKIDS NUMCARDS HOWPAID MORTGAGE STORECAR LOANS RISK INCOME1K
ID number Age Income in British pounds Gender Marital status Number of dependent children Number of credit cards How often is customer paid by employer (weekly, monthly) Does customer have a mortgage? Number of store credit cards Number of outstanding loans Credit risk category Income in thousands of British pounds (field derived within Clementine)
1. Continuing with the stream from the chapter, add the other available inputs, excluding income (which is linearly related to income1k), and ID, to a logistic regression model and evaluate the results. Do the additional variables substantially improve the predictive accuracy of the model? Examine the estimated coefficients for the significant inputs. Do these relationships make sense? 2. Rerun the Logistic node, dropping those inputs that were not significant in the last analysis. Does the accuracy of the model change much? Does the interpretation of any of the coefficients change substantially? 3. Rerun the Logistic node, this time using the Stepwise Method. Do the input fields selected match those retained in Exercise 2?
Logistic Regression 5 - 20
Predictive Modeling With Clementine 4. Run a rule induction model (using C5.0 or CHAID) on this data, using all fields but ID and income as inputs. How does the accuracy of this model compare to that found by logistic regression? What does this suggest about the relations in the data? Do the inputs used by the model correspond to the inputs that were found to be significant in the logistic regression analysis? 5. Run a neural net model on this data, again excluding ID and income as inputs. Include a sensitivity analysis. Does the neural network outperform the other models? Are the important predictors in the neural network model (sensitivity results) the same as the significant input fields in the logistic regression?
Logistic Regression 5 - 21
Logistic Regression 5 - 22
Data
To demonstrate discriminant analysis we take data from a study in which respondents answered, hypothetically, whether they would accept an interactive news subscription service (via cable). There is interest in identifying those segments most likely to adopt the service. Several demographic variables are available, including: education, gender, age, income (in categories), number of children, number of organizations the respondent belonged to, and the number of hours of TV watched per day. The outcome measure was whether they would accept the offering.
6.1 Introduction
Discriminant analysis is a technique designed to characterize the relationship between a set of variables, often called the response or predictor variables, and a grouping variable with a relatively small number of categories. By modeling the relationship, discriminant can make predictions for categories of the grouping variable. To do so, discriminant creates a linear combination of the predictors that best characterizes the differences among the groups. The technique is related to both regression and multivariate analysis of variance, and as such it is another general linear model technique. Another way to think of discriminant analysis is as a method to study differences between two or more groups on several variables simultaneously. Common uses of discriminant include: 1. Deciding whether a bank should offer a loan to a new customer. 2. Determining which customers are likely to buy a companys products. 3. Classifying prospective students into groups based on their likelihood of success at a school. 4. Identifying patients who may be at high risk for problems after surgery.
Discriminant Analysis 6 - 1
Predictive Modeling With Clementine set of variablesthe predictorsthat follow a multivariate normal distribution. Discriminant attempts to find the linear combinations of the predictors that best separate the populations. If we assume two predictor variables, X and Y, and two groups for simplicity, this situation can be represented as in Figure 6.1. Figure 6.1 Two Normal Populations and Two Predictor Variables, with Discriminant Axis
The two populations or groups clearly differ in their mean values on both the X and Y-axes. However, the linear functionin this instance, a straight linethat best separates the two groups is a combination of the X and Y values, as represented by the line running from lower left to upper right in the scatterplot. This line is a graphic depiction of the discriminant function, or linear combination of X and Y, that is the best predictor of group membership. In this case with two groups and one function, discriminant will find the midpoint between the two groups that is the optimum cutoff for separating the two groups (represented here by the short line segment). The discriminant function and cutoff can then be used to classify new observations. If there are more than two predictors, then the groups will (hopefully) be well separated in a multidimensional space, but the principle is exactly the same. If there are more than two groups, more than one classification function can be calculated, although not all the functions may be needed to classify the cases. Since the number of predictors is almost always more than two, scatterplots such as Figure 6.1 are not always that helpful. Instead, plots are often created using the new discriminant functions, since it is on these that the groups should be well separated. The effect of each predictor on each discriminant function can be determined, and the predictors can be identified that are more important or more central to each function. Nevertheless, unlike in regression, the exact effects of the predictors are not typically seen as of ultimate importance in discriminant analysis. Given the primary goal of correct prediction, the specifics of how this is accomplished are not as critical as the prediction itself (such as offering loans to customers who will pay them back). Second, as will be demonstrated below, the predictors do not directly predict the grouping variable, but instead a value on the discriminant function, which, in turn, is used to generate a group classification.
Discriminant Analysis 6 - 2
where FK is the score on function K, the Dis are the discriminant coefficients, and the Xis are the predictor or response variables (there are p predictors). The maximum number of functions K that can be derived is equal to the minimum of the number of predictors (p) or the quantity (number of groups 1). In most applications, there will be more predictors than categories of the grouping variable, so the latter will limit the number of functions. For example, if we are trying to predict which customers will choose one of three offers, (3-1), or two classification functions can be derived. When more than one function is derived, each subsequent function is chosen to be uncorrelated, or orthogonal, to the previous functions (just as in principal components analysis, where each component is uncorrelated with all others, see Chapter 7). This allows for straightforward partitioning of variance. Discriminant creates a linear combination of the predictor variables to calculate a discriminant score for each function. This score is used, in turn, to classify cases into one of the categories of the grouping variable.
Discriminant Analysis 6 - 3
Predictive Modeling With Clementine (so that if there are three groups, the prior probability of each group would be 1/3). We have more to say about prior probabilities below. Second, the conditional probability is the probability of obtaining a specific discriminant score (or one further from the group mean) given that a case belongs to a specific group. By assuming that the discriminant scores are normally distributed, it is possible to calculate this probability. With this information and by applying Bayes rule, the posterior probability is calculated, which is defined as the likelihood or probability of group membership, given a specific discriminant score. It is this probability value that is used to classify a case into a group. That is, a case is assigned to the group with the highest posterior probability. Although Clementine uses a probability method of classification, you will most probably use a method based on a linear function to classify new data. This is mainly for ease of calculation because calculating probabilities for new data is computationally intensive compared to using a classification function. This will be illustrated below.
Discriminant Analysis 6 - 4
Predictive Modeling With Clementine discriminant function performs quite well, especially when the sample sizes are small. This assumption can be tested with the Explore procedure or with the Boxs M statistic, displayed by Discriminant. For a more detailed discussion of problems with assumption violation, see P.A. Lachenbruch (Discriminant Analysis. 1975. New York: Hafner) or Carl Huberty (Applied Discriminant Analysis. 1994. New York: Wiley).
Discriminant Analysis 6 - 5
Predictive Modeling With Clementine Logistic regression, as we have seen in Chapter 5, is derived from a view of the world in which individuals fall more along a continuum. This difference in formulation led discriminant to be employed in credit analysis (there are those who repay loans and those who dont), while logistic regression was used to make risk adjustments in medicine (depending on demographics, health characteristics and treatment you are more or less likely to survive a disease). Despite these different origins, discriminant and logistic give very similar results in practice. Monte Carlo simulation work has not found one to be superior to the other over very general circumstances. There is, of course, the obvious point that if the data are based on samples from multivariate normal populations, then discriminant outperforms logistic regression. One consideration when choosing between these two methods involves how many dichotomous predictor variables (or dummy coded) variables are used in the analysis. Because of the stronger assumptions made about the predictor variables by discriminant, the more categorical variables you have, the more you would lean toward logistic regression. Within the domain of responsebased segmentation, from the business side more discriminant analysis is seen, while if the problem is formulated from a marketing perspective as a choice model, logistic models are more common. Note that neither discriminant nor logistic will produce a list of segments. Rather they will indicate which predictor variables (some may represent demographic characteristics) are relevant to the outcome. From the prediction equation or other summary measures you can determine the combinations of characteristics that most likely lead to the desired outcome.
Recommendations
Logistic regression and discriminant analysis give very similar results in practice. Since discriminant does make stronger assumptions about the nature of your predictor variables (formally, multivariate normality and equal covariance matrices are assumed), as more of your predictor variables are categorical (and thus need to be dummy coded) or dichotomous, you would move in the direction of logistic regression. Certain research areas have a tradition of using only one of the methods, which may also influence your choice.
Discriminant Analysis 6 - 6
Predictive Modeling With Clementine As in our other examples, we will move directly to the analysis although ordinarily you would run data checks and exploratory data analysis first.
Click FileOpen Stream and then move to the c:\Train\ClemPredModel folder Double-click on Discriminant.str Right-click on the Table node and select Execute to view the data
Place a Discriminant node from the Modeling palette to the right of the Type node Connect the Type node to the Discriminant node
The name of the Discriminant node will immediately change to NEWSCHAN, the outcome field.
Discriminant Analysis 6 - 7
The Use partitioned data option can be used to split the data into separate samples for training and testing. This may provide an indication of how well the model will work with new data. We will not use this option in this example, but instead will take advantage of a different option for validating the model (Leave-one-out classification) that is built into the Discriminant procedure. The Method option allows you to specify how you want the predictors entered into the model. By default, all of the terms are entered into the equation. If you do not have a particular model in mind, you can invoke the Stepwise option that will enter predictor variables into the equation based on a statistical criterion. At each step, terms that have not yet been added to the model are evaluated, and if the best of those terms adds significantly to the predictive power of the model, it is added. Some analysts prefer to enter all the predictor variables into the equation and then evaluate which are important. However, if there are many correlated predictor variables, you run
Discriminant Analysis 6 - 8
Predictive Modeling With Clementine the risk of multicollinearity, in which case a Stepwise method may be preferred. A drawback is that the Stepwise method has a strong tendency to overfit the training data. When using this method, it is especially important to verify the validity of the resulting model with a hold-out test sample or new data (which is common practice in data mining).
Click on the Method button and select Stepwise
Discriminant Analysis 6 - 9
You can use the Prior Probabilities area to provide Discriminant with information about the distribution of the outcome in the population. By default, before examining the data, Discriminant assumes an observation is equally likely to belong to each group. If you know that the sample proportions reflect the distribution of the outcome in the population then you can use the Compute from group sizes option to instruct Discriminant to make use of this information. For example, if an outcome category is very rare, Discriminant can make use of this fact in its prediction equation. We dont know what the proportions would be so we retain the default. The Use covariance matrix option is often useful whenever the homogeneity of variance option is not met. In general, if the groups are well separated in the discriminant space, heterogeneity of variance will not be terribly important. However, in situations when you do violate the equal variance assumption, it may be useful to use the Separate-groups covariance matrices to see if your predictions change by very much. If they do, that would suggest that the violation of the equal variance assumption was serious. It should be noted that using separate-groups covariance matrices does not affect the results prior to classification. This is because Clementine does not use the original scores to do the classification. Thus, the use of the Fisher classification functions is not equivalent to classification by Clementine with separate covariance matrices.
Click the Output button
Discriminant Analysis 6 - 10
Predictive Modeling With Clementine Figure 6.7 Discriminant Advanced Output Dialog
Checking Univariate ANOVAs will have Clementine display significance tests of between-group (outcome categories) differences on each of the predictor variables. The point of this is to provide some hint as to which variables will prove useful in the discriminant function, although this is precisely what discriminant will resolve. The Boxs M statistic is a direct test of the equality of covariance matrices. The covariance matrices are ancillary output and very rarely viewed in practice. However, you might view the within-groups correlations among the predictor variables to identify highly correlated predictors. Either Fisher's coefficients or the unstandardized discriminant coefficients can be used to make predictions for future observations (customers). Both sets of coefficients produce the same predictions when equal covariance matrices are assumed. If there are only two outcome categories (as is our situation), either set of coefficients is easy to use. If you want to try what if scenarios using a spreadsheet, the unstandardized coefficients, which involve a single equation in the two-outcome case, are more convenient. If you run discriminant with more than two outcome categories, then Fisher's coefficients are easier to apply as prediction rules. Casewise results can be used to display the codes for the actual group, predicted group, posterior probabilities, and discriminant scores for each case. The Summary table, also known by several other names including Classification table, Misclassification Table, and Confusion table, displays the number of cases correctly and incorrectly assigned to each of the groups based on the discriminant analysis. The Leave-one-out classification classifies each case based on discriminant coefficients calculated while the case is excluded from the analysis. This is a jackknife method and provides a classification table that should at least slightly better generalize to other samples.
Discriminant Analysis 6 - 11
Predictive Modeling With Clementine You can also produce a Territorial map, which is a plot of the boundaries used to classify cases into groups, but the map will not be displayed if there is only one discriminant function (the maximum number of functions is equal to the number of categories 1 in the outcome field). The Stepwise options allow you to display a Summary of statistics for all variables after each step.
Click the Means, Univariate ANOVAS, and Boxs M check boxes in the Descriptives area Click the Fishers and Unstandardized check boxes in the Function Coefficients area Click the Summary table and Leave-one-out classification check boxes in the Classification area
Discriminant Analysis 6 - 12
Wilks lambda is the default and probably the most common method. The differences between the methods are somewhat technical and beyond the scope of this course. You can change the statistical criterion for variable entry. For example, you might want to make the criterion more stringent when working with a large sample.
Click Cancel Click the Execute button Browse the Discriminant generated model in the Models Manager window Click the Advanced tab, and then expand the browsing window Scroll to the Classification Results
Although this table appears at the end of the discriminant output, we turn to it first. It is an important summary since it tells us how well we can expect to predict the outcome. The actual (known) groups constitute the rows and the predicted groups make up the columns. Of the 227 people surveyed who said they would not accept the offering, the discriminant model correctly predicted 157 of them; thus accuracy for this group is 69.2%. For the 214 respondents who said
Discriminant Analysis 6 - 13
Predictive Modeling With Clementine they would accept the offering, 66.4% were correctly predicted. Overall, the discriminant model was accurate in 67.80% of the cases. Is this good? Will this model work well with new data? The answer to the first question will largely depend on what level of predictive accuracy you required before you began the project. One way we can assess the success of the model is to compare these results with the predictions we would have made if we simply guessed the larger group. If we simply did that, we would be correct in 227 of 441 (227 + 214) instances, or about 51.5% of the time. The 67.8% correct figure, while certainly far from perfect accuracy, does far better than guessing. The Crossvalidated portion of the table gives us an idea about how accurate this model will be with new data. The percent of correctly classified cases has decreased slightly from 67.8% to 67.3% for the cross-validation. Because these results are virtually identical, it appears the model is valid. Since we are interested in discovering which characteristics are associated with someone who accepts the news channel offer, we proceed.
Scroll back to the Group Statistics pivot table
Viewing the means by themselves is of limited use, but notice the group that would accept the service is about 7 years older than the group that would not accept, whereas the daily hours of TV viewing are almost identical for the two groups. The standard deviations are very similar across groups, which is promising for the equal covariance matrices assumption.
Scroll to the Tests of Equality of Group Means pivot table
Discriminant Analysis 6 - 14
The significance tests of between-group differences on each of the predictor variables provide hints as to which will be useful in the discriminant function (recall we are using Wilks' criterion as a stepwise method). Notice Age in Years has the largest F (is most significant) and will be first selected in the stepwise solution. This table looks at each variable ignoring the others, while discriminant adjusts for the presence of the other variables in the equation (as would regression). Scroll to the Boxs M test results Figure 6.13 Boxs M Test Results
Because the significance value is well above 0.05, we can accept the null hypothesis that the covariance matrices are equal. However, the Boxs M test is quite powerful and leads to rejection of equal covariances when the ratio N/p is large, where N is the number of cases and p is the number of variables. The test is also sensitive to lack of multivariate normality, which applies to these data. If they were unequal, the effect on the analysis would be to create errors in the assignment of cases to groups.
Scroll to the Eigenvalues and Wilks Lambda portion of the output
Discriminant Analysis 6 - 15
Predictive Modeling With Clementine Figure 6.14 Summaries of Discriminant Function (Eigenvalues and Wilks Lambda)
These two tables are overall summaries of the discriminant function. The canonical correlation measures the correlation between a variable (or variables when there are more than two groups) contrasting the groups and an optimal (in terms of maximizing the correlation) linear combination of the predictor variables. In short, it measures the strength of the relationship between the predictor variables and the groups. Here, there is a modest (.363) canonical correlation. Wilks lambda provides a multivariate test of group differences on the predictor variables. If this test were not significant (it is highly significant), we would have no basis on which to proceed with discriminant analysis. Now we view the individual coefficients. Scroll down until you see the Standardized Coefficients and Structure Matrix Figure 6.15 Standardized Coefficients and Structure Matrix
Standardized discriminant coefficients can be used as you would use standardized regression coefficients in that they attempt to quantify the relative importance of each predictor in the discriminant function. The only three predictors that were selected by the stepwise analysis were
Discriminant Analysis 6 - 16
Predictive Modeling With Clementine Education, Gender and Age. Not surprisingly, age is the dominant factor. The signs of the coefficients can be interpreted with respect to the group means on the discriminant function. Notice the coefficient for gender is negative. Other things being equal, as you shift from a man (code 0) to a woman (code 1), this results in a one unit change, which when multiplied by the negative coefficient will lower the discriminant score, and move the individual toward the group with a negative mean (those that dont accept the offering). Thus women are less likely to accept the offering, adjusting for the other predictors. The Structure Matrix displays the correlations between each variable considered in the analysis and the discriminant function(s). Note that income category correlates more highly with the function than gender or education, but it was not selected in the stepwise analysis; this is probably because income correlated with predictors that entered earlier. The standardized coefficients and the structure matrix provide ways of evaluating the discriminant variables and the function(s) separating the groups.
Scroll down to the Canonical Discriminant Function Coefficients and Functions at Group Centroids are visible
In Figure 6.1 we saw a scatterplot of two separate groups and the axis along which they could be best separated. Unstandardized discriminant coefficients, when multiplied by the values of an observation, project an individual onto this discriminant axis (or function) that separates the groups. If you wish to use the unstandardized coefficients for prediction purposes, you would simply multiple a prospective customers education, gender and age values by the corresponding unstandardized coefficients and add the constant. Then you compare this value to the cut-point (by default the midpoint) between the two group means (centroids) along the discriminant
Discriminant Analysis 6 - 17
Predictive Modeling With Clementine function (the cut-point appears in Figure 6.1). If the prospective customers value is greater than the cut point you predict the customer will accept, if the score is below the cut point you predict the customer will not accept. This prediction rule is also easy to implement with two groups, but involves much more complex calculations when more than two groups are involved. It is in a convenient form to do what if scenarios, for example, it we have a male with 16 years of education at what age would such an individual be a good prospect? To answer this we determine the age value that moves the discriminant score above the cut-point.
Scroll down until you see the Classification Function Coefficients
Fisher function coefficients can be used to classify new observations (customers). If we know a prospective customers education (say 16 years), gender (Female=1) and age (30), we multiply these values by the set of Fisher coefficients for the No (no acceptance) group (2.07*16 + 1.98*1 + .32*30 20.85) which yield a numeric score. We repeat the process using the coefficients for the Yes group and obtain another score. The customer is then placed in the outcome group for which she has the higher score. Thus the Fisher coefficients are easy to incorporate later into other software (spreadsheets, databases) for predictive purposes. We did not test for the normality assumptions of discriminant analysis in this example. In general, normality does not make a great deal of difference, but heterogeneity of the covariance matrices can, especially if the sample group sizes are very different. Here the samples sizes were about the same. As mentioned earlier, whether you consider the hit rate here to be adequate really depends on the costs of errors, the benefits of a correct prediction and what your alternatives are. Here, although the prediction was far from perfect we were able to identify the relations between the demographic variables and the choice outcome.
Discriminant Analysis 6 - 18
Summary Exercises
A Note Concerning Data Files
In this training guide files are assumed to be located in the c:\Train\ClemPredModel directory.
ID AGE INCOME GENDER MARITAL NUMKIDS NUMCARDS HOWPAID MORTGAGE STORECAR LOANS RISK INCOME1K
ID number Age Income Gender Marital status # of dependent children # of credit cards paid M/Wkly Mortgage # of store cards held # other loans Credit risk category Income in thousands of British pounds (field derived within Clementine)
1. Begin with a clear Stream canvas. Place an SPSS File source node on the canvas and connect it to Credit.sav. 2. Attach a Type node to the Source node, and a Table node to the Type node. Execute the stream and allow Clementine to automatically type the fields. 3. Attach a SetToFlag node to the Type node and create separate dummy fields for each category of the marital field. Make sure that you code the True value as 1 and the False value as 0. This is important because Discriminant expects numeric data for the inputs. 4. Attach a Type node to the SetToFlag node. 5. Edit the second Type node and change the direction for risk to Out, and to None for ID, marital, income1k, and marital_3, or a reference field of your choice. Leave the direction as In for all the rest of the fields. 6. Attach a discriminant node to the second Type node and run the analysis. How many classification functions are significant? What variables are important predictors?
Discriminant Analysis 6 - 19
Discriminant Analysis 6 - 20
Data
We use a file containing information about the amount of solid waste in thousands of tons (WASTE) in various locations along with information about land use, including number of acres used for industrial work (INDUST), fabricated metals (METALS), trucking and wholesale trade (TRUCKS), retail trade (RETAIL), and restaurants and hotels (RESTRNTS). The data set appears in Chatterjee and Hadi (1988, Sensitivity Analysis in Linear Regression. New York: Wiley).
7.1 Introduction
Although it is used as an analysis technique in its own right, in this chapter we discuss principal components primarily as a data reduction technique in support of statistical predictive modeling (for example, regression or logistic regression) and clustering. We first review the role of principal components and factor analysis in segmentation and prediction studies, and then discuss what to look for when running these techniques. Some background principles will be covered along with comments about popular factor methods. We provide some overall recommendations. We will perform a principal components analysis on a set of fields recording different types of land usage, all of which are to be used to predict the amount of waste produced from that land.
7.2 Use of Principal Components for Prediction Modeling and Cluster Analyses
In the areas of segmentation and prediction, principal components and factor analysis typically serve in the ancillary role of reducing the many fields available to a core set of composite fields (components or factors) that are used by cluster, regression or logistic regression. Statistical prediction models such as regression, logistic regression, and discriminant analysis, when run with highly correlated input fields can produce unstable coefficient estimates (the problem of near multicollinearity). In these models, if any input field can be almost or perfectly predicted from a linear combination of the other inputs (near or pure multicollinearity), the estimation will either fail or be badly in error. Prior data reduction using factor or principal components analysis is one approach to reducing this risk. Although we have described this problem in the context of statistical prediction models, neural network coefficients can become unstable under these circumstances. However, since the Data Reduction: Principal Components 7 - 1
Predictive Modeling With Clementine interpretation of neural network coefficients is relatively rarely done, this issue is less prominent. Rule induction methods will run when predictors are highly related. However, if two numeric predictors are highly correlated and have about the same relationship to the output, then the predictor with the slightly stronger relationship to the output will enter into the model. The other predictor is unlikely to enter into the model, since it contributes little in addition to the first predictor. While this may be adequate from the perspective of accurate prediction, the fact that the first field entered the model, while the second didn't, could be taken to mean that the first was important and the second was not. However, if the first were removed, the second predictor would have performed nearly as well. Such relationships among inputs should be revealed as part of the data understanding and data preparation step of a data mining project. If this were not done, or if it were done inadequately, then the data reduction performed by principal components or factor analysis might be necessary (for statistical methods) and helpful (for both statistical and machine learning methods). In some surveys done for segmentation purposes, dozens of customer attitude measures or product attribute ratings may be collected. Although cluster analysis can be run using a large number of cluster fields, two complications can develop. First, if several fields measure the same or very similar characteristics and are included in a cluster analysis, then what they measure is weighted more heavily in the analysis. For example, suppose a set of rating questions about technical support for a product is used in a cluster analysis with other unrelated questions. Since distance calculations used in the Clementine clustering algorithms are based on the differences between observations on each field, then other things being equal, the set of related items would carry more weight in the analysis. To exaggerate to make a point, if two fields were identical copies of each other and both were used in a cluster analysis, the effect would be to double the influence of what they measure. In practice you rarely ask the same number of rating questions about each attribute (or psychographic) area. So principal components and factor analysis are used to either explicitly combine the original input fields into independent composite fields, to guide the analyst in constructing subscales, or to aid in selection of representative sets of fields (some analysts select three fields strongly related to each factor or component to be used in cluster analysis). Cluster is then performed on these fields. A second reason factor or principal components might be run prior to clustering is for conceptual clarity and simplification. If a cluster analysis were based on forty fields it would be difficult to look at so large a table of means or a line chart and make much sense of them. As an alternative, you can perform rule induction to identify the more influential fields and summarize those. If factor or principal components analysis is run first, then the clustering is based on the themes or concepts measured by the factors or components. Or, as mentioned above, clustering can be done on equal-sized sets of fields, where each set is based on a factor. If the factors (components) have a ready interpretation, it can be much easier to understand a solution based on five or six factors, compared to one based on forty fields. As you might expect, factor and principal components analyses are more often performed on soft measuresattitudes, beliefs, and attribute ratings and less often on behavioral measures like usage and purchasing patterns. Keep in mind that factor and principal components analysis are considered exploratory data techniques (although there are confirmatory factor methods; for example, Amos or LISREL can be used to test specific factor models). So as with cluster analysis, do not expect a definitive, unassailable answer. When deciding on the number and interpretation of factors or components, domain knowledge of the data, common sense, and a dose of hard thinking are very valuable.
7.3 What to Look for When Running Principal Components or Factor Analysis
There are two main questions that arise when running principal components and factor analysis: how many (if any) components are there, and what do they represent? Most of our effort will be directed toward answering them. These questions are related because, in practice, you rarely retain factors or components that you cannot identify and name. Although the naming of components has rarely stumped a creative researcher for long, which has led to some very oddsounding components, it is accurate enough to say that interpretability is one of the criteria when deciding to keep or drop a component. When choosing the number of components, there are some technical aids (eigenvalues, percentage of variance accounted for) we will discuss, but they are guides and not absolute criteria. To interpret the components, a set of coefficients, called loadings or lambda coefficients, relating the components (or factors) to the original fields, are very important. They provide information as to which components are highly related to which fields and thus give insight into what the components represent.
7.4 Principles
Factor analysis operates (and principal components usually operates) on the correlation matrix relating the numeric fields to be analyzed. The basic argument is that the fields are correlated because they share one or more common components, and if they didnt correlate there would be no need to perform factor or component analysis. Mathematically a one-factor (or component) model for three fields can be represented as follows (Vs are fields (or variables), F is a factor (or component), Es represent error variation that is unique to each field (uncorrelated with the F component and the E components of the other variables)):
V1 = L1*F1 + E1 V2 = L2*F1 + E2 V3 = L3*F1 + E3
Each field is composed of the common factor (F1) multiplied by a loading coefficient (L1, L2, L3 the lambdas) plus a unique or random component. If the factor were measurable directly (which it isnt) this would be a simple regression equation. Since these equations cant be solved as given (the Ls, Fs and Es are unknown), factor and principal components analysis take an indirect approach. If the equations above hold, then consider why fields V1 and V2 correlate. Each contains a random or unique component that cannot contribute to their correlation (Es are assumed to have 0 correlation). However, they share the factor F1, and so if they correlate the correlation should be related to L1 and L2 (the factor loadings). When this logic is applied to all the pairwise correlations, the loading coefficients can be estimated from the correlation data. One factor may account for the correlations between the fields, and if not, the equations can be easily generalized to accommodate additional factors. There are a number of approaches to fitting factors to a correlation matrix (least squares, generalized least squares, maximum likelihood), which has given rise to a number of factor methods. What is a factor? In market research factors are usually taken to be underlying traits, attitudes or beliefs that are reflected in specific rating questions. You need not believe that factors or components actually exist in order to perform a factor analysis, but in practice the factors are usually interpreted, given names, and generally spoken of as real things.
Principal components analysis attempts to account for the maximum amount of variation in the set of fields. Since the diagonal of a correlation matrix (the ones) represents standardized variances, each principal component can be thought of as accounting for as much as possible of the variation remaining in the diagonal. Factor analysis, on the other hand, attempts to account for correlations between the fields, and therefore its focus is more on the off-diagonal elements (the correlations). So while both methods attempt to fit a correlation matrix with fewer components or factors than fields, they differ in what they focus on when fitting. Of course, if a principal component accounts for most of the variance in fields V1 and V2 , it must also account for much of the correlation between them. And if a factor accounts for the correlation between V1 and V2 , it must account for at least some of their (common) variance. Thus, there is definitely overlap in the methods and they usually yield similar results. Often factor is used when there is interest in studying relations among the fields, while principal components is used when there is a greater emphasis on data reduction and less on interpretation. However, principal components is very popular because it can run even when the data are multicollinear (one field can be perfectly predicted from the others), while most factor methods cannot. In data mining, since data files often contain many fields likely to be multicollinear or near multicollinear, principal components is used more often. This is especially the case if statistical modeling methods, which will not run with multicollinear predictors, are used. Both methods are available in the PCA/Factor node; by default, the principal components method is used.
Predictive Modeling With Clementine the correlation matrix in Figure 8.1, there are five fields and therefore 5 units of standardized variance to be accounted for. Each eigenvalue measures the amount of this variance accounted for by a factor. This leads to a rule of thumb and a useful measure to evaluate a given number of factors. The rule of thumb is to select as many factors as there are eigenvalues greater than 1. Why? If the eigenvalue represents the amount of standardized variance in the fields accounted for by the factor, then if it is above 1, it must represent variance contained in more than one field. This is because the maximum amount of standardized variance contained in a single field is 1. Thus, if in our five-field analysis the first eigenvalue were 3, it must account for variation in several fields. Now an eigenvalue can be less that 1 and still account for variation shared among several fields (for example 30% of the variation of each of three fields for an eigenvalue of .9), so the eigenvalue of 1 rule is only applied as a rule of thumb. Another aspect of eigenvalues (for principal components and some factor methods) is that their sum is the same as the number of fields, which is equal to the total standardized variance in the fields. Thus you can convert the eigenvalue into a measure of percentage of explained variance, which is helpful when evaluating a solution. Finally, it is important to mention that in applications in which you need to be able to interpret the results, the components must make sense. For this reason, factors with eigenvalues over 1 that cannot be interpreted may be dropped and those with eigenvalues less than 1 may be retained.
7.7 Rotations
When factor analysis succeeds you obtain a relatively small number of interpretable factors that account for much of the variation in the original set of fields. Suppose you have eight fields and factor analysis returns a two-factor solution. Formally, the factor solution represents a twodimensional space. Such a space can be represented with a pair of axes as shown below. While each pair of axes defines the same two-dimensional space, the coordinates of a point would vary depending on which pair of axes was applied. This creates a problem for factor methods since the values for the loadings or lambda coefficients vary with the orientation of axes and there is no unique orientation defined by the factor analysis itself. Principal Components does not suffer from this problem since its method produces a unique orientation. This difficulty for factor analysis is a fundamental mathematical problem. The solutions to it are designed to simplify the task of interpretation for the analyst. Most involve, in some fashion, finding a rotation of the axes that maximizes the variance of the loading coefficients, so some are large and some small. This makes it easier for the analyst to interpret the factors. This is the best that can currently be done, but the fact that factor loadings are not uniquely determined by the method is a valid criticism leveled against it by some statisticians. We will discuss the various rotational schemes in the Methods section below.
7.10 Methods
There are several popular methods within the domain of factor and principal components analyses. The common factor methods differ in how they go about fitting the correlation matrix. A traditional method that has been around for many yearsfor some it means factor analysis is the principal axis factor method (often abbreviated as PAF). A more modern method that carries some technical advantages is maximum likelihood factor analysis. If the data are ill behaved (say near multicollinear), maximum likelihood, the more refined method, is more prone to give wild solutions. In most cases results using the two methods will be very close, so either is fine under general circumstances. If you suspect there are problems with your data, then principal axis may be a safer bet. The other factor methods are considerably less popular. One factor method, called Q factor analysis, involves transposing the data matrix and then performing a factor analysis on the records instead of the fields. Essentially, correlations are calculated for each pair of records based on the values of the input fields. This technique is related to cluster analysis, but is used infrequently today. Besides the factor methods, principal components can be run and, as mentioned earlier, must be run when the inputs are multicollinear. Similarly, there are several choices in rotations. The most popular by far is the varimax rotation, which attempts to simplify the interpretation of the factors by maximizing the variances of the input fields loadings on each factor. In other words, it attempts to finds a rotation in which some fields have high and some low loadings on each factor, which makes it easier to understand and name the factors. The quartimax rotation attempts to simplify the interpretation of each field in terms of the factors by finding a rotation yielding high and low loadings across factors for each field. The equimax rotation is a compromise between the varimax and quartimax rotation methods. These three rotations are orthogonal, which means the axes are perpendicular to each other and the factors will be uncorrelated. This is considered a desirable feature since statements can be made about independent factors or aspects of the data. There are nonorthogonal rotations available (axes are not perpendicular); popular ones are oblimin and promax (runs faster than oblimin). Such rotations are rarely used in data mining, since the point of data reduction is to obtain relatively independent composite measures, and it is easier to speak of independent effects when the factors are uncorrelated. Finally, principal components does not require a rotation, since there is a unique solution associated with it. However, in practice, a varimax rotation is sometimes done to facilitate the interpretation of the components.
The INDUST, METALS, TRUCKS, RETAIL, and RESTRNTS fields (which measure the number of acres of a specific type of land usage) will be used as inputs to predict the amount of solid waste (WASTE).
Close the Type node window Double-click on the Regression node named WASTE at the top of the Stream canvas Click the Expert tab, and then click the Expert option button Click the Output button, and then click the Descriptives check box (so it is checked)
Predictive Modeling With Clementine Figure 7.4 Requesting Descriptive Statistics in a Linear Regression Node
In anticipation of checking for correlation among the inputs (although we recommend it anyway), we request descriptive statistics (Descriptives). This will display correlations for all the fields in the analysis. (Note that we could have obtained these correlations from the Statistics node.) We can obtain more technical information about correlated predictors by checking the Collinearity Diagnostics check box.
Click OK, and then click the Execute button Right-click the Regression generated model node named Waste in the Models Manager window, then click Browse Expand the Analysis topic in the Summary tab
The estimated regression equation appears in the Summary tab; notice that two of the inputs have negative coefficients.
All correlations are positive and there are high correlations between the METALS and TRUCKS fields (.893) and between the RESTRNTS and RETAIL fields (.920). Since some of the inputs are highly correlated, this might create stability problems (large standard errors) for the estimated regression coefficients due to near multicollinearity.
Scroll to the Model Summary table
The regression model with five predictors accounted for about 83% (adjusted R Square) of the variation in the output field (waste). Scroll to the Coefficients table
Two of the significant coefficients (INDUST and RETAIL) have negative regression coefficients, although they correlate positively (see Figure 7.6) with the output field. Although there might be a valid reason for this to occur, this coupled with the fact that RETAIL is highly correlated with another predictor is suspicious. Also, those familiar with regression should note that the estimated beta coefficient for RESTRNTS is above 1, which is another sign of near multicollinearity. It is possible that this situation could have been avoided if a stepwise method had been used (this is left as an exercise). However, we will take the position that the current set of inputs is exhibiting signs of near multicollinearity and we will run principal components as an attempt to improve the situation. Before proceeding, let's examine how well this model fits the data.
Close the Regression browser window Double-click the PCA/Factor model node (named Factor) in the stream canvas
Predictive Modeling With Clementine In Simple mode (see Expert tab), the only options involve selection of the factor extraction method (some of these were discussed in the Methods section). Notice that Principal Components is the default method.
Click the Execute button Right-click the PCA/Factor generated model node named Factor in the Models Manager window, then click Browse
Five principal components were found. Since there were originally five input fields, reducing them to five principal components does not constitute data reduction (but it does solve the problem of multicollinearity). If the solution were successful, we would expect that the variation within the five input fields would be concentrated in the first few components and we could check this by examining the Advanced tab of the browser window. However, instead we will use the Expert options to have the PCA/Factor node select an optimal number of principal components.
Close the PCA/Factor browser window Double-click on the PCA/Factor model node named Factor Click the Expert tab, and then click the Expert Mode option button
The Extract factors option indicates that while in Expert mode, PCA/Factor will select as many factors as there are eigenvalues over 1 (we discussed this rule of thumb earlier in the chapter). You can change this rule or specify a number of factors; this might be done if you prefer more or fewer factors than the eigenvalue rule provides. By default, the analysis will be performed on the correlation matrix; principal components can also be applied to covariance matrices, in which case fields with greater variation will have more weight in the analysis. This is really all we need to proceed, but let's examine the other Expert options. Notice that the Only use complete records check box becomes active when the Expert Mode is selected. By default, PCA/Factor will only use records with complete information on the input fields. If this option is not checked, then a pairwise technique is used. Here for a record with missing values on one or more fields used in the analysis, fields with valid values will be used. However, the created factor score fields will be set to $null$ for these records. Also, substantial amounts of missing data, when the Use only complete records is not selected, can lead to numeric instabilities in the algorithm. The Sort values check box in the Component/Factor format section will have PCA/Factor list the fields in descending order by their loading coefficients on the factor/component for which they load highest. This makes it very easy to see which fields relate to which factors and is especially useful when a many input fields are involved. To further aid this effort, by suppressing loading coefficients less that .3 in absolute value (the Hide values below option) you will only see the larger loadings (small values are replaced with blanks) and not be distracted by small loadings. Although not required, these options make the interpretive task much easier when many fields are involved.
By default, no rotation is performed, which is often the case when principal components is run. The Delta and Kappa text boxes control aspects of the Oblimin and Promax rotation methods, respectively.
Click Cancel Click the Execute button Right-click the PCA/Factor generated model node, named Factor, in the Models Manager window, then click Browse Click the Model tab
Predictive Modeling With Clementine Figure 7.13 PCA/Factor Browser Window (Two-Component Solution)
The PCA/Factor browser window contains the equations to create component (in this case) or factor score fields from the inputs. Two components were selected based on the eigenvalue greater than 1 rule (recall five were selected in the original analysis under the Simple mode). The coefficients are so small because the components are normalized to have means of 0 and standard deviations of 1, while most inputs have values that extend into the thousands. To interpret the components, we turn to the advanced output.
Click the Advanced tab Scroll to the Communalities table in the Expert Output browser window
The communalities represent the proportion of variance in an input field explained by the factors (here principal components). Since initially, as many components are fit as there are inputs, the communalities in the first column (Initial) are trivially 1. They are of interest when a solution is reached (Extraction column). Here the communalities are below 1 and measure the percentage of variance in each input field that is accounted for by the selected number of components (two). Any fields having very small communalities (say .2 or below) have little in common with the other inputs, and are neither explained by the components (or factors), nor contribute to their definition. Of the five inputs, all but INDUST have a large proportion of their variance accounted for by the two components and Indust itself has a communality of .44 (44%).
Scroll to the Total Variance Explained table in the Avanced tab of browser window
The Initial eigenvalues area contains all (5) eigenvalues, along with the percentage of variance (of the fields) explained by each and a cumulative percentage of variance. We see in the Extracted Sums of Squared Loadings section that there are two eigenvalues over 1, the first being about twice the size of the second. Two components were selected and they collectively account for about 82 percent of the variance of the 5 inputs. The third eigenvalue is .73, which might be explored as a third component if more input fields were involved (reducing from five fields to three components is not much of a reduction). The remaining two components (fourth and fifth) are quite small. While not pursued here, in practice we might try out a solution with a different number of components.
PCA/Factor next presents the Component (or Factor) Matrix that contains the unrotated loadings. If a rotation were requested, this table would appear in addition to a table containing the rotated loadings. The input fields form the rows and the components (or factors if a factor method were run) form the columns. The values in the table are the loadings. If any loading were below .30 (in absolute value), blanks would appear in its position due to our option choice. While it makes no difference here, the option helps focus on the larger (absolute value closer to 1) loadings. The first component seems to be a general component, having positive loadings on all the input fields (recall that they all correlated positivelysee Figure 7.6). In some sense, it could represent the total (weighted) amount of land used in these activities. The second component has both positive and negative coefficients, and seems to represent the difference between land usage for trucking and wholesale trade, fabricated metals, and industrial work, versus retail trade, restaurants and hotels. This might be considered a contrast between manufacturing/industrial and service-oriented use of land. This pattern, all fields with positive loadings on the first component (factor) and contrasting signs on coefficients of the second and later components (factors), is fairly common in unrotated solutions. If we requested a rotation, the fields would group into the two rotated components according to their signs on the second component. We should note that when interpreting components or factors, the loading magnitude is important; that is, fields with greater loadings (in absolute value) are more closely associated with the components and are more influential when interpreting the components. We know that the two components account for 82 percent of the variation of the original input fields (a substantial amount), and that we can interpret the components. Now we will rerun the linear regression with the components as inputs.
Close the PCA/Factor browser window Double-click on the Type node located to the right of the PCA/Factor generated model node named Factor
Predictive Modeling With Clementine Figure 7.17 Type Node Set Up for Principal Components Regression
The two component score fields ($F-Factor-1, $F-Factor-2) are the only fields that will be used as inputs; the original land usage fields have their direction set to None. If both the land usage fields and the component score fields were inputs to the linear regression, we would have only exacerbated the near multicollinearity problem (as an exercise, explain why).
Close the Type node window Execute the Regression model node, named Waste, located in the lower right section of the Stream canvas Right-click the Regression generated model node named Waste in the Models Manager, then click Browse Click the Summary tab Expand the Analysis topic
Predictive Modeling With Clementine Figure 7.18 Linear Regression (Using Components as Inputs) Browser Window
The prediction equation for waste is now in terms of the two principal component fields. Notice that the coefficient for the second component has a negative sign, which we will consider when examining the expert output.
Click the Advanced tab Scroll to the Model Summary table
The regression model with two principal component fields as inputs accounts for about 73% of the variance (adjusted R square) in the Waste field. This compares with the 83% in the original analysis (Figure 8.8). Essentially, we are giving up 10% explained variance to gain more stable coefficients and possibly a simpler interpretation. The requirements of the analysis would determine whether this tradeoff is acceptable.
Scroll to the Coefficients table
Predictive Modeling With Clementine Figure 7.20 Coefficients Table (Principal Components Regression)
Both components are statistically significant. The positive coefficient for $F-Factor-1 indicates, not surprisingly, as overall land usage increases, so does the amount of waste. The coefficient for the second component (which represented a contrast of land use for manufacturing/industrial versus service-oriented), which is negative, indicates that, controlling for total land usage, as the amount of manufacturing/industrial land use increases relative to service-oriented usage, waste production goes down. Or, to put it another way, as service-oriented land use increases, relative to manufacturing/industrial, waste production increases. As mentioned before, the interpretation of the component, and thus the regression, results might be made easier by rotating (say using a varimax rotation) the components. Notice that the components, unlike the original fields (see Figure 7.7), have no beta coefficients above 1, indicating that the potential problem with near multicollinearity has been resolved. It is important to note that while we have shifted from a regression with five inputs to a regression with two components, the five inputs are still required to produce predictions because they are needed to create the component score fields.
Additional Readings
Those interested in learning more about factor and principal components analysis might consider the book by Kline (1994), Jae-On Kims introductory text (1978) and his book with Charles W Muellers (1979), and Harry Harmans revised text (1979) (see the References section).
Summary Exercises
A Note Concerning Data Files
In this training guide files are assumed to be located in the c:\Train\ClemPredModel directory.
Acreage (US) used for industrial work Acreage used for fabricated metal Acreage used for trucking and wholesale trade Acreage used for retail trade Acreage used for restaurants and hotels Amount of solid waste produced
1. Working with the current stream from the chapter, request a Varimax rotation of the principal components analysis. Interpret the component coefficients. Use the component score fields from this generated model node as inputs to the Regression node predicting waste. Does the R square change? Explain this. Do the regression coefficients change? How would you interpret them? 2. With the same data, use the Extraction Method drop-down list in the PCA/Factor node to run a factor analysis instead (using principal axis factoring or maximum likelihood) with no rotation. Compare the results to those obtained by the principal components in the chapter. Are they similar? In what way do they differ? Now rerun the factor analysis, requesting a varimax rotation. How do these results compare to those obtained in the first exercise? Do you find anything that leads you to prefer one to the other?
8.1 Introduction
It is often essential for organizations to plan ahead. In order to do this it is important to forecast events in order to ensure a smooth transition into the future. In order to minimize errors when planning for the future it is necessary to collect information on any factors which may influence plans on a regular basis over time. Once a catalogue of past and current information has been collected, patterns can be identified and these patterns help make forecasts into the future. Even though many organizations may collect historic information relevant to the planning process, forecasts are often made on an ad-hoc basis. This often leads to large forecasting errors and costly mistakes in the planning process. Statistical techniques provide a more scientific basis upon which to base forecasts. By using these techniques, a more structured approach can be used to ensure careful planning which will reduce the chance of making costly errors. Statisticians have developed a whole area of statistical techniques, known as time series analysis, which is devoted to the area of forecasting.
Examples
In order to understand how time series analysis works it is useful to give an example. Suppose that a company wishes to forecast the growth of its sales into the future. The benefit of making the forecast is that if the company has an idea of future sales it can plan the production process for its product. In doing so, it can minimize the chances of under producing and having product shortages or, alternatively, overproducing and having excess stock which will need to be stored at additional cost. Prior to being able to make the forecast, the company will need to collect information on its sales over time in order to gain a full picture of how sales have changed in the past. Once this information has been collected it is possible to plot how sales change over time. An example of this is shown in Figure 8.1. Here information on the sales of a product has been collected each month from January 1982 until December 1995.
Time Series Analysis and Forecasting with SPSS Trends Figure 8.1 Plot of Sales Over Time
This is a simple example that demonstrates the idea of time series. Time series analysis looks at changes over time. Any information collected over time is known as a time series. A time series is usually numerical information collected over time on a regular basis. One of the most common uses of time series analysis is to forecast future values of a series. There are a number of statistical time series techniques which can be used to make forecasts into the future. In the above example the forecast would be the future values of sales. Some time series methods can also be used to identify which factors have been important in affecting the series you wish to forecast. For example, to determine whether an advertising campaign has had a significant and beneficial effect on sales. It is also possible to use time series analysis to quantify the likely impact of a change in advertising expenditure on future sales. Other examples of time series analysis and forecasting include: Governments using time series analysis to predict the effects of government policies on inflation, unemployment and economic growth. Traffic authorities analyzing the effect on traffic flows following the introduction of parking restrictions in city centers. The analyses of how stock market prices change over time. By being able to predict when stock market prices rise or fall decisions can be made about the right times to buy and sell shares.
Time Series Analysis and Forecasting with SPSS Trends Companies predicting the effects of pricing policies or increased advertising expenditure on the sales of their product. A company wishing to predict the number of telephone calls at different times during the day, so it can arrange the appropriate level of staffing.
Time series analysis is used in many areas of business, commerce, government and academia, and its value cannot be overstated. A number of time series techniques can be found within the Time Series node in Clementine. This note provides analysts with both a flexible and powerful way to analyze time series data.
While regression can serve as a point of departure for both time series and econometric models, it is incumbent on you (the researcher) to generate the plots and statistics which will give some indication of whether the assumptions are being met in a particular context. Assumption 1 is concerned with the form of the specification of the model. Violations of this assumption include omission of important regressors (predictors), inclusion of irrelevant regressors, models nonlinear in the parameters, and varying coefficient models. When assumption 2 is violated, there is a biased intercept. Assumption 3 assumes constant variance (homoscedasticity) and no autocorrelation. (Autocorrelation is the correlation of a variable with itself at a fixed time lag.) Violations of the assumption are the reverse: non-constant variance (heteroscedasticity) and autocorrelation. Assumption 4 is often called the assumption of fixed or nonstochastic independent variables. Violations of this assumption include errors in measurement in the variables, use of lagged values of the dependent variable as regressors (common in time series analysis), and simultaneous equation models. Assumption 5 has two parts. If the number of observations does not exceed the number of independent variables, then your problem has a necessary singularity and your coefficients are not estimable. If there are exact linear relationships between independent variables, software might protect you from the consequences. If there are near-exact linear relationships between your independent variables, you face the problem of multicollinearity. In regression, parameters can be estimated by least squares. Least squares methods do not make any assumptions about the distribution of the disturbances. When you make the assumptions of the classical linear regression model and add to them the assumption that the disturbances are normally distributed, the regression estimators are maximum likelihood estimators (ML). It also can be shown that the least-squares methods produce Best Linear Unbiased estimates (BLU). The BLU and ML properties allow estimation of the standard errors of the regression coefficients and the standard error of the estimate, and therefore enable the researcher to do hypothesis testing and calculate confidence intervals.
Each column in the data editor corresponds to a given variable. The important point to note concerning the organization of time series data is that each row in the Table window corresponds to a particular period of time. Each row must therefore represent a sequential time period. The above example shows a data file containing monthly data for sales starting in January 1982. In order to use standard time series methods it is important to collect, or at least be able to summarize, the information over equal time periods. Within a time series data file it is essential that the rows represent equally spaced time periods. Even time periods for which no data was collected must be included as rows in the data file (with missing values for the variables).
Time Series Analysis and Forecasting with SPSS Trends the same chart. Points are joined up to display a line graph which show any patterns in your data.. The simplest way of identifying patterns in your data is to plot your information over the relevant time period and is essential for time series analysis. In our sales example, the interest might be to see how sales have changed over the fourteen-year period of interest.
Double-click on the Time Plot node to the right of the Time Intervals node to open it Use the variable selector tool to select Sales
Click Execute
There is an option to Display series in separate panels which can be used to generate a separate chart for each series if you want to plot several of them at once. If you do not check this option, all variables are plotted on one chart. Figure 8.4 shows how sales have changed over the fourteen years.
Time Series Analysis and Forecasting with SPSS Trends Figure 8.4 Sequence Plot of Sales
The sequence chart is the most powerful exploratory tool in time series analysis and it can be used to identify trend, seasonal and cyclical patterns in a time series. There is a clear regularity to the times series, and the volume of sales generally increases over time.
Trend Patterns
Trend refers to the smooth upward or downward movement characterizing a time series over a long period of time. This type of movement is particularly reflective of the underlying continuity of fundamental demographic and economic phenomena. Trend is sometimes referred to as secular trend where the word secular is derived from the Latin word saeculum, meaning a generation or age. Hence, trend movements are thought of as long-term movements, usually requiring 15 or 20 years to describe (or the equivalent for series with more frequent time intervals). Trend movements might be attributable to factors such as population change, technological progress, and large-scale shifts in consumer tastes. For example, if we could examine a time series on the number of pairs of shoes produced in the United States extending annually, say, from the 1700s until the present, we would find an underlying trend of growth throughout the entire period, despite fluctuations around this general upward movement. If we would compare the figures of the recent time against those near the
Time Series Analysis and Forecasting with SPSS Trends beginning of the series, we would find the recent numbers are much larger. This is because of the increase in population, because of the technical advances in shoe-producing equipment enabling vastly increased levels of production, and because of shifts in consumer tastes and levels of affluence which have meant a larger per capita requirement of shoes than in the earlier time In Figure 8.4 there is a clear upward trend in the data as sales have continued to increase from 1982 until 1995, albeit less pronounced from the beginning of 1991.
Cyclical Patterns
Cyclical patterns (or fluctuations), or business cycle movements, are recurrent up and down movements around the trend levels which have a duration of anywhere from about 2 to 15 years. The duration of these cycles can be measured in terms of their turning points, or in other words, from trough to trough or peak to peak. These cycles are recurrent rather than strictly periodic. The height and length (amplitude and duration) of cyclical fluctuations in industrial series differ from those of agricultural series, and there are differences within these categories and within individual series. Hence, cycles in durable goods activity generally display greater relative fluctuations than consumer goods activity and a particular time series of, say, consumer goods activity may possess business cycles which have considerable variations in both duration and amplitude. Economists have produced a large number of explanations of business cycle fluctuations including external theories which seek the causes outside the economic system, and internal theories in terms of factors within the economic system that lead to self-generating cycles. Since it is clear from the foregoing discussion that there is no single simple explanation of business cycle activity and that there are different types of cycles of varying length and size, it is not surprising that no highly accurate method of forecasting this type of activity has been devised. Indeed, no generally satisfactory mathematical model has been constructed for either describing or forecasting these cycles, and perhaps never will be. Therefore, it is not surprising to find that classical time series analysis adopts a relatively rough approach to the statistical measurement of the business cycle. The approach is a residual one; that is, after trend and seasonal variations have been eliminated from a time series, by definition, the remainder or residual is treated as being attributable to cyclical and irregular factors. Since the irregular movements are by their very nature erratic and not particularly tractable to statistical analysis, no explicit attempt is usually made to separate them from cyclical movements, or vice versa. However, the cyclical fluctuations are generally large relative to these irregular movements and ordinarily no particular difficulty in description or analysis arises from this source. Therefore, unless you have data available over a long period of time, cyclic patterns are not usually fit by forecasting models.
Seasonal Patterns
Seasonal variations are periodic patterns of movement in a time series. Such variations are considered to be a type of cycle that completes itself within the period of calendar year, and then continues in a repetition of this basic pattern. The major factors in this seasonal pattern are weather and customs, where the latter term is broadly interpreted to include patterns in social behavior as well as observance of various holidays such as Christmas and Easter. Series of monthly or quarterly data are ordinarily used to examine these seasonal patterns. Hence, regardless of trend or cyclical levels, one can observe in the United States that each year more ice cream is sold during the summer months than during the winter, whereas more fuel oil for home heating purposes is consumed in the winter than during the summer months. Both of these cases illustrate the effect of climatic factors in determining seasonal patterns. Also, department store sales generally reveal a minor peak during the months in which Easter occurs and a larger peak in Time Series Analysis 8 - 8
Time Series Analysis and Forecasting with SPSS Trends December, when Christmas occurs, reflecting the shopping customs of consumers associated with these dates. Seasonal patterns need not be linked to a calendar year. For example, if we studied the daily volume of packages delivered by a private delivery service, the periodic pattern might well repeat weekly (heavier deliveries mid-week, lighter deliveries on the weekend). Here the period for the seasonal pattern could be seven days. Of course, if daily data were collected over several years, then there may well be a yearly pattern as well, and just which time period constitutes a season is no longer clear. The number of time periods that occur during the completion of a seasonal pattern is referred to as the series periodicity. How often the time series data are collected usually depends on the type of seasonality that the analyst expects to find. For hourly data, where data are collected once an hour, there is usually one seasonal pattern every twenty-four hours. The periodicity is most likely to be 24. For monthly data, where each month a new time period of data is collected, there is usually one seasonal pattern every twelve months. The periodicity is thus likely to be 12. For daily data, where data are collected once every day, there is usually one seasonal pattern per week. The periodicity is therefore 7 if the data refer to a seven-day week or 5 if no data are collected on Saturdays and Sundays. For quarterly data, where data are collected once every three months, there is usually one seasonal pattern per year. The periodicity is therefore 4. For annual data, where data are collected once a year, there is no seasonal pattern. The periodicity is therefore none (undefined).
Of course, changes can occur in seasonal patterns because of changing institutional and other factors. Hence, a change in the date of the annual automobile show can change the seasonal pattern of automobile sales. Similarly, the advent of refrigeration techniques with the corresponding widespread use of home refrigerators has brought about a change of seasonal pattern of ice cream sales. The techniques of measurement of seasonal variation which we will discuss are particularly well suited to the measurement of relatively stable patterns of seasonal variation, but can be adapted to cases of changing seasonal movements as well. In Figure 8.4, there appears to be a rise in sales during the early part of the year while sales tend to fall to a low around November. Finally, there is some recovery in sales leading up to the Christmas period of each year.
Irregular Movements
Irregular movements are fluctuations in time series that are erratic in nature, and follow no regularly recurrent or other discernible pattern. These movements are sometimes referred to as residual variations, since, by definition, they represent what is left over in an economic time series after trend, cyclical, and seasonal elements have been accounted for. These irregular fluctuations result from sporadic, unsystematic occurrences such as wars, earthquakes, accidents, strikes, and the like. In the classical time series model, the elements of trend, cyclical, and seasonal variations are viewed as resulting from systematic influences leading to gradual growth, decline, or recurrent movements. Irregular movements, however, are considered to be so erratic that it would be fruitless to attempt to describe them in terms of a formal model. Irregular movements can result from a large number of causes of widely differing impact. Time Series Analysis 8 - 9
Time Series Analysis and Forecasting with SPSS Trends expenditure will have on future sales, or alternatively a $150,000 increase in advertising expenditure. The main drawbacks of causal time series models are that they require information on several variables in addition to the variable that is being forecast and usually take longer to develop. Furthermore, the model may require estimation of the future values of the independent factors before the dependent variable can be forecast.
8.6 Interventions
Time series may experience sudden shifts in level, upward or downward, as a result of external events. For example, sales volume may briefly increase as the result of a direct marketing campaign or a discount offering. If sales were limited by a companys capacity to manufacture a product, then bringing a new plant online would shift the sales level upward from that date onward. Similarly, changes in tax laws or pricing may shift the level of a series. The idea here is that some outside intervention resulted in a shift in the level of the series. In this context, a distinction is made between a pulsethat is, a sudden, temporary shift in the series leveland a step, a sudden, permanent shift in the series level. A bad storm, or a one-time, 30-day rebate offer, might result in a pulse, while a change in legislation or a large competitors entry into a market could result in a step change to the series. Time series models are designed to account for gradual, not sudden, change. As a result, they do not natively fit pulse and step effects very well. However, if you can identify events (by date) that you believe are associated with pulse or step effects, they can be incorporated into time series models (they are called intervention effects) and forecasts. Below we see an example of a pulse intervention. In April 1975 a one-time tax rebate occurred in an attempt to stimulate the US economy, then in recession. Note that the savings rate reached its maximum (9.7%) during this quarter. The intervention can be modeled and used in scenarios to assess the effect of a tax rebate on savings rates in the future.
Time Series Analysis and Forecasting with SPSS Trends Figure 8-5 U. S. Savings Rate (Seasonal Adjusted)Tax Rebate in April 1975
Time Series Analysis and Forecasting with SPSS Trends developed will depend upon the seasonal and trend patterns inherent in the series you wish to forecast. An analyst building a model might simply observe the patterns in a sequence chart to decide which type of exponential smoothing model is the most promising one to generate forecasts. In SPSS Trends, when the Expert Modeler examines the series, it considers all appropriate exponential smoothing models when searching for the most promising time series model. Simple exponential smoothing (no trend, no seasonality) can be described in two algebraically equivalent ways. One common formula, known as the recurrence form, is as follows:
S ( t ) = * y ( t ) + (1 ) * S ( t 1)
Also, the forecast
y ( m ) = S (t )
where y(t) is the observed value of the time series in period t, S(t-1) is the smoothed level of the series at time t-1, (alpha) is the smoothing parameter for the level of the series, and S(t) is the smoothed level of the series at time t, computed after y(t) is observed, and y(m) is the model estimated m step ahead forecast at time t. Intuitively, the formula states that the current smoothed value is obtained by combining information from two sources: the current point and the history embodied in the series. Alpha () is a weight ranging between 0 and 1. The closer alpha is to 1, the more exponential smoothing weights the most recent observation and the less it weights the historical pattern of the series. The smoothed value for the current case becomes the forecast value. This is the simplest form of an exponential smoothing model. As mentioned above and as will be detailed in a later chapter, extensions of the exponential smoothing model can accommodate several types of trend and seasonality, yielding a general model capable of fitting single-series data.
8.8 ARIMA
Many of the ideas that have been incorporated into ARIMA models were developed in the 1970s by George Box and Gwilym Jenkins, and for this reason ARIMA modeling is sometimes called Box-Jenkins modeling. ARIMA stands for AutoRegressive Integrated Moving Average, and the assumption of these models is that the variation accounted for in the series variable can be divided into three components: Autoregressive (AR) Integrated (I) or Difference Moving Average (MA)
An ARIMA model can have any component, or combination of components, at both the nonseasonal and seasonal levels. There are many different types of ARIMA models and the general form of an ARIMA model is ARIMA(p,d,q)(P,D,Q), where: p refers to the order of the nonseasonal autoregressive process incorporated into the ARIMA model (and P the order of the seasonal autoregressive process)
Time Series Analysis and Forecasting with SPSS Trends d refers to the order of nonseasonal integration or differencing (and D the order of the seasonal integration or differencing) q refers to the order of the nonseasonal moving average process incorporated in the model (and Q the order of the seasonal moving average process).
So for example an ARIMA(2,1,1) would be a nonseasonal ARIMA model where the order of the autoregressive component is 2, the order of integration or differencing is 1, and the order of the moving average component is also 1. ARIMA models need not have all three components. For example, an ARIMA(1,0,0) has an autoregressive component of order 1 but no difference or moving average component. Similarly, an ARIMA(0,0,2) has only a moving average component of order 2.
Autoregressive
In a similar way to regression, ARIMA models use independent variables to predict a dependent variable (the series variable). The name autoregressive implies that the series values from the past are used to predict the current series value. In other words, the autoregressive component of an ARIMA model uses the lagged values of the series variable, that is, values from previous time points, as predictors of the current value of the series variable. For example, it might be the case that a good predictor of current monthly sales is the sales value from the previous month. The order of autoregression refers to the time difference between the series variable and the lagged series variable used as a predictor. If the series variable is influenced by the series variable two time periods back, then this is an autoregressive model of order two and is sometimes called an AR(2) process. An AR(1) component of the ARIMA model is saying that the value of series variable in the previous period (t-1) is a good indictor and predictor of what the series will be now (at time period t). This pattern continues for higher-order processes. The equation representation of a simple autoregressive model (AR(1)) is:
y ( t ) = 1 * y (t 1) + e(t ) + a
Thus the series value at the current time point (y(t)) is equal to the sum of: (1) the previous series value (y(t-1)) multiplied by a weight coefficient (1); (2) a constant a (representing the series mean); and (3) an error component at the current time point (e(t)).
Moving Average
The autoregressive component of an ARIMA model uses lagged values of the series values as predictors. In contrast to this, the moving average component of the model uses lagged values of the model error as predictors. Some analysts interpret moving average components as outside events or shocks to the system. That is, an unpredicted change in the environment occurs, which influences the current value in the series as well as future values. Thus the error component for the current time period relates to the series values in the future. The order of the moving average component refers to the lag length between the error and the series variable. For example, if the series variable is influenced by the models error lagged one period, then this is a moving average process of order one and is sometimes called an MA(1) process. An MA(1) model would be expressed as:
y ( t ) = 1 * e( t 1) + e( t ) + a
Thus the series value at the current time point (y(t)) is equal to the sum of several components: (1) the previous time points model error (e(t-1)) multiplied by a weight coefficient (here 1); (2) a constant (representing the series mean); and (3) an error component at the current time point (e(t)).
Integration
The Integration (or Differencing) component of an ARIMA model provides a means of accounting for trend within a time series model. Creating a differenced series involves subtracting the values of adjacent series values in order to evaluate the remaining component of the model. The trend removed by differencing is later built back into the forecasts by Integration (reversing the differencing operation). Differencing can be applied at the nonseasonal or seasonal level, and successive differencing, although relatively rare, can be applied. The form of a differenced series (nonseasonal) would be:
x (t ) = y (t ) y ( t 1)
Thus the differenced series values (x(t)) is equal to the current series value (y(t)) minus the previous series value (y(t-1)). Multivariate ARIMA ARIMA also permits a series to be predicted from values in other data series. The relations may be at the same time point (for example, a company spending on advertising this month influences the companys sales this month) or in a leading or lagging fashion (for example, the companys spending on advertising two months ago influence the companys sales this month). Multiple predictor series can be included at different time lags. A very simple example of a multivariate ARIMA model appears below:
y( t ) = b1 * x( t 1) + e( t ) + a
Here the series value at the current time point (y(t)) is equal to the sum of several components: (1) the value of the predictor series at the previous time point (x(t-1)) multiplied by a weight coefficient (b1); (2) a constant; and (3) an error component at the current time point (e(t)). In a practical context, this model could represent monthly sales of a new product as a function of direct marketing spending the previous month. Complex ARIMA models that include other predictor series, autoregressive, moving average, and integration components can be built in the Time Series node..
Time Series Analysis and Forecasting with SPSS Trends of time series information is that the more data points you have, the greater your understanding of the past will be, and the more information you have to use to predict future values in the series. The first important question to be answered is how many data points are required before it is possible to develop time series forecasts. Unfortunately, there is no clear-cut answer to this, but the following factors influence the minimum amount of data required: Periodicity How often the data are collected Complexity of the time series model
It is important to note that some time series techniques incorporating seasonal effects require several seasonal spans of time series data before it is possible to use them. Usually four or more seasons of data observations is a good rule of thumb to use when attempting to explore seasonal modeling. For example, four years (seasonal spans) worth of quarterly or monthly data would be sufficient, as there are four replications of the time period. At the same time, four years worth of annual data is not enough historic data, as the sample is only four. The four year rule is not, however, a rigid rule, as time series can be developed and used for forecasting with less historic data. Two final thoughts: first, the more complex the time series model, the larger the time series sample size should be. Secondly, time series models assume that the same patterns appear throughout the series. If you are fitting a long series in which a dramatic change occurred that might influence the fundamental relations that exist over time (for example, deregulation in the airline and telecom industries), you may obtain more accurate prediction using only the recent (after the change) data to develop the forecasts.
The file contains information on 85 markets. Rather than looking at all of them, we will focus only on Markets 1 through 5. The Filter node to the right of the source node will filter out the markets we dont want.
Double-click the Filter node
Time Series Analysis and Forecasting with SPSS Trends Figure 8.7 Filter Node Dialog
The next step is to examine sequence charts of each series, but before doing so we will need to define the periodicity of each series. This is done in the Time Intervals node which is found in the Fields Ops palette.
Place a Type Node to the right of the Filter node Connect the Filter node to the Type node Place a Time Intervals node to the right of the Type node Connect the Type node to the Time Interval node Double-click on the Time Intervals node
Time Series Analysis and Forecasting with SPSS Trends Figure 8.8 Time Intervals Dialog
The Time Interval dropdown is used to define the periodicity of the series. By default it is set to None. While it is not required that you specify a periodicity, unless you do the Expert Modeler will not consider models that adjust for seasonal patterns. In this case, because we have collected data on a monthly basis, we can rightfully expect that the same pattern will repeat itself every twelve months, which is a season. Therefore we will define our time interval as months.
Click on the Time Interval dropdown and select Months
Time Series Analysis and Forecasting with SPSS Trends Figure 8.9 Time Intervals Dialog with Periodicity Defined
The next step is to label the intervals. You can either start labeling from the first record, which in the case of this data file is January, 1999, or build the labels from a field that identifies the time or date of each measurement. In order to use the Start labeling from first record method, you must specify the starting date and/or time to label the records. This method assumes that the records are already equally spaced, with a uniform interval between each measurement. Any missing measurements would be indicated by empty rows in the data. You can use the Build from data method for series that are not equally spaced. This requires that you have a field that contains the time or date of each measurement. Clementine will automatically impute values for the missing time points so that the series will have equally spaced intervals. However, in addition, this method requires a date, time, or timestamp field in the appropriate format to use as input. For example if you have a string field with values like Jan 2000, Feb 2000, etc., you can convert this to a date field using a Filler node. This is the method that we are going to use. However, before we can do this, we must convert the date field from a string to a Date.
Click OK Insert a Filler node between the Filter node and the Type Node
Time Series Analysis and Forecasting with SPSS Trends Figure 8.10 Stream After Adding the Filler Node
Double-click on the Filler node Select DATE_ in the Fill in fields box Select Always from the Replace: dropdown Type or use the expression builder to insert to_date(DATE_) in the Replace with: box
Click OK
Time Series Analysis and Forecasting with SPSS Trends Next, lets set up the Type node so that the direction for all the outcome series we want to forecast is set to Out and the direction for the newly converted DATE field is set to None. We will also need to instantiate the data.
Double-click on the Type node Set the direction on all the fields from Market1 to Total to Out Set the direction on the DATE field to none Click on Read Values button to instantiate the data
Click OK
Time Series Analysis and Forecasting with SPSS Trends Figure 8.13 Time Intervals Dialog with Date field added
The New field name extension is used to apply either a Prefix or Suffix to the new fields generated by the node. By default it is $T1_.
Click on the Build tab
Time Series Analysis and Forecasting with SPSS Trends Figure 8.14 Build Tab Dialog
The Build tab allows you to specify options for aggregating or padding fields to match the specified interval. These settings apply only when the Build from data option is selected on the Intervals tab. For example, if you have a mix of weekly and monthly data, you could aggregate or roll up the weekly values to achieve a uniform monthly interval. Alternatively, you could set the interval to weekly and pad the series by inserting blank values for any weeks that are missing, or by extrapolating missing values using a specified padding function. When you pad or aggregate data, any existing date or timestamp fields are effectively superseded by the generated TimeLabel and TimeIndex fields and are dropped from the output. Typeless fields are also dropped. Fields that measure time as a duration are preservedsuch as a field that measures the length of a service call rather than the time the call startedas long as they are stored internally as time fields rather than timestamp.
Click on the Estimation tab
Time Series Analysis and Forecasting with SPSS Trends Figure 8.15 Estimation Tab Dialog
The Estimation tab of the Time Intervals node allows you to specify the range of records used in model estimation, as well as any holdouts. These settings may be overridden in downstream modeling nodes as needed, but specifying them here may be more convenient than specifying them for each node individually. The Begin Estimation is used to specify when you want the estimation period to begin. ou can either begin the estimation period at the beginning of the data or exclude older values that may be of limited use in forecasting. Depending on the data, you may find that shortening the estimation period can speed up performance (and reduce the amount of time spent on data preparation), with no significant loss in forecasting accuracy. The End Estimation option allows you to either estimate the model using all records up to the end of the data or hold out the most recent records in order to evaluate the model. For example, if you hold out the last three records and then specify 3 for the number of records to forecast, you are effectively forecasting values that are already known, allowing you to compare observed and predicted values to gauge the models effectiveness to forecast into the future. We will use the default settings.
Click the Forecast tab
Time Series Analysis and Forecasting with SPSS Trends Figure 8.16 Forecast Tab Dialog
The Forecast tab of the Time Intervals node allows you to specify the number of records you want to forecast and to specify future values for use in forecasting by downstream Time Series modeling nodes. These settings may be overridden in downstream modeling nodes as needed, but specifying them here may be more convenient than specifying them for each node individually. The Extend records into the future option lets you specify the number of time points you wish to forecast beyond the estimation period. Note that these time points may or may not be in the future, depending on whether or not you held out some historic data for validation purposes. For example, if you hold out 6 records and extend 7 records into the future, you are forecasting 6 holdout values and only 1 future value. The Future indicator field is used to label the generated field to indicate whether a record contains forecast data. The default value for the label is $TI_Future. The Future Values to Use in Forecasting allows you to specify future values for any predictor fields you use. Future values for any predictor fields are required for each record that you want to forecast, excluding holdouts. For example, if you are forecasting next month's revenues for a hotel based on the number of reservations, you need to specify the number of reservations you actually expect. Note that fields selected here may or may not be used in modeling; to actually use a field as a predictor, it must be selected in a downstream modeling node. This dialog box simply gives you a convenient place to specify future values so they can be shared by multiple downstream modeling nodes without specifying them separately in each node. Also note that the list of available fields may be constrained by selections on the Build tab. For example, if Specify fields and functions is selected in the Build tab, any fields not aggregated or padded are dropped from the stream and cannot be used in modeling. The Future value functions lets you choose from a list of functions, or specify a value of your own. For example, you could set the value to the most recent value. The available functions depend on the type of field.
Click on the Extend records into the future check box Specify that you would like to forecast 3 records beyond the estimation period
Time Series Analysis and Forecasting with SPSS Trends Figure 8.17 Completed Forecast Tab Dialog
Click OK
The next step is to examine each series with a Sequence chart. We will display all the fields on the same chart.
Place a Time Plot node from the Graphs palette below the Time Intervals node Attach the Time Intervals node to the Time Plot node Double-click on the Time Plot node Select all the series from Model1 to Total Uncheck the Display Series in separate panels box
Time Series Analysis and Forecasting with SPSS Trends Figure 8.18 Completed Time Plot Dialog
Click Execute
Time Series Analysis and Forecasting with SPSS Trends Figure 8.19 Sequence Chart Output for Each Series
From this graph, it is clear that Broadband usage has been increasing rapidly in the US (even more so in other countries), so we see a steady, very smooth increase for all fields variables. The numbers for Market_1 do begin to dip in the last couple of months, but perhaps this is temporary. There is clearly no seasonality in these data, which makes sense. The number of broadband subscriptions does not rise and fall throughout the year. If we use this fact, we can reduce the time for the Expert Modeler to fit models to these series, since requesting that seasonality be considered will increase processing time. Additionally, because the series weve viewed here are so smooth, with no obvious outliers, well not request outlier detection. This will also save on processing time. Note, though, that if you are in doubt about this, it is safer to use outlier detection during modeling.
Place a Time Series node from the Modeling palette near the Time Intervals node Connect the Time Intervals node to the Time Series node
Time Series Analysis and Forecasting with SPSS Trends Figure 8.20 Stream with Times Series node attached
The default method is Expert Modeler which automatically selects the best exponential smoothing or ARIMA model for a series or a group of series. As an alternative, you can use the menu to specify that you only want to specify a custom Exponential Smoothing or ARIMA models. In addition, there is a Reuse Stored Settings option, which allows you to apply an existing model to new data, without re-estimating the model from the beginning. In this way you can save time by re-estimating and producing a new forecast based on the same model settings as before
Time Series Analysis and Forecasting with SPSS Trends but using more recent data. Thus, if the original model for a particular time series was Holt's linear trend, the same type of model is used for re-estimating and forecasting for that data; the system does not reattempt to find the best model type for the new data. We will use the Expert Modeler in this example. In addition, you can specify the confidence intervals you want for the model predictions and residual autocorrelations. By default, a 95% confidence interval is used. ou can set the maximum number of lags shown in tables and in plots of autocorrelations and partial autocorrelations. You must include a Time Intervals node upstream from the Time Series node. Otherwise, the dialog will indicate that no time interval has been defined and the stream will not run. In this example, the settings indicate that the model will be estimated from all the records and that forecasts will be made from 3 time periods beyond the estimation period.
Click the Criteria button
The All models option should be checked if you want the Expert Modeler to consider both ARIMA and exponential smoothing models. The other two modeling options can be used if you want the Expert Modeler to only consider Exponential smoothing or ARIMA models. The Expert Modeler will only consider seasonal models if periodicity has been defined for the active dataset. When this option is selected, the Expert Modeler considers both seasonal and nonseasonal models. If this option is not selected, the Expert Modeler only considers nonseasonal models. We will uncheck this option because the sequence charts clearly show that there were no seasonal patterns in broadband subscriptions.
Time Series Analysis and Forecasting with SPSS Trends The Events and Interventions option enables you to designate certain fields as event or intervention fields. Doing so identifies a field as containing time series data affected by events (predictable recurring situations, e.g., sales promotions) or interventions (one-time incidents, e.g., a power outage or employee strike). These fields must be of type Flag, Set or Ordered Set, and must be numeric (e.g. 1/0, not T/F, for a Flag field), to be included in this list.
Uncheck the Expert Modeler considers seasonal models option (not shown) Click the Outliers tab
The Detect Outliers automatically option is used to locate and adjust for outliers. Outliers can lead to forecasting bias either up or down, erroneous predictions if the outlier is near the end of the series, and increased standard errors. Because there were no obvious outliers in the sequence chart, we will leave this option unchecked.
Click Cancel Click Execute Right-click on the generated model named 6 fields in the Models palette Click Browse
Time Series Analysis and Forecasting with SPSS Trends Figure 8.24 Time Series Model Output (View = Simple)
The Time Series model displays details of the models the Expert Modeler selected for each series. In this case, it chose the Holts exponential smoothing model for the first four series and the last one, and the Winters additive exponential smoothing model for the fifth series. Given the likely similar patterns in the series, it is not surprising that the same model was chosen for most of the series. The default output shows for each series the model type, the number of predictors specified, and the goodness-of-fit measure (stationary R-squared is the default). This measure is usually preferable to an ordinary R-squared when there is a trend of seasonal pattern. If you have specified outlier methods, there is a column showing the number of outliers detected. The default output also includes a Ljung-Box Q statistic, which tests for autocorrelation of the error. Here we see that the result was significant for the Model_2, Model_4, and Total series. Latter on, we will examine some residuals plots to see why the results were significant. The default view (Simple) displays the basic set of output columns. For additional goodness-of-fit measures, you can use the View menu to select the Advanced option. The check boxes to the left of each model can be used to choose which models you want to use in scoring. All the boxes are checked by default. The Check all and Un-check all buttons in the upper left act on all the boxes in a single operation. The Sort by option can be used to sort the rows in ascending or descending order of a specified column. As an alternative, you can also click on the column heading itself to change the order. Time Series Analysis 8 - 33
The Root Mean Square Error (RMSE) is the square root of the mean squared error. The Mean Absolute Percentage Error (MAPE) is obtained by taking the absolute error for each time period, dividing it by the actual series value, averaging these ratios across all time points, and multiplying by 100. The Mean Absolute Error (MAE) takes the average of the absolute values of the errors. The Maximum Absolute Percentage Error (MaxAPE) is the largest absolute forecast error expressed as a percentage. The Maximum Absolute Error (MaxAE) is the largest forecast error, positive or negative. And finally, Normalized Bayesian Information Criterion (Norm BIC) is a general measure of the overall fit of a model that attempts to account for model complexity. From this table, you can easily scan the statistics to look for better, or poorer, fitting models. We can see here that Model_5 has the highest Stationary R-squared value (0.544) and Total has a very low one (0.049). However, the Total series has a lower MAPE than any of the other series. The summary statistics at the bottom of the output provide the mean, minimum, maximum and percentile rank values for the standard fit measures. Here we see that the value for Stationary Rsquared at the highest percentile (Percentile 95) is 0.544. This means that Model_5 should be
Time Series Analysis and Forecasting with SPSS Trends ranked in the highest percentile based on this statistic, and the Total series should be ranked in the lowest. Now lets look at the Residual plots.
Click on the Residuals tab
The Residuals tab shows the autocorrelation function (ACF) and partial autocorrelation function (PACF) of the residuals (the differences between expected and actual values) for each target field. The ACF values are the correlations between the current value and the previous time points. By default, 24 autocorrelations are displayed. The PACF values look at the correlations after controlling for the series values at the intervening time points. If all of the bars fall within the confidence intervals (the highlighted area), then there are no significant autocorrelations in the series. That seems to be the case with the Market_1 series. However, as we saw in Figure 8.24, the Market_2 series seemed to have significant autocorrelation based on the Ljung-Box Q statistic. Lets take a look at the residuals plot for the Market_2 series to see if we can see why that statistic was significant for that series.
Use the Display plot for model: option to select the Market_2 series
Time Series Analysis and Forecasting with SPSS Trends Figure 8.27 Residuals Output for the Market_2 Series
Here we see that there is significant autocorrelation at lag 6 in both the ACF and PACF plots. Thus, the results of the Ljung-Box Q statistic and these two plots are consistent: there is a nonrandom pattern in the errors. What this implies is not that the current model cant be used for forecasting, as it may perform adequately for the broadband company. But it suggests that the model can be improved. The Expert Modeler is an automatic modeling technique, and it normally finds a fairly acceptable model, but that doesnt mean that some tweaking on your part isnt appropriate.
Click OK Place the generated Times Series model from the Models palette onto the stream canvas Connect the Time Intervals node to the generated model Place a Table node nearby the generated model Connect the generated model to the Table node Execute the Table node
Time Series Analysis and Forecasting with SPSS Trends Figure 8.28 Table Output Showing Fields Created by Time Series Model
The table now contains a forecast value for each time point along with an upper and lower confidence limit. In addition, there is a field called $T1_Future that indicates that there are records that contain forecast data. For records that extend into the future, the value for this field will be 1.
Scroll to the bottom of the table and then slightly to the right
Time Series Analysis and Forecasting with SPSS Trends Figure 8.29 Table Output with Future Values Displayed
Notice that the original series all have null values on these last three records because they are into the future. On the right hand side in Figure 8.28, we can see the forecast values for future months (January 2004 to March 2004) for the Market_1 series. Finally, lets create a chart showing the forecast for one of the series.
Close the Table window Place a Time Plot node near the generated model on the stream canvas Connect the Time Plot node to the generated model Select the following fields to be plotted: Market_5, $TS-Market_5, $TSLCI-Market_5, $TSUCI-Market_5 Uncheck the Display Series in separate panels option Click OK
Time Series Analysis and Forecasting with SPSS Trends Figure 8.30 Sequence Chart for Market_5 along with Forecasts and Upper & Lower Confidence Limits
From this chart, it appears that the model fits this series very well.
Click on FileSave Stream AsBroadband.str
Time Series Analysis and Forecasting with SPSS Trends Figure 8.31 Broadband2.str with the Generated Model from Broadband1.str.
This node contains the settings from the time series models we just created. Normally, at this point with any other Clementine generated model, we would make predictions on new data by attaching this node to the Type node and executing the generated model. This would make predictions for new cases. Time series data, though, is different. Unlike with other types of data files, where there is no special order to the cases (in terms of modeling), order makes a difference in a time series. To reuse our settings, but also use the new data (from January to March), to make estimates, we must create a new Time Series node directly from the generated Time Series model.
Right-click on the generated model and select Edit Click on GenerateGenerate Modeling Node
Time Series Analysis and Forecasting with SPSS Trends Figure 8.32 Broadband2.str with the Time Series Node Generated from the Previous Model
We dont have to specify any outcome fields because the models, with all specifications, are already stored in the generated time series modeling node. We simply insert the model node and decide whether the model should be re-estimated or not. Assuming that you have recently estimated models, you might be willing to act as if the estimated parameters for the models still hold. You can avoid estimation, and apply the models to the new data by using the method Reuse Stored Settings option. This choice means that Clementine will use the model settings for both model form (type of exponential smoothing and ARIMA model) and the exact parameter estimates (e.g., value of an AR(1) nonseasonal term). If instead you wish to re-estimate the model parameters, the Expert Modeler choice means that Clementine will use the model form found in the model file, but will re-estimate the parameters. Although it will clearly take more computing time to re-estimate model parameters, unless you have many, many time series which are very long, re-estimating the parameters is usually the better choice. However, if you are, lets say, making forecasts every month (week, etc.) based on just one additional month (week, etc.) of data, it may not be worth the effort to re-estimate every month. In that case, you may wish to re-estimate every few months.
Double-Click on the Time Series node Click the Method dropdown and select Reuse Stored Settings
Time Series Analysis and Forecasting with SPSS Trends Figure 8.33 Time Series Model Node with Reuse Stored Settings Selected
Click Execute to place a new model in the Models Manager Browse the new model
Time Series Analysis and Forecasting with SPSS Trends Figure 8.34 Time Series Model Output
As we can see, the models used for each series are the same as before (see Figure 8.24). Now lets take a look at the new forecasts for April, May and June.
Attach the new model to the Time Intervals node Attach a Table node to the new Time Series model Execute the Table node
Time Series Analysis and Forecasting with SPSS Trends Figure 8.35 Table Node Output with New Forecasts
In summary, in this chapter we demonstrated how to make forecasts for several series at once and use the estimated models to re-estimate those models on new data and make new forecasts at a future date for those same series. The process of applying the models to new data can be repeated as often as necessary.
Summary Exercises
A Note Concerning Data Files
In this training guide files are assumed to be located in the c:\Train\ClemPredModel directory.
Objectives
We introduce the Decision List model, and then describe differences between Decision List and the decision tree algorithms. We then detail the expert options available within the Decision List modeling node. We will also demonstrate the Interactive Decision List feature.
Data
In this chapter we use the data file churn.txt, which contains information on 1477 customers of a telecommunications firm who have at some time purchased a mobile phone. The customers fall into one of three groups: current customers, involuntary leavers and voluntary leavers. Unlike the models developed in Chapter 3, here we want to understand which factors influence the voluntary leaving of a customer, rather than trying to predict all three categories.
9.1 Introduction
Clementine contains five different algorithms for performing rule induction: C5.0, CHAID, QUEST, C&R Tree (classification and regression trees) and Decision List. The first four are similar in that they all construct a decision tree by recursively splitting data into subgroups defined by the predictor fields as they relate to the outcome. However, they differ in several ways that are important to the user (see Chapter 3). Decision List predicts a symbolic output, but it does not construct a decision tree; instead, it repeatedly applies a decision rules approach. To give you some sense of a Decision List model we begin by browsing such a model and viewing its characteristics. After that we continue by reviewing a table that highlights some distinguishing features of the rule induction algorithms. Finally, we will outline the difference between decision trees and decision rules and the various options for the Decision List algorithm in the context of predicting symbolic outputs.
Decision List 9 - 1
Once the Decision List generated model is in the Models palette, the model can be browsed.
Right-click the Decision List node named CHURNED[Vol] in the Models palette Click Browse
The results are presented as a list of decision rules, hence Decision List. If you are familiar with the C5.0 model output you will see a distinct likeness to the Rule Set presentation of a C5.0 model. Figure 9.2 Browsing the Decision List Model
Decision List 9 - 2
The first row gives information about the training sample. The sample has 719 records (Cover (n)) of which 267 meet the target value Vol (Frequency). In consequence the percentage of records meeting the target value is 37.13% (Probability). A numbered row represents a model rule and consists of an id, a Segment, a target value or Score (Vol) and a number of measures (here: Cover (n), Frequency and Probability). As you can see, a segment is described by one or more conditions, and each condition in a segment is based on a predictive field, e.g. SEX = F, INTERNATIONAL > 0 in the second segment. All predictions are for the Vol category, as this is what is defined in the Decision List modeling node. The accuracy of predicting this category is listed for each segment in the Probability column, and accuracy is reasonably high for most segments. As a whole our model has 5 segments and a Remainder. The maximum number of predictive fields in a segment is 2. Each segment is not too small (see measure Cover (n)); the smallest has 52 records. This is not chance. The maximum number of segments in the model, the maximum number of predictive fields in a segment and the minimum number of records in a segment are all set in the Decision List node, as we will see later. We now review in some detail the Decision List model.
The Target
A characteristic of Decision List is that it models a particular value of a symbolic target. In the Decision List model at hand we have modeled the voluntary leaving of a customer as represented by target value CHURNED = Vol.
Overlapping Segments
In our model the 5 segments and the Remainder form a non-overlapping segmentation of the training sample, meaning that a customer (or a record) belongs to exactly one segment or to the Remainder. So the total of the Cover (n) for all segments including the Remainder should match the Cover (n) of the training sample. This basic requirement affects the way a particular segment should be interpreted when reading the model. The Nth segment should be interpreted as: The record is in segment N and not(segment N-1) and not(segment N-2 ) and.. and not(segment 1) Example
Decision List 9 - 3
Predictive Modeling With Clementine Given our model a Female customer with International >0 and AGE from 43 to 58 satisfies both segment 1 and segment 2. However she will be regarded as member of segment 1. The rules are applied in the order you see them listed for the segments, so this customer is assigned to segment 1. A customer belongs to segment 2 if: not (SEX = F and 42 < AGE <= 58) [the segment 1 conditions] and SEX = F and International > 0 And a customer belongs to segment 3 if: not (SEX = F and International > 0) and not (SEX = F and 42 < AGE <= 58) and SEX = F and 73 < AGE <= 89
This mechanism prevents multiple counting of customers in overlapping segments. Be aware that the order of the segments in the model affects the segment a customer belongs to and so also the measures Cover (n), Frequency and Probability for each model segment. This is a consequence of the iterative method by which Decision List generates rules. In a later section we will cover in detail how this rule induction mechanism works. For now it is sufficient to realize that the Decision list algorithm is constructing trees of decision rules using a very different splitting mechanism than the one used in the decision tree algorithms. This is the reason why the Decision List algorithm is not a tree but a rule algorithm.
Model Criterion
Split Type for Symbolic Predictors Continuous Target Continuous Predictors Criterion for Predictor Selection Can Cases Missing Predictor Values be Used? Priors Pruning
C5.0
Multiple
CHAID
Multiple
1
QUEST
Binary
C&R Tree
Binary
Decision List
Multiple
Yes No2 Chi-square F test for continuous Yes, missing becomes a category No Stops rather
No Yes Statistical
Yes Yes Impurity (dispersion) measure Yes, uses surrogates Yes Cost-
No No2 Statistical
Decision List 9 - 4
Predictive Modeling With Clementine Criterion Build Models Interactively Supports Boosting
1
SPSS has extended the logic of the CHAID approach to accommodate ordinal and continuous target variables. 2 Continuous predictors are binned into ordinal variables containing by default approximately equal sized categories. Unlike the decision tree algorithms, Decision List does not create subgroups by splitting but by either adding a new predictor or by narrowing the domain of the existing predictor(s) in the group (decision rule approach) and in consequence tree-splitting issues are not applicable here. Decision List can handle targets that are of type flag and set. Decision List is designed to model a specific category of a symbolic target, so effectively it predicts a binary outcome (target or not target). The algorithm treats continuous predictors by binning them into ordinal fields with approximately equal number of records in each category.. In generating rules, just like CHAID and QUEST, Decision List uses more standard statistical methods, as explained below. The way missing values are handled is set with Expert options. Either the missing fields in a predictor are neglected when it comes to using that predictor in forming a subgroup, or like CHAID, all missing values are used as an additional category in model building. The process of rule generation halts based on settings such as the maximum number of predictors in a rule, explicit group size related settings, and the statistical confidence required.
Decision List 9 - 5
In this example we will attempt to predict which customers voluntarily cancel their mobile phone contract. Rather than rebuild the source and Type nodes, we use the existing stream opened previously. Well delete the Decision list node so we can review the default settings.
Close the Decision List Browser window Delete the CHURNED[Vol] node Place a Decision List node from the Modeling palette to the upper right of the Type node in the Stream Canvas Connect the Type node to the Decision List node (see Figure 9.3)
The name of the Decision List node should immediately change to No Target Value. Figure 9.3 Decision List Modeling Node Added to Stream
The reason for the name No Target Value is because target field CHURNED has three values, but Decision List predicts only one specific target value.
Double-click the Decision List node to edit it
Decision List 9 - 6
Predictive Modeling With Clementine Figure 9.4 Decision List Dialog - Initial
The Model name option allows you to set the name for both the Decision List and resulting Decision List nodes. The Use partitioned data option is checked so that the Decision List node will make use of the Partition field created by the Partition node earlier in the stream. By default the model is built automatically, as the Mode is set to Generate model. By selecting Launch interactive session it is possible to create the model interactively. The Target value has to be set explicitly to Vol.
Click the button, to the right of the Target value Click Vol, then click Insert
With Decision List you are able to generate rules better than the average or worse than the average depending on your goal (where the average is the overall probability of the target value). This is set by the Search Direction value of Up or Down.. An upward search looks for segments with a high frequency. A downward search will create segments with low frequency. A decision rule model contains a number of segments. The maximum is set in Maximum number of segments. Each segment is described by one or more predictors, also known as attributes in the Decision List node. The maximum number of predictive fields to be used in a segment is set in Maximum
Decision List 9 - 7
Predictive Modeling With Clementine number of attributes. You may compare this setting with Levels below root setting in CHAID and QUEST, prescribing the maximum tree depth. The Maximum number of attributes setting implies a stopping criterion for the algorithm. Just like the stopping criteria of CHAID, Decision List also has settings related to segment size: As percentage of previous segment (%) and As absolute value (N). The percentage setting states that a segment can only be created if it contains at least a certain percentage of records of its parent. Compare this with a branch point in a tree algorithm. The absolute value setting is straightforward: a segment only qualifies for the model if it is not too small, thus serving the generality requirement of a predictive model. The larger of these two settings takes precedence. Note that whereas in CHAIDs stopping criteria you must choose either a percentage or an absolute value approach, Decision List combines the two by using the percentage requirement for the parent and the absolute value requirement for the child. The models accuracy is controlled by Confidence interval for new conditions (%). This is a statistical setting and the most commonly used value is 95, the default. Of course depending on the business case and how costly an erroneous prediction, you may increase or decrease this confidence value.
Decision List 9 - 8
Predictive Modeling With Clementine Figure 9.5 Three New Fields Generated by the Decision List Node
Three new columns appear in the data table, $D-CHURNED, $DP-CHURNED and $DICHURNED. The first represents the predicted target value for each record, the second the probability and the third shows the ID of the model segment a record belongs to. The sixth segment is the Remainder. Note that the predicted value is either Vol or $null$, demonstrating that the Decision List algorithm predicts a particular value of the target field to the exclusion of the others.
Click FileClose to close the Table output window
First we will edit the Select node on the upper right that we will use to select the Training sample cases:
Double-click on the Select node on the upper right to edit it Click the Expression Builder button Move Partition from the Fields list box to the Expression Builder text box Click the equal sign button Click the Select from existing field values button (not shown) and insert the value 1_Training
Decision List 9 - 9
Now do the same for the Select node on the right to select the Testing sample cases. Insert Partition value 2_Testing and annotate the node as Testing. Now attach a separate Matrix node to each of the Select nodes. For each of the Select nodes:
Place a Matrix node from the Output palette near the Select node Connect the Matrix node to the Select node Double-click the Matrix node to edit it Put CHURNED in the Rows: Put $D-CHURNED in the Columns: Click the Appearance tab Click the Percentage of row option Click on the Output tab and custom name the Matrix node for the Training sample as Training and the Testing sample as Testing (this will make it easier to keep track of which output we are looking at) Click OK
For each actual risk category, the Percentage of row choice will display the percentage of records predicted in each of the outcome categories.
Execute each Matrix node
Decision List 9 - 10
Predictive Modeling With Clementine Figure 9.7 Matrix Output for the Training and Testing Samples
Looking at the Training sample results, the model predicts about 82.0% of the Vol (Voluntary Leavers) category correctly. The results with the testing sample compare favorably (80.5% accurate) which suggests that the model will perform well with new data. Note that technically no prediction for the other two categories is correct, since the model doesnt predict Current or InVol but just $null$. But we can combine these results by hand to obtain the accuracy. The percentage of correct not Vol predictions is: (313 + 48)/((313 + 68)+(48 +23))*100 = 79.9%. We could have made this calculation easier by creating a two-valued target field based on CHURNED, thus creating a 2 by 2 matrix. Decision List would create the same rules for such a field.
Close both Matrix windows
By default, an Evaluation chart will use the first target outcome category to define a hit. To change the target category on which the chart is based, we must specify the condition for a User defined hit in the Options tab of the Evaluation node.
Click the Options tab Click the User defined hit checkbox in the User defined hit group Click the Expression Builder button Click @Functions on the functions category drop-down list Select @TARGET on the functions list, and click the Insert button Click the = button Right-click CHURNED in the Fields list box, then select Field Values
Decision List 9 - 11
Figure 9.8 Specifying the Hit Condition within the Expression Builder
Click OK
In the evaluation chart, a hit will now be based on the Voluntary Leaver target category.
Click Execute
Decision List 9 - 12
Predictive Modeling With Clementine Figure 9.10 Gains Chart for the Voluntary Leaving Group
The gains line ($D-CHURNED) in the Training data chart rises steeply relative to the baseline, indicating that hits for the Voluntary Leaving category are concentrated in the percentiles predicted most likely to contain this type of customer, according to the model.
Hold the cursor over the model line in the Training partition at the 40th percentile
Approximately 77% of the hits were contained within the first 40 Percentiles.
Decision List 9 - 13
Predictive Modeling With Clementine Figure 9.11 Gains Chart for the Voluntary Leaving Group (Interaction Enabled)
The gains line in the chart using Testing data is very similar which suggests that this model can be reliably used to predict voluntary leavers with new data.
Close the Evaluation chart window
Decision List 9 - 14
Predictive Modeling With Clementine are SEX and AGE. Because the sample used for training the model gradually decreases during the stepwise rule discovery process, there will be other predictive fields coming to the surface as being most important. This intuitively makes sense. So in step 2 when finding the best second segment, using the whole training sample except the first segment, the most important fields turn out to be SEX and International. Similarly, when finding segment 3 and using the whole training sample except for the first two segments, SEX and AGE are again the most important predictors. The process continues until the algorithm is not able to construct segments satisfying the requirements, or stopping criteria are reached.
Expert mode options allow you to fine-tune the rule induction process.
Click the Expert tab Click the Expert Mode option button
Decision List 9 - 15
Binning
Binning is a method of transforming a numeric field (of type Range) into a number of categories/intervals. The Number of bins input will set the maximum number of bins to be constructed. Whether this maximum will actually be the number of bins depends on other settings as well. There are two main types of Binning methods, Equal Count and Equal Width. Equal Width will transform numeric fields into a number of fixed width intervals. Equal Count is a more balanced binning method, and it will create intervals based on an equal number of records per interval. The three settings below this control details of the model process, described below. If Allow missing values in conditions is checked, the Decision List algorithm will regard being empty or undefined as a particular category that can be used as a condition in a segment. That may result in a segment such as SEX = F and AGE IS MISSING.
Decision List 9 - 16
Predictive Modeling With Clementine List Generation Simple To simplify the argument we will describe the process given the setting Model search width = 1, meaning we will not create multiple lists simultaneously to choose from in the end. So we will assume one List cycle here. Rule Generation Given the above, the rule generation process starts with the full Training sample to search for segments. The solution area is generated as follows: on the first rule level segments are constructed based on 1 predictive field. The best 5 (Rule search width) will be selected as starting points for a second rule level, resulting in a set of segments each described by 2 predictive fields. Again the best 5 are selected for the third rule level. This goes on till the last rule level, which is 5 (Maximum number of attributes), and indeed in principle the fifth rule level segments are described by five predictive fields. It is not always possible to refine a certain segment in a next step by adding a new predictive field. One of the reasons is the group size as set in As absolute value (N). The algorithm may come up with segments that are described by less than five predictive fields. On the other hand, refining a given segment in a next step can also be done by not adding a new predictive field, but by reconsidering an existing predictive field. This is set by Allow attribute re-use. (e.g., Age between (20, 60) in step 1 could be refined to Age between (25, 55) in level 2). So this is why in rule level N there may be segments having less than N predictive fields. A segment that is not refined anymore is called a final result, which is comparable to terminal nodes in a decision tree. If Model search width =1, out of all these final results the algorithm will return the best 5 (Maximum number of segments) based on the target values probability. Our previous model did create all five. The decision rule process may not be able to use all the freedom as set in the Rule search width (5) and in Maximum number of attributes (5). The main reasons are typically group size requirements and/or the statistical confidence requested.
Decision List 9 - 17
Predictive Modeling With Clementine When working in Interactive mode, the Maximum Number of alternatives setting is active. When the model is automatically generated, its values is set to 1. Be aware that the Model search width and the Rule search width have a direct impact on the datamining processing time.
Figure 9.13 Decision List Model Tab with Interactive session enabled
Decision List 9 - 18
Predictive Modeling With Clementine Note that we have modified some of the default settings, such as the maximum number of attributes, the maximum number of segments, and the absolute value of the minimum segment size. Click on the Expert tab to review those settings as well. When the model executes, a generated Decision List model is not added to the Model Manager area. Instead, the Decision List Viewer opens, as shown in Figure 9.14.
Click Execute to open the Decision List Viewer In the Decision List Viewer click on the Preview button corner to show the Preview pane in the bottom right
The Decision List Viewer workspace provides options for configuring, evaluating, and deploying models. The workspace consists of three panes. The Working model pane displays the current model representation. The Preview pane displays an alternative model or model snapshot to compare to the working model. The Manager pane contains the Session Results tab and the Snapshots tab. The Session Results tab displays mining task results as well as alternative models. The Snapshots tab displays current model snapshots (a snapshot is a model representation at a specific point in time). Note: The generated, read-only model displays only the working model pane and cannot be modified. In the working model pane you can see two rules. The first gives information about the training sample. Here the sample has 719 records (Cover (n)) of which 267 meet the target value
Decision List 9 - 19
Predictive Modeling With Clementine (Frequency). In consequence the percentage of records meeting the target value is 37.13% (Probability). The second, called Remainder, is now the first segment in our model and contains the whole training sample. This will be the starting point for building our Decision List model.
Right-click the Remainder segment From the dropdown list select Organize Mining Tasks
The first task and only task is the default task as defined in the Decision List Node. You have the choice of either executing, deleting, or modifying the task, or creating a new data mining task.
Click Execute to run the default task
Decision List 9 - 20
Predictive Modeling With Clementine Figure 9.16 The Session Results Pane
In the Session Results pane a new entry appears after the task has finished. This entry has 1 main line and two sub lines. The main line states that the mining task was performed on the first segment (#1) and is completed. The two sub lines are the two alternative lists that were generated by this data mining task. Recall that for this task the Model search width is set to 2. The first alternative list (1.1 Alternative 1) contains 7 segments (7#) and the model represented by this list has an average probability of 59.36%. The second alternative list has 8 segments and the corresponding model has an average probability of 56.13%. Lets view each of the two alternative lists
In the Session Results tab click on 1.1 Alternative 1
The result will be displayed in the Preview pane. Notice the line Model Summary in the bottom of the Preview pane.
Decision List 9 - 21
You will see that these two alternatives differ in their 7th segment. The first has a 7th segment based on SEX and the second on AGE. Another interesting segment is the Remainder. The first alternative has a Remainder of 281 and misses 7 voluntarily leaving customers, whereas the second alternative list has a Remainder of 254 and misses 6 of these customers. Assume that we prefer the first alternative but we want to capture some more of the voluntary leavers in the model. First we must promote the first alternative list to our working model, then from there we will continue the model building process.
In the Session Results tab, right-click on 1.1 Alternative 1 From the drop down list select Promote to Working Model
Decision List 9 - 22
Predictive Modeling With Clementine Figure 9.18 Promoting an Alternative to the Working Model
Decision List 9 - 23
Predictive Modeling With Clementine Figure 9.19 Gains Chart of Working Model
The results look encouraging on both the training data and the testing data. The segments included in the model are represented by the solid line; the excluded portion (Remainder) is represented by the dashed line. Lets put both the Working model and the Preview model on display in the Gains chart
Click Chart Options Click Preview Model check box Click OK
Decision List 9 - 24
Predictive Modeling With Clementine Figure 9.20 Gains Chart of Working Model and Preview Model
Although model performance is similar, the Preview model (alternative 2) performs a bit more poorly than the Working model.
Click Viewer tab In the Working model pane, right-click on segment SEX = F
Decision List 9 - 25
Predictive Modeling With Clementine Figure 9.21 Options to Modify a Segment in the Model
Choices in the context menu allow you to modify the segments created by the data mining task. For example you may decide to delete a segment or to exclude it from scoring. You can even edit the segment. For example, you could add an extra condition to the segment SEX = F, or you could modified the lower and upper boundary value of EST_INCOME in segment 6 (Edit Segment Rule).
Model Assessment
We have used the Gains chart above to get an overall view of the model. You can assess the model on a segment level by using the model measures. There are five types of measures available.
From the menu, click ToolsOrganize Model Measures
Decision List 9 - 26
Predictive Modeling With Clementine Figure 9.22 Organize Model Measures Dialog
When building a Decision List model, you have five types of measures at your disposal (Display): a Pie Chart and four numerical measures. Each measure has a Type, the Data Selection it will operate on (here Training Data) and whether it will be displayed in the model (Show). The Pie Chart displays the part of the Training sample that is described by a segment. The other Coverage measure is Cover (n), which will show the number of records in the Training sample in that segment. The Frequency measure displays the number of records in the segment with the target value, Probability calculates the ratio of Frequency over Cover(n) and the Error returns the statistical error. It is possible to add new measures to your model by clicking the Add new model measure button . Well create a measure (call it %Test) showing the probability of each segment on the Testing partition. Furthermore. we will rename Probability to %Train.
Click the Add new model measure button
Decision List 9 - 27
Click the dropdown list for Data Selection and change to Testing Data Click the Show checkbox for %Test Double-click in the Name cell for Probability and change its name to %Train, then hit Enter
Click OK
Decision List 9 - 28
Predictive Modeling With Clementine Figure 9.25 New Measures Added to the Working Model
Decision List Viewer can be integrated with Microsoft Excel, allowing you to use your own value calculations and profit formulas directly within the model building process to simulate cost/benefit scenarios. The link with Excel allows you to export data to Excel, where it can be used to create presentation charts, calculate custom measures, such as complex profit and ROI measures, and view them in Decision List Viewer while building the model. The following steps are valid only when MS Excel is installed. If Excel is not installed, the options for synchronizing models with Excel are not displayed. Suppose that we have created a template in Excel where, based on the Probability and on the Coverage of a segment, we calculate the amount of loss we will suffer should the customers in a segment actually leave voluntarily.
Click Tools and select Organize Model Measures Click Yes for Calculate custom measures in Excel (TM) Click Connect to Excel (TM) button Browse to C:\Train\ClemPredModel\ and select Template_churn_loss.xlt Click Open
Decision List 9 - 29
Predictive Modeling With Clementine Figure 9.26 The ExcelWorkbook for the Churn Case
Decision List 9 - 30
Predictive Modeling With Clementine The Choose Inputs for Custom Measures window reveals that Excel expects two fields for input: Probability and Cover. On the other hand four fields are available to add to your model: Loss = Probability * Cover * Loss Cover * Variable Cost %Loss = 100 * Loss / Sum (Loss), the fraction of the total loss a segment can be accounted for Cumulative = Cumulative Loss %Cumulative = % Cumulative Loss By default all are selected. Clicking on an empty Model Measure cell in the dialog will open a drop down list with all the measures available in your model.
Click in the Model Measure cell for Probability and select %Train Click in the Model Measure cell for Cover and select Cover (n)
Figure 9.28 Mapping Excel Input File to the Decision List Model Measure
Click OK
In the Organize Model Measures window you will see which measures are available for input to your model. By default all are selected.
Deselect measure %Test (not shown) Click OK
Decision List 9 - 31
Predictive Modeling With Clementine Figure 9.29 the Decision List Model with External Measures
As you can see, segment 4 is responsible for more than 20% of the total loss expected (reflected by its measure %Loss), and the first four segments for more than 50% (reflected by the %Cumulative measure for the fourth segment). So if the business objective was to select a set of customers in a retention campaign to reduce the expected loss by at least 50%, the list manager would probably choose the first 4 segments to be scored. If you wish to exclude a segment from a model, it can be done from a context menu.
Right-click on Segment 5
Decision List 9 - 32
Predictive Modeling With Clementine Figure 9.30 Manually Excluding Segments from Scoring Based on External Measures
Interactive Decision Lists are not a model, but instead are a form of output, like a table or graph. When you are satisfied with the list you have built, you can generate a model to be used in the stream to make predictions.
Click GenerateGenerate Model Click OK in the resulting dialog box (not shown) Close the Interactive Decision List Viewer window
A generated Decision List model appears in the upper left corner of the Stream Canvas. It can be edited, attached to other nodes, and used like any other generated model. The only difference is in how it was created.
Decision List 9 - 33
Summary Exercises
A Note Concerning Data Files
In this training guide files are assumed to be located in the c:\Train\ClemPredModel directory.
Decision List 9 - 34
10.1 Introduction
When you are creating a model, it isnt possible to know in advance which modeling technique will produce the most accurate result. Often several different models may be appropriate for a given data file, and normally it is best to try more than one of them. For example, suppose you are trying to predict a binary outcome (buy/not buy). Potentially, you could model the data with a Neural Net, any of the Decision Tree algorithms, Logistic Regression, or Decision List. In certain situations, you may also be able to use Discriminant Analysis. Unfortunately this process can be quite time consuming. The Binary Classifier node allows you to create and compare models for binary outcomes using a number of methods all at the same time, and compare the results. You can select the modeling algorithms that you want to use and the specific options for each. You can also specify multiple variants for each model. For instance, rather than choose between the quick, dynamic, or prune method for a Neural Net, you can try them all. The node generates a set of models based on the specified options and ranks the candidates based on the criteria you specify. The supported algorithms include Neural Net, Decision Trees (C5.0, C&RT, QUEST, and CHAID), Logistic Regression and Decision List. To use this node, a single target field of type Flag and at least one predictor field are required. We will continue to use the Churn.txt file which we used in earlier chapters. However, we will have to combine the Voluntary and Involuntary Leavers into a single category in order to use this node. The predictor fields can be numeric ranges or categorical, although any categorical predictors must have numeric storage (not string). If necessary, you can use the Reclassify node to convert them.
Click FileOpen Stream, and then move to the c:\train\ClemPredModel folder Double-click on FindBestModel.str Place a Binary Classifier node from the Modeling palette to the right of the Type node Connect the Type Node to the Binary Classifier node Edit the Derive node named LOYAL
Predictive Modeling With Clementine Figure 10.1 Creation of Flag Field Identifying Loyal Customers
In the Derive node we use the field CHURNED to create a new target with the name LOYAL. This target will be a flag, with a value of Leave when CHURNED is not equal to Current; this means that customers who are voluntary or involuntary leavers will have values of Leave. Current customers who will stay have will have a Stay.
Close the Derive node Edit the Binary Classifier node
The maximum models listed in the Binary Classifier summary report is 20 by default, but you can increase or decrease this value. The Rank models by option allows you to specify the criteria used to rank the models. Note that the True value defined for the target field is assumed to represent a hit when calculating profits, lift, and other statistics. We have defined Leave as the True category in the Derive node because we are more interested in locating persons who will leave the company than those who will stay. Models can be ranked on either the Training or Testing data, if a Partition node is used.
Click on the Rank models by menu to see the different ranking options
Predictive Modeling With Clementine Figure 10.3 Ranking Options Within the Binary Classifier
Overall accuracy refers to the percentage of records that is correctly predicted by the model relative to the total number of records. Area under the curve (ROC curve) provides an index for the performance of a model. The further the curve is above the reference line, the more accurate the model. Profit (Cumulative) is the sum of profits across cumulative percentiles (sorted in terms of confidence for the prediction), based on the specified cost, revenue, and weight criteria. Typically, the profit starts near 0 for the top percentile, increases steadily, and then decreases. For a good model, profits will show a well-defined peak, which is reported along with the percentile where it occurs. For a model that provides no information, the profit curve will be relatively straight and may be increasing, decreasing, or level, depending on the cost/revenue structure that applies. Lift (Cumulative) refers to the ratio of hits in cumulative quantiles relative to the overall sample (where quantiles are sorted in terms of confidence for the prediction). For example, a lift value of 3 for the top quantile indicates a hit rate three times as high as for the sample overall. For a good model, lift should start well above 1.0 for the top quantiles and then drop off sharply toward 1.0 for the lower quantiles. For a model that provides no information, the lift will hover around 1.0. Number of variables ranks models based on the number of variables used. The Profit Criteria section is used to define the costs, revenue and weight values for each record. Profit equals the revenue minus the cost for each record. Profits for a quantile are simply the sum of profits for all records in the quantile. Profits are assumed to apply only to hits, but costs apply to all records. Use the Costs option to specify the cost associated with each record. You can either specify a Fixed or Variable cost. Use the fixed costs option if the costs are the same for each record. If the costs are variable, select the field which has the cost associated for each record. The Revenue option is used to specify the amount of revenue associated with each record. Again, this
Predictive Modeling With Clementine value can be either Fixed or Variable. The Weight option should be used if your data represent more than one unit. This option allows you to use frequency weights to adjust the results. For fixed weights, you will need to specify the weight value (the number of units per record). For variable weights, use the Field Selector button to select a field as the weight field. Note that model profit will have nothing to do with monetary profit unless you specify actual cost and revenue values. Nevertheless, the defaults will still give you some sense of how good the model is compared to other models. For example, if it costs you 5 dollars to send out a promotion, and you get 10 dollars in revenue for each positive response, the model with the highest cumulative profit would be the one with the most hits. Lift Criteria is used to specify the percentile use for lift calculations. The default is 30.
Select Lift from the Rank model by: menu
Predictive Modeling With Clementine Figure 10.5 Binary Classifier Expert Tab
The Expert tab allows you to select from the available model types and to specify stopping rules. By default, each of the seven model types are checked and will be used. However, it is important to note that the more models you select, the longer the processing time will be. You can uncheck a box if you dont want to consider a particular algorithm. The Model parameters option can be used to change the default settings for each algorithm, or to request different versions of the same model. For example, you could request that all six of the Neural Net training methods in a single pass of the data. In this example, we will request one additional Neural Net model (the Dynamic method) and take the default values for all the other models.
Click on the Model Parameters cell for Neural Net and select Specify
Predictive Modeling With Clementine Figure 10.6 Algorithms Setting for Neural Net models
Click in the Method row in the Options cell and select Specify
Predictive Modeling With Clementine Figure 10.7 Neural Net Parameter Editor
At this point, we could check additional Neural Net algorithms. However, in the interest of time, we will stick with just the Quick method.
Click Cancel
Before we move on, note that the Set random seed parameter is set to false. This means that the random seed for the neural net model(s) will be generated each time the Binary Classifier node is executed, and this will result in a somewhat different model each time (for each type of neural net requested). If you wish, you can set the Set random seed parameter to true, and then specify a seed with the Seed parameter. In the class, we will not do so to make the example more realistic, so expect your results to differ from that listed below or from the instructor.
Click OK again to return to the main dialog Click on the Stopping rules button
Stopping rules can be set to restrict the overall execution time to a specific number of hours. All models generated to that point will be included in the results, but no additional models will be produced. In addition, you can request that execution be stopped once a model has been built that meets all the criteria specified in the Discard tab (see Figure 10.9).
Click Cancel Click the Discard Tab
The Discard tab allows you to automatically discard models that do not meet certain criteria. These models will not be listed in the summary report. You can specify a minimum threshold for overall accuracy, lift, profit, and area under the curve, and a maximum threshold for the number of variables used in the model. Optionally, you can use this dialog in conjunction with Stopping rules to stop execution the first time a model is generated that meets all the specified criteria. In this example, we will not set any discard criteria.
Here we see that the CHAID model is the best based on the Lift statistic. The number to the right of the model type indicates the number of variations you requested with that algorithm. Note that there is a 1 to the right of each model. Use the Sort by: option or click on a column header to change the column used to sort the table. In addition, you can use the Show/hide columns menu tool to show or hide specific columns and the button to change the cumulative lift percentile. The default value is 30. If a partition is in use, you can choose to view results for the training or testing partition as applicable. Because we did not use a Partition node, we are displaying the results from the Training set. The next step is to generate one or more of the models listed in the Binary Classifier Report browser. Each generated model can be used as is without having to re-execute the stream. Alternatively, you can generate a modeling node which you can add to your stream.
Check the Generate: box for CHAID Click GenerateModel(s) to Palette
Predictive Modeling With Clementine Figure 10.11 Binary Classifier Generate Options
Close the Binary Classifier browser Move the generated model to the Stream Canvas Connect the CHAID model to the Type node Place a Matrix node to the right of the CHAID model Connect the CHAID model to the Matrix node
Figure 10.12 Revised Stream with the Addition of the CHAID Model and a Matrix Node
Double-click on the Matrix node Put LOYAL in the Rows: Put $R-LOYAL in the Columns: Click the Appearance tab Click the Percentage of row option Execute the Matrix node
Here we see that the CHAID model correctly identified just over 90% of the Leavers. And while it didnt predict current customers with the same degree of accuracy, the 79.7% figure would in all likelihood be very acceptable. In this way, we were readily able to run several models at one time, compare their results, and choose a model to examine further, or use to make predictions on future data.
Summary Exercises
A Note Concerning Data Files
In this training guide files are assumed to be located in the c:\Train\ClemPredModel directory.
response orispend orivisit spendb visitb promspd promvis promspdb promvisb totvisit totspend forpcode mos mosgroup title sex yob age ageband
Response to campaign Pre-campaign expenditure Pre-campaign visits Pre-campaign spend category Pre-campaign visits category Post-campaign expenditure Post-campaign visits Post-campaign spend category Post-campaign visit category Total number of visits Total spend Post Code 52 Mosaic Groups Mosaic Bands Title Gender Year of Birth Age Age Category
1. Begin with a clear Stream canvas. Place an SPSS File source node on the canvas and connect it to charity.sav. 2. Try to predict Response to campaign using all the available model choices. Use the defaults first, and use the same inputs as you did in Chapter 2. Which model is best, and which is worse? You choose the criterion for ranking models, or look at more than one. Which one uses fewer inputs? 3. Now change some of the model settings on one more models and rerun the Binary Classifier. Does the order of models change?
Predictive Modeling With Clementine 4. Pick two or more models and generate a model for each. Add them to the stream and use an Analysis node to further compare their predictions.
11.1 Introduction
Throughout this course we have looked at several different modeling techniques, including neural networks, decision trees and rule induction, regression and logistic regression, and discriminant analysis. After building a model we have usually performed some form of diagnostic analysis that helps with the interpretation of the model, and we have also done additional analyses to help determine where the model is more and less accurate. In this chapter we develop and extend the model building skills learned so far. The key concept in these examples is that models built with an algorithm in Clementine should usually (unless accuracy is very high and satisfactory) be viewed not as the endpoint of an analysis, but as a way station on the path to a robust solution. There are various methods to improve models, only some of which we discuss here, and you are likely to come up with your own as you become experienced using Clementine. We provide methods for how to improve a model, but there is no one simple answer as to how this should be done. That is because the appropriate method is highly dependent upon characteristics of the existing model that has been built. Potential things to consider when improving the performance of a model are: The modeling technique used The data type of the output field (symbolic or numeric) Which parts of the model are under-performing, i.e., are less accurate The distribution of confidence values for the existing model.
But sometimes it would be helpful to modify the confidence so that, for the category of interest, a high confidence value means a prediction of leave, and a low confidence value indicates stay. Such a field is a type of score that can be used in choosing cases for future actionsintervention, marketing efforts, and so forth. For the examples in this chapter, we will use the churn data from a previous example. A Derive node has been added to the beginning of the stream to create a modified version of the CHURNED field. We convert CHURNED into the field LOYAL which measures whether or not a customer remained with the company. LOYAL groups together both voluntary and involuntary leavers into one group, so comparisons can be made with customers who remain loyal. We begin by opening the corresponding stream.
Click FileOpen Stream and move to the c:\Train\ClemPredModel directory Double-click on Confidence.str
Both a neural net and C5.0 model were trained to predict the field LOYAL, using the ChurnTrain.txt data. Their generated models were then added to the stream connected to the Type node. Two Derive nodes were then added, one for each model, and given names corresponding to the score values they calculate based on the model predictions and confidence values (C5_SCORE and NN_SCORE). Then a histogram node displays each score with LOYAL as an overlay. Figure 11.1 Stream Calculating Scores from Predictions and Confidence Values
Lets look at the Derive node, which calculates the C5_SCORE field.
Edit the Derive node C5_SCORE
The new field C5_SCORE is created using two formulas. When the prediction of the C5.0 model is that a customer will stay the confidence score is calculated as:
Predictive Modeling With Clementine 0.5 (model confidence/2) When the prediction is that a customer will leave (the category of most interest) the score is: 0.5 + (model confidence/2) Since the model confidence varies from .50 to 1 for a C5.0 model (it cant be below .50 because the mode is used to make a prediction, and there are only two categories), the first equation will create scores ranging from 0 to .25, and the greater the confidence that a customer will stay, the closer the score will be zero (0.5 1.0/2). Conversely, the second equation creates scores ranging from .75 to 1.00, and the greater the confidence the customer will leave, the closer the score to 1 (0.5 + 1.0/2). Figure 11.2 Derive Node Transforming Predictions and Confidence Values into Scores
The Derive node for NN_SCORE is similar. However, since confidence varies from 0 to 1 for a neural network model, the resulting modified confidence will vary from 0 to .50 for customers predicted to stay, and from .50 to 1.0 for those predicted to leave, i.e., across the full range from 0 to 1. Before examining the distribution of these new fields, lets review the values of $CC-LOYAL, the actual confidence values for the C5.0 model, overlaid by $C-LOYAL, the predicted value.
Close the Derive node Add a Histogram node to the Stream canvas, and connect the C5_SCORE Derive node to it
We can see in the figure below that the confidence scores range from .50 to 1.0, but that a high confidence doesnt necessarily indicate that we expect a customer to leave or stay, since there are customers in both categories at high confidence values (we would find the same pattern if we used the values of LOYAL, the actual status of customers). For this model, very high confidence is associated with the Stay category, but this sort of pattern would not be found in general. Figure 11.3 Distribution of Original Confidence Value by Predicted Loyalty
Now we can create the histogram with the modified confidence scores. Well look at the score field for each model in turn.
Close the Histogram window Execute the Histogram node named C5_SCORE
The distribution of the new C5_SCORE field is bimodal, with scores either near 0 or near 1.0. And those predicted to leave as customers all have scores above .75 (actually above about .90 or so), and those who are predicted to stay have scores below .25 (actually close to 0).
Predictive Modeling With Clementine Figure 11.4 C5_SCORE Distribution Overlaid by Predicted Loyalty
The distribution of NN_SCORE is continuous throughout the range from 0 to 1 because the confidence values for a neural network model also have that range. But the main point, again, is that those customers predicted to leave have scores ranging from .50 to 1, and those predicted to stay have scores below .50.
Predictive Modeling With Clementine Figure 11.5 NN_SCORE Distribution Overlaid with Predicted Value of LOYAL
The new score fields can now be used to score a database, as is commonly done in many datamining applications, so that customers can, for example, be selected for a marketing campaign based on their propensity to leavethe value of C5_SCORE or NN_SCORE. This is perhaps the chief advantage of creating the new field. But there is at least one other. A score field can be used in a new model to improve the prediction of LOYAL. The score fields do not perfectly predict the value of LOYAL (remember we have been using the predicted value of LOYAL, not the actual values, in our histograms; try running the histograms with LOYAL to see the difference), but they apparently have a high degree of potential predictive power. Clearly, this is based purely upon the way that C5.0 or the neural network has differentiated between customers who will leave or stay, but if the model has a high degree of accuracy (which it does in this case), then the score field(s) may act as a very good predictor for another modeling technique. If a more complex model, such as a meta-model were to be built, information on the score values from a C5.0 model could be used as an input to another modeling technique, such as a neural network. We shall look at this form of meta-modeling in the next section.
Predictive Modeling With Clementine We can use the C5_SCORE as one of the inputs to a modified neural network model. We know that the C5.0 algorithm can predict loyalty with high accuracy; thus it is hoped that by inputting the C5.0 scores into a neural network analysis, the neural network may be able to correctly predict some of the remaining 11% of cases that the C5.0 model incorrectly classified.
Close the Histogram plot window Click FileClose Stream and click No if asked to save changes Click FileOpen Stream and move to the c:\Train\ClemPredModel directory Double-click on Metamodel.str
The figure below shows the completed stream loaded into Clementine. It is fairly complex, but partially because it retains most of the original stream from the previous example. Figure 11.6 Meta-Model Stream
A Type node has been inserted after the node that creates C5_SCORE. If we are to build a model based upon results obtained from previous models, each of the newly created fields will need to be instantiated and have its direction set. We will be using both the new score field and the predicted value from the C5.0 model. A decision must be made as to which fields should be inputs to the new model. You can use all the original fields, or reduce their number since the C5_SCORE and $C-LOYAL fields will effectively contain much of the predictive power of the original fields. If the number of inputs wasn't large, then including them along with the two new fields in the new neural network will not appreciably slow training time, and that is the approach we take here. But you may wish to drop at least some of the fields that had little influence on the model, since including all fields can lead to over-fitting.
Execute the Table node attached to the Type node downstream of the C5_SCORE Derive node (to instantiate the Type node), and then close the Table window Edit the Type node attached to the C5_SCORE Derive node
In this example, we will use all the original input variables as predictors, plus the predicted value of LOYAL from the C5.0 model and the calculated score. The output field remains LOYAL. A Neural Net node has been attached to the Type node (and renamed MetaModel_LOYAL). Weve set the random seed to 1000 so that everyone will obtain the same solution, and we use the Quick training method. Lets run the model.
Close the Type node Execute the neural network MetaModel_LOYAL Browse the generated model, and click Expand All in the Summary tab
Predictive Modeling With Clementine Figure 11.8 Output from Meta-Model for LOYAL
We can see that, not surprisingly, the field $C-LOYAL is by far the dominant input within the model. The predicted accuracy of the model has increased from 89.17% to 90.05%. This is an improvement of approximately 0.8% on the accuracy of the original C5.0 model. This is admittedly a small improvement, but the original C5.0 model was already very accurate. Still, every improvement can be important. The Analysis and Matrix nodes connected to the generated meta-model can be used to further analyze the new model. It is worth mentioning that while meta-modeling is a highly accepted way of increasing the accuracy of a model, analysts can often find themselves in a position of over-fitting a model if care is not taken. The best way of protecting against this is to ensure that a validation sample of the data is taken before any modeling takes place. The initial models, along with the more sophisticated meta-models, can then be built on part of the original data and finally tested on the holdout data. The true new accuracy of the meta-model should be determined with a validation data sample. As with all holdout samples, one is looking for consistency in results between the training data and the holdout data.
Figure 11.9 displays the error-model stream in the Clementine Stream canvas. The upper stream in the canvas includes the generated model from the neural network and attaches a Derive node to it. The Derive node compares the original target field (LOYAL) with the network prediction of the output ($N-LOYAL), calculating a flag field (CORRECT) with a value of True if the prediction of the neural network is correct, and False if it was not. The first goal of the error model is to use a rule induction technique, which can isolate where the neural network model is under-performing. This will be done by using the C5.0 algorithm to predict the field CORRECT. We chose a C5.0 model because its transparent output will provide the best understanding of where the neural network is under-performing. In order to ensure that the C5.0 model returns a relatively simple model, the expert options have been set so that the minimum records per branch is 15. Setting this value is a judgment call based on the number of records in the training data and the number of rules with which you wish to work (another approach would be to winnow attributes, an expert option). A Type node is used to set the new field CORRECT to direction OUT, and the original inputs to the neural network to IN. It would need to be fully instantiated before training the C5.0 model.
Predictive Modeling With Clementine Figure 11.9 Error Modeling from a Neural Network Model
In this example, the model has already been trained and added to the stream, labeled C5 Error Model. Lets browse this model.
Edit the C5.0 generated model node labeled C5 Error Model
We generated a ruleset from the C5.0 model because it makes it easier to view those rules for the False values of CORRECT. Again, we are trying to predict the values of CORRECT, which means we are trying to predict whether the neural network was accurate or not. There are four rules for a False value
Click the Show all levels button Click the Show or Hide Instances and Confidence button confidence values are visible) (so instances and
These rules all have reasonable values of confidence, ranging from .667 to .818 (although you might prefer them to be a bit higher). Rule 1 tells us that for male customers who make less than about a minute of international calls per month and almost no long distance calls, and who are single (STATUS=S), we predict the value of CORRECT to be False, i.e., the wrong prediction. Some customers with these characteristics were correctly predicted to leave or stay, but the majority were not (81.8% were a false prediction).
Predictive Modeling With Clementine Figure 11.10 Decision Tree Ruleset Flagging Where Errors Occur Within the Neural Network
The next step is to split the training data into two groups based on the ruleset, one for predictions of True and the other for False. We can do this by generating a Rule Tracing Supernode from the Rule browser window and applying a Reclassify or Derive node to truncate the values of the new field to just True and False. We will use the Reclassify node to modify the Rule field so that it only has two categories, which we will rename as Correct and Incorrect. Lets check the distribution of this field.
Close the C5.0 Model browser window Execute the Distribution node named Split
The neural network accuracy was 78.70%. The distribution of Split doesnt match this because we limited the records per branch to no lower than 15, and because the C5.0 model cant perfectly predict when the neural network was accurate or not. There are clearly enough cases with a value of Correct (992) to predict with a new model, but there are only 116 cases with a value of Incorrect, which is a bit low for accurate modeling. The best solution is to create a larger initial sample so that the 10% or so of cases predicted to be incorrect by the C5.0 model would be represented by a larger number of cases. If that isnt possible, you can use a Balance node and boost the number of cases in the Incorrect category (although this is not an ideal solution). Since this is an example of the general method, we wont bother doing either to see how much we can improve our model with no special sampling. Looking back at the stream, we next added a Type node to set the direction of FALSE_TRUE and Split to NONE so that they are not used in the modeling process. We wish to use only the original predictors. The stream then branches into two after the Type node. The upper branch uses a Select node to select only those records with predictions expected to be correct, while the lower branch selects those records with predictions expected to be incorrect. We reemphasize that the split of the training data is not based on the output field. Instead, only demographic and customer status fields were used to create the field Split used for record selection. It is for this reason that this model can, if successful, be used in a production environment to make predictions on new data where the outcome is unknown. After the data are split, the customers for whom we generally made correct predictions are modeled again with a neural network. We do so because these cases were modeled well before with a neural network, so the same should be true now. And, with the problematic cases removed, we expect the network to perform better. For the customer group for which predictions were generally wrong, we use a C5.0 model to try a new technique, since the neural network tended to mispredict for this group. We could certainly try another neural network, however, or any other modeling technique. After the models are created, they are added to the stream, and Analysis nodes are then attached to assess how well each performed. Lets see how well we did.
The neural network model for the group of customers with generally originally correct predictions is correct 83.87% of the time, a substantial improvement over the base figure of 78.7%. The C5.0 model is even more accurate, correctly predicting who will leave or stay for 88.79% of the cases that were originally difficult to accurately predict. Clearly, using the errors in the original neural network to create new models has led to a substantial improvement with little additional effort. If you take this approach, you would, as usual, explore each model to see which fields are the better predictors and how this differs in each model. Figure 11.12 Model Accuracy for Two Groups
So far so good, but wed still like to automate the solution so that the data all flow in one stream rather than in two, and we can therefore make a combined prediction for LOYAL on new data. This is easy to do. To demonstrate, we open a stream with a modified version of the current one.
Close the current stream (click FileClose Stream and click No when asked to save changes) Click FileOpen Stream and move to the c:\Train\ClemPredModel directory Double-click on Combined_predictions.str Switch to small icons (right-click Stream canvas, click Icon SizeSmall)
We have combined the two generated models in sequence in this modified stream. You might think that we could simply combine the output from each model, since each was trained on a different group of cases and thus will make predictions only for those cases, but this isnt the case. Although each model was trained on only a portion of the data, each will make predictions for all the cases. (Why? To verify this, execute the Table node.)
But the solution is simple. We know that the value of the field Split tells us which models output to use, and we do so in the Derive node named Prediction.
Edit the Derive node named Prediction
This node creates a new field called Prediction. When Split is equal to Correct, the value of Prediction is set to the output of the neural network output. Otherwise, the value of Prediction is set to the output of the C5.0 model. Thus, we have a new field that contains the combined prediction from the best model for that group of customers.
Predictive Modeling With Clementine Figure 11.14 Derive Node to Create Prediction Field
We know that the baseline neural network had an accuracy of 78.7%, and made 236 errors. We will do much better with these two models. To see how much, we can execute the Matrix node that crosstabulates Prediction and LOYAL.
Close the Derive node Execute the Matrix node named Prediction x LOYAL
The combined models have made only 173 errors, quite an improvement. This translates to an accuracy of 84.38%, or an increase of about 5.7% over the original neural network model.
Predictive Modeling With Clementine Figure 11.15 Comparison of Prediction and LOYAL
The process of modeling errors need not stop here. Although there will clearly be diminishing returns as the number of errors decreases, it is certainly possible to attempt to separately model the remaining errors from the combined model. At the very least, you would still want to investigate those customers whose behavior remains difficult to model. Eventually you would validate the models with the ChurnValidate.txt dataset. We wont do that here because the stream with the C5.0 model predicting errors in the original neural network has only 33 records, not enough for a reasonable validation. Obviously, the validation dataset should be of sufficient size, just as with the training file. We should also note that this same technique could be used for output fields that are numeric, either integers or real numbers. In that case, the errors are relative, not absolute, but some numeric bounds can be specified to differentiate cases deemed to be in error from those with sufficiently accurate predictions. Then the former group of cases can be handled in a similar manner as was done above.
Summary Exercises
A Note Concerning Data Files
In this training guide files are assumed to be located in the c:\Train\ClemPredModel directory. In these exercises we will use the streams created in the chapter. 1. Use the stream Metamodel.str. Rerun the MetaModel_LOYAL neural network model, removing all the original inputs from the model and thus using only the modified confidence score and the predicted value from the C5.0 model. How does this affect model performance? Add this generated model to the stream and validate it with the ChurnValidate.txt data file. Was the model validated, in your judgment? 2. Use the stream Errors.str. Instead of using a C5.0 model to predict cases with proportionally more errors, try another neural network. How well does this perform compared to the C5.0 model? How does it compare to the accuracy of the original neural network? Do you recommend that we use a neural network for these cases?