AZ Modeling Test Preparation Document PreF02

A-Z Modeling Test Preparation Documentation
The purpose of this documentation is to guide people who are new or recently introduced to Inductis modeling approaches how to move through all processes in a modeling solution from data collection/preparation, through CART, linear and logistic modeling, validation, and result presentation. While every problem we encounter will have a different solution and different ways of obtaining the solution, what this documentation will guide the user through is a typical total solution from start to end of a representative project. We will address the problem of predicting the outcome of a binary event (such as bankruptcy, response to an offer, etc) as well as that of predicting a continuous variable (such as spend amount). The level of documentation around each stage will be appropriate for someone with a day or twos experience with that stage of the process. Therefore, by no means is this documentation supposed to present a comprehensive method to any specific Inductis modeling approach. What it does do, however, is present basic underlying methods that would be followed, and expanded upon, in an actual live solution. It does this by allowing the user to start at the beginning, seeing how each stage of the process contributes to the overall problem, and how each interacts and flows together while progressing towards a final solution and its presentation. The focus will be on the execution of each step of the process and methods used to verify the integrity of the process, and not necessarily obtaining the best model. Advanced problem solving strategies and modeling techniques will not be addressed. There will be additional documentation coming out that will address more advanced techniques in each area of the process independently.
Problem Solving Approach and Modeling Outline

Understand the problem / Outline a solution Data Collection Assess and Prepare the data o EDD Generation o Formatting Issues o Imputation o Cap/Floor o Modeling/Validation Split Modeling o Profiling and Variable Creation o Model Sample Preparation o CART Model o Multivariate Regression Variable Selection o Logistic Model Concordant/Discordant Pairs Fit Validation/Results Presentation
Understand the Problem

This stage could take anywhere from hours to days, but it is critical in order to set up the rest of the process properly. Understanding the problem allows you to draft an approach, ensures you collect the necessary data, and gives you guidance on how to make decisions as you go along. The best way to become prepared for this is through experience working on real projects. The sample problems in the following documentation/exercises will be clear and straightforward.
Data Collection
This is often a painstaking and lengthy part of a client engagement. Since we often rely on the client to obtain the data for us, this can be a time consuming process. It is therefore more important that we understand the problem at hand and address any related data issues as soon as possible in order to avoid lengthy data request repetitions. For the purpose of this exercise, the data will be given to you at the beginning of the process. For some best practices in the area of data collection, please see the DATA REQUEST documentation on the MRL in the Analytical Methodology/Methods/Internal/Other subfolder.
Assess and Prepare the Data

The goal of this step is to prepare a master data set to be used in the modeling phase of the problem solution. This set should contain at the very least: A key, or set of keys, that identifies each record uniquely The dependent variable(s) relevant to the problem All independent variables that are relevant or may be important to the problem solution In the early stages of a solution, it can be sometimes hard to determine an exact list pertaining to the third bullet. Often, nothing is left out to begin with, and the list of relevant variables is derived and constantly updated as the process unfolds. If the required data is spread across several data sets, then the pertinent records and variables will need to be extracted from each dataset and merged together to form the master dataset. If this must be done, it is very important that the proper keys are used across the datasets so that not only do you end up with all the needed variables in the final dataset, but that you are matching the information from one source correctly with information from another. For example, you may have a customer dataset with basic account information such as account number, account type, customer name, address, date of account opening etc (a static data set), and another data set, performance data, which contains account performance over time with variables such as account number, and total balance over all billing periods in the last two years. If there are not unique records for each account number in each dataset then the analyst needs to find out why, and understand how to resolve this issue (perhaps the performance data is stored vertically and needs to be transposed so that each account number has a unique record). Once keys are set up in all datasets, the datasets can be merged. In this case, your master dataset would contain demographic and performance information for each account. An example is included below where a file with customer ID as a unique key is merged with a file whose unique key is a combination of customer ID and account number, as some customers have more than one account. The result is a file with an individual record for each customer account where the information from the customer level file is repeated for each individual account of that customer.
data cust_demo_data; input cust_number 1-7 Cust_Age 9-11; datalines; 1234567 22 2345678 42 2164518 35 ; run;
data acct_data; input @1 cust_number 7. datalines; 1234567 002816383 2245.65 2345678 442164518 35.89 2164518 209273502 401.23 1234567 312345678 421.34 2164518 097386283 0.00 2164518 476429482 45.98 ; run;
@9 acct_number 9. @19 acct_balance 10.2;
proc sort data = cust_demo_data nodupkey; by cust_number; run; proc sort data = acct_data nodupkey; by cust_number acct_number; run; data cust_acct_merged; merge cust_demo_data (in=in1) acct_data (in=in2) ; by cust_number; merge_check = compress (in1 !! in2); run; proc freq data = cust_acct_merged; table merge_check /list missing; title "Check how the customer and account level files merged"; run;
The resulting file looks like the following.

Obs 1 2 3 4 5 6 cust_ number 1234567 1234567 2164518 2164518 2164518 2345678 Cust_Age 22 22 35 35 35 42 acct_ number 2816383 312345678 97386283 209273502 476429482 442164518 acct_ balance 2245.65 421.34 0.00 401.23 45.98 35.89 merge_ check 11 11 11 11 11 11
The IN statement names a new variable whose value indicates whether that input data set contributed data to the current observation. Within the DATA step, the value of the variable is 1 if the data set contributed to the current observation, and 0 otherwise. In the above code, the values of the IN variables are concatenated into a single variable merge_check. The IN variables in1 and in2 are dropped at the end of the data step, so if their value needs to be saved they must be explicitly stored in another variable such as is done with merge_check above. A PROC FREQ is then run on the merge_check variable to allow the user to analyze how the by variables matched across the merged data sets. A simple spot check of a handful of records pre and post merge to determine if the merge had the desired results is also a very useful quality control check. If required, additional work can be done to transpose this file so that there is a single record per customer while recording account information side by side. Alternatively, a roll up approach can be taken. An example, of where all account balances are rolled up to the customer level is demonstrated below with both a PROC SUMMARY approach, and
a data step approach. These are just two approaches which can be used to perform such a task.
proc summary nway data = cust_acct_merged missing; class cust_number; var acct_balance; output out = cust_rolled_up sum=; run; proc print data = cust_rolled_up; title "Data rolled up with proc summary"; run; data manual_roll_up; set cust_acct_merged; by cust_number acct_number; retain acct_balance_sum; if(first.cust_number) then acct_balance_sum = acct_balance; else acct_balance_sum + acct_balance; if(last.cust_number); drop acct_number acct_balance merge_check; run; proc print data = manual_roll_up; title "Data rolled up with manual data step"; run;
***Output*******;
Data rolled up with proc summary cust_ Obs number _TYPE_ _FREQ_ 1 2 3 1234567 2164518 2345678 1 1 1 2 3 1 acct_ balance 2666.99 447.21 35.89
Data rolled up with manual data step cust_ number 1234567 2164518 2345678 acct_ balance_ sum 2666.99 447.21 35.89
Obs 1 2 3
Cust_Age 22 35 42
Extended Data Dictionary (EDD)

Once you have prepared your master dataset, you should run an Extended Data Dictionary (EDD) on it. The macro to do this is included in the macro toolkit. Some code is shown below to illustrate how to produce an EDD on a dataset..(WARNING: You may need to sample down your data if it is very large. An EDD on a 1GB dataset can take a very long time, and the time increase is great, and possibly exponential, for further
increases in data size. But remember that an EDD on a sample will not necessarily give you the true Maximum and Minimum values for the original population).
libname macro '<Project Folder>\10.Toolkit\05.TL_EDD_Code'; filename mcrsrc catalog 'macro.std_code_lib_db2'; %include mcrsrc(edd_gen2); %include mcrsrc(edd_num); %include mcrsrc(edd_char); %edd (libname=<libname>,dsname=<dataset>, Name>,NUM_UNIQ=Y); edd_out_loc=<EDD Path>\<EDD
The result is an Excel spreadsheet which gives you information on the variables in your dataset including: Variable name Format Number of unique values Number of missing values Distribution (proc means output for numeric variables; highest and lowest frequency categories for a categorical variable) o Numeric variables: standard numerical distribution including the mean, min, max, and percentiles 1, 5, 10, 25, 50, 75, 90, 95, and 99 o Categorical variables: top 1 gives you the level which occurs the most, top 2 the second most, etc. bot 1 is the least frequently occurring category, bot 2 the second least frequently occurring level, etc. The number following the double colon gives you the exact number of times that level occurs. If nothing precedes the double colon, then that is the count of the number of missings An example is included below:
Obs 1 2 3 4 5 6 7 8
name Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8
type num num num num num num num char
var_length 8 8 8 8 8 8 8 1
n_po s 0 8 16 24 32 40 48 56
numobs 46187 46187 46187 46187 46187 46187 46187 46187
nmiss 0 0 0 0 0 0 0 0
unique 929 505 175 257 10673 33123 1332 10
mean_or _top1 0.12648 0.06473 714.4287 6 656.3005 4 100.5036 8 305.9735 6 0.11786 2::52495 2
min_or_ top2 0 0 0 0 0 0 0 ::429733
p1_or_top 3 0 0 650 0 0 0 0 1::376441
Deleted Columns
p99_or_ bot2 0.769 0.285 756 755 1710.922 2552.431 1.073 7::5468
max_or_bot1 0.998 0.944 794 794 136318.123 221315.614 47.794 8::2781
Basic things that should be looked for when first assessing an EDD include: Are data formats correct? Are numerical variables stored as text? Do date variables need to be converted in order to be useful? Which variables have missing values? Outliers? Do any variables exhibit invalid values (many 9999999, 101010101 values, etc)? o If you have a data dictionary provided by the client, there may be information on invalid values, so this would be the first thing to check o Invalid values will often be obvious in the Inductis EDD because of the way it is set up, and the way invalid values are defined. They may occur often enough so that they take up a large swatch of the distribution columns. They are often large or small enough so that they easily present themselves at either the min or max end of the distribution. Invalid value flags generally overwrite other data values and are usually defined to stand out, not blend in. For example, they are often made to take much larger values, such as 99999999999, than any normal value of that variable. Are any distributions clearly out of line with expectations?
Formatting Issues
If there are numeric variables which have been stored as text strings, you can convert them to numeric variables using code similar to below. One can define a separate variable and keep both the character and numeric representations, or replace the character representation with a numeric. Take care that you do not truncate any numbers in this conversion by paying attention to the variable length, which is given to you in the EDD. In general, it is wise to only restrict the length of a numeric variable if saving space in your dataset is very important, as the risks of truncating some significant digits of some value of the variable are usually not worth saving a relatively small amount of space. In the below example the length of the HHNPERSN numeric conversion variable is restricted to three (the minimum length of a numeric variable) to illustrate how to define the length of a new variable if needed. In this case the analyst would have checked that none of the values exceed the maximum size allowed for an integer of length three.
data char_to_numeric (rename = (HHNPERSN_Numeric = HHNPERSN)); length HHNPERSN_Numeric 3; format HHNPERSN_Numeric 2.; HHNPERSN = "3"; HHNPERSN_Numeric = input(HHNPERSN,2.); drop HHNPERSN; run;
A table showing the maximum size of integers allowed with a specific length is also shown below. Hence, if for some reason you needed to store a 9-digit account number as a numeric, it would have to have a length of at least 6 to ensure no truncation takes place.
Length in Bytes Largest Integer Represented Exactly Exponential Notation 3 4 5 6 7 8 8,192 2,097,152 536,870,912 137,438,953,472 35,184,372,088,832 9,007,199,254,740,992 213 221 229 237 245 253
As a final note on the length of numeric variables, if you try to merge two datasets by a common numeric variable, but whose lengths were defined differently in each dataset, you may see a warning in the log file similar to:
WARNING: Multiple lengths were specified for the BY variable by_var by input data sets. This may cause unexpected results.
It is generally not wise to overlook log file warnings unless you have a very good reason to. A short data step redefining the length of the shorter variable in one of the datasets before merging will suffice to get rid of the warning, and could reveal important data problems, such as information being truncated from some values of the BY variables in the data set with the shorter length. A case where the analyst may wish to change a numeric variable to character is discussed below. In this example, the analyst has decided he/she would prefer to treat the variable Reward_Level as a categorical variable. To avoid any later confusion it is decided to change it to a character variable upfront, so it cannot be used as a numeric later, and no modeling insight can be derived from the implicit scale of the numeric. The code is shown below, which closely resembles to character to numeric version illustrated above.
data numeric_to_char (rename = (Reward_Level_Char = Reward_Level)); length Reward_Level_Char $1; format Reward_Level_Char $1.; Reward_Level = "3"; Reward_Level_Char = put(Reward_Level,$1.); drop Reward_Level; run;
Formatting issues may require other action by the analyst. One example is variable creation in order to take advantage of variables in date format. For example, we may want to take an application date/time stamp variable, and define a variable application_month, based only on the month, for grouping or other purposes. We may also want to change a date format into a numeric variable in order to take advantage of the information and be able to use it in the modeling process. For example, the code below illustrates how a numeric year variable can be created out of a character date format using the substr and put functions.
data date_dat; format char_format_start_date $10.; char_format_start_date run; = "FEB2003";
start_date_year = put(substr(char_format_start_date,4,4),4.);
proc print; run;
There are many varieties of date functions and formats, and ways to manipulate them. A primer on such methods can be found in the Dates, Times, and Intervals chapter in the SAS online documentation - http://v8doc.sas.com/sashtml/ . Zip Codes and Other Problems with Leading Zeroes Other formatting problems can occur when switching a variable from numeric to character or vice versa. For example, account numbers and especially zip codes can contain leading zeroes when stored in character format. Problems can occur when one dataset has the data in character format and the other in numeric format. If translating the variable from numeric to character in one dataset, some values will not match correctly if the leading 0 issue is not taken into account. An example is listed below where numeric zip code and account variables are translated into character variables with the leading zeroes using the z format. There is also an example where a character variable without leading zeroes is translated into a character variable with the leading zeroes:
data leading_zero_data; Zip_Numeric = 7078; Zip_Character = put(Zip_Numeric,z5.); Acct_Numeric = 12345678; Acct_Character = put(Acct_Numeric,z9.); * Convert an 7-digit character variable to a 9 digit character variable with leading zeroes ******; Seven_Digit_Acct_Character = "2345678"; Nine_Digit_Acct_Character = put(input(Seven_Digit_Acct_Character,9.),z9.); run; proc contents data = leading_zero_data; run; proc print data = leading_zero_data ;run;
*** OUTPUT ************;

# Variable Type Len Pos 4 Acct_Character Char 9 21 3 Acct_Numeric Num 8 8 6 Nine_Digit_Acct_Character Char 9 37 5 Seven_Digit_Acct_Character Char 7 30 2 Zip_Character Char 5 16 1 Zip_Numeric Num 8 0
Obs 1
Zip_ Numeric 7078
Zip_ Character 07078
Acct_ Numeric 12345678
Acct_ Character 012345678
Seven_Digit_ Acct_Character 2345678
Nine_Digit_ Acct_Character 00234567
Imputation Methods How to handle missing values

Please refer to the document IND032_20050217_Missing Imputation_VA_F1_VA in the Analytical Methodology/Methods/Internal/Other subfolder of the Documents and Lists section on the MRL.
Capping / Flooring
Variable capping and/or flooring can be useful with certain modeling techniques, particularly regressions, where coefficient estimation can be affected by data outliers. What is an outlier? The definition of an outlier really depends on the specific dataset and the objective of the modeling. Generally, an outlier is a data value that is extreme when compared with other values in a data set or compared to the distribution of values. For example, if the maximum value is much further from the 99th percentile value than the distance between the 98th and 99th percentiles, then that maximum might be considered an outlier. One
concern in including outliers in the modeling dataset is the high leverage or influence they would have over model coefficients. The standard deviation of the variable about its mean can also be unduly affected. At Inductis, we generally only alter a variables distribution above the 99th percentile and/or below the 1st percentile, if it appears the values in these ranges vary substantially from the rest of the distribution. Some judgment here is required. Note that many of our variables do not require flooring, as they have natural floors, often 0, where the data below the 1st percentile are clearly not outliers. An exception to the capping/flooring 99th/1st percentiles rule of thumb is when there are a large number of invalid values which take up much more than a single population percentile. These need to be treated in some fashion (deleted or replaced) before assessing the outlier question. Another exception is the presence of large numbers of zero values (e.g. for account spending). These have the potential to add extreme skew or bimodality to a variable that could bias model estimation. When capping/flooring is required, the analyst would treat the outlying data by either truncating the distribution (excluding these records from further analyses and modeling) or reining in the data by reassigning the values so that their distance from the percentile in question is significantly reduced, often by an exponential factor. There is an .xls sheet with some code that facilitates the writing of flooring and capping code in SAS. The formulas and actual view for these cells in the excel sheet is shown below:
Capping
Flooring
Some example code, which was written using the .xls sheet, is shown below where the data above/below the 99th/1st percentiles is reined in by a cubic root factor.
** cap/flooring **********; ** floor *******; if (Var4<=-0.00743) then Var4=.038077*Var4**(1/3); else Var4=Var4; if (Var5<=-0.01133) then Var5=.050445*Var5**(1/3); else Var5=Var5; if (Var6<=-0.67573) then Var6=.770044*Var6**(1/3); else Var6=Var6; ** capping ******; if (Var1>=1875) then Var1=43.301270*Var1**(1/2); else Var1=Var1; if (Var2>=2343.75) then Var2=48.412292*Var2**(1/2); else Var2=Var2; if (Var3>=0.13152) then Var3=.362657*Var3**(1/2); else Var3=Var3;
Modeling
Profiling and Variable Creation
This part of the modeling process is often the key to obtaining the best model. Mining the data for trends, multivariate relationships, and deriving intelligent variables from this information can add a lot to the final modeling solution. Sometimes tools such as CART and or MARS do very good jobs of identifying such relationships and give us the blueprints for deriving additional variables for our analysis. Other times the analysts may need to mine the data using more hands-on approaches with tools available in Excel, such as pivot tables. The use of bivariate plots and proc univariate are also useful techniques in profiling and variable creation (Please see the document IND032_20050531_EDA_KM2 on the MRL in the Analytical Methodology/ Methods/Internal/Other subfolder). These topics fall into the category of obtaining the best model, and as this is not the focus of this document, the techniques will not be addressed here. You will see, however, how CART can discover some variable relationships and give useful information to the solution.
Modeling Sample Preparation

In cases where very little data is available, advanced techniques need to be employed in order to get the most out of the data at hand. These techniques are not in the scope of this text and we will assume there is plenty of data available. Given this assumption, the analyst should split their data into separate modeling and validation data sets. The modeling set is used to determine and estimate the final model, while the validation data set, which is not used in the model estimation process, is used to validate that estimated final model and aid in giving an idea of its robustness. A common way to split a dataset is a two-thirds/one-third split into modeling and validation data sets. This really depends on how much data is available and what makes sense in a specific situation. The split will generally be made randomly. Some straightforward code to do this is shown below:
data <modeling_set> <validation_set>; set <master data set>; if(ranuni(20050920) <0.66666) then output <modeling_set>; else output <validation_set>; run;
The analyst should check that the modeling and validation sets have the same general characteristics, as they should have if there is plenty of data and in fact this split was random. This can be assessed by carefully examining and comparing separate EDD outputs on the two datasets. The analyst should specifically check that the distribution of the dependent variable is similar across the sets. For a binary dependent variable, there is a more advanced macro available, gbsplit, which will split your data into modeling and validation data sets, while ensuring the event rate stays almost identical in each and the goal of random splitting is achieved. It is also important to check for similar distributions of the most important independent variables. Note that the last two processes listed are interchangeable. When data is abundant, the analyst may often prefer to perform profiling and variable creation on the modeling set only, allowing the validation set to be completely independent form the modeling procedure.
CART Modeling
When the problem at hand involves classifying observations into distinct levels of the dependent variable, classification techniques and tools, such as CART can be used effectively. The simplest situation is a modeling problem where the dependent variable is binary. Other modeling approaches such as logistic and probit methods, also tackle the binary dependent variable problem, but CART is designed to search for meaningful
relationships between independent variables with respect to the dependent variable, and is better suited to find more complex interactions in the variables with little user input. Discriminant Analysis is a technique which competes with CART. (See http://www.statsoft.com/textbook/stathome.html for an introduction to Discriminant Analysis)
So what is CART?
Produces either classification or regression trees, depending on whether the dependent variable is categorical or numeric, respectively Trees are formed by a collection of rules based on values of certain variables in the modeling data set o Rules are selected based on how well splits based on variables values can differentiate observations based on the dependent variable o Once a rule is selected and splits a node into two, the same logic is applied to each child node (i.e. it is a recursive procedure) o Splitting stops when CART detects no further gain can be made, or some pre-set stopping rules are met Each branch of the tree ends in a terminal node o Each observation falls into one and exactly one terminal node o Each terminal node is uniquely defined by a set of rules o For classification trees, terminal nodes are classified along one level of the dependent variable, by comparing concentration of dependent variable levels in terminal node compared to entire population (this assumes CART default settings. Alteration of the option PRIORS can affect node classification. Please see the tutorial mentioned below for more information)
A sample CART tree is shown below:

Node 1 N = 30759 LINE_ A S SIG N <= 2050. 000 No de 2 N = 14 827 PROD2_B RKDW N$ = ( GL ,PR) Term ina l No d e 5 N = 9 23 3 B UREA U_ V A R4 > 707.5 00 Te r minal Node 4 N = 2 820 A CCOUNT_USA GE > 21.87 5 No de 5 N = 2028 RECENT_USA G E < = 3 71. 500 RECENT_USA GE > 3 71 .50 0 Ter mina l Nod e 2 N = 1087 Ter min al Node 3 N = 941 B UREA U_V A R6 < = 0.3 73 Termin al Node 6 N = 2860 LINE_A SSIGN > 205 0.0 0 0 Node 6 N = 1593 2 BUREA U_V A R3 > 7 1 2.50 0 Te rmin al Node 9 N = 12 45 0 BUREA U_V A R6 > 0 .373 Node 8 N = 622 BUREA U_ V A R7 <= 2.16 0 B UREA U_V A R7 > 2.1 60 Ter mina l Node 7 N = 200 Terminal Nod e 8 N = 422
PROD2_ BRKDW N$ = (OC,O P,PL) Node 3 N = 5 59 4
BUREA U_V A R3 <= 712.50 0 No de 7 N = 3482
BUREA U_V A R4 <= 7 0 7.50 0 Nod e 4 N = 2 774 A CCO UNT_US A GE <= 21 .875 Te rminal Node 1 N = 74 6
There is a tutorial for CART which can be accessed through the CART section of the Windows program menu on the servers. It will show you how to get going with basics such as opening your modeling dataset, selecting the target (dependent variable), and the predictors and running the model. It will also show you how to select specific options for running the CART tree. Some of those options are described briefly below: Select the splitting rule from the Method tab Select how the results are tested from the Testing tab Specify the minimum number of cases at the parent or terminal nodes from the Advanced tab Specify penalty on each variable from the Penalty tab Specify the cost of misclassification from the Cost tab Specify the desired reference population frequencies (Priors)
For now we will be content with selecting the dependent and independent variables, what type of tree (classification or regression), running the model accepting default options, and interpreting the results. Once a model is run, a Tree Navigator will pop-up in the window. A Tree Navigator displays a high-level picture of the model built.
We can use the curve under the tree to prune the tree by reducing the number of nodes and making the tree simpler:
The relative cost is the misclassification due to the tree in question. This tends to decrease as the number of nodes in a tree increases, but can actually begin to go up with a tree with too many nodes. Depending on the goal of analysis, a simpler tree may or may not be more suitable than a more complex one; As a rule of thumb, simpler is generally better. One option that should be set if you have a large number of observations is the minimum size of parent nodes and terminal nodes. These put a lower bound on the size of nodes that can split and the size they can split into, and without them, a first run tree can turn up too many nodes to be useful. While the user can always prune the tree back by clicking on the curve under the navigator tree, even this can be difficult if there are many points on the curve (Example below):
By right-clicking a node in navigator view, and selecting rules, CART will give you C code which can be applied to drill down to that node. If that node is a terminal node, it can be used to determine if an observation belongs to that terminal node. An example for a hypothetical terminal node 7 follows:
/*Terminal Node 7*/ if ( var1 > 2050 && var6 <= 712.5 && var13 > 0.3735 && var5 <= 2.16 ) { terminalNode = 7; class = 0; }
Other things that can be discovered in the navigator file include: The tree diagram graphically portrays the splitting power of each predictor variable, the complexity of the model, its classification accuracy, and a quick definition of the interesting segments. The curve under the tree diagram reports on the overall accuracy of the tree and tells us how we might trade off accuracy for a smaller simpler tree. By drilling down to each node, we can get the complete tabulation of the target variable, and find the true proportions of events/non-events in each node. By clicking on the rules, we get the rules written in C code that define the node. This code can be used to create new variables for use in subsequent regression modeling. With a little editing, it can be converted into SAS code. One of the most important outputs is the Summary Reports tab, which includes tabs on the list of the independent variables ranked by their relative importance, gains chart, and terminal node information. o Click the tab Summary Reports, and then variable importance to see the variable importance list. This information provides a basis for us to exclude irrelevant or less important variables and therefore reduce the dimensionality and speed up subsequent processes. It is also useful for identifying the most promising independent variables for use in another modeling tool such as logistic regression.
o Gains Chart
o Terminal Node Information
The performance of the trees is assessed by the gains chart and table, prediction success table and misclassification tables. The Score function can be used to score another dataset o If there is a separate validation file it can be scored to see if the composition of the nodes is similar to that in the modeling file
Warning! Start good habits now

A very nice, useful, and convenient feature of CART is its easy to use GUI. The one major disadvantage of this, however, is that you can spend hours tweaking your tree (adding/removing variables, playing with several options, etc., and once you finally find the tree you like, you print off the CART tree and close your session. If you do not carefully note what you did exactly to get a certain tree, and you exit CART, it can be very difficult to get CART to recreate that tree the next time you use it. Small changes in options can affect how the final tree looks, and you may never be able to reproduce that tree without physically coding all the rules, and even then, you do not know under what circumstances that tree was produced.
One way to counter this is to start submitting command files to CART in batch, similar to how you would do it for a SAS job, in lieu of using all the functionality of the GUI. This way you can save versions of your code, add comments, and you will have a history of the tweaks and the results at each stage to refer back to. It takes a little more effort up front, but could possibly save lots of time in the end. An example of a CART command file is listed below. Once a command file has been submitted and run, the corresponding navigator file can be opened and analyzed. If you are not sure of how to add the code for a certain option, you can use the GUI to select certain options, run the model again, and then select Open Command Log under the View menu and you can see what code CART generates to run the model with the tweaks you have just selected. You can subsequently add the necessary code to your command file, save as a different version, and resubmit. For more on the topic of command files seek the help of someone more experienced in CART who has used them.
USE " Grove LIMIT LIMIT <Modeling Set Location> \ <modeling set>" "<Navigator Output Location> \<Grove Name>.GRV" ATOM = 400 MINCHILD = 200
CATEGORY AUXILIARY MODEL dep_var KEEP KEEP dep_var, var1, var2, var3, var5, var6, var7, var9, var10, var11, var12, var13, var14, char_var1$, char_var2$, char_var3$ PRIORS EQUAL CATEGORY acct_bad, char_var1$, char_var2$, char_var3$ ERROR CROSS = 10 METHOD GINI POWER = PENALTY PENALTY BUILD REM REM REM REM REM 01. 02. 03. 04. 05.
0.0000
First run through with exploratory method Trying misclassify cost and vfold validation Removing Inductis Score Removing misclassification costs Remove var4 and var8
What can I do with a CART tree?

Use as a stand alone classification model o Pros Quick Easy to implement o Cons Could be sensitive to slight variable shifts/changes Relatively small number of different scores. Not very useful for rank ordering a predictive list, but great for segmentation. Use variable importance list and discovered relationships to perform additional profiling and variable creation Use CART directly to derive additional variables for use in more specific linear modeling o Terminal nodes can be coded into a SAS dataset using rules defined by the final tree As an example of the last point, CART generated the code shown below for a hypothetical Terminal Node 7 (as shown above). Also shown is the analogous SAS code which will add a dummy variable, CART_TN_7, to a data set denoting whether or not a particular observation belongs to that terminal node:
/* CART generated code*/
/*Terminal Node 7*/ if ( var1 > 2050 && var6 <= 712.5 && var13 > 0.3735 && var5 <= 2.16 ) { terminalNode = 7; class = 0; } * SAS Code **; if ( (var1 > 2050 ) AND (var6 <= 712.5) AND (var13 > 0.3735) AND (var5 <= 2.16 ) ) then CART_TN_7 = 1; else CART_TN_7 = 0;
Multivariate Regression
Regression analysis is the analysis of the relationship between one variable and another set of variables. The relationship is expressed as an equation that predicts a response variable (also called dependent) from a function of predictor variables (also called independents) and parameters. The parameters are adjusted so that a measure of fit is optimized. A regression equation with two predictors might look like this:
y = + 1 X 1 + 2 X 2 +
where is the intercept and is the residual error (portion of variability in the response variable that is not explained by the model). Much of the effort in model fitting is focused on minimizing the size of the residual, as well as ensuring that it is randomly distributed with respect to the model predictions. There are many possible variations on the basic regression equation, driven primarily by the nature of the predictors and response variable and the general relationship between them. For example, the following examples show interaction terms and a polynomial relationship:
y = + 1 X 1 + 2 X 2 + 3 X 1 X 2 +
y = + 1 X 1 + 2 X 12 + 3 X 13 +
Multivariate regression is generally used when the response variable is continuous with an unbounded range. Often we use it for modeling variables that are strictly positive (hence bounded at zero). Sometimes, we need to make special adjustments to the model if the distribution is significantly censored at zero (i.e. lots of zero spend amounts). While mathematically it is feasible to apply multivariate regression to discrete ordered dependent variables, many of the assumptions behind the theory of multivariate linear regression no longer hold, and there are other techniques better suited for this type of analysis. If the dependent variable is binary, one of those superior methods is logistic regression which is discussed later in this document. Before running a regression model in SAS you first need to check that your dataset is ready for regression modeling. SAS will discard any observations with missing values for any of the variables the user specifies to be considered for the model, so it is necessary that general modeling data preparation and missing imputation has occurred. Also, PROC REG will only accept numeric variables as predictors. If you have a categorical variable, then dummies need to be created for all but one level of the variable. The level that is left out is assumed to be the baseline for that variable, and the coefficients of all other
levels dummies are in comparison to the baseline. For example, if categorical variable, cat_var1, takes values A, B, and C, we would define two dummy variables for the multivariate regression as follows, and use them in lieu of the original variable:
cat_var1_B = (cat_var1 = "B"); cat_var1_C = (cat_var1 = "C");
Note that in more recent versions of SAS, it is not necessary to create your own dummies, but we introduce it here as it is still necessary for the version of SAS currently running on most of Inductis servers. To run a multivariate regression in SAS, and estimate the coefficients of the model, one would run code similar to the below:
proc reg data = <libname>.<modeling dataset>; model dependent_variable = var1 var2 var3 var4 var5 var6 var7 var8 var9; run;
Among the output in the .lst file you will see a table like the following:
The REG Procedure Model: MODEL1 Dependent Variable: Inductis_Score Analysis of Variance
Source Model Error Corrected Total DF 9 71 Sum of Squares Mean Square F Value 5.28 Pr > F <.0001
279.51816 12.70537 170.94968 2.40774 80 450.46784
Root MSE 1.55169 R-Square 0.6205 Dependent Mean 8.93724 Adj R-Sq 0.5029 Coeff Var 17.36207
Parameter Estimates Variable Intercept var1 var2 var3 var4 var5 var6 var7 var8 var9 Parameter Standard DF Estimate Error 1 1 1 1 1 1 1 1 1 1 t Value Pr > |t|
0.22626 0.01531 14.78 <.0001 0.19584 0.01856 10.55 <.0001 -0.00014678 0.00002044 -7.18 <.0001 0.00002783 0.00000660 4.22 <.0001 -0.00115 0.00051301 -2.24 0.0250 -0.00001987 0.00000549 -3.62 0.0003 0.00012041 0.00008330 1.45 0.1484 0.00186 0.00073916 2.51 0.0120 0.00605 0.00137 4.41 <.0001 -0.00915 0.00234 -3.92 <.0001
The Parameter Estimate column gives the associated parameter estimate for each variable, next to its Standard Error. The t-value is used to calculate the coefficients pvalue, a measure of whether it is significantly different from 0. With no other options selected, SAS will estimate the full model, meaning all variables will be included, regardless of whether their coefficients are significantly different from 0. For example, var6 above would not normally be considered significant given default values of 0.05 for signifigance levels used in variable selection techniques, which are discussed below.
Variable Selection in Multivariate Regression

There are several variable selection options in PROC REG that trim a list of possible explanatory variables in a model: forward, backwards, and stepwise. Each works on the basic premise of testing whether a variable belongs to the model via t-tests to determine if its associated coefficient is significantly different from zero when the role of other predictors is taken into consideration. The difference between each of the methods is the manner and order in which each conducts these tests.
Forward: starts with no variables and adds the one which, when estimated with the variables already in the current model, has the lowest p-value which is also below a certain threshold o Threshold can be changed by changing the option sle (significance level for entry), default is 0.10 o Options to use: /selection=forwards sle=0.15; Backward elimination: starts with all possible predictors and removes the one with the highest p-value at each step, as long as it is above a certain threshold o Threshold can be altered by changing the option sls (significance level for staying), default is 0.10 o Options to use: /selection=backwards sls=0.2; Stepwise: a combination of forwards and backwards elimination. adds variables one at a time like forwards, but at each step reassesses each variable in the entire model to ensure it still meets sls criterion o Options to use: /selection=stepwise sle=0.15 sls=0.2; After using one of these methods, the user will have a feel of what variables are important in the model. Judging by coefficient size is inappropriate as this can be affected by variable scale and not directly correlated to variable importance. The best option for the analyst to assess variable importance within the model is to include the option STB in the model statement. The output will then include a Standardized Estimate Column as shown below. Variable importance within that particular model can then be assessed by comparing absolute magnitude of the standardized coefficients. For example, in the output below Var4 is the most important based on the Standardized Estimates, although that would not have been clear from the Parameter Estimates.
The REG Procedure Model: MODEL1 Dependent Variable: Inductis_Score Analysis of Variance
Source Model Error Corrected Total DF 9 71 Sum of Squares Mean Square F Value 5.28 Pr > F <.0001
279.51816 12.70537 170.94968 2.40774 81 450.46784
Root MSE 1.55169 R-Square 0.6205 Dependent Mean 8.93724 Adj R-Sq 0.5029 Coeff Var 17.36207
Parameter Estimates Variable Intercept var1 var2 var3 var4 var5 var6 var7 var8 var9 Parameter Standard DF Estimate Error 1 1 1 1 1 1 1 1 1 1 t Value Standardized Pr > |t| Estimate
0.22626 0.01531 14.78 <.0001 0 0.19584 0.01856 10.55 <.0001 0.05410 -0.00014678 0.00002044 -7.18 <.0001 -0.04466 0.00002783 0.00000660 4.22 <.0001 0.02529 -0.00115 0.00051301 -2.24 0.0250 -0.11584 -0.00001987 0.00000549 -3.62 0.0003 -0.01924 0.00012041 0.00008330 1.45 0.1484 0.07471 0.00186 0.00073916 2.51 0.0120 0.01367 0.00605 0.00137 4.41 <.0001 0.02281 -0.00915 0.00234 -3.92 <.0001 -0.02569
Multivariate Regression and R2 The R2 statistic is a measure of how well the model predicts the dependent variable based on the value of the independent variables. It measures predictive power of the model: the proportion of the total variation in the dependent variable that is explained (accounted for) by variation in the independent variables. The statistic is most meaningful for ordinary regression problems, but logistic regression has an analogous statistic called generalized R2.
R2 can be defined in various ways. The most common formula seen for R2 is: R2 = 1-SSE / SST = (SST SSE) / SST
2 Where, SSE is the sum of the squared residuals ( ( y i y ) ), and SST is the total sum
2 of squares ( ( y i y ) ) of the dependent variable
Thirty-three points showing error as defined by the length of a line connecting a point to the regression line (used to calculate SSE)
Shows the distance between points and the mean line that will be used to calculate the total sum of squares, SST.
R2 for a model fit with least squares regression will always be between 0 and 1. However, when calculating an R2 on a validation sample, R2 can be negative if SSE is larger than SST. You will notice that R2 is automatically reported in the .lst file with the output for proc reg. A value called adjusted R2 is also reported, which penalizes the model for the amount of variables it uses, as adding more of more variables increases the models R2 but can lead to poor models that overfit the data. Scoring Datasets using a Multivariate Regression Model Using PROC SCORE with the output from a PROC REG is a fast and convenient way to score datasets using the output coefficients from a multivariate regression. Below is an example where the model estimates from a PROC REG are output to a dataset which is then used as a set of input coefficients to PROC SCORE on two separate occasions, once to score the original modeling dataset, and once to score a validation dataset. The var statement must be followed by the response variable and all predictors. The option TYPE=PARMS is used when the input to score is regression coefficients from a PROC REG as it is in this case. The option predict is used so that PROC SCORE outputs actual predictions of the response variable. By default it outputs the negative residual for each observation, which is the models predicted value minus the actual value.
proc reg data=<mod_set> outest=<mod_estimates>; model Inductis_Score = &predictor_list /selection = stepwise sle = 0.25 sls = 0.15 stb; run; proc score data=<mod_set> score=<mod_estimates> out=<mod_scores> type=parms predict; var Inductis_Score &predictor_list; run; proc score data=<val_set> score=<mod_estimates> out=<val_scores> type=parms predict; var Inductis_Score &predictor_list; run;
To calculate R2 on the validation set you can run a PROC CORR on the output dataset from a PROC REG with predict option selected. You need only keep the predicted values (MODEL1) and the actual scores to avoid unnecessary correlation calculations involving other variables. The Pearson correlation coefficient can be found in the output, and its square is the R2of the model determined by the original proc reg. The short piece of required code is shown below.
proc corr data = <val_scores> (keep = MODEL1 Inductis_Score);run;
The output will look like the following. R, the Pearson coefficient of correlation between the two variables has value 0.52433 in this output. The square of that value will be the R2 for this model.
The CORR Procedure 2 Variables: Inductis_Score MODEL1 Simple Statistics Variable Maximum Inductis_Score 1.00000 MODEL1 0.86518 N 30759 30759 Mean 0.12239 0.12239 Std Dev 0.20447 0.10721 Sum 3765 3765 Minimum 3.05567E-6 -0.13820
Pearson Correlation Coefficients, N = 30759 Prob > |r| under H0: Rho=0 Inductis_ Score Inductis_Score MODEL1 1.00000 0.52433 <.0001 MODEL1 0.52433 <.0001 1.00000
Logistic Regression
In a classification setting, assigning outcome probabilities to observations can be achieved through the use of a logistic model, which is basically a method which transforms information about the binary dependent variable into an unbounded continuous variable and estimates a regular multivariate model (See Allisons Logistic Regression for more information on the theory of Logistic Regression). The functional form is:
p ln = + 1 X 1 + 2 X 2 + 3 X 3 +... k X k 1 p
This can be transformed into:
p=
e + 1 X 1 + 2 X 2 + 3 X 3 +... k X k 1 + e + 1 X1 + 2 X 2 + 3 X 3 +... k X k
Here p is the probability that an observation with variable values (X1,X2,,Xk), will take the value of the binary outcome designated as 1. Similar to the multivariate regression constraints, before running a logistic model in SAS you first need to check that your dataset is ready for logistic modeling. SAS will discard any observations with missing values for any of the variables the user specifies to be considered for the model, so it is necessary that general modeling data preparation and missing imputation have occurred. Also, PROC LOGISTIC will only accept numeric variables as predictors, unless categorical level character variables are specified in the CLASS statement. You must also be careful with categoricals that are coded with numerals the program will treat these as if they were continuous numerics unless they are specified in the CLASS statement. To run a logistic regression in SAS, and estimate the coefficients of the model, one would run code similar to the below:
proc logistic data = <libname>.<modeling dataset> DESCENDING; model dependent_variable = var1 var2 var3 var4 var5 var6 var7 var8 var9; run;
The DESCENDING options lets SAS know that the value of the dependent variable we wish to predict is 1, and not 0. Among the output in the .lst file you will see a table like the following:
The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Parameter Intercept Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 DF 1 1 1 1 1 1 1 1 1 1 Standard Estimate Wald Error Chi-Square 458.0953 94.7829 192.4487 90.7652 34.4986 3.8263 0.2797 733.9139 52.6921 93.7599 Pr > ChiSq <.0001 <.0001 <.0001 <.0001 <.0001 0.0505 0.5969 <.0001 <.0001 <.0001
1.7033 0.0796 -0.00737 0.000757 -0.0139 0.000999 8.944E-6 9.388E-7 0.000150 0.000025 0.000122 0.000062 0.000033 0.000063 8.074E-6 2.98E-7 0.000320 0.000044 -0.00580 0.000599
The Estimate column gives the associated parameter estimate for each variable, next to its Standard Error. These estimates are the betas in the equation shown above. The Wald Chi-Square value is used to calculate the coefficients p-value, a measure of whether it is significantly different from 0. With no other options selected, SAS will estimate the full model, meaning all variables will be included, regardless of whether their coefficients are significantly different from 0. You will notice the high p-value associated with the coefficient for the variable labeled Var6.
Variable Selection in Logistic Regression

The same variable selection techniques available in PROC REG are available in PROC LOGISTIC. With logistic, Chi-square statistical tests are used instead of t tests, but other than that the techniques progress similarly. The option STB can also be used in proc logistic to assess variables importance. Sample output with the Standardized Estimate Column is shown below.
Analysis of Maximum Likelihood Estimates Standard Wald Standardized DF Estimate Error Chi-Square Pr > ChiSq 1 1 1 1 1 1 1 1 1 1.7191 0.0738 -0.00737 0.000757 -0.0139 0.000999 8.957E-6 9.388E-7 0.000149 0.000025 0.000132 0.000060 8.077E-6 2.98E-7 0.000328 0.000042 -0.00580 0.000599 542.2506 94.7657 192.7533 91.0441 34.4504 4.8649 734.5800 62.1276 93.8375 <.0001 <.0001 <.0001 <.0001 <.0001 0.0274 <.0001 <.0001 <.0001
Parameter Intercept Var1 Var2 Var3 Var4 Var5 Var7 Var8 Var9
Estimate -0.1335 -0.0934 0.2471 0.0867 0.0209 0.5182 0.0819 -0.1237
Statistical Measure of Classification Model Performance and Fit - Concordant/Discordant Pairs

A set of statistics used to grade classification model performance relies on the set pairs formed from every possible combination of events and non-events in the sample. For n events, and m non-events in a sample of n+m observations, there would be n*m pairs. Pairs are labeled as follows: Concordant Pair: A pair where the event element of the pair has a higher predicted event probability than the non-event element Discordant Pair: A pair where the event element of the pair has a lower predicted event probability than the non-event element Tied Pair: Any pair in which the model assigns exactly the same probabilities is called a tie. SAS PROC logistic reports the percent of pairs which are concordant, discordant, and tied, as shown below.
It is better for a model to get a high percentage of concordant pairs as it is a measure of how consistent the model is in ranking randomly chosen events higher than randomly chosen non-events. Please see the Validation documentation on the MRL, for additional information and other statistics involving these pairs.
Measure of Model Fit

A test assessesing the goodness-of-fit of a classification model is the Hosmer and Lemeshow test. It is induced by using the LACKFIT option when running a PROC logistic. The data records are ranked in order of the predicted values and the Expected number of events in each of 10 groups is computed. In the output table this is compared to the Observed number of events in each group. The Hosmer-Lemeshow test uses a test statistic with a chi-square distribution that measures how far actuals are from predicted within the defined groups. A lower statistic indicates a better fit. Example output from a PROC logistic using the LACKFIT option is shown below:
Partition for the Hosmer and Lemeshow Test Group 1 2 3 4 5 6 7 8 9 10 Total 45 45 45 45 45 45 45 45 45 41 resp_code = 1 Observed Expected 3 4 9 11 18 24 29 39 41 38 2.22 4.70 8.72 12.70 18.88 25.06 28.94 33.91 40.76 40.11 resp_code = 0 Observed Expected 42 41 36 34 27 21 16 6 4 3 42.78 40.30 36.28 32.30 26.12 19.94 16.06 11.09 4.24 0.89
Hosmer and Lemeshow Goodness-of-Fit Test Chi-Square 9.1720 DF 8 Pr > ChiSq 0.3280
A low chi-square with high p-value indicates a good fit since the expected frequencies are not far from actuals.
Validation/Results Presentation
What can I do to tell if my model is working?
Model validation is an important part of the modeling process. There are many statistics that can be analyzed and methods to use in order to test if the model you produced is rigorous and reliable. You can find a longer discussion on this in the Validation and Presentation documentation on the MRL. Here we will introduce a couple of fairly quick validation methods you can use, one of which is used a lot here at Inductis, and is often a way results are presented to the client. One possible approach is to use the validation data (if such data exists) in order to reestimate the final model that was originally determined on the modeling dataset. Usually we would do this using only the predictor variables that were included in the original model. Comparing coefficients from the modeling estimation and the validation estimation is a way of assessing the models consistency. Are the parameter estimates close? Does each coefficient fall within the standard error interval described by the other model? This procedure is one step of a more advanced process called n-fold cross validation which will not be covered in detail here. Another validation method involves scoring both the modeling and validation test datasets (using the coefficients from the original model) and producing a gains chart from each set of scores. A gains chart is obtained by rank ordering all observations by their model score and determining what percentage of events (1s on the dependent variable) are captured in any given percentage of the rank ordered list. A sample gains table is shown below where the list was split into centiles. From this table, you can take all the ordered pairs (Centile, Cumulative Percent of Events) in order to graph the gains chart.
Percent of Events Captured Percent of Non Events Captured
0% 0% 0% 0% 1% 83% 86% 87% 89% 91% 93% 95% 97% 98% 100%
Centile
1 2 3 4 5 91 92 93 94 95 96 97 98 99 100
Number of Events
4 4 5 3 4 0 0 1 0 0 0 0 1 0 0
Cumulative Events
4 8 13 16 20 214 214 215 215 215 215 215 216 216 216
Number of NonEvents
Cumulative Non Events

0 0 0 1 2 192 197 200 205 209 214 218 222 226 230
Difference
0.0185 0.0370 0.0602 0.0697 0.0839 0.1560 0.1342 0.1258 0.1041 0.0867 0.0649 0.0475 0.0348 0.0174 0.0000
2% 0 4% 0 6% 0 7% 1 9% 1 <Rows Deleted Here> 99% 4 99% 5 100% 3 100% 5 100% 4 100% 5 100% 4 100% 4 100% 4 100% 4
Sample Gains Curve 100 90 80 Percent of Events Captured 70 60 50 40 30 20 10 0 0 20 40 60 80 100 Event
The gains chart from the modeling dataset alone can be used to assess the performance of Percent of Population the model. This assessment focuses on the fact that we want the curve to rise as quickly as possible above the diagonal line that would be obtained with a random ranking. Sometimes we will focus on the specific vertical distance between the two lines at particular points of interest along the curve. The gains curves from the modeling and validation sets can also be compared with one another to determine how consistently the model performs on a validation sample. If the models performance is much poorer on the validation sample, then there is cause for concern, and perhaps the model was overfitted. Below is listed some code that will score both a modeling and validation dataset, and then prepare those scores, using the bin macro, for gains tables to be used for gains charts. There are alternative ways to score a validation dataset, like using PROC LOGISTIC with initial estimates set to the coefficient estimates from a modeling step and the option max iterations set to 0, but the lgtscore2 macro is commonly used.
proc logistic data=<mod_data> DESCENDING outest=<coeff_estimates>; model dep_var=var1 var2 .... ; output out= <mod_scores> p=pred_mod; run; %bin(infile=<mod_scores>, outfile=<mod_scores_bin> ,q=100, var=pred_mod, var_r = pred_mod_r, Var_q = pred_mod_q); proc summary nway data= <mod_scores_bin> class pred_mod_q pred_mod_r; var remiss; output out = <mod_scores_LZ> sum=; run; %lgtscore2 ( indata= <val_data>, outdata=<val_scores>, regest =<coeff_estimates>;, depvar = dep_var, depvarh =pred_val missing;
);
%bin(infile=<val_scores>, outfile=<val_scores_bin> ,q=100, var=pred_val, var_r = pred_val_r, Var_q = pred_val_q); proc summary nway data= <val_scores_bin> class pred_val_q pred_val_r; var remiss; output out = <val_scores_LZ> sum=; run; missing;
Bibliography
General References: http://support.sas.com/onlinedoc/913/docMainpage.jsp www.statsoftinc.com/textbook/stathome.html Logistic Regression: Allison Paul, 2001. Logistic Regression Using the SAS System. http://userwww.sfsu.edu/~efc/classes/biol710/logistic/logisticreg.htm http://www2.chass.ncsu.edu/garson/pa765/logistic.htm
MRL Resources for Further Reading

Analytical Methodology/Methods/Internal/Other Subfolder
Merging Data Files IND032_20050217_Missing Imputation_VA_F1_VA IND032_20050531_EDA_KM2 DATA REQUEST Outliers and Inadmissible Values Modeling Model Adequacy Model Validation

AZ Modeling Test Preparation Document PreF02

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AZ Modeling Test Preparation Document PreF02

Uploaded by

Copyright:

Available Formats

A-Z Modeling Test Preparation Documentation

Problem Solving Approach and Modeling Outline

Understand the Problem

Assess and Prepare the Data

@9 acct_number 9. @19 acct_balance 10.2;

The resulting file looks like the following.

Extended Data Dictionary (EDD)

name Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8

type num num num num num num num char

numobs 46187 46187 46187 46187 46187 46187 46187 46187

unique 929 505 175 257 10673 33123 1332 10

min_or_ top2 0 0 0 0 0 0 0 ::429733

p1_or_top 3 0 0 650 0 0 0 0 1::376441

max_or_bot1 0.998 0.944 794 794 136318.123 221315.614 47.794 8::2781

proc print; run;

*** OUTPUT ************;

Zip_ Numeric 7078

Zip_ Character 07078

Acct_ Numeric 12345678

Acct_ Character 012345678

Seven_Digit_ Acct_Character 2345678

Nine_Digit_ Acct_Character 00234567

Imputation Methods How to handle missing values

Modeling Sample Preparation

A sample CART tree is shown below:

PROD2_ BRKDW N$ = (OC,O P,PL) Node 3 N = 5 59 4

BUREA U_V A R3 <= 712.50 0 No de 7 N = 3482

o Terminal Node Information

Warning! Start good habits now

What can I do with a CART tree?

279.51816 12.70537 170.94968 2.40774 80 450.46784

Variable Selection in Multivariate Regression

279.51816 12.70537 170.94968 2.40774 81 450.46784

This can be transformed into:

Variable Selection in Logistic Regression

Estimate -0.1335 -0.0934 0.2471 0.0867 0.0209 0.5182 0.0819 -0.1237

Statistical Measure of Classification Model Performance and Fit - Concordant/Discordant Pairs

Measure of Model Fit

Cumulative Non Events

Sample Gains Curve 100 90 80 Percent of Events Captured 70 60 50 40 30 20 10 0 0 20 40 60 80 100 Event

MRL Resources for Further Reading

You might also like

* OUTPUT **********;