Professional Documents
Culture Documents
The purpose of this documentation is to guide people who are new or recently introduced to Inductis modeling approaches how to move through all processes in a modeling solution from data collection/preparation, through CART, linear and logistic modeling, validation, and result presentation. While every problem we encounter will have a different solution and different ways of obtaining the solution, what this documentation will guide the user through is a typical total solution from start to end of a representative project. We will address the problem of predicting the outcome of a binary event (such as bankruptcy, response to an offer, etc) as well as that of predicting a continuous variable (such as spend amount). The level of documentation around each stage will be appropriate for someone with a day or twos experience with that stage of the process. Therefore, by no means is this documentation supposed to present a comprehensive method to any specific Inductis modeling approach. What it does do, however, is present basic underlying methods that would be followed, and expanded upon, in an actual live solution. It does this by allowing the user to start at the beginning, seeing how each stage of the process contributes to the overall problem, and how each interacts and flows together while progressing towards a final solution and its presentation. The focus will be on the execution of each step of the process and methods used to verify the integrity of the process, and not necessarily obtaining the best model. Advanced problem solving strategies and modeling techniques will not be addressed. There will be additional documentation coming out that will address more advanced techniques in each area of the process independently.
Data Collection
This is often a painstaking and lengthy part of a client engagement. Since we often rely on the client to obtain the data for us, this can be a time consuming process. It is therefore more important that we understand the problem at hand and address any related data issues as soon as possible in order to avoid lengthy data request repetitions. For the purpose of this exercise, the data will be given to you at the beginning of the process. For some best practices in the area of data collection, please see the DATA REQUEST documentation on the MRL in the Analytical Methodology/Methods/Internal/Other subfolder.
data acct_data; input @1 cust_number 7. datalines; 1234567 002816383 2245.65 2345678 442164518 35.89 2164518 209273502 401.23 1234567 312345678 421.34 2164518 097386283 0.00 2164518 476429482 45.98 ; run;
proc sort data = cust_demo_data nodupkey; by cust_number; run; proc sort data = acct_data nodupkey; by cust_number acct_number; run; data cust_acct_merged; merge cust_demo_data (in=in1) acct_data (in=in2) ; by cust_number; merge_check = compress (in1 !! in2); run; proc freq data = cust_acct_merged; table merge_check /list missing; title "Check how the customer and account level files merged"; run;
The IN statement names a new variable whose value indicates whether that input data set contributed data to the current observation. Within the DATA step, the value of the variable is 1 if the data set contributed to the current observation, and 0 otherwise. In the above code, the values of the IN variables are concatenated into a single variable merge_check. The IN variables in1 and in2 are dropped at the end of the data step, so if their value needs to be saved they must be explicitly stored in another variable such as is done with merge_check above. A PROC FREQ is then run on the merge_check variable to allow the user to analyze how the by variables matched across the merged data sets. A simple spot check of a handful of records pre and post merge to determine if the merge had the desired results is also a very useful quality control check. If required, additional work can be done to transpose this file so that there is a single record per customer while recording account information side by side. Alternatively, a roll up approach can be taken. An example, of where all account balances are rolled up to the customer level is demonstrated below with both a PROC SUMMARY approach, and
a data step approach. These are just two approaches which can be used to perform such a task.
proc summary nway data = cust_acct_merged missing; class cust_number; var acct_balance; output out = cust_rolled_up sum=; run; proc print data = cust_rolled_up; title "Data rolled up with proc summary"; run; data manual_roll_up; set cust_acct_merged; by cust_number acct_number; retain acct_balance_sum; if(first.cust_number) then acct_balance_sum = acct_balance; else acct_balance_sum + acct_balance; if(last.cust_number); drop acct_number acct_balance merge_check; run; proc print data = manual_roll_up; title "Data rolled up with manual data step"; run;
***Output*******;
Data rolled up with proc summary cust_ Obs number _TYPE_ _FREQ_ 1 2 3 1234567 2164518 2345678 1 1 1 2 3 1 acct_ balance 2666.99 447.21 35.89
Data rolled up with manual data step cust_ number 1234567 2164518 2345678 acct_ balance_ sum 2666.99 447.21 35.89
Obs 1 2 3
Cust_Age 22 35 42
increases in data size. But remember that an EDD on a sample will not necessarily give you the true Maximum and Minimum values for the original population).
libname macro '<Project Folder>\10.Toolkit\05.TL_EDD_Code'; filename mcrsrc catalog 'macro.std_code_lib_db2'; %include mcrsrc(edd_gen2); %include mcrsrc(edd_num); %include mcrsrc(edd_char); %edd (libname=<libname>,dsname=<dataset>, Name>,NUM_UNIQ=Y); edd_out_loc=<EDD Path>\<EDD
The result is an Excel spreadsheet which gives you information on the variables in your dataset including: Variable name Format Number of unique values Number of missing values Distribution (proc means output for numeric variables; highest and lowest frequency categories for a categorical variable) o Numeric variables: standard numerical distribution including the mean, min, max, and percentiles 1, 5, 10, 25, 50, 75, 90, 95, and 99 o Categorical variables: top 1 gives you the level which occurs the most, top 2 the second most, etc. bot 1 is the least frequently occurring category, bot 2 the second least frequently occurring level, etc. The number following the double colon gives you the exact number of times that level occurs. If nothing precedes the double colon, then that is the count of the number of missings An example is included below:
Obs 1 2 3 4 5 6 7 8
var_length 8 8 8 8 8 8 8 1
n_po s 0 8 16 24 32 40 48 56
nmiss 0 0 0 0 0 0 0 0
mean_or _top1 0.12648 0.06473 714.4287 6 656.3005 4 100.5036 8 305.9735 6 0.11786 2::52495 2
Deleted Columns
p99_or_ bot2 0.769 0.285 756 755 1710.922 2552.431 1.073 7::5468
Basic things that should be looked for when first assessing an EDD include: Are data formats correct? Are numerical variables stored as text? Do date variables need to be converted in order to be useful? Which variables have missing values? Outliers? Do any variables exhibit invalid values (many 9999999, 101010101 values, etc)? o If you have a data dictionary provided by the client, there may be information on invalid values, so this would be the first thing to check o Invalid values will often be obvious in the Inductis EDD because of the way it is set up, and the way invalid values are defined. They may occur often enough so that they take up a large swatch of the distribution columns. They are often large or small enough so that they easily present themselves at either the min or max end of the distribution. Invalid value flags generally overwrite other data values and are usually defined to stand out, not blend in. For example, they are often made to take much larger values, such as 99999999999, than any normal value of that variable. Are any distributions clearly out of line with expectations?
Formatting Issues
If there are numeric variables which have been stored as text strings, you can convert them to numeric variables using code similar to below. One can define a separate variable and keep both the character and numeric representations, or replace the character representation with a numeric. Take care that you do not truncate any numbers in this conversion by paying attention to the variable length, which is given to you in the EDD. In general, it is wise to only restrict the length of a numeric variable if saving space in your dataset is very important, as the risks of truncating some significant digits of some value of the variable are usually not worth saving a relatively small amount of space. In the below example the length of the HHNPERSN numeric conversion variable is restricted to three (the minimum length of a numeric variable) to illustrate how to define the length of a new variable if needed. In this case the analyst would have checked that none of the values exceed the maximum size allowed for an integer of length three.
data char_to_numeric (rename = (HHNPERSN_Numeric = HHNPERSN)); length HHNPERSN_Numeric 3; format HHNPERSN_Numeric 2.; HHNPERSN = "3"; HHNPERSN_Numeric = input(HHNPERSN,2.); drop HHNPERSN; run;
A table showing the maximum size of integers allowed with a specific length is also shown below. Hence, if for some reason you needed to store a 9-digit account number as a numeric, it would have to have a length of at least 6 to ensure no truncation takes place.
Length in Bytes Largest Integer Represented Exactly Exponential Notation 3 4 5 6 7 8 8,192 2,097,152 536,870,912 137,438,953,472 35,184,372,088,832 9,007,199,254,740,992 213 221 229 237 245 253
As a final note on the length of numeric variables, if you try to merge two datasets by a common numeric variable, but whose lengths were defined differently in each dataset, you may see a warning in the log file similar to:
WARNING: Multiple lengths were specified for the BY variable by_var by input data sets. This may cause unexpected results.
It is generally not wise to overlook log file warnings unless you have a very good reason to. A short data step redefining the length of the shorter variable in one of the datasets before merging will suffice to get rid of the warning, and could reveal important data problems, such as information being truncated from some values of the BY variables in the data set with the shorter length. A case where the analyst may wish to change a numeric variable to character is discussed below. In this example, the analyst has decided he/she would prefer to treat the variable Reward_Level as a categorical variable. To avoid any later confusion it is decided to change it to a character variable upfront, so it cannot be used as a numeric later, and no modeling insight can be derived from the implicit scale of the numeric. The code is shown below, which closely resembles to character to numeric version illustrated above.
data numeric_to_char (rename = (Reward_Level_Char = Reward_Level)); length Reward_Level_Char $1; format Reward_Level_Char $1.; Reward_Level = "3"; Reward_Level_Char = put(Reward_Level,$1.); drop Reward_Level; run;
Formatting issues may require other action by the analyst. One example is variable creation in order to take advantage of variables in date format. For example, we may want to take an application date/time stamp variable, and define a variable application_month, based only on the month, for grouping or other purposes. We may also want to change a date format into a numeric variable in order to take advantage of the information and be able to use it in the modeling process. For example, the code below illustrates how a numeric year variable can be created out of a character date format using the substr and put functions.
data date_dat; format char_format_start_date $10.; char_format_start_date run; = "FEB2003";
start_date_year = put(substr(char_format_start_date,4,4),4.);
There are many varieties of date functions and formats, and ways to manipulate them. A primer on such methods can be found in the Dates, Times, and Intervals chapter in the SAS online documentation - http://v8doc.sas.com/sashtml/ . Zip Codes and Other Problems with Leading Zeroes Other formatting problems can occur when switching a variable from numeric to character or vice versa. For example, account numbers and especially zip codes can contain leading zeroes when stored in character format. Problems can occur when one dataset has the data in character format and the other in numeric format. If translating the variable from numeric to character in one dataset, some values will not match correctly if the leading 0 issue is not taken into account. An example is listed below where numeric zip code and account variables are translated into character variables with the leading zeroes using the z format. There is also an example where a character variable without leading zeroes is translated into a character variable with the leading zeroes:
data leading_zero_data; Zip_Numeric = 7078; Zip_Character = put(Zip_Numeric,z5.); Acct_Numeric = 12345678; Acct_Character = put(Acct_Numeric,z9.); * Convert an 7-digit character variable to a 9 digit character variable with leading zeroes ******; Seven_Digit_Acct_Character = "2345678"; Nine_Digit_Acct_Character = put(input(Seven_Digit_Acct_Character,9.),z9.); run; proc contents data = leading_zero_data; run; proc print data = leading_zero_data ;run;
Obs 1
Capping / Flooring
Variable capping and/or flooring can be useful with certain modeling techniques, particularly regressions, where coefficient estimation can be affected by data outliers. What is an outlier? The definition of an outlier really depends on the specific dataset and the objective of the modeling. Generally, an outlier is a data value that is extreme when compared with other values in a data set or compared to the distribution of values. For example, if the maximum value is much further from the 99th percentile value than the distance between the 98th and 99th percentiles, then that maximum might be considered an outlier. One
concern in including outliers in the modeling dataset is the high leverage or influence they would have over model coefficients. The standard deviation of the variable about its mean can also be unduly affected. At Inductis, we generally only alter a variables distribution above the 99th percentile and/or below the 1st percentile, if it appears the values in these ranges vary substantially from the rest of the distribution. Some judgment here is required. Note that many of our variables do not require flooring, as they have natural floors, often 0, where the data below the 1st percentile are clearly not outliers. An exception to the capping/flooring 99th/1st percentiles rule of thumb is when there are a large number of invalid values which take up much more than a single population percentile. These need to be treated in some fashion (deleted or replaced) before assessing the outlier question. Another exception is the presence of large numbers of zero values (e.g. for account spending). These have the potential to add extreme skew or bimodality to a variable that could bias model estimation. When capping/flooring is required, the analyst would treat the outlying data by either truncating the distribution (excluding these records from further analyses and modeling) or reining in the data by reassigning the values so that their distance from the percentile in question is significantly reduced, often by an exponential factor. There is an .xls sheet with some code that facilitates the writing of flooring and capping code in SAS. The formulas and actual view for these cells in the excel sheet is shown below:
Capping
Flooring
Some example code, which was written using the .xls sheet, is shown below where the data above/below the 99th/1st percentiles is reined in by a cubic root factor.
** cap/flooring **********; ** floor *******; if (Var4<=-0.00743) then Var4=.038077*Var4**(1/3); else Var4=Var4; if (Var5<=-0.01133) then Var5=.050445*Var5**(1/3); else Var5=Var5; if (Var6<=-0.67573) then Var6=.770044*Var6**(1/3); else Var6=Var6; ** capping ******; if (Var1>=1875) then Var1=43.301270*Var1**(1/2); else Var1=Var1; if (Var2>=2343.75) then Var2=48.412292*Var2**(1/2); else Var2=Var2; if (Var3>=0.13152) then Var3=.362657*Var3**(1/2); else Var3=Var3;
Modeling
Profiling and Variable Creation
This part of the modeling process is often the key to obtaining the best model. Mining the data for trends, multivariate relationships, and deriving intelligent variables from this information can add a lot to the final modeling solution. Sometimes tools such as CART and or MARS do very good jobs of identifying such relationships and give us the blueprints for deriving additional variables for our analysis. Other times the analysts may need to mine the data using more hands-on approaches with tools available in Excel, such as pivot tables. The use of bivariate plots and proc univariate are also useful techniques in profiling and variable creation (Please see the document IND032_20050531_EDA_KM2 on the MRL in the Analytical Methodology/ Methods/Internal/Other subfolder). These topics fall into the category of obtaining the best model, and as this is not the focus of this document, the techniques will not be addressed here. You will see, however, how CART can discover some variable relationships and give useful information to the solution.
The analyst should check that the modeling and validation sets have the same general characteristics, as they should have if there is plenty of data and in fact this split was random. This can be assessed by carefully examining and comparing separate EDD outputs on the two datasets. The analyst should specifically check that the distribution of the dependent variable is similar across the sets. For a binary dependent variable, there is a more advanced macro available, gbsplit, which will split your data into modeling and validation data sets, while ensuring the event rate stays almost identical in each and the goal of random splitting is achieved. It is also important to check for similar distributions of the most important independent variables. Note that the last two processes listed are interchangeable. When data is abundant, the analyst may often prefer to perform profiling and variable creation on the modeling set only, allowing the validation set to be completely independent form the modeling procedure.
CART Modeling
When the problem at hand involves classifying observations into distinct levels of the dependent variable, classification techniques and tools, such as CART can be used effectively. The simplest situation is a modeling problem where the dependent variable is binary. Other modeling approaches such as logistic and probit methods, also tackle the binary dependent variable problem, but CART is designed to search for meaningful
relationships between independent variables with respect to the dependent variable, and is better suited to find more complex interactions in the variables with little user input. Discriminant Analysis is a technique which competes with CART. (See http://www.statsoft.com/textbook/stathome.html for an introduction to Discriminant Analysis)
So what is CART?
Produces either classification or regression trees, depending on whether the dependent variable is categorical or numeric, respectively Trees are formed by a collection of rules based on values of certain variables in the modeling data set o Rules are selected based on how well splits based on variables values can differentiate observations based on the dependent variable o Once a rule is selected and splits a node into two, the same logic is applied to each child node (i.e. it is a recursive procedure) o Splitting stops when CART detects no further gain can be made, or some pre-set stopping rules are met Each branch of the tree ends in a terminal node o Each observation falls into one and exactly one terminal node o Each terminal node is uniquely defined by a set of rules o For classification trees, terminal nodes are classified along one level of the dependent variable, by comparing concentration of dependent variable levels in terminal node compared to entire population (this assumes CART default settings. Alteration of the option PRIORS can affect node classification. Please see the tutorial mentioned below for more information)
BUREA U_V A R4 <= 7 0 7.50 0 Nod e 4 N = 2 774 A CCO UNT_US A GE <= 21 .875 Te rminal Node 1 N = 74 6
There is a tutorial for CART which can be accessed through the CART section of the Windows program menu on the servers. It will show you how to get going with basics such as opening your modeling dataset, selecting the target (dependent variable), and the predictors and running the model. It will also show you how to select specific options for running the CART tree. Some of those options are described briefly below: Select the splitting rule from the Method tab Select how the results are tested from the Testing tab Specify the minimum number of cases at the parent or terminal nodes from the Advanced tab Specify penalty on each variable from the Penalty tab Specify the cost of misclassification from the Cost tab Specify the desired reference population frequencies (Priors)
For now we will be content with selecting the dependent and independent variables, what type of tree (classification or regression), running the model accepting default options, and interpreting the results. Once a model is run, a Tree Navigator will pop-up in the window. A Tree Navigator displays a high-level picture of the model built.
We can use the curve under the tree to prune the tree by reducing the number of nodes and making the tree simpler:
The relative cost is the misclassification due to the tree in question. This tends to decrease as the number of nodes in a tree increases, but can actually begin to go up with a tree with too many nodes. Depending on the goal of analysis, a simpler tree may or may not be more suitable than a more complex one; As a rule of thumb, simpler is generally better. One option that should be set if you have a large number of observations is the minimum size of parent nodes and terminal nodes. These put a lower bound on the size of nodes that can split and the size they can split into, and without them, a first run tree can turn up too many nodes to be useful. While the user can always prune the tree back by clicking on the curve under the navigator tree, even this can be difficult if there are many points on the curve (Example below):
By right-clicking a node in navigator view, and selecting rules, CART will give you C code which can be applied to drill down to that node. If that node is a terminal node, it can be used to determine if an observation belongs to that terminal node. An example for a hypothetical terminal node 7 follows:
/*Terminal Node 7*/ if ( var1 > 2050 && var6 <= 712.5 && var13 > 0.3735 && var5 <= 2.16 ) { terminalNode = 7; class = 0; }
Other things that can be discovered in the navigator file include: The tree diagram graphically portrays the splitting power of each predictor variable, the complexity of the model, its classification accuracy, and a quick definition of the interesting segments. The curve under the tree diagram reports on the overall accuracy of the tree and tells us how we might trade off accuracy for a smaller simpler tree. By drilling down to each node, we can get the complete tabulation of the target variable, and find the true proportions of events/non-events in each node. By clicking on the rules, we get the rules written in C code that define the node. This code can be used to create new variables for use in subsequent regression modeling. With a little editing, it can be converted into SAS code. One of the most important outputs is the Summary Reports tab, which includes tabs on the list of the independent variables ranked by their relative importance, gains chart, and terminal node information. o Click the tab Summary Reports, and then variable importance to see the variable importance list. This information provides a basis for us to exclude irrelevant or less important variables and therefore reduce the dimensionality and speed up subsequent processes. It is also useful for identifying the most promising independent variables for use in another modeling tool such as logistic regression.
o Gains Chart
The performance of the trees is assessed by the gains chart and table, prediction success table and misclassification tables. The Score function can be used to score another dataset o If there is a separate validation file it can be scored to see if the composition of the nodes is similar to that in the modeling file
One way to counter this is to start submitting command files to CART in batch, similar to how you would do it for a SAS job, in lieu of using all the functionality of the GUI. This way you can save versions of your code, add comments, and you will have a history of the tweaks and the results at each stage to refer back to. It takes a little more effort up front, but could possibly save lots of time in the end. An example of a CART command file is listed below. Once a command file has been submitted and run, the corresponding navigator file can be opened and analyzed. If you are not sure of how to add the code for a certain option, you can use the GUI to select certain options, run the model again, and then select Open Command Log under the View menu and you can see what code CART generates to run the model with the tweaks you have just selected. You can subsequently add the necessary code to your command file, save as a different version, and resubmit. For more on the topic of command files seek the help of someone more experienced in CART who has used them.
USE " Grove LIMIT LIMIT <Modeling Set Location> \ <modeling set>" "<Navigator Output Location> \<Grove Name>.GRV" ATOM = 400 MINCHILD = 200
CATEGORY AUXILIARY MODEL dep_var KEEP KEEP dep_var, var1, var2, var3, var5, var6, var7, var9, var10, var11, var12, var13, var14, char_var1$, char_var2$, char_var3$ PRIORS EQUAL CATEGORY acct_bad, char_var1$, char_var2$, char_var3$ ERROR CROSS = 10 METHOD GINI POWER = PENALTY PENALTY BUILD REM REM REM REM REM 01. 02. 03. 04. 05.
0.0000
First run through with exploratory method Trying misclassify cost and vfold validation Removing Inductis Score Removing misclassification costs Remove var4 and var8
Multivariate Regression
Regression analysis is the analysis of the relationship between one variable and another set of variables. The relationship is expressed as an equation that predicts a response variable (also called dependent) from a function of predictor variables (also called independents) and parameters. The parameters are adjusted so that a measure of fit is optimized. A regression equation with two predictors might look like this:
y = + 1 X 1 + 2 X 2 +
where is the intercept and is the residual error (portion of variability in the response variable that is not explained by the model). Much of the effort in model fitting is focused on minimizing the size of the residual, as well as ensuring that it is randomly distributed with respect to the model predictions. There are many possible variations on the basic regression equation, driven primarily by the nature of the predictors and response variable and the general relationship between them. For example, the following examples show interaction terms and a polynomial relationship:
y = + 1 X 1 + 2 X 2 + 3 X 1 X 2 +
y = + 1 X 1 + 2 X 12 + 3 X 13 +
Multivariate regression is generally used when the response variable is continuous with an unbounded range. Often we use it for modeling variables that are strictly positive (hence bounded at zero). Sometimes, we need to make special adjustments to the model if the distribution is significantly censored at zero (i.e. lots of zero spend amounts). While mathematically it is feasible to apply multivariate regression to discrete ordered dependent variables, many of the assumptions behind the theory of multivariate linear regression no longer hold, and there are other techniques better suited for this type of analysis. If the dependent variable is binary, one of those superior methods is logistic regression which is discussed later in this document. Before running a regression model in SAS you first need to check that your dataset is ready for regression modeling. SAS will discard any observations with missing values for any of the variables the user specifies to be considered for the model, so it is necessary that general modeling data preparation and missing imputation has occurred. Also, PROC REG will only accept numeric variables as predictors. If you have a categorical variable, then dummies need to be created for all but one level of the variable. The level that is left out is assumed to be the baseline for that variable, and the coefficients of all other
levels dummies are in comparison to the baseline. For example, if categorical variable, cat_var1, takes values A, B, and C, we would define two dummy variables for the multivariate regression as follows, and use them in lieu of the original variable:
cat_var1_B = (cat_var1 = "B"); cat_var1_C = (cat_var1 = "C");
Note that in more recent versions of SAS, it is not necessary to create your own dummies, but we introduce it here as it is still necessary for the version of SAS currently running on most of Inductis servers. To run a multivariate regression in SAS, and estimate the coefficients of the model, one would run code similar to the below:
proc reg data = <libname>.<modeling dataset>; model dependent_variable = var1 var2 var3 var4 var5 var6 var7 var8 var9; run;
Among the output in the .lst file you will see a table like the following:
The REG Procedure Model: MODEL1 Dependent Variable: Inductis_Score Analysis of Variance
Source Model Error Corrected Total DF 9 71 Sum of Squares Mean Square F Value 5.28 Pr > F <.0001
Root MSE 1.55169 R-Square 0.6205 Dependent Mean 8.93724 Adj R-Sq 0.5029 Coeff Var 17.36207
Parameter Estimates Variable Intercept var1 var2 var3 var4 var5 var6 var7 var8 var9 Parameter Standard DF Estimate Error 1 1 1 1 1 1 1 1 1 1 t Value Pr > |t|
0.22626 0.01531 14.78 <.0001 0.19584 0.01856 10.55 <.0001 -0.00014678 0.00002044 -7.18 <.0001 0.00002783 0.00000660 4.22 <.0001 -0.00115 0.00051301 -2.24 0.0250 -0.00001987 0.00000549 -3.62 0.0003 0.00012041 0.00008330 1.45 0.1484 0.00186 0.00073916 2.51 0.0120 0.00605 0.00137 4.41 <.0001 -0.00915 0.00234 -3.92 <.0001
The Parameter Estimate column gives the associated parameter estimate for each variable, next to its Standard Error. The t-value is used to calculate the coefficients pvalue, a measure of whether it is significantly different from 0. With no other options selected, SAS will estimate the full model, meaning all variables will be included, regardless of whether their coefficients are significantly different from 0. For example, var6 above would not normally be considered significant given default values of 0.05 for signifigance levels used in variable selection techniques, which are discussed below.
Forward: starts with no variables and adds the one which, when estimated with the variables already in the current model, has the lowest p-value which is also below a certain threshold o Threshold can be changed by changing the option sle (significance level for entry), default is 0.10 o Options to use: /selection=forwards sle=0.15; Backward elimination: starts with all possible predictors and removes the one with the highest p-value at each step, as long as it is above a certain threshold o Threshold can be altered by changing the option sls (significance level for staying), default is 0.10 o Options to use: /selection=backwards sls=0.2; Stepwise: a combination of forwards and backwards elimination. adds variables one at a time like forwards, but at each step reassesses each variable in the entire model to ensure it still meets sls criterion o Options to use: /selection=stepwise sle=0.15 sls=0.2; After using one of these methods, the user will have a feel of what variables are important in the model. Judging by coefficient size is inappropriate as this can be affected by variable scale and not directly correlated to variable importance. The best option for the analyst to assess variable importance within the model is to include the option STB in the model statement. The output will then include a Standardized Estimate Column as shown below. Variable importance within that particular model can then be assessed by comparing absolute magnitude of the standardized coefficients. For example, in the output below Var4 is the most important based on the Standardized Estimates, although that would not have been clear from the Parameter Estimates.
The REG Procedure Model: MODEL1 Dependent Variable: Inductis_Score Analysis of Variance
Source Model Error Corrected Total DF 9 71 Sum of Squares Mean Square F Value 5.28 Pr > F <.0001
Root MSE 1.55169 R-Square 0.6205 Dependent Mean 8.93724 Adj R-Sq 0.5029 Coeff Var 17.36207
Parameter Estimates Variable Intercept var1 var2 var3 var4 var5 var6 var7 var8 var9 Parameter Standard DF Estimate Error 1 1 1 1 1 1 1 1 1 1 t Value Standardized Pr > |t| Estimate
0.22626 0.01531 14.78 <.0001 0 0.19584 0.01856 10.55 <.0001 0.05410 -0.00014678 0.00002044 -7.18 <.0001 -0.04466 0.00002783 0.00000660 4.22 <.0001 0.02529 -0.00115 0.00051301 -2.24 0.0250 -0.11584 -0.00001987 0.00000549 -3.62 0.0003 -0.01924 0.00012041 0.00008330 1.45 0.1484 0.07471 0.00186 0.00073916 2.51 0.0120 0.01367 0.00605 0.00137 4.41 <.0001 0.02281 -0.00915 0.00234 -3.92 <.0001 -0.02569
Multivariate Regression and R2 The R2 statistic is a measure of how well the model predicts the dependent variable based on the value of the independent variables. It measures predictive power of the model: the proportion of the total variation in the dependent variable that is explained (accounted for) by variation in the independent variables. The statistic is most meaningful for ordinary regression problems, but logistic regression has an analogous statistic called generalized R2.
R2 can be defined in various ways. The most common formula seen for R2 is: R2 = 1-SSE / SST = (SST SSE) / SST
2 Where, SSE is the sum of the squared residuals ( ( y i y ) ), and SST is the total sum
2 of squares ( ( y i y ) ) of the dependent variable
Thirty-three points showing error as defined by the length of a line connecting a point to the regression line (used to calculate SSE)
Shows the distance between points and the mean line that will be used to calculate the total sum of squares, SST.
R2 for a model fit with least squares regression will always be between 0 and 1. However, when calculating an R2 on a validation sample, R2 can be negative if SSE is larger than SST. You will notice that R2 is automatically reported in the .lst file with the output for proc reg. A value called adjusted R2 is also reported, which penalizes the model for the amount of variables it uses, as adding more of more variables increases the models R2 but can lead to poor models that overfit the data. Scoring Datasets using a Multivariate Regression Model Using PROC SCORE with the output from a PROC REG is a fast and convenient way to score datasets using the output coefficients from a multivariate regression. Below is an example where the model estimates from a PROC REG are output to a dataset which is then used as a set of input coefficients to PROC SCORE on two separate occasions, once to score the original modeling dataset, and once to score a validation dataset. The var statement must be followed by the response variable and all predictors. The option TYPE=PARMS is used when the input to score is regression coefficients from a PROC REG as it is in this case. The option predict is used so that PROC SCORE outputs actual predictions of the response variable. By default it outputs the negative residual for each observation, which is the models predicted value minus the actual value.
proc reg data=<mod_set> outest=<mod_estimates>; model Inductis_Score = &predictor_list /selection = stepwise sle = 0.25 sls = 0.15 stb; run; proc score data=<mod_set> score=<mod_estimates> out=<mod_scores> type=parms predict; var Inductis_Score &predictor_list; run; proc score data=<val_set> score=<mod_estimates> out=<val_scores> type=parms predict; var Inductis_Score &predictor_list; run;
To calculate R2 on the validation set you can run a PROC CORR on the output dataset from a PROC REG with predict option selected. You need only keep the predicted values (MODEL1) and the actual scores to avoid unnecessary correlation calculations involving other variables. The Pearson correlation coefficient can be found in the output, and its square is the R2of the model determined by the original proc reg. The short piece of required code is shown below.
proc corr data = <val_scores> (keep = MODEL1 Inductis_Score);run;
The output will look like the following. R, the Pearson coefficient of correlation between the two variables has value 0.52433 in this output. The square of that value will be the R2 for this model.
The CORR Procedure 2 Variables: Inductis_Score MODEL1 Simple Statistics Variable Maximum Inductis_Score 1.00000 MODEL1 0.86518 N 30759 30759 Mean 0.12239 0.12239 Std Dev 0.20447 0.10721 Sum 3765 3765 Minimum 3.05567E-6 -0.13820
Pearson Correlation Coefficients, N = 30759 Prob > |r| under H0: Rho=0 Inductis_ Score Inductis_Score MODEL1 1.00000 0.52433 <.0001 MODEL1 0.52433 <.0001 1.00000
Logistic Regression
In a classification setting, assigning outcome probabilities to observations can be achieved through the use of a logistic model, which is basically a method which transforms information about the binary dependent variable into an unbounded continuous variable and estimates a regular multivariate model (See Allisons Logistic Regression for more information on the theory of Logistic Regression). The functional form is:
p ln = + 1 X 1 + 2 X 2 + 3 X 3 +... k X k 1 p
p=
e + 1 X 1 + 2 X 2 + 3 X 3 +... k X k 1 + e + 1 X1 + 2 X 2 + 3 X 3 +... k X k
Here p is the probability that an observation with variable values (X1,X2,,Xk), will take the value of the binary outcome designated as 1. Similar to the multivariate regression constraints, before running a logistic model in SAS you first need to check that your dataset is ready for logistic modeling. SAS will discard any observations with missing values for any of the variables the user specifies to be considered for the model, so it is necessary that general modeling data preparation and missing imputation have occurred. Also, PROC LOGISTIC will only accept numeric variables as predictors, unless categorical level character variables are specified in the CLASS statement. You must also be careful with categoricals that are coded with numerals the program will treat these as if they were continuous numerics unless they are specified in the CLASS statement. To run a logistic regression in SAS, and estimate the coefficients of the model, one would run code similar to the below:
proc logistic data = <libname>.<modeling dataset> DESCENDING; model dependent_variable = var1 var2 var3 var4 var5 var6 var7 var8 var9; run;
The DESCENDING options lets SAS know that the value of the dependent variable we wish to predict is 1, and not 0. Among the output in the .lst file you will see a table like the following:
The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Parameter Intercept Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 DF 1 1 1 1 1 1 1 1 1 1 Standard Estimate Wald Error Chi-Square 458.0953 94.7829 192.4487 90.7652 34.4986 3.8263 0.2797 733.9139 52.6921 93.7599 Pr > ChiSq <.0001 <.0001 <.0001 <.0001 <.0001 0.0505 0.5969 <.0001 <.0001 <.0001
1.7033 0.0796 -0.00737 0.000757 -0.0139 0.000999 8.944E-6 9.388E-7 0.000150 0.000025 0.000122 0.000062 0.000033 0.000063 8.074E-6 2.98E-7 0.000320 0.000044 -0.00580 0.000599
The Estimate column gives the associated parameter estimate for each variable, next to its Standard Error. These estimates are the betas in the equation shown above. The Wald Chi-Square value is used to calculate the coefficients p-value, a measure of whether it is significantly different from 0. With no other options selected, SAS will estimate the full model, meaning all variables will be included, regardless of whether their coefficients are significantly different from 0. You will notice the high p-value associated with the coefficient for the variable labeled Var6.
Parameter Intercept Var1 Var2 Var3 Var4 Var5 Var7 Var8 Var9
It is better for a model to get a high percentage of concordant pairs as it is a measure of how consistent the model is in ranking randomly chosen events higher than randomly chosen non-events. Please see the Validation documentation on the MRL, for additional information and other statistics involving these pairs.
Hosmer and Lemeshow Goodness-of-Fit Test Chi-Square 9.1720 DF 8 Pr > ChiSq 0.3280
A low chi-square with high p-value indicates a good fit since the expected frequencies are not far from actuals.
Validation/Results Presentation
What can I do to tell if my model is working?
Model validation is an important part of the modeling process. There are many statistics that can be analyzed and methods to use in order to test if the model you produced is rigorous and reliable. You can find a longer discussion on this in the Validation and Presentation documentation on the MRL. Here we will introduce a couple of fairly quick validation methods you can use, one of which is used a lot here at Inductis, and is often a way results are presented to the client. One possible approach is to use the validation data (if such data exists) in order to reestimate the final model that was originally determined on the modeling dataset. Usually we would do this using only the predictor variables that were included in the original model. Comparing coefficients from the modeling estimation and the validation estimation is a way of assessing the models consistency. Are the parameter estimates close? Does each coefficient fall within the standard error interval described by the other model? This procedure is one step of a more advanced process called n-fold cross validation which will not be covered in detail here. Another validation method involves scoring both the modeling and validation test datasets (using the coefficients from the original model) and producing a gains chart from each set of scores. A gains chart is obtained by rank ordering all observations by their model score and determining what percentage of events (1s on the dependent variable) are captured in any given percentage of the rank ordered list. A sample gains table is shown below where the list was split into centiles. From this table, you can take all the ordered pairs (Centile, Cumulative Percent of Events) in order to graph the gains chart.
Percent of Events Captured Percent of Non Events Captured
0% 0% 0% 0% 1% 83% 86% 87% 89% 91% 93% 95% 97% 98% 100%
Centile
1 2 3 4 5 91 92 93 94 95 96 97 98 99 100
Number of Events
4 4 5 3 4 0 0 1 0 0 0 0 1 0 0
Cumulative Events
4 8 13 16 20 214 214 215 215 215 215 215 216 216 216
Number of NonEvents
Difference
0.0185 0.0370 0.0602 0.0697 0.0839 0.1560 0.1342 0.1258 0.1041 0.0867 0.0649 0.0475 0.0348 0.0174 0.0000
2% 0 4% 0 6% 0 7% 1 9% 1 <Rows Deleted Here> 99% 4 99% 5 100% 3 100% 5 100% 4 100% 5 100% 4 100% 4 100% 4 100% 4
The gains chart from the modeling dataset alone can be used to assess the performance of Percent of Population the model. This assessment focuses on the fact that we want the curve to rise as quickly as possible above the diagonal line that would be obtained with a random ranking. Sometimes we will focus on the specific vertical distance between the two lines at particular points of interest along the curve. The gains curves from the modeling and validation sets can also be compared with one another to determine how consistently the model performs on a validation sample. If the models performance is much poorer on the validation sample, then there is cause for concern, and perhaps the model was overfitted. Below is listed some code that will score both a modeling and validation dataset, and then prepare those scores, using the bin macro, for gains tables to be used for gains charts. There are alternative ways to score a validation dataset, like using PROC LOGISTIC with initial estimates set to the coefficient estimates from a modeling step and the option max iterations set to 0, but the lgtscore2 macro is commonly used.
proc logistic data=<mod_data> DESCENDING outest=<coeff_estimates>; model dep_var=var1 var2 .... ; output out= <mod_scores> p=pred_mod; run; %bin(infile=<mod_scores>, outfile=<mod_scores_bin> ,q=100, var=pred_mod, var_r = pred_mod_r, Var_q = pred_mod_q); proc summary nway data= <mod_scores_bin> class pred_mod_q pred_mod_r; var remiss; output out = <mod_scores_LZ> sum=; run; %lgtscore2 ( indata= <val_data>, outdata=<val_scores>, regest =<coeff_estimates>;, depvar = dep_var, depvarh =pred_val missing;
);
%bin(infile=<val_scores>, outfile=<val_scores_bin> ,q=100, var=pred_val, var_r = pred_val_r, Var_q = pred_val_q); proc summary nway data= <val_scores_bin> class pred_val_q pred_val_r; var remiss; output out = <val_scores_LZ> sum=; run; missing;
Bibliography
General References: http://support.sas.com/onlinedoc/913/docMainpage.jsp www.statsoftinc.com/textbook/stathome.html Logistic Regression: Allison Paul, 2001. Logistic Regression Using the SAS System. http://userwww.sfsu.edu/~efc/classes/biol710/logistic/logisticreg.htm http://www2.chass.ncsu.edu/garson/pa765/logistic.htm