Data Analysis

2.
1 Exploratory Data Analysis Classical methods of statistical inference depend heavily at least on the following assumptions: The data are outlier-free. The data are nearly normal. In addition, certain statistical techniques (e.g., multiple regression) require that: The relationships of independent variable(s) with dependent variable are linear. Independent variables themselves are not highly correlated. The following graphical techniques available in IDAMS module GRAPHID can provide useful insights into the nature of the data. Histogram Box and whisker plot Scatter plot 2.2 Construction of Indices An important question in social survey research is how to measure latent variables. Latent variables are those variables, which cannot be observed or measured directly. Some examples are job satisfaction, leadership quality, work environment in a company, etc. Such variables are measured through certain indicators (which are called indicator items, or simply items. Indicator items are selected on the basis of theoretical grounds, prior research evidence, and if these are not available, or are not well founded, through factor analysis. The indicator items measuring a latent variable should satisfy at least the following criteria; The items measuring a latent variable should be highly correlated among themselves. The items measuring a latent variable should not be correlated with the items of another latent variable. These conditions are necessary, since all the items of a latent variable are expected to measure uniquely that latent variable and no other latent variable. After having selected the indicator items, the next step in data analysis is to construct a single measure score or index of the selected items. 3.1 Descriptive Statistics Descriptive statistics provide basic information about the distribution of a variable in a data set, such as: (i) Its central tendency How the data bunch up (ii) Dispersion How the data spread out (iii) Shape of the distribution How the data look like Choice of Statistical Techniques As mentioned earlier, the choice of a statistical technique is a complex issue, which should not be reduced to a cookbook approach. With this caveat, we have
suggested appropriate techniques for different data analysis situations in Tables 1 and 2. They are meant to be a framework, and are intended as general guidelines. They should not be applied rigidly in all situations. It would be desirable to try more than one way of analyzing the data, whenever possible. Table 1: Appropriate techniques for problems without distinction between independent and dependent variables No. of Variables NON-METRIC One One One METRIC One Interval or ratio scale Mean, Median, Mode, Variance, Skewness, Kurtosis Nominal Ordinal Preferences Frequencies, Proportions Median, Mode Rank Consensus among evaluators Measurement Level Analysis Method
NON-METRIC Two Two Two METRIC Two More than two Interval-scale Interval-scale Scatter plot, Pearson's Correlation Coefficient Principal Components Analysis, Factor Analysis, Cluster Analysis Multidimensional Scaling Dichotomous Nominal Ordinal Cross-tabulation Chi-square Cross tabulation, Chi-square, Correspondence Analysis Kendall's Tau,Spearman's Rho, Gamma
Table 2: Appropriate techniques for problems with distinction between independent and dependent variables No. of Variables Depend ent One One One Independ ent One One One Measurement Level Dependent Nominal Nominal (dichotomous) Nominal Independent Nominal Nominal Nominal (Dichotomous ) Nominal (Dichotomous ) Interval-scale Nominal Interval-scale Nominal Non-parametric tests, Chisquare Multiple Classification Analysis Wilcoxon's two sample test, Chi-square, KolmogorovSmirnov Test t-test, Analysis of Variance Analysis Method
One
One
Interval-scale
One One One One
One One More More
Interval-scale Interval-scale Nominal Interval-scale
Regression Analysis Analysis of Variance Discriminant Analysis Analysis of Variance, Multiple Regression Analysis, Multiple Classification Analysis Analysis of Variance, Multiple Regression Analysis, Multiple Classification Analysis Multiple Regression Analysis
One
More
Interval scale
Dummy
One
More
Interval-scale
Interval-scale
1. Cluster Analysis A procedure for partitioning a set of objects into groups or clusters in such a way that profiles of objects in the same cluster are very similar, whereas the profiles of objects in different clusters are quite distinct. The number and characteristics of clusters are not known a priori and are derived from the data. 2. Correlation Analysis
3.
4.
5.
6.
7.
Correlation is a measure of the relationship between two or more variables. The most commonly used type of correlation coefficient is Pearson's r, also called linear or product moment correlation. It is essential that the variables are measured on at least interval scales. Discriminant Analysis A technique for classifying objects into one of two or more alternative groups (or populations) on the basis of a set of measurements ( i.e. Variables). The populations are known to be distinct and an object can belong to only one of them. The technique can also be used to identify which variables contribute to making the classification. Thus, the technique can be used for description as well as for prediction. Principal Components Analysis Principal components analysis (PCA) is performed to simplify the description of a set of interrelated variables in a data matrix. PCA transforms the original variables into new uncorrelated variables, called principal components. Each principal component is a linear combination of the original variables. The amount of information conveyed by a principal component is its variance. The principal components are derived in decreasing order of variance. Thus, the most informative principal component is the first, and the least informative is the last. Factor analysis is similar to principal components analysis in that it is a technique for examining the interrelationships among a set of variables, but its objective is somewhat different. Classical factor analysis is viewed as a technique for clarifying the underlying dimensions or factors that explain the pattern of correlations among a much larger set of variables. This technique is applied to (i) reduce the number of variables, and (ii) detect the structure in the relationships among variables. Modern factor analysis, implemented in IDAMS, aims to represent geometrically the information in a data matrix in a low-dimensional Euclidean space and to provide related statistics. The fundamental goal is to highlight relations among elements (variables/individuals), which are represented by points in graphical displays (called factorial maps), and reveal the structural features of the data matrix. In these maps, both variables and individuals can be displayed. Since, the number of individuals is often very large; they are represented by the centers of gravity of their categories. Correspondence analysis (CA) is a multivariate technique for exploring crosstabular data by converting them into graphical displays, called factorial maps, and related numerical statistics. CA is primarily intended to reveal features in the data rather than to test hypotheses about the underlying processes, which generate the data. However, correspondence analysis and principal components analysis are used under different circumstances. PCA uses covariances or correlations (Euclidean metrics) for data reduction and is therefore applicable to continuous measurements. CA, on the other hand, uses chi-square metrics and is therefore applicable to contingency tables (cross- tabulations). By extension, correspondence analysis can also be applied to tables with binary coding. The module can handle active as well as passive variables. Active variables are those, which participate in the determination of factorial axes. Passive variables are those, which do not participate in the determination of factorial axes, but they are projected on to the factorial axes. Multidimensional Scaling (MDS)
Multidimensional scaling is an exploratory data analysis technique that transforms the proximities (or distances) between each pair of objects (or variables) in a given data set into comparable Euclidean distances. MDS produces a spatial representation of the objects (usually two-dimensional maps) in such a way that maximizes the fit between the proximities for each pair of objects and the Euclidean distance between them in the spatial representation. The greater the proximity between the objects, the closer they are situated in the map. Like factor analysis, the main concern of MDS is to reveal the structure of relationships among the objects. 8. Multiple Classification Analysis (MCA) MCA is a technique for examining the inter-relationships between several predictor variables and a dependent variable. The technique can handle predictors with no better than nominal measurements and interrelationships of any kind among predictors or between a predictor and the dependent variable. The dependent variable may be interval-scaled or dichotomous. 9. Analysis of Variance This statistical technique assesses the effect of an independent or 'control' categorical variable (factor) upon a continuous dependent variable. 10. POSCOR (Ranking program based on partially ordered sets) POSCOR is a procedure for ranking of objects when more than one variable is considered simultaneously in rank - ordering. The procedure offers the possibility to give each object belonging to a given set its relative position in probabilistic terms vis--vis the other objects in the same set. The position of each object is measured by a score, called POSCOR score. 11. Rank The procedure allows the aggregation of individual opinions, expressing the choice of priorities, ranking of alternatives or selection of preferences. It determines a reasonable rank order of alternatives, using preference data as input and three different ranking procedures two based on fuzzy logic and one based on classical logic. 12. Regression Analysis A technique for exploring the relationship between a dependent variable and one or more independent variables. Linear regression explores the relationship that can be described by straight lines or their generalization to many dimensions. 13. Search Search is a binary segmentation procedure for developing a predictive model for dependent variable(s). It divides the sample through a series of binary splits into mutually exclusive series of subgroups such that at each binary split the two new subgroups reduce the predictive error more than a split into any other pair of subgroups. 14. Typology A clustering procedure for large data sets, which can handle nominal, ordinal and interval-scaled variables simultaneously. The procedure can handle active and passive variables. Active variables are those, which take part in the construction of the typology, whereas passive variables are those, which do not take part in the construction of the typology, but their average statistics are computed for each typology group. PARAMETRIC STATISTICS
4.3 Analysis of Variance (One Way ANOVA) One-way analysis of variance can be viewed as a special case of bivariate analysis, where one variable (X) is a categorical (nominal) variable and the other variable (Y) is an interval-scaled variable. Suppose X is classified into g categories. The interrelationship between X and Y involves a comparison of the distribution of Y among the g categories of X. This comparison may involve comparison of distribution parameters, such as means or variances. The statistical procedure that compares the means of different groups is called Analysis of Variance (ANOVA). It rather seems odd that a procedure that compares means is called analysis of variance, but this name is derived from the fact that in order to test for statistical significance between means, actually variances are compared (i.e., analyzed). The categorical variable is usually referred to as a factor and the categories are called levels. The one-way analysis of variance is also called single - factor analysis. Y . . . Sum Mean Overall or grand mean = (S y1 + S y2+ . + . S yg)/n Essentially, the method depends upon the partitioning of both degrees of freedom and sums of squared deviations between a component, called error, and the other, called effect. The sum of squared deviations for effects is also influenced by the error, which is an all-pervading uncertainty or noise, distributed such that under the null hypothesis of no effect between the means of different categories (i.e., absence of effect), the expected values of the two sums of squared deviations are proportional to their respective degrees of freedom. Hence, the mean squared deviations (i.e., the sum of squared deviations divided by the degrees of freedom) would have the same expectations. If, however, the effect does exist, it will inflate its own mean squared deviation, but not that of the error. If the effect is large enough, it would lead to significance shown by the Ftest. In this context, F equals to the ratio of the mean squared deviations for the effect and that for the error. The behavior of the Y observations can be modeled as follow: yij = m ij + e ij i=1,2, . . . ,nj. j = 1,2, . . . ,g where e ij represents a random variable with mean 0 and variance s . It is assumed that e ij are mutually independent. Under the assumption of homogeneity of variance the model can be written as: i=1,2, . . . ,nj.; j=1,2, . . .,g Group1 . . . S y1 Group2 . . . S y2 Group. . . . S yi Group. . . . S Group g . . . S yg
where
is an estimate of mj and (yij
) is an estimate of e
ij.
The predicted
values for yij from this model are ij = . Total sum of squares The total variation in Y over the sample, or the total sum of squares ( TSS) is determined by
where y.. is the overall or grand mean given by where n = This sum of squares measures the variations in the values of Y around the grand mean (i.e. the sum of squared deviations). Sum of squares between groups The sum of squares between groups (BSS) measures the variation between the group means and is computed as follows:
Mean square a between groups =MSB = BSS/( g-1) Sum of squares within groups The sum of squares within groups is given by
Mean square within groups = MSW=WSS /(n-g). It can be easily seen that TSS=BSS+ WSS. Thus, the total variation is partitioned into two components :Within Groups Variance and Between Groups Variance. Under the normality assumption, the ratio MSB/MSW has an F distribution with (g-1) and (n-g) degrees of freedom. This test statistic can be used to test the null hypothesis of no difference between the group means. The computation of F statistic is illustrated in the following table, called ANOVA table. Source of variance Between Within Total Degrees of freedom g-1 n-g n-1 Sum of squares BSS WSS TSS Mean square MSB=BSS/ (g-1) MSW=WSS/ (n-g) F ratio MSB/MS W
4.4 Regression Analysis Regression analysis is one of the most commonly used statistical techniques in social and behavioral sciences as well as in physical sciences. Its main objective is to explore the relationship between a dependent variable and one or more independent variables (which are also called predictor or explanatory variables). Linear regression explores relationships that can be readily described by straight lines or their generalization to many dimensions. A surprisingly large number of problems can be solved by linear regression, and even more by means of transformation of the original variables that result in linear relationships among the transformed variables. Mathematically, the regression model is represented by the following equation: Yi=a +S b i Xij + e i where p is the number of predictors, the subscript i refers to the ith observation and the subscript j refers to the j th predictor and e i is the difference between the ith observation and the model; e i is also called the error term. However in this Chapter, we would examine the case of Simple Linear Regression, involving only two variables, one dependent variable ( Y) and one independent variable (X): Yi =a + b Xi + e i The first step in determining whether there is a relationship between two variables is to examine the graph of the observed data (Y*X). The graph is called a scatter plot. IDAMS modules Scatter or Graphid can be used to draw the scatter plot. If there is a relationship between the variables X and Y, the dots of the scatter plot would be more or less concentrated around a curve, which may be called the curve of regression. In the particular case when the curve is a straight line, it is called the line of regression and the regression is said to be linear. In addition to the linearity property, the scatter plot is also useful for observing whether there are any outliers in the data and whether there are two or more clusters of points. For the population, the bivariate regression model is: Yi =a + b Xi + e i where the subscript i refers to the ith observation, a is the intercept and b is the regression coefficient. The intercept, a , is so called, because it intercepts the Yaxis. It estimates the average value of Y, when X=0. Assumptions The regression model is based on the following assumptions. The relationship between X and Y is linear. The expected value of the error term is zero The variance of the error term is constant for all the values of the independent variable, X. This is the assumption of homoscedasticity. There is no autocorrelation. E (e ie j) =0. The independent variable is uncorrelated with the error term. The error term is normally distributed. Estimation of Parameters The random sample of observations can be used to estimate the parameters of the regression equation. The method of least squares is used to fit a continuous dependent variable (Y) as a linear function of a single predictor variable ( X). The least squares method finds the line which minimizes the sum of squared deviations
from each point in the sample to the point on the line corresponding to the Xvalue. Given a set of n observations Yi of the dependent variable corresponding to a set of values Xi of the predictor, and the assumed regression model, the ith residual is defined as the difference between the ith observation Yi and the fitted value i. di = (i - Yi) The least square line is: = A + BX where
and A=-B Here and denote the sample means of X and Y, and denotes the predicted value of Y for a given X. The estimate of 2 is called the residual mean square and is computed as:
The number n 2, called the residual degrees of freedom, is the sample size minus the number of parameter (in this case, and ). The square root of the Residual Mean Square ( RMS) is called the standard error of the estimate and is denoted by S. In effect, it indicates the reliability of the estimating equation. Standard errors of A and B are
Standardized regression coefficient The standardized regression coefficient is the slope in the regression equation if X and Y are standardized. After standardization, the intercept ( A) will be equal to zero. And the standardized slope will be equal to the correlation coefficient r. Significance of regression For testing the null hypothesis H0: b =0 , it is expedient to represent the results of regression analysis in the form of an analysis of variance (ANOVA) table If X were useless in predicting Y, the best estimate of Y would be , regardless of the values of X. To measure how different the fitted line is from , we calculate the sum of squares foe regression as (Y- )2, summed over each data point. The residual mean square is a measure of how poorly or how well the regression line fits the actual data points. A large residual mean square indicates poor fit. If residual mean square is large, the value of F would be low and F ratio may become non-significant. If F ratio is statistically significant it implies that the null hypothesis H 0: b =0 is rejected. ANOVA Table for Simple Linear Regression.
Source of Variation Regression Residual Total
Sums of Squares
Df 1 N2 N1
Mean Square SSreg / 1 SSres / ( N 2)
F MSreg / MSres
Multiple Regression Multiple regression: In multiple regression analysis, we are studying the relationship between one dependent variable and several independent variables (called predictors). The regression equation takes the form o Y =b0+ b1x1 + b2x2 + bp+ e, o where Y is the dependent variable, the b's are the regression coefficients for the corresponding x (independent) terms, b0 is a constant or intercept, and e is the error term reflected in the residuals. The parameters of the regression equation are estimated using the ordinary least squares method (OLS). Ordinary least squares: This method derives its name from the criterion used to draw the best-fit regression line: a line such that the sum of the squared deviations of the distances of all the points to the line is minimized. Intercept: The intercept, b0, is where the regression plane intersects with the Y-axis. It is equal to the estimated Y value when all the independents have a value of 0. Regression coefficient: Regression coefficients bi are the slopes of the regression plane in the direction of xi. Each regression coefficient represents the net effect the ith variable has on the dependent variable, holding the remaining x's in the equation constant . Beta weights are the regression coefficients for standardized data. Beta is the average amount by which the dependent variable increases when the independent variable increases one standard deviation and other independent variables are held constant. The ratio of the beta weights is the ratio of the predictive importance of the independent variables. o Standardized means that for each datum the mean is subtracted and the result divided by the standard deviation. The result is that all variables have a mean of 0 and a standard deviation of 1. Residuals are the difference between the observed values and those predicted by the regression equation Dummy variables: Regression assumes interval data, but dichotomies may be considered a special case of intervalness. Nominal and ordinal categories can be transformed into sets of dichotomies, called dummy variables. To prevent perfect multicollinearity, one category must be left out. o Interpretation of b for dummy variables. For b coefficients for dummy variables, which have been binary coded (the usual 1=present, 0=not present), b is relative to the reference category (the category left out).
Multiple R: The correlation coefficient between the observed and predicted values. It ranges in value from 0 to 1. A small value indicates that there is little or no linear relationship between the dependent variable and the independent variables. o Multiple R 2 is the percent of the variance in the dependent variable, explained by the independent variables. It is also called the coefficient of multiple determination. Mathematically, R2 = [ 1 - (SSE/SST) ] , where SSE = error sum of squares = S (Yi - Est Yi) 2 where Yi is the actual value of Y for the ith case and Est Yi is the regression prediction for the ith case. SST = total sum of squares =S (Yi - MeanY) 2 Adjusted R-Square: When there are a large number of independent variables, it is possible that R2 may become artificially large, simply because some independent variables' chance variations "explain" small parts of the variance of the dependent variable. It is therefore essential to adjust the value of R2 as the number of independent variables increases. In the case of a few independent variables, R2 and adjusted R2 will be close. In the case of a large number of independent variables, adjusted R2 may be noticeably lower. Multicollinearity is the intercorrelation of the independent variables. The values of r2's near 1 violate the assumption of no perfect collinearity, while high r2's increase the standard error of the regression coefficients and make assessment of the unique role of each independent variable difficult or impossible. While simple correlations tell something about multicollinearity, the preferred method of assessing multicollinearity is to compute the determinant of the correlation matrix. Determinants near zero indicate that some or all independent variables are highly correlated. Partial correlation is the correlation of two variables while controlling for a third or more other variables. For example r12.34 is the correlation of variables 1 and 2, controlling for variables 3 and 4. Partial correlation r12.34 equal to uncontrolled correlation r12 No effect of control variables Partial correlation near 0 Original correlation is spurious. Stepwise Regression: Stepwise regression is a sequential process for fitting the least squares model, where at each step a single predictor variable is either added to or removed from the model in the next fit. Multiple Classification Analysis Multiple classification analysis: Multiple Classification Analysis (MCA) is a technique for examining the interrelationship between several predictor variables and one dependent variable in the context of an additive model Independent variables may be measured on nominal or ordinal scales and the dependent variable may be interval scale or a dichotomy. Additive model: Such a model assumes that the dependent variable can be predicted from an additive combination of the independent (or predictor) variables. In other words, they assume that the average score on the
o
dependent variable for a given set of individuals (objects or cases) is predictable by adding the effects of several predictors.
Eta: Eta indicates the ability of a predictor, using the given categories, to explain variation in the dependent variable. Eta square: Eta2 is the correlation ratio and indicates the proportion of the total sum of squares, explained by the predictor. MCA Beta: This is directly analogous to Eta statistic, but is based on the adjusted means rather than the raw means. Beta is a measure of the ability of a predictor to explain variation in the dependent variable, after adjusting for the effects of all other predictors. Note that this is not in terms of percentage of variance explained. Multiple correlation coefficient squared: This coefficient indicates the proportion of variance explained in this run of the program. Adjustment for degrees of freedom: This is the factor used to correct for capitalizing on chance in fitting the model in the particular sample being analyzed. Multiple correlation coefficient squared (Adjusted): This coefficient estimates the proportion of variance in the dependent variable, explained by the predictor variables.
5.2 Multiple Regression Model Consider a random sample of n observations (xi1, xi2, . . . . , xip, yi), i = 1, 2, . . . , n. The p + 1 random variables are assumed to satisfy the linear model yi = b 0 + b 1xi1 + b 2xi2 , +b pxip + ui i = 1, 2, . . . , n where ui are values of an unobserved error term, u, and. the unknown parameters are constants. Assumptions The error terms ui are mutually independent and identically distributed, with mean = 0 and constant variances E [ui] = 0 V [ui] = This is so, because the observations y1, y2, . . . ,yn are a random sample, they are mutually independent and hence the error terms are also mutually independent The distribution of the error term is independent of the joint distribution of x i, x 2, . . . , x p The unknown parameters b 0, b 1, b 2, . . . ,b p are constants. Equations relating the n observations can be written as:
The parameters b 0, b 1, . . . b p can be estimated using the least squares procedure, which minimizes the sum of squares of errors.
Minimizing the sum of squares leads to the following equations, from which the values of b can be computed:
Geometrical Representation The problem of multiple regression can be geometrically represented as follows. We can visualize that n observations (xi1, xi2, ..xip, yi) i = 1, 2, .n are represented as points in a (p+1) - dimensional space. The regression problem is to determine the possible hyper-planes in the p dimensional space, which will be the best- fit. We use the least squares criterion and locate the hyper-plane that minimizes the sum of squares of the errors, i.e., the distances from the points around the plane (observations) and the point on the plane. (i.e. the estimate ). = a+b1x1+b2x2++bpxp Standard error of the estimate
Se = where yi = the sample value of the dependent variable i = corresponding value estimated from the regression equation n = number observations p = number of predictors or independent variable The denominator of the equation indicates that in multiple regression with p independent variables, the standard error has n-p-1 degrees of freedom. This happens because the degrees of freedom are reduced from n by p+1 numerical constants a, b1, b2, ..bp, that have been estimated from the sample. Fit of the regression model The fit of the multiple regression model can be assessed by the Coefficient of Multiple determination, which is a fraction that represents the proportion of total variation of y that is explained by the regression plane.
Sum of squares due to error SSE = Sum of squares due to regression SSR = Total sum of squares SST = Obviously, SST = SSR + SSE The ratio SSR/SST represents the proportion of the total variation in y explained by the regression model. This ratio, denoted by R2, is called the coefficient of multiple determination. R2 is sensitive to the magnitudes of n and p in small samples. If p is large relative to n, the model tends to fit the data very well. In the extreme case, if n = p+1, the model would exactly fit the data. A better goodness of fit measure is the adjusted R2, which is computed as follows: Adjusted R2= 1 ( ) (1-R2)
=1Statistical inferences for the model The overall goodness of fit of the regression model ( i.e. whether the regression model is at all helpful in predicting the values of y can be evaluated, using an F-test in the format of analysis of variance. Under the null hypothesis: Ho: 1 = 2 = ... = p = 0, the statistic
= has an F-distribution with p and n--1 degrees of freedom ANOVA Table for Multiple Regression Source of Variation Regression Error Total Sum of Squares SSR SSE SST Degrees of freedom p (n-p-1) (n-1) Mean Squares MSR MSE F ratio
MSR/MS E
Whether a particular variable contributes significantly to the regression equation can be tested as follows: For any specific variable xi, we can test the null hypothesis Ho: i = 0, by computing the statistic t= and performing a one or two tailed t-test with n-p-1 degrees of freedom. Standardized regression coefficients The magnitude of the regression coefficients depends upon the scales of measurement used for the dependent variable y and the explanatory variables included in the regression equation. Unstandardized regression coefficients cannot be compared directly because of differing units of measurements and different variances of the x variables. It is therefore necessary to standardize the variables for meaningful comparisons. The estimated model i = bo+b1xi1+b2xi2+.bpxip can be written as: + The expressions in the parentheses are standardized variables; bs; are unstandardized regression coefficients and s1, s2, sp are the standard deviations of variables x1, x2, .xp and sx is the standard deviation of variable y. The coefficients (bisi)/sy, j=1,2,,p are called standardized regression coefficients. The standardized regression coefficient measures the impact of a unit change in the standardized value of xi on the standardized value of y. The larger the magnitude of standardized bi, the more xi contributes to the prediction of y. However, the regression equation itself should be reported in terms of the unstandardized regression coefficients so that prediction of y can be made directly from the x variables. Multiple Correlation Multiple correlation coefficient, R, is a measure of the strength of the linear relationship between y and the set of variables x1, x2, xp. It is the highest possible simple correlation between y and any linear combination of x1,x2,.,xp. This property explains that the computed value of R is never negative. In this sense, the least squares regression plane maximizes the correlation between the x variables
and the dependent variable y. Hence, it represents a measure of how well the regression equation fits the data. When the value of the multiple correlation R is close to zero, the regression equation barely predicts y better than sheer chance. A value of R close to 1 indicates a very good fit. Partial Correlation A useful approach to study the relationship between two variables x and y in the presence of a third variable z is to determine the correlation between x and y after controlling the effect of z. This correlation is called partial correlation. Partial correlation is the correlation of two variables while controlling for a third or more other variables. For example r12.34 is the correlation of variables 1 and 2, controlling for variables 3 and 4. If partial correlation r12.34 is equal to uncontrolled correlation r12 , it implies that the control variables have no effect on the relationship between variables 1 and 2.. If partial correlation is nearly equal to zero, it implies that the correlation between original variable is spurious . Partial correlation coefficient is a measure of the linear association between two variables after adjusting for the linear effect of a group of other variables. If the number of other variables is equal to 1, the partial correlation coefficient is called the first order coefficient. If the number of other variables is equal to 2, the partial correlation coefficient is called the second order coefficient, and so on. First order Partial Correlation The first order partial correlation between xi and xj holding constant xl is computed by the following formula rij.l = where rij, ril and rjl are zero order (Pearsons r coefficient) Second order Partial Correlation Correlation between xi and xj holding constant xl and xm is computed by the following formula: rij.lm = where rij, rim.l, rjm.l are first order partial correlation coefficients. Statistical significance of partial correlation coefficients can be tested by using a test statistic similar to the one for simple correlation coefficient. t= where q is the number of variables held constant. The value of t is compared with tabulated t for n-q-2 degrees of freedom. Multicollinearity In practice, the problem of multicollinearity occurs when some of the x variables are highly correlated. Multicollinearity can have significant impact on the quality and stability of the fitted regression model. A common approach to multicollinearity problem is to omit explanatory variables. For example if x1 and x2 are highly correlated (say correlation is greater than 0.9), then the simplest approach would be to use only one of them, since one variable conveys essentially all the information in the other variable.
The simplest method for detecting multicollinearity is the correlation matrix, which can be used to detect if there are large correlations between pairs of explanatory variables. When more subtle patterns of correlation coefficients exist, the determinant of the correlation matrix computed by IDAMS can be used to detect multicollinearity. The determinant of the correlation matrix represents as a single number the generalized variance in the set of predictor variables, and varies from 0 to 1. The value of the determinant near zero indicates that some or all explanatory variables are highly correlated. The value of the determinant equal to zero indicates a singular matrix, which indicates that at least one of the predictors is a linear function of one or more other predictors. Another approach is to compute the tolerance associated with a predictor. The tolerance of xi is defined as 1 minus the squared multiple correlation between that xi and the remaining x variables. When tolerance is small, say less than 0.01, then it would be expedient to discard the variable with the smallest tolerance. The inverse of the tolerance is called the variance inflation factor ( VIF). Stepwise Regression Stepwise regression is a sequential process for fitting the least squares model, where at each step a single explanatory variable is either added to or removed from the model in the next fit. The most commonly used criterion for the addition or deletion of variables in stepwise regression is based on partial F-statistic: = The suffix Full refers to the larger model with p explanatory variables, whereas the suffix Reduced refers to the reduced model with ( p- q) explanatory variables. Forward selection Forward selection procedure begins with no explanatory variable in the model and sequentially adds a variable according to the criterion of partial F- statistic. At each step, a variable is added, whose partial F- statistic yields the smallest p - value. Variables are entered as long as the partial F-statistic p-value remains below a specific maximum value (PIN). The procedure stops when the addition of any of the remaining variables yields a partial p-value > PIN. This procedure has two limitations. Some of the variables never get into the model and hence their importance is never determined. Another limitation is that a variable once included in the model remains there throughout the process, even if it loses its stated significance, after the inclusion of other variable(s). Backward elimination The backward elimination procedure begins with all the variables in the model and proceeds by eliminating the least useful variable at a time. A variable, whose partial F p-value is greater than a prescribed value, POUT, is the least useful variable and is therefore removed from the regression model. The process continues, until no variable can be removed according to the elimination criterion. Stepwise procedure The stepwise procedure is a modified forward selection method which later in the process permits the elimination of variables that become statistically nonsignificant. At each step of the process, the p-values are computed for all variables in the model. If the largest of these p-values > POUT, then that variable is eliminated. After the included variables have been examined for exclusion, the
excluded variables are re-examined for inclusion. At each step of the process, there can be at the most one exclusion, followed by one inclusion. It is necessary that PIN POUT to avoid infinite cycling of the process. Regression with Qualitative Explanatory Variables Sometimes, explanatory variables for inclusion in a regression model are not interval scale; they may be nominal or ordinal variables. Such variables can be used in the regression model by creating dummy (or indicator) variables. Dichotomous Variables Dichotomous variables do not cause the regression variables to lose any of their properties. Since they have two categories, they manage to trick least squares, while entering into the regression equation as interval scale variables with just two categories. Consider for example, the relationship between income and gender y = a + bx where y = income of an individual, and x = a dichotomous variable, coded as 0 if female 1 if otherwise The estimated value of y is =a if x = 0 =a+b if x = 1 Since our best estimate for a given sample is the sample mean, a is estimated as the average income for females and a+b is estimated as average income for males. The regression coefficient b is therefore male female In effect, females are considered as the reference group and males income is measured by how much it differs from females income. Polytomous Variables Consider, for example, the relationship between the time spent by an academic scientist on teaching and his rank. y = a+bx where y is the percentage of work time spent on teaching x is a polytomous variable rank with three modalities: 1 = Professor 2 = Reader 3 = Lecturer We create two dummy variables: X1 = 1 if rank = Professor 0 if otherwise X2 = 1 if rank = Reader 0 if otherwise Note that we have created two dummy variables to represent a trichotomous variable. If we create a third dummy variable X3 (score 1; if rank = Lecturer, and 0 otherwise), the parameters of the regression equation cannot be estimated uniquely. This is because if the score of any respondent on X1 and X2 is known, it would always be possible to predict his score on X3. For example if a respondent has score 0 on X1 (not Professor) and 0 on X2 (not Reader), then the respondent is certainly a Lecturer (i.e., score 1 on X3). This represents a situation of perfect
multicollinearity. Hence the general rule for creating dummy variables is: Number of dummy variables = Number of modalities minus 1. Statistical significance of regression coefficients and Multiple R2 is determined in the same way as for interval scale explanatory variables. 5.3 Multiple Classification Analysis Multiple Classification Analysis (MCA) is a technique for examining the interrelationship between several predictor variables and one dependent variable in the context of an additive model. Unlike simpler forms of other multivariate methods, MCA can handle predictors with no better than nominal measurements and interrelationships of any form among the predictor variables or between a predictor and dependent variable. It is however essential that the dependent variable should be either an interval-scale variable without extreme skewness or a dichotomous variable with frequencies which are not extremely unequal. In statistical terms, the MCA model specifies that a coefficient be assigned to each category of each predictor, and that each individuals score on the dependent variable be treated as the sum of the coefficients assigned to categories characterizing that individual, plus the average for all cases, plus an error term. Yij...n= + ai +bj+ . . . .+e ij..n where Yij...n = The score on the dependent variable of individual n who falls in category i of predictor A, category j of predictor B, etc = Grand mean of the dependent variable. ai = The effect of the membership in the i th category of predictor A. bj = The effect of the membership in the j th category of predictor B. e ij..n= Error term for this individual.. The coefficients are estimated in such a way that they provide the best possible fit to the observed data, i.e., they minimize the sum of squared errors. The coefficients can be estimated by solving a set of equations, known as normal equations (or least squares equations). The normal equations used by the MCA program (shown here for three predictors) are as follows: a i = A i Y (1/Wi ) S Wij bj - (1/Wi ) S Wik ck bj = Bj Y - (1/Wj )S Wij bj - (1/Wj )S Wik ck ck = Y - (1/Wk) S Wij bj -(1/Wk ) S Wik ck where A i = Mean value of Y for cases falling in the i th category of predictor A. Bj = Mean value of Y for cases falling in the j th category of predictor B. C k = Mean value of Y for cases falling in the kth category of predictor C. The MCA program uses an iterative procedure to solve the normal equations An important feature of the MCA program is its ability to determine the coefficients or adjusted deviations associated with the categories of each predictor. The adjusted deviations represent the programs attempt to fit an additive model by solving a set of linear equations. The program actually arrives at the coefficients by a series of successive approximations, altering one coefficient at a time on the basis of the latest estimates of other coefficients. Formulae for statistics printed by the program Notation Yk = Individual ks score on the dependent variable wk = Individual ks weight N = Number of individuals
C = Total number of categories across all predictors ci = Total number of categories in predictor i P = Number of predictors aij = Adjusted deviation of j th category of predictor i on final iteration 1 Sum of Y
Sum of Y2
Grand mean of Y
Sum of Y for category j of predictor I Sum of Y2 for category j of predictor I Standard deviation of Y
Mean of Y for category j of predictor I
Sum of squares based on unadjusted deviations for predictor i Sum of squares based on adjusted deviations for predictor i Explained sum of squares
10
11
Total sum of squares
12 13 14 15 16
Residual sum of squares Eta for predictor i Beta for predictor i Multiple correlation coefficient (squared) Adjustment for degrees of freedom Multiple correlation coefficient (squared and adjusted for degrees of freedom) Eta (squared and adjusted for degrees of freedom)
17
18
A variety of F tests can be computed from the statistics printed by the program. The first test answers the question: Do all predictors together explain a significant proportion of the variance of the dependent variable?
The second test answers the question: Does this particular predictor all by itself explains a significant portion of the variance of the dependent variable? This is the classical question answered by one-way analysis of variance. The F test for predictor i is computed as follows:
Measures of importance of predictors The following criteria can be used for assessing the importance of a predictor, i.e., the degree of relationship between an independent variable and the dependent variable or its predictive power. Eta statistic This statistic can be used for assessing the bivariate relationship between a predictor and the dependent variable. Eta squared (also called the correlation ratio) can be interpreted as the proportion of variance explained by the predictor. Beta statistic
This statistic is an approximate measure of the relationship between a predictor and the dependent variable, while holding constant all other predictors, i.e., assuming that in each category of a given predictor all other predictors are distributed as they are in the population at large. The rank order of these betas indicates the relative importance of the various predictors in their explanation of the variance in dependent variable, if all other predictors were held constant. To assess the marginal or unique explanatory power a predictor has over and above what can be explained by other predictors, the following procedures are suggested. 1. One can remove the effects of other predictors from the predictor in question, and then correlate the residuals of that predictor (actual minus predicted values) with the dependent variable. This part correlation asks whether there is any variability in X not predictable by other predictors that help to explain Y? In other words, one assesses the importance of a predictor in terms of the variance in the dependent variable marginally explainable by the predictor relative to the total variance in the dependent variable. The squared part correlation can be obtained by carrying out two MCA analyses, with and without the predictor in question, since the squared part correlation is equal to the increase in multiple R squared. Squared part correlation = (R2adj with everything in) - (R2adj omitting one set) 2. One can remove the effects of the other predictors from both the dependent variable and the predictor in question, and correlate the two sets of residuals. This is the partial correlation coefficient. The squared partial correlation can be estimated from two multiple R-squares:
Advantages of Multiple Classification Analysis MCA can overcome some of the problems of Analysis of Variance, Multiple Regression or Discriminant Analysis. In the case of Analysis of Variance, the problem of correlated predictors must be considered, whereas in the case of Multiple Regression or Discriminant Analysis, one is faced with the problem of predictors, which are not interval scale variables, but categories, often with scales as weak as the nominal level. An important feature of MCA is its ability to show the effect of each predictor on the dependent variable, both before and after taking into account the effects of all other predictors. Multiple Regression and Discriminant Analysis can also do this, but under certain restrictive conditions. They usually require that all predictor variables are measured on interval scales and the relationships are linear or linearized. MCA is not constrained by any of these conditions. The predictors are always treated as sets of classes or categories; hence it does not matter whether a particular set represents a nominal scale (categories) or ordinal scale (ranking) or an interval scale (classes of numerical variable). Another important feature is the format in which the results are presented. All coefficients are expressed as deviations from the overall mean, not from unknown mean of the excluded class in each set. The constant term in the predicting equation is the overall mean, not some composite sum of means of the excluded subclasses. Moreover adjusted and unadjusted subgroup means are available in the same table, which can be used to detect the amount of intercorrelations between the predictors.
NON-PARAMETRIC STATISTICS Non-parametric statistics allow testing of hypothesis even when certain classical assumptions, such as interval-scale measurement or normal distribution are not met. In research practice, these classical assumptions are often strained. Basically, there is at least one non-parametric equivalent for each parametric general type of test. Non-parametric tests generally fall into the following groups: Tests of differences between groups Tests of differences between variables Tests of relationships between variables
Parametric versus non-parametric A potential source of confusion in working out what statistics to use in analysing data is whether your data allows for parametric or nonparametric statistics. The importance of this issue cannot be underestimated! If you get it wrong you risk using an incorrect statistical procedure or you may use a less powerful procedure. Non-paramteric statistical procedures are less powerful because they use less information in their calulation. For example, a parametric correlation uses information about the mean and deviation from the mean while a nonparametric correlation will use only the ordinal position of pairs of scores. The basic distinction for paramteric versus non-parametric is:
If your measurement scale is nominal or ordinal then you use nonparametric statistics If you are using interval or ratio scales you use parametric statistics.
There are other considerations which have to be taken into account: You have to look at the distribution of your data. If your data is supposed to take parametric stats you should check that the distributions are approximately normal. The best way to do this is to check the skew and Kurtosis measures from the frequency output from SPSS. For a relatively normal distribution: skew ~= 1.0 kurtosis~=1.0 If a distribution deviates markedly from normality then you take the risk that the statistic will be inaccurate. The safest thing to do is to use an equivalent non-parametric statistic. Non-parametric statistics Descriptive Name For what Mode Media n Central tendancy Central tendancy Notes Greatest frequency 50% split of distribution lowest and highest value
Range Distribution Association Name Spearman's Rho Kendall's Tau Chi square For what
Notes
Correlatio based on rank order of n data Correlatio n Tabled
data
Parametric Assumed distribution Assumed variance Typical data Data set relationships Usual central measure Benefits Tests Choosing Correlation test Independent measures, 2 groups Independent measures, >2 groups Repeated measures, 2 conditions Repeated measures, >2 conditions Choosing parametric test Pearson Independentmeasures t-test One-way, independentmeasures ANOVA Matched-pair ttest Normal Homogeneous Ratio or Interval Independent Mean Can draw more conclusions
Non-parametric Any Any Ordinal or Nominal Any Median Simplicity; Less affected by outliers
Choosing a nonparametric test Spearman Mann-Whitney test
Kruskal-Wallis test
Wilcoxon test
One-way, repeated Friedman's test measures ANOVA
Table 37.1. Selecting a statistical test Type of Data
Goal
Measurement (from Gaussian Population) Mean, SD
Rank, Score, or Measurement (from NonGaussian Population) Median, interquartile range
Binomial (Two Possible Outcomes) Proportion
Survival Time
Describe one group Compare one group to a hypothetical value Compare two unpaired groups Compare two paired groups
Kaplan Meier survival curve
One-sample t Wilcoxon test test
Chi-square or Binomial test ** Fisher's test Log-rank test (chi-square or Mantelfor large Haenszel* samples) McNemar's test Conditional proportional hazards regression* Cox proportional hazard regression** Conditional proportional hazards regression**
Unpaired t test
Mann-Whitney test
Paired t test
Wilcoxon test
Compare three One-way or more ANOVA unmatched groups Compare three Repeatedor more measures matched ANOVA groups Quantify association between two variables Predict value from another measured variable Predict value from several measured or binomial variables Pearson correlation
Kruskal-Wallis test
Chi-square test
Friedman test
Cochrane Q**
Spearman correlation
Contingency coefficients* * Simple logistic regression* Cox proportional hazard regression* Cox proportional hazard regression*
Simple linear regression or Nonlinear regression Multiple linear regression* or Multiple nonlinear regression**
Nonparametric regression**
Multiple logistic regression*
Parametric Assumptions The observations must be independent
The observations must be drawn from normally distributed populations These populations must have the same variances The means of these normal and homoscedastic populations must be linear combinations of effects due to columns and/or rows* Nonparametric Assumptions Observations are independent Variable under study has underlying continuity Measurement What are the 4 levels of measurement? 1. Nominal or Classificatory Scale Gender, ethnic background 2. Ordinal or Ranking Scale Hardness of rocks, beauty, military ranks 3. Interval Scale Celsius or Fahrenheit 4. Ratio Scale Kelvin temperature, speed, height, mass or weight Nonparametric Methods There is at least one nonparametric test equivalent to a parametric test These tests fall into several categories 1. Tests of differences between groups (independent samples) 2. Tests of differences between variables (dependent samples) 3. Tests of relationships between variables 1. Tests of differences between groups (independent samples) Two samples compare mean value for some variable of interest Differences between independent groups Differences between dependent groups Relationships between variables: Two variables of interest are categorical Parametric Correlation coefficient Nonparametric Spearman R Kendall Tau Coefficient Gamma Chi square Phi coefficient Fisher exact test Kendall coefficient of concordance
Summary Table of Statistical Tests
Tests of Differences between Groups 1. Mann-Whitney U-Test: A non-parametric test equivalent to t-test. It tests whether two independent samples are from the same population. Requires an ordinal level of measurement. U is the number of times a value in the first group precedes a value in the second group when values are ordered in ascending order. This test is the nonparametric substitute for the equal-variance t-test when the assumption of normality is not valid. When in doubt about normality, it is safer to use this test. Two fundamental assumptions of this test are: The distributions are at least ordinal in nature. The distributions are identical, except for location. This means that ties are not acceptable. This particular test is based on ranks and has good properties (asymptotic relative efficiency) for symmetric distributions. The Mann-Whitney test statistic, U, is defined as the total number of times a Y precedes an X in the configuration of combined samples. It is directly related to the sum of ranks. This is why this test is sometimes called the Mann-Whitney U test and at other times called the Wilcoxon Rank Sum test The Mann-Whitney U test calculates UX and UY. The formula for UX is as follows. The formula for UY is obtained by replacing X by Y in the above formula: deviation makes little difference unless there are a lot of ties.
Relationships between Variables Non-parametric equivalents of correlation coefficient are: Spearman's correlation coefficient Rho, Kendall's Tau and Gamma. 1. Spearman's Correlation Coefficient is a commonly used non-parametric measure of correlation between two ordinal variables. It can be thought of as the regular product moment correlation coefficient in terms of the proportion of variability accounted for.
Pearson correlation is unduly influenced by outliers, unequal variances, non-normality, and nonlinearity. An important competitor of the Pearson correlation coefficient is the Spearmans rank correlation coefficient. This latter correlation is calculated by applying the Pearson correlation formula to the ranks of the data rather than to the actual data values themselves. In so doing, many of the distortions that plague the Pearson correlation are reduced considerably. Pearson correlation measures the strength of linear relationship between X and Y. In the case of nonlinear, but monotonic relationships, a useful measure is Spearmans rank correlation coefficient, Rho, which is a Pearsons type correlation coefficient computed on the ranks of X and Y values. It is computed by the following formula:
where di is the difference between the ranks of Xi and Yi. rs = +1, if there is a perfect agreement between the two sets of ranks. rs = - 1, if there is a complete disagreement between the two sets of ranks.
2. Kendall's Tau is a non-parametric measure of association for ordinal or ranked variables. It is equivalent to Spearman's Rho with regard to the underlying assumptions. However, Spearman's Rho and Kendall's Tau are not identical in magnitude, since their underlying logic and computational formulae are quite different. Two different variants of Tau are computed: Tau b and Tau c. These measures differ only as to how tied ranks are handled. In most cases, these values are very similar, and when discrepancies occur, it is probably safer to interpret the lower value.
Kendalls Tau
This is a measure of correlation between two ordinal-level variables. It is most appropriate for square tables. For any sample of n observations, there are [n (n-1)/2] possible comparisons of points (XI, YI) and (XJ, YJ). Let C = Number of pairs that are concordant. Let D = Number of pairs that are not concordant.
Kendalls Tau = Obviously, Tau has the range: - 1 Tau +1 If XI = XJ, or YI = YJ or both, the comparison is called a tie. Ties are not counted as concordant or discordant. If there are a large number of ties, then the dominator replaced by has to be
where nX is the number of ties involving X, and nY is the number of ties involving Y. In large samples, the statistic: 3 Tau {n (n-1)}
1/2
/ {2 (2n+5)}
1/2
has a normal distribution, and therefore can be used as a test statistic for testing the null hypothesis of zero correlation. Kendalls Tau is equivalent to Spearmans Rho, with regard to the underlying assumptions, but Spearmans Rho and Kendalls Tau are not identical in magnitude, since their underlying logic and computational formulae are quite different. The relationship between the two measures is given by -1 {(3 Kendalls Tau) (2 Spearman Rho)} +1 In most cases, these values are very similar, and when discrepancies occur, it is probably safer to interpret the lower value. More importantly,
Kendalls Tau and Spearmans Rho imply different interpretations. Spearmans Rho is considered as the regular Pearsons correlation coefficient in terms of the proportion of variability accounted for, whereas Kendalls Tau represents a probability, i.e., the difference between the probability that the observed data are in the same order versus the probability that the observed data are not in the same order. There are two different variants of Tau, viz. Tau b and Tau c. These measures differ only as to how tied ranks are handled.
Kendall's Tau-b
Kendall's Tau-b is a measure of association often used with but not limited to 2-by-2 tables. It is computed as the excess of concordant over discordant pairs (C - D), divided by a term representing the geometric mean between the number of pairs not tied on X (X0) and the number not tied on Y (Y0): Tau-b = (C - D)/ SQRT [(C + D + Y0)(C + D + Y0)] There is no well-defined intuitive meaning for Tau -b, which is the surplus of concordant over discordant pairs as a percentage of concordant, discordant, and approximately one-half of tied pairs. The rationale for this is that if the direction of causation is unknown, then the surplus of concordant over discordant pairs should be compared with the total of all relevant pairs, where those relevant are the concordant pairs, the discordant pairs, plus either the X-ties or Y-ties but not both, and since direction is not known, the geometric mean is used as an estimate of relevant tied pairs. Tau-b requires binary or ordinal data. It reaches 1.0 (or -1.0 for negative relationships) only for square tables when all entries are on one diagonal. Tau-b equals 0 under statistical independence for both square and nonsquare tables. Tau-c is used for non-square tables.
Tau-c
Kendall's Tau-c, also called Kendall-Stuart Tau-c, is a variant of Tau-b for larger tables. It equals the excess of concordant over discordant pairs, multiplied by a term representing an adjustment for the size of the table. Tau-c = (C - D)*[2m/(n2(m-1))] where
m = the number of rows or columns, whichever is smaller n = the sample size.

3. Gamma (Goodman Kruskal Gamma). Another non-parametric measure of correlation is Gamma. In terms of the underlying assumptions, Gamma is equivalent to Spearman's Rho or Kendall's Tau. In terms of interpretation and computation, it is more similar to Kendall's Tau than Spearman's Rho. Gamma statistic is, however, preferable to Spearman's Rho and Kandall's Tau when the data contain many tied observations.
Another non-parametric measure of correlation is Goodman Kruskal Gamma ( ) which is based on the difference between concordant pairs (C) and discordant pairs (D). Gamma is computed as follows: = (C-D)/(C+D) Thus, Gamma is the surplus of concordant pairs over discordant pairs, as a percentage of all pairs, ignoring ties. Gamma defines perfect association as weak monotonicity. Under statistical independence, Gamma will be 0, but it can be 0 at other times as well (whenever concordant minus discordant pairs are 0). Gamma is a symmetric measure and computes the same coefficient value, regardless of which is the independent (column) variable. Its value ranges between +1 to 1. In terms of the underlying assumptions, Gamma is equivalent to Spearmans Rho or Kendalls Tau; but in terms of its interpretation and computation, it is more similar to Kendalls Tau than Spearmans Rho. Gamma statistic is, however, preferable to Spearmans Rho and Kandalls Tau, when the data contain many tied observations.
4. Chi-square test: This goodness of fit test compares the observed and expected frequencies in each category to test whether all the categories contain the same proportion of values.
Another useful way of looking at the relationship between two nominal (or categorical) variables is to cross-classify the data and get a count of the number of cases sharing a given combination of levels (i.e., categories), and then create a contingency table (cross-tabulation) showing the levels and the counts. A contingency table lists the frequency of the joint occurrence of two levels (or possible outcomes), one level for each of the two categorical
variables. The levels for one of the categorical variables correspond to the columns of the table, and the levels for the other categorical variable correspond to the rows of the table. The primary interest in constructing contingency tables is usually to determine whether there is any association (in terms of statistical dependence) between the two categorical variables, whose counts are displayed in the table. A measure of the global association between the two categorical variables is the Chisquare statistic, which is computed as follows: Consider a contingency table with k rows and h columns. Let nij denote the cross-frequency of cell (i, j). Let denote the expected frequency of the cell. The deviation between the observed and expected frequencies (nij ) characterizes the disagreement between the observation and the hypothesis of independence. The expected frequency for any cell can be calculated by the following formula: = (RT CT) / N where = expected frequency in a given cell (i, j) RT = row total for the row containing that cell. CT = column total for the column containing that cell. N = total number of observations. All the deviations can be studied by computing the quantity, denoted by
This statistic is distributed according to Pearsons Chi-square law with (k1) (h-1) degrees of freedom. Thus, the statistical significance of the relationship between two categorical variables is tested by using the 2 test which essentially finds out whether the observed frequencies in a distribution differ significantly from the frequencies, which might be expected according to a certain hypothesis (say the hypothesis of independence between the two variables). Assumptions
The 2 test requires that the expected frequencies are not very small. The reason for this assumption is that the Chi-square inherently tests the underlying probabilities in each cell; and when the expected cell frequencies fall, these probabilities cannot be estimated with sufficient precision. Hence, it is essential that the sample size should be large enough to guarantee the similarity between the theoretical and the sampling distribution of the 2 statistic. In the formula for computation of 2, the expected value of the cell frequency is in the denominator. If this value is too small, the 2 value would be overestimated and would result in the rejection of the null hypothesis. To avoid making incorrect inferences from the 2test, the general rule is that an expected frequency less than 5 in a cell is too small to use. When the contingency table contains more than one cell with an expected frequency < 5, one can combine them to get an expected frequency 5. However, in doing so, the number of categories would be reduced and one would get less information. It should be noted that the 2test is quite sensitive to the sample size. If the sample size is too small, the 2 value is overestimated; if it is too large, the 2 value is underestimated. To overcome this problem, the following measures of association are suggested in the literature: Phi square (2), Cramers V and Contingency Coefficient. Phi square is computed as follows: 2= 2/N where N is the total number of observations. For all contingency tables, which are 2 2, 2 k, or 2 h, Phisquare has a very nice property that its value ranges from 0 (no relationship) to 1 (perfect relationship). However, Phi-square loses this nice property, when both dimensions of the table are greater than 2. By a simple manipulation of Phisquare, we get a measure (Cramers V), which ranges from 0 to 1 for any size of the contingency table. Cramers V is computed as follows:
where L = min(h, k) Contingency coefficient
The coefficient of contingency is a Chi-square -based measure of the relation between two categorical variables (proposed by Pearson, the originator of the Chi-square test). It is computed by the following formula:
Its advantage over the ordinary Chi-square is that it is more easily interpreted, since its range is always limited to 0 through 1 (where 0 means complete independence). The disadvantage of this statistic is that its specific upper limit is limited by the size of the table; Contingency coefficient can reach the limit of 1, only if the number of categories is unlimited.
Lambda L This is a measure of association for cross tabulations of nominal-level variables. It measures the percentage improvement in predictability of the dependent variable (row variable or column variable), given the value of the other variable (column variable or row variable). The formula is Lambda A( Row dependent)
Lambda B- (Columns dependent)
Symmetric Lambda This is a weighted average of the Lambda A and Lambda B , The formula is
Fisher's Exact Test Fisher's exact test is a test for independence in a 2 2 table.. . This test is designed to test the hypothesis that the two column percentages are equal. It is particularly useful when sample sizes are small (even zero in some cells) and the Chi-square test is not appropriate. The test determines whether the two groups differ in the proportion with which they fall in two classifications: The test is based on the probability of the observed outcome, and is given by the following formula:
where a, b, c, d represent the frequencies in the four cells;. N = total number of cases. Mann Whitney U -Test
This test is the nonparametric substitute for the equal-variance t-test when the assumption of normality is not valid. When in doubt about normality, it is safer to use this test. Two fundamental assumptions of this test are: The distributions are at least ordinal in nature. The distributions are identical, except for location. This means that ties are not acceptable. This particular test is based on ranks and has good properties (asymptotic relative efficiency) for symmetric distributions. The Mann-Whitney test statistic, U, is defined as the total number of times a Y precedes an X in the configuration of combined samples. It is directly related to the sum of ranks. This is why this test is sometimes called the Mann-Whitney U test and at other times called the Wilcoxon Rank Sum test The Mann-Whitney U test calculates UX and UY. The formula for UX is as follows. The formula for UY is obtained by replacing X by Y in the above formula: deviation makes little difference unless there are a lot of ties. Wilcoxon Signed-Rank Test This nonparametric test makes use of the sign and the magnitude of the rank of the differences of two related samples, and utilizes information about both the direction and relative magnitude of the differences within pairs of variables. Sum Ranks (W) The basic statistic for this test is the minimum of the sum of the positive ranks (SR+) and the sum of the negative ranks (SR-). This statistic is called W. W =Minimum [ R +, R - ] Mean of W This is the mean of the sampling distribution of the sum of ranks for a sample of n items. m W = n(n +1) 4 Standard deviation of W This is the standard deviation of the sampling distribution of the sum of ranks. The formula is: s W =SQRT [ { n (n+1) (2n+1)/24 } ] - { ( t i) 3 - ( t i) /48} ] where t i represents the number of times the ith value occurs. Number of Zeros: If there are zero differences, they are thrown out and the number of pairs is reduced by the number of zeros. Number of Ties: - The treatment of ties is to assign an average rank for the particular set of ties. This is the number of sets of ties that occur in the data. Approximations with (and without) Continuity Correction: Z-Value If the sample size 15, a normal approximation method may be used to approximate the distribution of the sum of ranks. Although this method does correct for ties, it does not have the continuity correction factor. The z value is as follows:
If the correction factor for continuity is used, the formula becomes:
Correlation is a measure of the strength of relationship between random variables. The (population) correlation between two variables X and Y is defined as: (X, Y) = Covariance (X, Y) / {Variance (X) Variance (Y)} where Covariance (X, Y)= (X X) (Y Y) where X and Y are the expected values of X and Y respectively. is called the Product Moment Correlation Coefficient or simply the Correlation Coefficient. If X and Y tend to increase together, is positive. If, on the other hand, one tends to increase as the other tends to decrease, is negative. The value of correlation coefficient lies between 1 and +1, inclusive.
References: http://www.unesco.org/webworld/idams/advguide/Chapt6_2.htm
1/2

Data Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analysis

Uploaded by

Copyright:

Available Formats

2.

One One One One

One One More More

Interval-scale Interval-scale Nominal Interval-scale

is an estimate of mj and (yij

Source of Variation Regression Residual Total

Mean Square SSreg / 1 SSres / ( N 2)

Mean of Y for category j of predictor I

Total sum of squares

Correlatio based on rank order of n data Correlatio n Tabled

Choosing a nonparametric test Spearman Mann-Whitney test

One-way, repeated Friedman's test measures ANOVA

Table 37.1. Selecting a statistical test Type of Data

Measurement (from Gaussian Population) Mean, SD

Rank, Score, or Measurement (from NonGaussian Population) Median, interquartile range

Binomial (Two Possible Outcomes) Proportion

Kaplan Meier survival curve

One-sample t Wilcoxon test test

Multiple logistic regression*

Parametric Assumptions The observations must be independent

Summary Table of Statistical Tests

m = the number of rows or columns, whichever is smaller n = the sample size.

where L = min(h, k) Contingency coefficient

Lambda B- (Columns dependent)

If the correction factor for continuity is used, the formula becomes:

You might also like