You are on page 1of 44
e JIGSAW ACADEMY “Analytics for Professionals, BASIC STATISTICAL CONCEPTS Py CHI-SQUARE TESTS ‘We have looked at hypothesis testing using data from one sample or two samples. + Inone sample tests, we test if sample mean is different from a hypothesized value or a population mean In two sample tests, we look at the difference between two means from two samples, or two proportions from two samples What if there are more than two samples? ‘The best way to test differences in proportions for more than two samples is to use a Chi Squared Test ‘Chi-Squared tests can also be used for 1, Equality of more than two population means, 2 Test hypothesis about population variance 3, Test for independence of attributes in a population CHI-SQUARE TESTS & We'llstart withe verysimple exemple that looks at categorical or tabular data You are retailerthat islookingat brand RO! to assess shelf spaceeffectiveness.Let’ssay youlook ata particular category, carbonated beverages. Based.on past data, you know that Brand A, the category leader, has52% share of volume, followed by BrandB, with 35% share of volume, and allother Brands, withthe remaining 13% shareof volume. You takes random sample of data from particular store 200 purchases of carbonated baverages. This iswhat you find: Brand A: 177 purchases Brand B: 78 purchases Brand: 45 purchases Novi, 177 purcheses equates to 59% share for Brand A;78 purchases equatesto 26 for Brand B, and 15% for Brand C. Do you believe thatthis proportion isstatistically significantly different from the proportion ccbserved in the population? Oris thisstore perheps not representative of the population? « CHI-SQUARE TESTS & We introduce the concept of Observed Frequency vs Expected Frequency ‘The Idea Is to check the difference between what you see In your sample vs what you expected in your sample, and then assess the chances of seeing that difference purely by chance What would be expected frequencies in our example? We were expacting 52% for Brand A out of 00= 156 So, for all brands, we can build a table like this: Observed 177 78 26 Expected 156 105 39 CHI-SQUARE TESTS The Chi-Square statisticis builtas: co-e? © x. We need also the degrees of freedom to obtainthe critical value for a Chi Square test. Degrees of Fraadom=Numberof cells—1 Inthiscase: 2 (2-1) For our example, Chi-Squere Test Statistic= = (177-156))2/156 +(78-105)*2/105 + [45-29)"2/29 = 10.69 Degrees of Freedom=2 « CHI-SQUARE TESTS & Seo Whatis the critical value here at the 95% significance level? Whats the conclusion based on your test statistic? What was our null hypothesis? 4 CHI-SQUARE TESTS € Now let's look ata more complicated example: Let'ssay you are looking at preferences for beveragesby age, and you wantto understandif ‘there is an association between ageend.rand preference and%of brand preference You doa survey on a rancom sample, ane thisis what you get: Coke 33 45 SL Pepsi 26 33, 28 Sprite 21 20 2. ‘You can usea Chi Square Test to tast the association betweenthe rawand column variables. in other words, you are testing thenull hygothesis hat? « CHI-SQUARE TESTS & ‘We know that Chi-Square Test Statisticisa function of observed values and expected values.. Where are the expected values? You can actually calculate the expected values under the assumption thatthe null hypothesisis true. How is that done? Expected Value= Row Total * Column Total /n, wheren is total number of observationsin sample CHI-SQUARE TESTS Now we can calculate the Chi-Square Statistica (o-)? e ¥- = Degrees if reedom are : (r-1)*(c-1) In thiscase, Test Statistics 1.51, with degreesoffreedomna (How dia get that?) What is the conclusion? ‘There is no association between age and choice of brand CHI-SQUARE TESTS = Goodness of Fit Another very popularuse of Chi Squares in Goodness of Fit tests; that is, tastingifthe data follows a particular distribution ornot Here isan example (from http://www stat yale acu/Courses/1997-98/101/chigf htm) A gambleris playinga new game ina casino, whichinvolvesrolling three dicoata time. Winnings are directly proportional tathe numberof’ rolled, This is what is observedin 100 rolls ofthe dice Would you have causeto believe that the gambler ismeybe fr” ame | “too” lucky, and isplayingwith loeded dice? CHI-SQUARE TESTS = \Whet distribution would you expect the outcome of seeinga6 on rolleddiceto follow? Binomial What should the expected probabilities be of number of 0's in three rolled dice? ‘We can calculate that using the Binomial Distributton formula mx. canes “aoa = = Paste prorat painter |B 7 Me - | G- | ae -) a G| ae oe INOM_DIST(O. 3.1/6, FALSE) 4 CHI-SQUARE TESTS € So therefore we now have both observed and expected, and can use the Chi Square test to check if the difference between observed and expected suggests that the gambler is maybe too lucky How would you do that? Chi Square Statistics =23.6 Degrees if Freedom = 3 Conelusion? 4 CHI-SQUARE TESTS € You can similarly test for a normal distribution or any type of distribution - youneed to calculate expected distribution using the probability distribution formula. Ifyou are checking for normality, for example, follow these steps: Calculate mean and std deviation of your data Bin the data into sub-intervals Calculate expected probability of those sub-intervals (using normal probability function) Compare that to frequency observed in data Construct Chi Square and test CHI-SQUARE TESTS - EXCEL nae Re we BA Re Reece” E) HTESTCS7.547) coke 46 Pes! 289 sore 153 m2 ous or 35 CHI-SQUARE TESTS - SAS & Proc format: value Expfnt 1="High Cholesterol Diet! OetLew Cholesterol Diet'; value Rspfnt 1="¥es Detne aca Fatconp: input Exposure Response Coun Label ResponsestHeart Disease"; davelines; proc sort dava-FatComp; by descending Exposure descending Response: & CHI-SQUARE TESTS - SAS = proc freq dacesfatcomp order=dava: veig 3 title 'Case-Contecl Study of High Fat/Cholestexel Dice's Case.Contio! Stidy of High Fat/Cholestorot Diet CHI-SQUARE TESTS - SAS = 4 CHI-SQUARE TEST OF VARIANCE ‘The ChiSquare distribution can also be used for testing variance and std deviation. Remember ‘that 0 far all hypothesis testing we have reviewed was concerned with testing means. ‘We cannot use the sametests to check for variance or std deviation (why not?) a x Testsniiciscomputedar 2 (8-18? where N= samplesize 'S=samplestd, deviation = populationstddeviation CHI-SQUARE TESTS - SAS & Acallcenter isexperimenting with different aporoachesto improve customer experience, typically with thoaim of consistent call resolution time. Currently averagerasolutiontime is 6.5 minutes, witha variance of 4.5 minutes. Anew approach has been testedand resultsinan average resolution time of 6 minutes, and a variance of 3 minutes across 20 calls Is the new approach sufficiently different fromthe standard to justifyinvestment init? four aim s consistency, we should be checking if there is significantreduction in variance of. resolution time: So, HO: Variance = 4.5 minutes Hi: Variance <4.5 minutes (Chi Square Statistic = 29 * an2 / (0.592) = 12.98 DF=29 CHI-SQUARE TESTS - SAS Conclusion? This test does require that the underlying population is normally distributed ANOVA Analysis of variance (ANOVA) Is a statistical technique used to compare the means of two or more groups of observations, or treatments. + continuous dependent, or response, variable iscrete independent variable, also called a predictor or explanatory variable. Let’s look at an example: = Let’s say that you work for a retailer who wants to understand shelving height impacts on sales. That is, do sales of a particular brand change significantly if they ate placed at eye level or at lower levels or higher levels ‘There are 4 shelves in an aisle, and the brand is placed in a different shelf in every aisle, and sales recorded for 10 days. e ANOVA The table to the right fists the total units SLE a ee TT sold every day for 10 days, when the brand was stocked in shelves at 21 19 11 2 different heights 20 18 2 10 We eed to dete if height he 18 2 15 iz Jenow need to determine if height has an impact on total sales, or, is there a 18, 2 10 ir siference inthe group means by 7 [2 7 4 height? 13 16 21 8 18 12 8 19 15 16 18 20 7 i4 24 2B 2 19 18 15 ANOVA In ANOVA — the overall variance of the population is divided into two groups: Within group variance: ANOVA Partitioning Variability in ANOVA 4 ANOVA & It can be established (mathematically) that there are two independent ways of establishing the standard error of the mean (essentially a measure of variance] Intuition: Approach 1: Use the sample variances to come up with an estimate of total variance -SSW Approach 2: Use the comparison of group means ~ SSB ANOVA looks at a ratio of the two methods of estimating variance —if the ratiois similar, then the null hypothesis is unlikely to be rejected 4 ANOVA & Another way of looking at ANOVA is = ‘Any observation in an experiment can be broken down into The overall mean + (or -) How far the average of a group is from the overall mean - Between Group Variation + (or -) How far an observation is from the average of the group. ~ Within Group Variation If the independent variable has no impact, then within group variation and between ‘group variation should be similar with any small differences attributable to random. sampling error 4 ANOVA e So, the standard null ypathesic ‘woule be that all meansare equal HD, Meml= Mean= Mean8= Meant =Mears HA Mesnl Mean? or Meam3oMeant or Mean © Mean! o The Alternate: Atleast onepair of iid endw b O selemS 6 meansare unequal ‘What would be constructedes a test-statistic? Ifthe variation within a group is significantly different from the veriation between the _groups, then we would expectto reject thenull hypothesis z Hl i A ANOVA How do we calculate the within group variations? = Calculate the variance for each group, and then calculate an ‘average across groups Between group variation? = Calculate the average of the square variations of each population mean from the mean for all the data (Grand Mean) ANOVA Within Group Variance 1. Calculate the Mean for each group a |» | 2. Subtract each sample mean pls from every score in that group ela ls 3. Square the difference ws |» 4. Add up alll the squared Differences Se ‘The SSW (Sum of Squares, Witl Bj | a be written as w|i | iw s |i | is Ss =a. Fe)? ote ras alls va | ws | we sex | uss | usa ANOVA = Between Group Variance 1, Calculate a Grand Mean for all observations acrossall groups 2, Subtract each grand mean from each sample mean 3. Square these differences 4. Multiply each squared score bysamplesize 5. Add them all up = The SSW (Sum of Squares, Within) can be written as ssp = 3°, -F) »tule2to ea ps | 9 uits [ola pute tw ofe)als ote tele a ]u]s|o vf[eyals 2 4 ANOVA & Total Variation = Between Variance + Within Variance — Another way of looking at total variation is: * Calculate Grand Mean of all observations * Calculate Difference between each observation and the Grand Mean + Square the differences + Add them all up Should we checkif total variation calculated this way will add up to the sum of between variance and within variance? Calculated this way, SST (Total Sum of Squares) = 459.775 — SSB (Sum of Squares Between) = 13.875 — SSW (Sum of Squares Within) = 445.9 ANOVA Therefore: SST= SSB + SSW — What does this look like? = The OLS regression equation So, similar to OLS, we are trying to explain variation What does a large value of SSB mean? — a5 55B become bigger, the chance of rejecting the null hypothesis, becomes higher What does a large value of SSW mean? — Asthe SSW becomes smaller, the chance of rejecting the null hypothesis becomes higher. ANOVA 6 f__ TEST STATISTIC What about the test statistic? — The test stat for ANOVA = MSB/MSW Where, MSB = SSB/DFB: MSW = SSW/DFW. DFB =k-1,where kis the number of groups, and DFW= n-k, where n = number of di=t, dat observations — dt=2, det ‘The Test Stat follows an F-Distribution — dt=5,d2=2 — di=100, d2=1 Any random variate of F-distribution can be di=100, d2=100 characterized as the ratio of two Chi Square Distributions Uifar U,/de 0 ‘Where U, and U, are Chi Square Dist dh of 6 SgaRAGAHEEMESCRARAHHEDEMa aanRaRanateceoseeee eset ANOVA TEST STATISTIC Hf we calculate for our example: a ww g & % a 8 al 7 2 é z 2 a o 6 sg 2 5 a o i SSB/DFB= — N=40 distribution table? — Critical value= ? — Null Hypothesis to be 2 — MSB - Msw -K What do we assess from the ANOVA $ F DISTRIBUTION TABLE F - Distribution (C= 0.05 in the Right Tail) Tpercor Depew of Fender ANOVA $ ASSUMPTIONS + The populations from which the samples were obtained must be normally or approximately normally distributed. + The samples must be independent. + The variances of the populations must be equal. ANOVA $ EXCEL Use Data Analysis in Tools Results for previous example: Anova: Single Factor SUMMARY ‘Grous Count Sum Average Variance Toumn 7 «777 s66R67 Column2 10 181 18.1 14.2202 Coumn 3 30 ma ara ta 7141 Columns 401651651444 ANOVA, ‘Soures oF Varaton SS wT rae Fae Between Groups 73075 5S @GTOHZIO.T/2E7S CZ O66) within Groups 4459 36 12.28611 Total 489.778 20 ANOVA SAS PROCANOVA — SASexpects data to be in the following format: = Group Observation Syntax: Procanova data = test; Class group; Model rate = group; Run; ANOVA 6 TWO WAY Let’s now look at a more complex example. — Let’s say we are interested in understanding the impact of both shelf level as well as aisle placement on sales for Brand A — That is, not only the height of the product placed, but also other brands/categories that the product is placed in are hypothesized to have an impact on Brand A sales — If there are three different aisles, we have 3*5 different placements for Brand A — How do we determine if mean sales rates are different between the groups? « TWO WAY ANOVA & ‘A Two-Way ANOVAis useful when we desire to compare the effect of mu levels of two factors and we have multiple observations at each level. There are three null hypothesis that can be tested in a two-way ANOVA — The population means of the first factor are equal. Thisis like the one- way ANOVA for the row factor. — The population means of the second factor are equal. Thisis like the ‘one-way ANOVA for the column factor. — There is no interaction between the two factors. This is similar to performing a test for independence with contingency tables. TWO WAY ANOVA $ SAS/EXCEL ‘We do not need to get into all the calculation details, — We can run an two-way analysis in Excel or SAS, — In Excel, use Data Analysis — In SAS, use PROCGLM + procglm data=test; class shelf shelf*aisle; run; — What does the shelftaisle do? le; model rate= shelf aisle * What If we leave it out? TWO WAY CASE STUDY ‘PROBLEM: You have been colledinas 0 consultant to help the Pratt and Whitney plant determine the best method of applying ‘the reflective stripe that is used to guide the Automated Guided Vehicles (AGVs)clona ‘their path. There are two weys of opplying the stripe (point or coated adhesive tape) Tina TERETE ‘and three types of flocring (linoleum, 1 Paine ‘concrete or Ii concrete) in the facilities that use the AGVs, You have set yp two identical “test tracks” on each type of flooring and copplied the stripe using the two methods under study. You run 3 replications in random order and count the number of ‘tracking errors per 1000 ft of track. Use the Two Way ANOVA options in Excel ‘and SAS to assess if there isa significant difference in using paint or adhesive, by type of floor. Aahesive "07 408 n3 412 Ne 10s 108 mt 407 we 132 ai 122 23 28 408 Ne us NON-PARAMETRIC TESTS = Most of the tests we have covered have required that the underlying data distribution that we ‘take sample from havea particular distribution type— normal, binomial etc Thereare testsin statisticsthat do not require adistribution for yourdata~ these are called non, perametrictests. Clearly, non-parametrictestsare “better” then parametrictests because you are not bound to havea data distribution of aparticular type We have alroady lookedat one example ofa non-parametrictest- which one? Why wouldwethen useperametrictests at ell? They are less powerful then parametric tests in the sense that they use more information end are sometimes less floxbleinterms of tasting different xinds ofhypathesis Also, as sample size increases, itturns out that non-parametric testdistributions approximate: normal distributions JIGSAW ACADEMY ‘Analytics for Professionals www.jigsawacademy.com

You might also like