Statistical Data Analysis Fundamentals

Summary
Part1
Population=fullResearchResults(N)
Sample=PartofPopulation(n)
TypesofData:
1.categorical=responses(eg.eyecolour,levelofsatisfaction...)
a)Nominal=yes/noanswers
b)ordinal=values(eg.1.poor,2.average,3.good)
2.Numerical
a)continues=counting(fullnumberseg.age)
b)discrete=measurement(eg.weight)
ComulativeFrequencyDistribution=SumofPopulationsorSamplestoCertainPoint
eg.
Class
Frequency
Percentage
CumulativeFr.
CumulativeP.
10butlessthan 3
20
15%
15%
20butlessthan 6
30
30%
45%
30butlessthan 5
40
25%
14
70%
addingfr.values
addingp.values
X
i
=ithvalueofthevariableX
N
xi
ArithmeticMean(Population)= = i=1N =
x1,x2,x3=PopulationValues
x1+x2+x3+...xN
xi
ArithmeticMean(Sample)= x = i=nn =
x1+x2+x3+...xn
n
x1,x2,x3=SampleValue
Median=ValuewhichStandsintheMiddle(eg.1,2,2,3,3,4,5Medianis3)
1
Position
alsoCalculatedby:
n+1
2
Note:IfevenAmountofNumberstheAverageoftheTwointheMiddleisMedian
(xix)2
Variance(Sample)= s2 = i=1 n1
Variance(Population)=sameFormulaotherSymbols:
s2 = 2
x =
n1=N
StandardDeviation(Sample)= s =
(xix)2
i=1
n1
StandardDeviation(Population)=sameFormulaother
Symbolss.o.:
BasicallyStandardDeviationisforboth: variance
CoefficientofVariation= C V = ( xs ) 100%
MeasuresrelativeVariation
expressesthestandarddeviationasapercentageofthemean
AlwaysinPercentage
ShowsVariationrelativetoMean
CanbeusedtoComparetwoormoresetsofDatameasuredindifferentunits
n
(xix)(yiy)
Covariance(Sample) C OV (x, y) = sx,y = i=1
n1
Covariance(Population)=SameFormulaotherSymbols
s.o.
MeasuresthestrengthoflinearRelationshipbetweentwovariables
2
ResultsofCovariance:COV(x,y)>0=xandytendtomoveinthesamedirection
COV(x,y)=0=thereisnolinearrelationshipbetweenxandy
COV(x,y)<0=xandytendtomoveintheoppositedirection
CoefficientofCorrelation: r =
COV (x,y)
sxsy
CoefficientofCorrelation:sameFormulaotherSymbols
s.o.:
r=p
Note:1<r<1!!!!
Part2
DiscreteRandomVariable=CountableNumber(TossaCoin:X=NumberofHeads=>Xis
discreterandomvariable)
P(x) 0
P (x) = 1
x
CumulativeProbabilityFunction=showstheprobabilitythatXisequalorsmallerthan x0
F( x0) = P (X x0)
ExpectedValue(orMean)ofadisc.Number
= E(x) = xP (x) =>example:E(x)= (x1 P (x1)) + (x2 P (x2))...

x
Variance
2 = (X )2P (x)
StandardDeviation
= 2 =
(x )2P (x) => =

x
(x ) P (x ) + (x ) P (x ).....
1
FunctionsofRandomVariables
P(x)istheProbabilityforXg(x)isafunctiondescribingX
ExpectedValue:E(g(x))= g(x)P (x)

x
Ifg(x)=Xwegetthenormalfunction
Ifg(x)=(x )2 wegettheformulaforvariance
SpecialcaseifXisalwaysthesamevariablethanwecansaythattheMeanisXand
theVariance=0
IfthereisavariablebeforeourXwejustmultiplythemfortheExpectedValue
3
E(bX)=b 2
IfthereisavariablebeforeourXwejustsquareittomultiplyitwiththeVariancetoget
theVarianceofthatEquation
Var(bX)= b22
Example:
ConsiderZ=a+bXXhasMeanof x andVarianceof x2
=>
z = E(a + bx) = a + bx
z2 = V ar(a + bx) = b22 =>standarddeviationofZ= |b| x
=>
SPECIALCASE!!!!(abitcomplicatedbutstepbystepeasy)
Z=
Xx
x
ExpectedValueZ:
z = E((X x)/x) = (E(X) x)/x = (x x)/x = 0/x = 0
WearesimplyusingtherulesthatwecanexcludetheVariableXfromtheotherconstants.
Insteadofa+bxwehavetheopposite:(xa)/binwhich(a= xandb = x) =>wecanusethe
rule
VarianceofZ:
z2 = V ar((X x)/x) = V ar(X/x) V ar(x/x) = V ar(1/x X ) = (1/x)2 V ar(X) = (1/)2 x2 =
Looksworsethanitis.aswellweareusingtherule.FirstweareseparatingtheXvaluefrom
theavalue( x) thanwecanjustletitfallbecausewhenwelookforvariancewedonttake
intoaccounttheconstantwhichweaddorsubtract.ThanwesimplytakeXseparatelyfrom
thebvalue (x) .BecauseweknowtogettheVariancewesimplytakethebvaluetothe
squareandsolvetheVariancevalueforX.BecausethevariancevalueandthebValueare
both x2 andwehavetodividethemfromeachotherwegetthevalueof1.
BernoulliDistribution:
justtwopossibilities:success/failure
P=probabilityofsuccess
1P=probabilityoffailure
Randomvariablexdefinedas1ifsuccessand0iffailure
P(X=1)=PandP(X=0)=1P
Mean: = nP
TheVariance: 2 = nP (1 P )
n!
TheNumberofsequencesofx(success)inntrials:
C nx = x!(nx)!
x2
x2
= 1
BernoulliProbabilityDistribution:
Hastohaveafixednumberofn
Pofsuccessandfailureaddupto1anddontchangeduringtheexperiment,
independentfromeachother
nx
n!
P(x)= x!(nx)!
P (1 P )
=>ProbabilityofxsuccessesinntrialswiththeprobabilityofPoneachtrial
JointProbabilityFunction
XtakesthespecificvaluexandYtakesthevalueyasafunctionofxandy
P(x,y)=P(X=x y = Y )
MarginalProbabilitiesare:
P(x)= P (x, y)
P(y)= P (x, y)
ConditionalProbabilityFunction
YtakesthevalueofyandxisspecifiedforX=P(yIx)= P(x,y)
P(x)
=y)
XtakethevalueofxandyisspecifiedforY=P(XIY=y)= P(X,Y
P(Y =y)
(slightlydifferent)
IndependentwhenP(x)P(y)=P(x,y)
Covariance:Thestrengthoflinearrelationship

Cov(X,Y)=E((X x)(Y y)) = (x x)(y y)P (x, y)
x y
COV (X,Y )
Correlation:p=Corr(X,Y)= xy
p=0norelationship
p>0positiverelationship=>whenXishighYaswell
p<0negativerelationship=>whenXhighYlow
ComulativeDistributionFunction
expressestheprobabilitythatXdoesntexceedxF(x)=P(X x)
example:aandb,twovaluesofX,a<b=>P(a<X<b)=F(b)F(a)
Mean: x = E (X) =
x2 = V ar(X) =
xmin
xmax
xmin
xmax
xf (x)dx
(x E (X))2f (x)dx
NormalDistributionFunction
lookslikeabell
symmetrically
Mean,MedianandModeareequal
Locationisdeterminedby >changingitshiftsthedistributionto
leftorright
Spreadisdeterminedby >changingitspreadsorcloses
Therandomvariablehasaninfiniterange
anynormaldistributionfunctioncanbeturnedintoastandardized
X
normaldistribution(Z)=Z = >thestandardizednormal
distributionshavegenerallyameanof0andavarianceof1
UseTable1inthebooktogetfromaZvaluetheF(Z)value
JointCumulativeDistributionFunction
SupposeXandYarecontinuousrandomvariables
Thefunctionisdescribed:F(x,y)
ItdefinesthatXislessthanxsimultaneouslyYislessthany
F(x,y)=P(X<x Y < y)
Therandomvariablesareindependentif:
P(X x, Y y) =P(X
x)P (Y y)
F(x,y)=F(x)F(y)
Covariance=COV(X,Y)=E((X x)(Y y))
Note:IfXandYareindependenttheircovariancewillbe0
)
Correlation=Corr(X,Y)= COVx(X,Y
y
RulesforRandomVariables
1. Themeanoftheirdifferenceisthedifferenceoftheirmeans:
E(XY)= x y
2. IftheCovariancebetweenXandYis0,thenthevarianceoftheir
differencesis:
Var(XY)= x2 + y2
3. IftheCovariancebetweenXandYisnot0,thentheVarianceof
6
theirdifferenceis:
Var(XY)= x2y2 2Cov(X, Y )
linearcombinationofXandY(whereaandbareconstant),
W=aX+bY
themeanofWis,
w = E (W ) = E (aX + bY ) = ax + by
thevarianceofWis, w2 = a2x2 + b2y2 + 2abCorr(X, Y )xy
Note:ifXandYarenormallydistributedWisaswell
Part3
DescriptiveStatistics:Collecting,presentinganddescribingdata
InferentialStatistics:Drawingconclusionsordecisionsconcerninga
PopulationbasedonSampleData
SamplingDistributions
distributionofallvaluesofasamplefromapopulation
TheStepstodevelopaSampleDistribution:
1. Listthegivenvalues(example,N=4,X=ageofthe4individually,
ValuesofX=18,20,22,24
2. CalculateMeanandStandardDeviation(Population)
a. = 18+20+22+24
= 21
4
b. =
(Xi)2
N
= 2.236
3. Allpossiblesamplecombinationsinatable:
18
20
22
24
18
18,18
18,20
18,22
18,24
20
20,18
20,20
20,22
20,24
22
22,18
22,20
22,22
22,24
24
24,18
24,20
24,22
24,24
4. Thandrawameantable
18
20
22
24
18
18
19
20
21
20
19
20
21
22
22
20
21
22
23
24
21
22
23
24
5. =>16SampleMeans
6. SummaryofSamplingDistribution:
a. =
Xi
N
b. =
X
18+19+19+20+20+20+21+21+21+21+22+22+22+23+23+24
16
(Xi)2
N
(1821)2+(1921)2+(1921)2...(2421)2
16
7. ComparingthePopulationandSample:
a. Population:
i. N=4
ii. = 21
iii. = 2.236
b. Sample:
i. n=2
ii. = 21
iii. = 1.58
ExpectedValueofSampleMeanDistribution
X=
1
n
Xi
i=1
StandardErroroftheMean
DescribestheVariabilityintheMean:
DecreaseswhenSampleSizeincreases
= n
= 21
= 1.58
IfthePopulationisNormal
samplingdistributionalsonormallydistributed
X = andX = n
ZValueforSampleMeanDistributions
Z=
(X)
X
= (X)
X = samplemean
= populationmean
= populationstandarddeviation
n=samplesize
IfPopulationisnotnormal
approximatelynormalifn>25
Example:
= 3
= 8
n= 36
Probabilitythat X between7.8and8.2=?
n>25=>approxnormal=> = X & X = n
X =
3
36
= 0.5
P(7.8<Z<8.2)=P ( 7.88
0.5 <
X
X
<
8.28
0.5 ) =P(0.4<Z<0.4)
=>F(0.4)(1F(0.4))=0.3108
SampleVariance
x1,x2,x3,xnarerandomsampleofpopulation
s2
1
n1
(xi x)2
i=1
thesquarerootiscalledstandardsampledeviation
ChiSquareDistribution
dependsondegreesoffreedom=n1=d.f.
table7
2
n12 = (n1)s
2
exampletofindProbability
Freezerhastoholdtemperaturewithlittlevariation
standarddeviationofnomorethan4=> = 4
Sample14Freezeraretested=>14=n=>d.f.=13
Whatistheprobabilitythatthesamplevarianceexceeds
27.52?=> s2 = ?
2
(141)27.52
2
P (s2 > 27.52) = P ((n1)s
) = P ( (n1)s
2 >
16
16 > 22.36) = P (13 > 22.36) = 0.05
P( 132 > 22.36) = 0.05
Table7:d.f.1322.36as =>P=0.05
FindingtheChiValue
n1=141=13=d.f.
= 0.05
132 = 22.36
PointandIntervalEstimates
Pointestimateisasinglenumber
Intervalestimatesisthewidthofalowerbutstillreliablepointtoa
upperbutstillreliablepointalsoknownas
confidenceinterval
IfP(a< <b)=1 thantheintervalfromatobiscalleda100(1 )%
aconfidenceintervalof
Thequantity(1 )iscalledtheconfidenceleveloftheinterval(
between0and1)writtenasa< <bwith100(1 ) %confidence
supposeconfidencelevel=95%
alsowrittenas(1 )=0.95=>fromrepeatedsamples95%off
alltheintervalswillhavetheunknownvariable
GeneralFormula:Pointestimate (reliability
Factor)(StandardError)
Note:thereliabilityfactordependsondesiredlevelof
10
confidence
tDistribution
Considerasampleofnobservations
meanof x andstandarddeviations
normallydistributedpopulationwithmeanof
n1degreesoffreedom
x
Thenvariable: t = s
n
Weusetdistributionwhenpopulationstandarddeviationis
unknownanduseinstead(s=samplestandarddeviation)
=>notthataccuratebecauseweusejustasample
Assumption:
Populationstandarddeviationisunknown
populationisnormallydistributed
ifpopulationisnotnormalusbiggersample
UseTDistribution
(1 )confidenceintervalEstimate
x tn1,/2 sn < < x + tn1,/2 sn
tdependsondegreesoffreedom
useTable8forsolving
Example
Samplen=25 x = 50 s=8forma95%confidenceinterval
for
d.f.251=24(1 )=0.95=>0.05= /2 =0.025
tn1/2 = t240.025 =2.0639

50(2.0639) 825 < <50+(2.0639) 825
46.69776< <53.30224
ConfidenceIntervalforthePopulationProportion
P =
11
P(1P)
n
p Z a/2
=>
p(1p)
n
p(1p)
=>

n
< P < p + Z a/2
p(1p)
n
ToexplainIusethefollowingexample:
Randomsampleof100people25arelefthanded95%
confidenceintervalforthetrueproportionoflefthanders
p Z a/2 p(1p)
< P < p + Z a/2 p(1p)

n
n
p = 25 Z =>10.95=0.050.05/2=0.025
a/2
100
0.95+0.025=0.975ZTable:lookintheF(Z)for0.975=>1.96
n=100
25
1.96 0.25(10.25)
< P < 100
+ 1.96 0.25(10.25)
100
100
0.1651<P<0.3349
Wecaninterpretthatsolutionasfollowed:
Weare95%confidentthatthetruepercentageoflefthanders
inthepopulationliesbetween16.51%and33.49%
25
100
Part4
DifferencesbetweentwoMeans
Goal:
formaconfidenceintervalforthedifferencebetween x y
Needtobeunrelatedandindependent(onesampledoesntaffect
theother)
Pointestimateisthedifferencebetweenthesamplemeans x y
If x2andy2 areknownuse
Z a/2
If x2andy2 areunknownusetdistribution
x2andy2 areknown
Assumptions
samplesarerandomandindependent
populationdistributionhastobenormaldistributed
Populationvariancesareknown
Var( X Y ) = XY 2 =
x2
nx
y2
ny
xy)
andZ= (xy)(
=>standardnormaldistribution
2 y 2
x
nx
+ ny
ConfidenceIntervaliswrittenasfollowed:
12
(x y) Z a/2
x2
nx
y2
ny
< x y < (x y) + Z a/2
x2
nx
y2
ny
x2andy2 areunknown
Assumption:
Samplesareindependentandrandom
Populationsarenormallydistributed
PopulationVariancesareunknownandassumedunequal
Useatdistributionwithvdegreesoffreedom
2
v=
sy2
( snxx + ny )2
sy2
s 2
( ny )2
( nxx )2
( (nx1) )+( (ny1) )
TheconfidenceIntervalisdescribedasfollows:
(x y) tv,a/2
sx2
nx
sy2
ny
tv,a/2
< x y < (x y) +
sx2
nx
sy2
ny
ConfidenceIntervalforthePopulationVariance
Goal:FormaConfidenceIntervalforthepopulationVariance, 2
BasedonSampleVariance s2
Populationisnormallydistributed
2
RandomVariable: n12 = (n1)s

2
n1,a2 denotesthenumberforwhich:P( n12 > n1,a2) =
(n1)s2
n1,a/22
< 2 <
(n1)s2
n1,1a/22
Forexplanationhereanexample:
Suppose:
Samplesize:17
SampleMean:3004
Samplestandarddeviation:74
Populationisnormal
Determinetheconfidenceintervalfor 2
13
Firstdetermineeverything:
n1=171=16 a2 =(10.95)/2=0.0251 a2 =0.975thanfindChi
Values:
X n1,a/22 = X 171,0.0252 > 28.85
X n1,1a/22 = X 171,1a/22 > 6.91
Thanfind s2 = 742
2
2
Nowfillitintheformula: (n1)s
2 < <
n1,a/2
2
(n1)s2
n1,1a/22
(171)74
< 2 < (171)74
=3037< 2 <12680
28.85
6.91
=>Nowwehavethestandardsampledeviation.fromthatwe
justhavetotakethesquarerootandjustconsiderthepositive
valuesassolutions=>55.1&112.6asourlimitssowecan
formulate:Weareto95%confidentthatthepopulation
standarddeviationliesbetween55.1and112.6
HypothesisTests
NullHypothesis
alwaysaboutapopulationparameter
NullHypothesisisthehypothesisthatweassumethatour
assumptioniscorrect(example:themeanofthetvsinanamerican
householdisthree=> H 0 : = 3
Referstostatusquo(notguilty)
containsalways=, or
mayormaynotberejected
AlternativeHypothesis
assumestheoppositeof H 0 (inourexample: H 1 : =/ 3 )
containsalways=,
/ < or >
Mayornotmaybesupported
Example:thepopulationmeanageis50=> H 0 : = 50
nowweselectasampleandcalculatethemean.Letssuppose
itwas X = 20=>unlikelyNullhypothesisistrue
14
Levelofsignificance
Definestherejectionregionofthesampledistribution
writtenas typicalvaluesare0.01,0.05,0.1
isselectedbyresearcher
providesthecriticalvalues
Typesoftests(3isanexampleforanynumber)
TwoTailtest:
H0 : = 3
H 1 : =/ 3
UpperTailtest:
H0 : 3
H 1 : >3
LowerTailtest:
H0 : 3
H 1 : <3
ErrorsinmakingDecisions
Type1Error=rejectatrueNullHypothesis
theprobabilityis alsocalledlevelof signif icance
Type2Error=FailtorejectafalseNullHypothesis
Theprobabilityis
actualSituationshownbelow
Decision
H 0true
H 0f alse
DonotrejectH 0
Noerror(1 )
Type2Error( )
RejectH 0
Type1Error( )
NoError(1 )
TestofHypothesisfortheMean( Known)
Convertsampleresult( x)toazvalue
15

ConsidertheTest:
H 0 : = 0
H 1 : > 0
TheDecisionRuleis:
Reject H 0if z =
> za
AlternateRule:
Reject H 0if X > 0 + Z a n
PValue
ProbabilityobtainingaTeststatisticmoreextremethanthe
observedsamplevaluegiventhat H 0 istrue
alsocalledobservedValueofSignificance
showsthesmallestvalueof forwhich H 0 canberejected
Convertsampleresult(eg. x)toteststatistic(eg.zstatistic)
Exampleofuppertailtest:
obtainpvalue
x
pvalue=(P> /n0 , giventhatH 0istrue) =>
P (Z >
x0
/n
= 0)
DecisionRulecomparethepvalueto
Ifpvalue< ,reject H 0
Ifpvalue ,dontreject H 0
OneTailTest
alternativeHypothesisfocusesononeDirection
if H 1is" > "thensomething, itsanuppertailtest
if H 1is" < "thensomething, itsalowertailtest
LoweranduppertailtestshavejustonecriticalValuesince
therejectionareaisinonlyonetail
TwoTailTest
twocriticalvaluesdefiningthetwoareasofrejection
16
tTestofHypothesisfortheMean( U nknown)
convertsampleresults( x) toatteststatistic
ConsidertheTest:
H 0 : = 0
H 1 : > 0
TheDecisionRuleis:
x
Reject H 0if t = s 0 > tn1,a
n
Foratwotailedtest:
H 0 : = 0
H 1 : =/ 0
TheDecisionRuleis
t=
x0
s
n
< tn1,a/2orif t =
x0
s
n
> tn1,a/2
TestofthePopulationProportion
involvescategoricalvalues
twooutcomes
success(acertaincharacteristicispresent)
failure(acertaincharacteristicisnotpresent)
ProportionofthepopulationiswrittenasP
SampleSizeislarge
SampleProportioninthesuccessareaiswritten "p
p =
x
n
numberofsuccessesinsample
samplesize
ifnP(1P)>9, pcanbeseenasapproximatelynormaldistributed
ThereforeMean= p = P
andstandardDeviation= =
p
P(1P)
n
HypothesisTestforProportion(nP(1P)>9)
ZVALUEbecausenormaldistributed
Z=
17
pP
0
P 0(1P 0)
n

Statistical Data Analysis Fundamentals

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Data Analysis Fundamentals

Uploaded by

Copyright:

Available Formats

Summary

Covariance(Sample) C OV (x, y) = sx,y = i=1

= E(x) = xP (x) =>example:E(x)= (x1 P (x1)) + (x2 P (x2))...

(x )2P (x) => =

ExpectedValue:E(g(x))= g(x)P (x)

z2 = V ar(a + bx) = b22 =>standarddeviationofZ= |b| x

Cov(X,Y)=E((X x)(Y y)) = (x x)(y y)P (x, y)

tn1/2 = t240.025 =2.0639

< P < p + Z a/2

< P < p + Z a/2 p(1p)

< x y < (x y) + Z a/2

RandomVariable: n12 = (n1)s

You might also like