You are on page 1of 10

2/28/2015

CorrelationandLinearRegression

CorrelationandLinearRegression

Introduction
Inthissectionwediscusscorrelationanalysiswhichisatechniqueusedtoquantifythe
associationsbetweentwocontinuousvariables.Forexample,wemightwanttoquantify
theassociationbetweenbodymassindexandsystolicbloodpressure,orbetween
hoursofexerciseperweekandpercentbodyfat.Regressionanalysisisarelated
techniquetoassesstherelationshipbetweenanoutcomevariableandoneormorerisk
factorsorconfoundingvariables(confoundingisdiscussedlater).Theoutcomevariable
isalsocalledtheresponseordependentvariable,andtheriskfactorsand
confoundersarecalledthepredictors,orexplanatoryorindependentvariables.In
regressionanalysis,thedependentvariableisdenoted"Y"andtheindependent
variablesaredenotedby"X".
[NOTE:Theterm"predictor"canbemisleadingifitisinterpretedastheabilityto
predictevenbeyondthelimitsofthedata.Also,theterm"explanatoryvariable"might
giveanimpressionofacausaleffectinasituationinwhichinferencesshouldbelimited
toidentifyingassociations.Theterms"independent"and"dependent"variableareless
subjecttotheseinterpretationsastheydonotstronglyimplycauseandeffect.

LearningObjectives
Aftercompletingthismodule,thestudentwillbeableto:
1. Defineandprovideexamplesofdependentandindependentvariablesinastudyofapublichealthproblem
2. Computeandinterpretacorrelationcoefficient
3. Computeandinterpretcoefficientsinalinearregressionanalysis

CorrelationAnalysis
Incorrelationanalysis,weestimateasamplecorrelationcoefficient,morespecificallythePearsonProductMomentcorrelation
coefficient.Thesamplecorrelationcoefficient,denotedr,
rangesbetween1and+1andquantifiesthedirectionandstrengthofthelinearassociationbetweenthetwovariables.Thecorrelation
betweentwovariablescanbepositive(i.e.,higherlevelsofonevariableareassociatedwithhigherlevelsoftheother)ornegative(i.e.,higher
levelsofonevariableareassociatedwithlowerlevelsoftheother).
Thesignofthecorrelationcoefficientindicatesthedirectionoftheassociation.Themagnitudeofthecorrelationcoefficientindicatesthe
strengthoftheassociation.
Forexample,acorrelationofr=0.9suggestsastrong,positiveassociationbetweentwovariables,whereasacorrelationofr=0.2suggesta
weak,negativeassociation.Acorrelationclosetozerosuggestsnolinearassociationbetweentwocontinuousvariables.
Itisimportanttonotethattheremaybeanonlinearassociationbetweentwocontinuousvariables,butcomputationofacorrelationcoefficient
doesnotdetectthis.Therefore,itisalwaysimportanttoevaluatethedatacarefullybeforecomputingacorrelationcoefficient.Graphical
displaysareparticularlyusefultoexploreassociationsbetweenvariables.
ThefigurebelowshowsfourhypotheticalscenariosinwhichonecontinuousvariableisplottedalongtheXaxisandtheotheralongtheYaxis.

http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html

1/10

2/28/2015

CorrelationandLinearRegression

Scenario1depictsastrongpositiveassociation(r=0.9),similartowhatwemightseeforthecorrelationbetweeninfantbirthweightand
birthlength.
Scenario2depictsaweakerassociation(r=0,2)thatwemightexpecttoseebetweenageandbodymassindex(whichtendstoincrease
withage).
Scenario3mightdepictthelackofassociation(rapproximately0)betweentheextentofmediaexposureinadolescenceandageat
whichadolescentsinitiatesexualactivity.
Scenario4mightdepictthestrongnegativeassociation(r=0.9)generallyobservedbetweenthenumberofhoursofaerobicexercise
perweekandpercentbodyfat.

ExampleCorrelationofGestationalAgeandBirthWeight
Asmallstudyisconductedinvolving17infantstoinvestigatetheassociationbetweengestationalageatbirth,measuredinweeks,andbirth
weight,measuredingrams.

http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html

2/10

2/28/2015

CorrelationandLinearRegression

Wewishtoestimatetheassociationbetweengestationalageandinfantbirthweight.Inthisexample,birthweightisthedependentvariableand
gestationalageistheindependentvariable.Thusy=birthweightandx=gestationalage.Thedataaredisplayedinascatterdiagraminthe
figurebelow.

Eachpointrepresentsan(x,y)pair(inthiscasethegestationalage,measuredinweeks,andthebirthweight,measuredingrams).Notethat
theindependentvariableisonthehorizontalaxis(orXaxis),andthedependentvariableisontheverticalaxis(orYaxis).Thescatterplot
showsapositiveordirectassociationbetweengestationalageandbirthweight.Infantswithshortergestationalagesaremorelikelytobeborn
withlowerweightsandinfantswithlongergestationalagesaremorelikelytobebornwithhigherweights.
Theformulaforthesamplecorrelationcoefficientis

whereCov(x,y)isthecovarianceofxandydefinedas

sx2andsy2arethesamplevariancesofxandy,definedas

Thevariancesofxandymeasurethevariabilityofthexscoresandyscoresaroundtheirrespectivesamplemeans(
http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html

,considered
3/10

2/28/2015

CorrelationandLinearRegression

separately).Thecovariancemeasuresthevariabilityofthe(x,y)pairsaroundthemeanofxandmeanofy,consideredsimultaneously.

Tocomputethesamplecorrelationcoefficient,weneedtocomputethevarianceofgestationalage,thevarianceofbirthweightandalsothe
covarianceofgestationalageandbirthweight.
Wefirstsummarizethegestationalagedata.Themeangestationalageis:

Tocomputethevarianceofgestationalage,weneedtosumthesquareddeviations(ordifferences)betweeneachobservedgestationalage
andthemeangestationalage.Thecomputationsaresummarizedbelow.

Thevarianceofgestationalageis:

Next,wesummarizethebirthweightdata.Themeanbirthweightis:

Thevarianceofbirthweightiscomputedjustaswedidforgestationalageasshowninthetablebelow.

http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html

4/10

2/28/2015

CorrelationandLinearRegression

Thevarianceofbirthweightis:

Nextwecomputethecovariance,

Tocomputethecovarianceofgestationalageandbirthweight,weneedtomultiplythedeviationfromthemeangestationalagebythe
deviationfromthemeanbirthweightforeachparticipant(i.e.,

Thecomputationsaresummarizedbelow.Noticethatwesimplycopythedeviationsfromthemeangestationalageandbirthweightfromthe
twotablesaboveintothetablebelowandmultiply.

Thecovarianceofgestationalageandbirthweightis:

Wenowcomputethesamplecorrelationcoefficient:

Notsurprisingly,thesamplecorrelationcoefficientindicatesastrongpositivecorrelation.
http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html

5/10

2/28/2015

CorrelationandLinearRegression

Aswenoted,samplecorrelationcoefficientsrangefrom1to+1.Inpractice,meaningfulcorrelations(i.e.,correlationsthatareclinicallyor
practicallyimportant)canbeassmallas0.4(or0.4)forpositive(ornegative)associations.Therearealsostatisticalteststodeterminewhether
anobservedcorrelationisstatisticallysignificantornot(i.e.,statisticallysignificantlydifferentfromzero).Procedurestotestwhetheran
observedsamplecorrelationissuggestiveofastatisticallysignificantcorrelationaredescribedindetailinKleinbaum,KupperandMuller.1

RegressionAnalysis
Regressionanalysisisawidelyusedtechniquewhichisusefulformanyapplications.Weintroducethetechniquehereandexpandonitsuses
insubsequentmodules.

SimpleLinearRegression
Simplelinearregressionisatechniquethatisappropriatetounderstandtheassociationbetweenoneindependent(orpredictor)variableand
onecontinuousdependent(oroutcome)variable.Forexample,supposewewanttoassesstheassociationbetweentotalcholesterol(in
milligramsperdeciliter,mg/dL)andbodymassindex(BMI,measuredastheratioofweightinkilogramstoheightinmeters2)wheretotal
cholesterolisthedependentvariable,andBMIistheindependentvariable.Inregressionanalysis,thedependentvariableisdenotedYandthe
independentvariableisdenotedX.So,inthiscase,Y=totalcholesterolandX=BMI.
Whenthereisasinglecontinuousdependentvariableandasingleindependentvariable,theanalysisiscalledasimplelinearregression
analysis.Thisanalysisassumesthatthereisalinearassociationbetweenthetwovariables.(Ifadifferentrelationshipishypothesized,suchas
acurvilinearorexponentialrelationship,alternativeregressionanalysesareperformed.)
ThefigurebelowisascatterdiagramillustratingtherelationshipbetweenBMIandtotalcholesterol.Eachpointrepresentstheobserved(x,y)
pair,inthiscase,BMIandthecorrespondingtotalcholesterolmeasuredineachparticipant.Notethattheindependentvariableisonthe
horizontalaxisandthedependentvariableontheverticalaxis.
BMIandTotalCholesterol

ThegraphshowsthatthereisapositiveordirectassociationbetweenBMIandtotalcholesterolparticipantswithlowerBMIaremorelikelyto
havelowertotalcholesterollevelsandparticipantswithhigherBMIaremorelikelytohavehighertotalcholesterollevels.Incontrast,suppose
weexaminetheassociationbetweenBMIandHDLcholesterol.
Incontrast,thegraphbelowdepictstherelationshipbetweenBMIandHDLHDLcholesterolinthesamesampleofn=20participants.
BMIandHDLCholesterol

ThisgraphshowsanegativeorinverseassociationbetweenBMIandHDLcholesterol,i.e.,thosewithlowerBMIaremorelikelytohavehigher
HDLcholesterollevelsandthosewithhigherBMIaremorelikelytohavelowerHDLcholesterollevels.
Foreitheroftheserelationshipswecouldusesimplelinearregressionanalysistoestimatetheequationofthelinethatbestdescribesthe
associationbetweentheindependentvariableandthedependentvariable.Thesimplelinearregressionequationisasfollows:
http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html

6/10

2/28/2015

CorrelationandLinearRegression

,where
isthepredictedorexpectedvalueoftheoutcome,Xisthepredictor,b0istheestimatedYintercept,andb1istheestimatedslope.TheY
interceptandslopeareestimatedfromthesampledatasoastominimizethesumofthesquareddifferencesbetweentheobservedandthe
predictedvaluesoftheoutcome,i.e.,theestimatesminimize:

Thesedifferencesbetweenobservedandpredictedvaluesoftheoutcomearecalledresiduals.TheestimatesoftheYinterceptandslope
minimizethesumofthesquaredresiduals,andarecalledtheleastsquaresestimates.1
Residuals
Conceptually,ifthevaluesofXprovidedaperfectpredictionofYthenthesumofthesquared
differencesbetweenobservedandpredictedvaluesofYwouldbe0.Thatwouldmeanthat
variabilityinYcouldbecompletelyexplainedbydifferencesinX.However,ifthedifferences
betweenobservedandpredictedvaluesarenot0,thenweareunabletoentirelyaccountfor
differencesinYbasedonX,thenthereareresidualerrorsintheprediction.Theresidualerror
couldresultfrominaccuratemeasurementsofXorY,ortherecouldbeothervariablesbesidesX
thataffectthevalueofY.
Basedontheobserveddata,thebestestimateofalinearrelationshipwillbeobtainedfromanequationforthelinethatminimizesthe
differencesbetweenobservedandpredictedvaluesoftheoutcome.TheYinterceptofthislineisthevalueofthedependentvariable(Y)
whentheindependentvariable(X)iszero.Theslopeofthelineisthechangeinthedependentvariable(Y)relativetoaoneunitchangeinthe
independentvariable(X).Theleastsquaresestimatesoftheyinterceptandslopearecomputedasfollows:

where
risthesamplecorrelationcoefficient,
thesamplemeansare and
andSxandSyarethestandarddeviationsoftheindependentvariablexandthedependentvariabley,respectively.

BMIandTotalCholesterol
Theleastsquaresestimatesoftheregressioncoefficients,b0andb1,describingtherelationshipbetweenBMIandtotalcholesterolareb0=
28.07andb1=6.49.Thesearecomputedasfollows:

TheestimateoftheYintercept(b0=28.07)representstheestimatedtotalcholesterollevelwhenBMIiszero.BecauseaBMIofzerois
meaningless,theYinterceptisnotinformative.Theestimateoftheslope(b1=6.49)representsthechangeintotalcholesterolrelativetoaone
unitchangeinBMI.Forexample,ifwecomparetwoparticipantswhoseBMIsdifferby1unit,wewouldexpecttheirtotalcholesterolstodifferby
approximately6.49units(withthepersonwiththehigherBMIhavingthehighertotalcholesterol).
Theequationoftheregressionlineisasfollows:

Thegraphbelowshowstheestimatedregressionlinesuperimposedonthescatterdiagram.

Theregressionequationcanbeusedtoestimateaparticipant'stotalcholesterolasafunctionofhis/herBMI.Forexample,supposea
http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html

7/10

2/28/2015

CorrelationandLinearRegression

participanthasaBMIof25.Wewouldestimatetheirtotalcholesteroltobe28.07+6.49(25)=190.32.Theequationcanalsobeusedto
estimatetotalcholesterolforothervaluesofBMI.However,theequationshouldonlybeusedtoestimatecholesterollevelsforpersonswhose
BMIsareintherangeofthedatausedtogeneratetheregressionequation.Inoursample,BMIrangesfrom20to32,thustheequationshould
onlybeusedtogenerateestimatesoftotalcholesterolforpersonswithBMIinthatrange.
Therearestatisticalteststhatcanbeperformedtoassesswhethertheestimatedregressioncoefficients(b0andb1)arestatisticallysignificantly
differentfromzero.ThetestofmostinterestisusuallyH0:b1=0versusH1:b10,whereb1isthepopulationslope.Ifthepopulationslopeis
significantlydifferentfromzero,weconcludethatthereisastatisticallysignificantassociationbetweentheindependentanddependent
variables.

BMIandHDLCholesterol
Theleastsquaresestimatesoftheregressioncoefficients,b0andb1,describingtherelationshipbetweenBMIandHDLcholesterolareas
follows:b0=111.77andb1=2.35.Thesearecomputedasfollows:

Again,theYinterceptinuninformativebecauseaBMIofzeroismeaningless.Theestimateoftheslope(b1=2.35)representsthechangein
HDLcholesterolrelativetoaoneunitchangeinBMI.IfwecomparetwoparticipantswhoseBMIsdifferby1unit,wewouldexpecttheirHDL
cholesterolstodifferbyapproximately2.35units(withthepersonwiththehigherBMIhavingthelowerHDLcholesterol.Thefigurebelowshows
theregressionlinesuperimposedonthescatterdiagramforBMIandHDLcholesterol.

Linearregressionanalysisrestsontheassumptionthatthedependentvariableiscontinuousandthatthedistributionofthedependentvariable
(Y)ateachvalueoftheindependentvariable(X)isapproximatelynormallydistributed.Note,however,thattheindependentvariablecanbe
continuous(e.g.,BMI)orcanbedichotomous(seebelow).

ComparingMeanHDLLevelsWithRegressionAnalysis
ConsideraclinicaltrialtoevaluatetheefficacyofanewdrugtoincreaseHDLcholesterol.WecouldcomparethemeanHDLlevelsbetween
treatmentgroupsstatisticallyusingatwoindependentsamplesttest.Hereweconsideranalternateapproach.Summarydataforthetrialare
shownbelow:

SampleSize

MeanHDL

StandardDeviationofHDL

NewDrug

50

40.16

4.46

Placebo

50

39.21

3.91

HDLcholesterolisthecontinuousdependentvariableandtreatmentassignment(newdrugversusplacebo)istheindependentvariable.
Supposethedataonn=100participantsareenteredintoastatisticalcomputingpackage.Theoutcome(Y)isHDLcholesterolinmg/dLandthe
independentvariable(X)istreatmentassignment.Forthisanalysis,Xiscodedas1forparticipantswhoreceivedthenewdrugandas0for
participantswhoreceivedtheplacebo.Asimplelinearregressionequationisestimatedasfollows:
=39.21+0.95X,
where istheestimatedHDLlevelandXisadichotomousvariable(alsocalledanindicatorvariable,inthiscaseindicatingwhetherthe
participantwasassignedtothenewdrugortoplacebo).TheestimateoftheYinterceptisb0=39.21.TheYinterceptisthevalueofY(HDL
cholesterol)whenXiszero.Inthisexample,X=0indicatesassignmenttotheplacebogroup.Thus,theYinterceptisexactlyequaltothemean
HDLlevelintheplacebogroup.Theslopeisestimatedasb1=0.95.ThesloperepresentstheestimatedchangeinY(HDLcholesterol)relative
toaoneunitchangeinX.AoneunitchangeinXrepresentsadifferenceintreatmentassignment(placeboversusnewdrug).Theslope
representsthedifferenceinmeanHDLlevelsbetweenthetreatmentgroups.Thus,themeanHDLforparticipantsreceivingthenewdrugis:
=39.21+0.95(1)=40.16

http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html

8/10

2/28/2015

CorrelationandLinearRegression

Astudywasconductedtoassesstheassociationbetweenaperson'sintelligenceandthe
sizeoftheirbrain.ParticipantscompletedastandardizedIQtestandresearchersused
MagneticResonanceImaging(MRI)todeterminebrainsize.Demographicinformation,
includingthepatient'sgender,wasalsorecorded.

TheControversyOverEnvironmentalTobaccoSmokeExposure
Thereisconvincingevidencethatactivesmokingisacauseoflungcancerandheartdisease.Manystudiesdoneinawidevarietyof
circumstanceshaveconsistentlydemonstratedastrongassociationandalsoindicatethattheriskoflungcancerandcardiovasculardisease
(i.e..,heartattacks)increasesinadoserelatedway.Thesestudieshaveledtotheconclusionthatactivesmokingiscausallyrelatedtolung
cancerandcardiovasculardisease.Studiesinactivesmokershavehadtheadvantagethatthelifetimeexposuretotobaccosmokecanbe
quantifiedwithreasonableaccuracy,sincetheunitdoseisconsistent(onecigarette)andthehabitualnatureoftobaccosmokingmakesit
possibleformostsmokerstoprovideareasonableestimateoftheirtotallifetimeexposurequantifiedintermsofcigarettesperdayorpacksper
day.Frequently,averagedailyexposure(cigarettesorpacks)iscombinedwithdurationofuseinyearsinordertoquantifyexposureas"pack
years".
Ithasbeenmuchmoredifficulttoestablishwhetherenvironmentaltobaccosmoke(ETS)exposureiscausallyrelatedtochronicdiseaseslike
heartdiseaseandlungcancer,becausethetotallifetimeexposuredosageislower,anditismuchmoredifficulttoaccuratelyestimatetotal
lifetimeexposure.Inaddition,quantifyingtheserisksisalsocomplicatedbecauseofconfoundingfactors.Forexample,ETSexposureisusually
classifiedbasedonparentalorspousalsmoking,butthesestudiesareunabletoquantifyotherenvironmentalexposurestotobaccosmoke,
andinabilitytoquantifyandadjustforotherenvironmentalexposuressuchasairpollutionmakesitdifficulttodemonstrateanassociationeven
ifoneexisted.Asaresult,therecontinuestobecontroversyovertheriskimposedbyenvironmentaltobaccosmoke(ETS).Somehavegone
sofarastoclaimthatevenverybriefexposuretoETScancauseamyocardialinfarction(heartattack),butaverylargeprospectivecohort
studybyEnstromandKabatwasunabletodemonstratesignificantassociationsbetweenexposuretospousalETSandcoronaryheart
disease,chronicobstructivepulmonarydisease,orlungcancer.(Itshouldbenoted,however,thatthereportbyEnstromandKabathasbeen
widelycriticizedformethodologicalproblems,andtheseauthorsalsohadfinancialtiestothetobaccoindustry.)
Correlationanalysisprovidesausefultoolforthinkingaboutthiscontroversy.ConsiderdatafromtheBritishDoctorsCohort.Theyreportedthe
annualmortalityforavarietyofdiseaseatfourlevelsofcigarettesmokingperday:Neversmoked,114/day,1524/day,and25+/day.Inorder
toperformacorrelationanalysis,Iroundedtheexposurelevelsto0,10,20,and30respectively.
CVDMortality/100,000men/yr.

LungCancerMortality/100,000men/yr.

CigarettesSmokedPerDay

572

14

10(actually114)

802

105

20(actually1524)

892

208

30(actually>24)

1025

355

Thefiguresbelowshowthetwoestimatedregressionlinessuperimposedonthescatterdiagram.Thecorrelationwithamountofsmokingwas
strongforbothCVDmortality(r=0.98)andforlungcancer(r=0.99).NotealsothattheYinterceptisameaningfulnumberhereitrepresents
thepredictedannualdeathratefromthesediseaseinindividualswhoneversmoked.TheYinterceptforpredictionofCVDisslightlyhigher
thantheobservedrateinneversmokers,whiletheYinterceptforlungcancerislowerthantheobservedrateinneversmokers.
Thelinearityoftheserelationshipssuggeststhatthereisanincrementalriskwitheachadditionalcigarettesmokedperday,andtheadditional
riskisestimatedbytheslopes.ThisperhapshelpsusthinkabouttheconsequencesofETSexposure.Forexample,theriskoflungcancerin
neversmokersisquitelow,butthereisafiniteriskvariousreportssuggestariskof1015lungcancers/100,000peryear.Ifanindividualwho
neversmokedactivelywasexposedtotheequivalentofonecigarette'ssmokeintheformofETS,thentheregressionsuggeststhattheirrisk
wouldincreaseby11.26lungcancerdeathsper100,000peryear.However,theriskisclearlydoserelated.Therefore,ifanonsmokerwas
employedbyatavernwithheavylevelsofETS,theriskmightbesubstantiallygreater.

http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html

9/10

2/28/2015

CorrelationandLinearRegression

Finally,itshouldbenotedthatsomefindingssuggestthattheassociationbetweensmokingandheartdiseaseisnonlinearattheverylowest
exposurelevels,meaningthatnonsmokershaveadisproportionateincreaseinriskwhenexposedtoETSduetoanincreaseinplatelet
aggregation.

Summary
Correlationandlinearregressionanalysisarestatisticaltechniquestoquantifyassociationsbetweenanindependent,sometimescalleda
predictor,variable(X)andacontinuousdependentoutcomevariable(Y).Forcorrelationanalysis,theindependentvariable(X)canbe
continuous(e.g.,gestationalage)orordinal(e.g.,increasingcategoriesofcigarettesperday).Regressionanalysiscanalsoaccommodate
dichotomousindependentvariables.
Theproceduresdescribedhereassumethattheassociationbetweentheindependentanddependentvariablesislinear.Withsome
adjustments,regressionanalysiscanalsobeusedtoestimateassociationsthatfollowanotherfunctionalform(e.g.,curvilinear,quadratic).
Hereweconsiderassociationsbetweenoneindependentvariableandonecontinuousdependentvariable.Theregressionanalysisiscalled
simplelinearregressionsimpleinthiscasereferstothefactthatthereisasingleindependentvariable.Inthenextmodule,weconsider
regressionanalysiswithseveralindependentvariables,orpredictors,consideredsimultaneously.

http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html

10/10

You might also like