You are on page 1of 61

bivariate EDA and regression analysis

width

length

weight of core

distance from quarry

3 2 1

AG_C1_2

0 -1 -2 -3 -4 -5 -4 -3 -2 -1 0 1 2 AG_C1_1 3 4 5

3 2 1

AG_C1_2

0 -1 -2 -3 -4 -5 -4 -3 -2 -1 0 1 2 AG_C1_1 3 4 5

3 2 1

AG_C1_2

0 -1 -2 -3 -4 -5 -4 -3 -2 -1 0 1 2 AG_C1_1 3 4 5

scatterplot matrix
AG_C1_2
AG_C1_1

AG_C2_2

AG_C3_2

AG_C4_2
AG_C1_1 AG_C2_1

AG_C2_1

AG_C3_1

AG_C3_1

AG_C4_1

AG_C4_1

AG_C1_2

AG_C2_2

AG_C3_2

AG_C4_2

A G_C1_1

A G_C2_1

A G_C3_1

A G_C4_1

A G_C1_2

A G_C2_2

A G_C3_2

A G_C4_2

A G_C1_1

A G_C1_1

A G_C2_1

A G_C2_1

A G_C3_1

A G_C3_1

A G_C4_1

A G_C4_1

A G_C1_2

A G_C1_2

A G_C2_2

A G_C2_2

A G_C3_2

A G_C3_2

A G_C4_2

A G_C4_2

A G_C1_1

A G_C2_1

A G_C3_1

A G_C4_1

A G_C1_2

A G_C2_2

A G_C3_2

A G_C4_2

10

5
AG_C2_1

-5

-10 -4

-3

-2

-1

0 1 2 AG_C1_1

3 2 1

AG_C1_2

0 -1 -2 -3 -4 -5 -4 -3 -2 -1 0 1 2 AG_C1_1 3 4 5

scatterplots
scatterplots provide the most detailed summary of a bivariate relationship, but they are not concise, and there are limits to what else you can do with them simpler kinds of summaries may be useful more compact; often capture less detail may support more extended mathematical analyses may reveal fundamental relationships

y = a + bx

y = a + bx
6 5 4
(x2,y2)

b = slope y b = y/x

3
2 1 1 2 3 4 5
(x1,y1)

b = (y2-y1)/(x2-x1)
6

a = y intercept

y = a + bx
we can predict values of y from values of x predicted values of y are called y-hat

y a bx
the predicted values (y) are often regarded as dependent on the (independent) x values try to assign independent values to x-axis, dependent values to the y-axis

y = a + bx
becomes a concise summary of a point distribution, and a model of a relationship may have important explanatory and predictive value

how do we come up with these lines? various options:


by eye calculating a Tukey Line (resistant to outliers) locally weighted regression LOWESS least squares regression

linear regression
linear regression and correlation analysis are generally concerned with fitting lines to real data least squares regression is one of the main tools attempts to minimize deviation of observed points from the regression line maximizes its potential for prediction

standard approach minimizes the squared variation in y


i )2 ( yi y
i 1 n

Note:
these are the vertical deviations this is a sum-squared-error approach

regressing x on y would involve defining the line

xi c dyi
by minimizing

xi

calculating a line that minimizes this value is called regressing y on x appropriate when we are trying to predict y from x this is also called Model I Regression

start by calculating the slope (b):

(x
i 1 n

x )( y i y )
i

covariance

(x
i 1

x)

once you have the slope, you can calculate the y-intercept (a):

y a y bx

b xi n

regression pathologies
things to avoid in regression analysis

Tukey Line
resistant to outliers divide cases into thirds, based on x-axis identify the median x and y values in upper and lower thirds slope (b)= (My3-My1)/(Mx3-Mx1) intercept (a) = median of all values yi-b*xi

Correlation
regression concerns fitting a linear model to observed data correlation concerns the degree of fit between observed data and the model... if most points lie near the line:
the fit of the model is good the two variables are strongly correlated values of y can be well predicted from x

Pearsons r
this is assessed using the product-moment correlation coefficient:
r

( x x )( y y ) (x x) ( y y)
i i 2 i i

= covariance (the numerator), standardized by a measure of variation in both x and y

x
r

( x x )( y y ) (x x) ( y y)
i i 2 i i

(xi,yi)

+
y

unlike the covariance, r is unit-less ranges between 1 and 1

0 = no correlation -1 and 1 = perfect negative and positive correlation (respectively)


correlation between x and y is the same as between y and x

r is symmetrical

no question of independence or dependence recall, this symmetry is not true of regression

regression/correlation
one can assess the strength of a relationship by seeing how knowledge of one variable improves the ability to predict the other

if you ignore x, the best predictor of y will be the mean of all y values (y-bar) if the y measurements are widely scattered, prediction errors will be greater than if they are close together we can assess the dispersion of y values around their mean by:

( y

y)

yi

( yi yi ) 2

( yi y ) 2

r2=

( yi yi ) 2 ( yi y ) 2

coefficient of determination (r2) describes the proportion of variation that is explained or accounted for by the regression line r2=.5
half of the variation is explained by the regression half of the variation in y is explained by variation in x

yi

correlation and percentages


much of what we want to learn about association between variables can be learned from counts
ex: are high counts of bone needles associated with high counts of end scrapers?

sometimes, similar questions are posed of percent-standardized data


ex: are high proportions of decorated pottery associated with high proportions of copper bells?

caution
these are different questions and have different implications for formal regression percents will show at least some level of correlation even if the underlying counts do not
spurious correlation (negative) closed-sum effect

case 1 2 3 4 5 6 7 8 9 10

C_v1 15 35 20 23 36 79 40 95 27 67

C_v2 14 1 96 59 90 2 99 36 0 93

C_v3 C_v4 C_v5 C_v6 C_v7 C_v8 C_v9 C_v10 94 89 73 7 86 26 28 22 58 98 59 95 31 52 15 5 66 75 99 61 76 23 90 33 97 11 77 21 32 62 13 77 65 83 54 68 23 48 30 94 8 14 74 71 52 74 69 95 5 3 97 9 60 35 41 44 22 58 5 16 10 27 85 57 34 13 63 74 100 43 95 43 27 90 3 87 36 68 75 48

10 vars. 5 vars. 3 vars. 2 vars.

original counts
-1.0 -0.5 0.0 r 0.5 1.0

percents (10 vars.)


0.5 1.0

-1.0

-0.5

0.0 r

percents (5 vars.)
-1.0 -0.5 0.0 r 0.5 1.0

percents (3 vars.)
-1.0 -0.5 0.0 r 0.5 1.0

percents (2 vars.)
0.5 1.0

-1.0

-0.5

0.0 r

100 80 60 40 20 0 0

20

15
P10_V2

C_V2

10

20

40 60 C_V1

80

100

0 0

10 P10_V1

15

20

40

70 60

90 80 70

30
T5_V2 T3_V2

50

60

T2_V2
20 30 40 50 T3_V1 60 70 80

40 30 20 10

20

50 40 30 20 10 0 10 20 30 40 50 60 70 80 90 100 T2_V1

10

0 0

10

20

30 40 T5_V1

50

60

70

0 10

regression assumptions
both variables are measured at the interval scale or above variation is the same at all points along the regression line (variation is homoscedastic)

residuals
vertical deviations of points around the regression for case i, residual = yi-y-hati [yi-(a+bxi)] residuals in y should not show patterned variation either with x or y-hat normally distributed around the regression line residual error should not be autocorrelated (errors/residuals in y are independent)

standard error of the regression


recall: standard error of an estimate (SEE) is like a standard deviation can calculate an SEE for residuals associated with a regression formula

S yi y i

yi

to the degree that the regression assumptions hold, there is a 68% probability that true values of y lie within 1 SEE of y-hat 95% within 2 SEE can plot lines showing the SEE y-hat = a+bx +/- SEE

data transformations and regression


read Shennan, Chapter 9 (esp. pp. 151-173)

200

150
VAR2

100

50
200

0 0

50

100 VAR1

150

200
150
VAR2

100

50

0 0

50

100 VAR1

150

200

200

150
VAR2

100

50

40

80 VAR1

120 160

200

150
VAR2

100

50

0 0

5 VAR1T

10

15

let VAR1T = sqr(VAR1)

distribution and fall-off models ex: density of obsidian vs. distance from the quarry:
6 5 4
DENSITY

3 2 1 0 0

10

20

30

40 50 DIST

60

70

80

Plot of Residuals against Predicted Values


6 5 4 3 2 1 0 0

RESIDUAL
10 20 30 40 50 DIST 60 70 80

DENSITY

-1 -1

1 2 ESTIMATE

LG_DENS log(DENSITY)
6 5 4 3 2
LG_DENS

2 1 0 -1 -2 -3

DENSITY

10

20

30

40 50 DIST

60

70

80

10

20

30

40 50 DIST

60

70

80

2 1
LG_DENS

0 -1 -2 -3

y = 1.70-.05x [remember y is logged density]

10

20

30

40 50 DIST

60

70

80

6 5 4

6 5 4

logy = 1.70-.05x

DENSITY

3 2 1 0 0

DENSITY

3 2 1 0 0

10

20

30 40 50 DISTANCE

60

70

80

10

20

30 40 50 DISTANCE

60

70

80

fplot y = exp(1.70-.05*x)

begin PLOT DENSITY*DISTANCE / FILL=1,0,0

fplot y = exp(1.70-.05*x) ; XLABEL='' YLABEL='' XTICK=0 XPIP=0 YTICK=0 YPIP=0 XMIN=0 XMAX=80 YMIN=0 YMAX=6
end

transformation summary
correcting left skew:
x4 x3 x2 x log(x) -1/x -1/x2 stronger strong mild weak mild strong stronger

correcting right skew:

You might also like