Biostat HW 09MS009

LS321 -Take Home Assignment
ARITRA KR. MUKHOPADHYAY (09MS009) 3RD YEAR DEPARTMENT OF PHYSICAL SCIENCES (IISER-K) April 13, 2012
Softwares used-OCTAVE and QTIPLOT. The program codes are all attached in the email as a .tar.gz le with the assignment . Only some relevant portions of the codes (not entire ones) have been provided here too.
Answer 1:
A program is run in software OCTAVE which draws 50 normally distributed random samples with mean 170cms. and standard deviation 8.2cms. each having 136 members. The means of each of the sample is found out and a histogram plot is made of the means by using software QTIPLOT. I have drawn two histograms , one with step size 0.4cm and the other with 0.3cm. The best t comes SSE from the one with step size 0.3 cms with the adjusted R2 (=1- T SE )value 0.944 which is close to 1 indicating the SSE or the sum of squares of errors is very small compared to the TSE or the total sum of squares hence a good t. The distribution of the sample means is accordance with the Central Limit Theorem with most of the samples having mean quite close to the population mean of 170cms (from the plot the mean comes to 169.79cms). The source code is pasted here: x=zeros ( 1 3 6 , 5 0 ) ;%d e f i n i n g a 13650 d i m e n s i o n a l z e r o matrix . %aim each o f t h e 50 columns w i l l c o r r e s p o n d t o a sample and w i l l have 136 e l e m e n t s i n i t ( so 136 rows ) chosen randomly so t h a t t h e y a r e c o n s i s t e n t w i t h t h e p o p u l a t i o n p a r a m e t e r s mean=170 cms and s t a n a r d d e v i a t i o n =8.2cms . t h e f u n c t i o n randn g e n e r a t e s random numbers d i s t r i b u t e d n o r m a l l y ( mean=0, s . d=1) , so i t r a n s f o r m i t t o s . d randn + mean t o g e n e r a t e n o r m a l l y d i s t r i b u t e d s a m p l e s w i t h mean=170 and s . d =8.2 . f o r i =1:50 x ( : , i ) =8.2randn ( 1 3 6 , 1 ) +170; end m =mean( x ) ; %c a l c u l a t e s t h e mean o f each column t h a t i s each sample and r e t u r n s an a r r a y o f means h a v i n g 50 elements . 1
dlmwrite ( m. dat ,m, \n ) ; %e x p o r t s t h e a r r a y o f means i n t o a d a t a f i l e m. dat The graphs are shown below :
Histogram & normal t of the means (step size-0.4cm)
167 14 12 10 Counts 8 6 4 2 0 -2 167 168 169 170 Means (cm.) 171 172 168 169 170 171 172 173 14 12 10 8 6 4 2 0 -2 173
Mean = 169.7920166847662 Standard Deviation = 1.539216886766 Chi^2/doF = 3.119772650346584e+00 R^2 = 0.912529738775329 Adjusted R^2 = 0.842553529795593
Histogram & normal t of the means (step size-0.3cm)

167 10 8 6 Counts 4 4 2 0 -2 167 2 168 169 170 171 172 173 10
Mean = 169.7949254594500 Standard Deviation = 1.45410560771153 Chi^2/doF = 6.264486427601768e-01 R^2 = 0.961668352278661 Adjusted R^2 = 0.94463206440251
168
169
170 Means (cm.)
171
172
0 173
Answer 2:
First of all the sample size is 7 perhaps too small to comment on a good t. The mean and variance of the sample comes to be 78.256 and 159.90 which are obviously not equal which is a criteria for the Poisson distribution. So apriori it does not seem to t to Poisson distribution. But I draw 10,000 random samples each having 7 units with the mean as that of the given sample 78.256. Since the expectation of the sample means and that of sample variance are the unbiased estimates of population means and population variances, these quantities are calculated and comes to 78.277 and 78.521 (almost equal as it should be for Poisson). But the variance of the sample variances is large at 2102.9 . Noting that the variance of the given data or sample is 159.90 which is quite small compared to 2102.9 we can conclude that the data can be modelled to Poisson distribution. If the given number of sample points were large one could have done a Poisson tting and obtain a Chi square goodness of t result ,which would have been more accurate. The code is pasted here: 2
x = [ 8 7 , 5 3 , 7 2 , 9 0 , 7 8 , 8 5 , 8 3 ] ;%g i v e n d a t a M =mean( x ) ;%mean o f t h e g i v e n d a t a V=var ( x ) ;%v a r i a n c e o f t h e d a t a y=zeros ( 7 , 1 0 0 0 0 ) ; f o r i =1:10000 y ( : , i )=randp (M, 7 , 1 ) ; %g e n e r a t i n g 10 ,000 P o i s s o n d i s t r i b u t e d s a m p l e s h a v i n 7 e l e m e n t s each , w i t h t h e mean as t h e mean o f t h e p o p u l a t i o n d a t a M. end m =mean( y ) ; %a r r a y o f mean o f each o f t h e s a m p l e s v=var ( y ) ; %a r r a y o f v a r i a n c e o f each o f t h e s a m p l e s em=sum(m) /10000 %e x p e c t a t i o n o f sample means ev=sum( v ) /10000 %e x p e c t a t i o n o f sample v a r i a n c e s vv=var ( v ) %v a r i a n c e o f sample v a r i a n c e s
Answer 3:
The two sets of graphs showing the change of the 5% critical value of the tdistribution are attached here. One a normal plot and the other a log-log plot. The value decreases with the increase of the degree of freedom. As can be seen from the plot the value decrease rapidly between degree of freedom 1 and 10 (approx) and then the slope falls, ultimately saturating. The log-log plot is more indicative since the fall of the function at small values of the degree of freedom is shown clearly as well as the fall at the largest values close to 100. Whereas the normal plot only gives an idea of the steep fall in the beginning and the slow decrease in the end. The relevant portion of the code and plots are attached here: x = 1 : 1 0 0 ; %g e n e r a t e an a r r a y f o r g e g r e e o f freedom v a l u e s t=t i n v ( 0 . 9 5 , x ) ;%g e n e r a t e t h e 5% c r i t i c a l v a l u e s f o r t s t a t i s t i c s with degree of freedoms in x . in octave to g e t y% c r i t i c a l v a l u e s one need t o i n p u t (1y ) /100 as t h e argument f o r t i n v hence t h e argument 0 . 9 5 . plot ( x , t , r ) ; %p l o t s t h e graph o f t h e change o f v a l u e s w i t h d e g r e e o f freedom plot ( log10 ( x ) , log10 ( t ) , r ) ; %c r e a t e s a l o g l o g p l o t o f t h e same
Answer 4:
A simple regression model is dened by Yi = a + bXi +
i
(1)
where the Yi s are the response values , Xi s the predictor values ,a and b are the regression parameters and the i are the errors. For the least square estimation the target is to minimise the sum of the squares of these individual errors with respect to the regression parameters. That is a b
n
e2 i
i=1 n
= =
a b
(Yi a bXi )2 = 0
i=1 n
(2) (3)
e2 i
i=1
(Yi a bXi )2 = 0
i=1
The st equation gives

n
(Yi a bXi )
i=1
(4) (5)
a and the second one yields the equation

n
= Y b X
Xi (Yi a bXi )
i=1 n n
= = = =
0 0 0 0
(6) (7) (8) (9)
i=1
(Xi X)(Yi a bXi ) + X

i=1 n
(Yi a bXi )
i=1 n
(Xi X)(Yi Y + b X bXi )

n
i=1
(Xi X)(Yi Y )
i=1
b(Xi X)2
b =
Cov(X, Y ) (10) V (X)
For the maximum likelihood estimation one assumes that the errors i s are independently and identically normally distributed with mean 0 and variance 2 . Hence the Yi s are also normally distributed with mean (a+bxi ) and variance 2 . Under this assumption the joint pdf of the Yi s is
n
f (Y1 , ..., Yn |a, b, 2 )
=
i=1
f (Yi |a, b, 2 ) 1 1 exp( 2 2 )n/2 2 (2

n
(11) (Yi a bXi )2 ) (12)

i=1
Taking the log of the likelihood function one has n n 1 logL = log(2) log( 2 ) 2 2 2 2 5
n
(Yi a bXi )2
i=1
(13)
The target is to maximise this with respect to the parameters a,b and 2 . Derentiating with respect to a and b and setting the result to zero gives us the equations:
n
(Yi a bXi )
i=1 n
= =
0 0
(14) (15)
Xi (Yi a bXi )
i=1
which are exactly the same equations we obtained in the case of the least square estimation of these parameters. Then by solving we get the same estimate values as obtained before. Hence the maximum likelihood estimates of the simple regression parameters are same as their least square estimates. (Proved)
Answer 5:
To answer this question I have used a two step procedure 1.] ANOVA test 2.] t-test . ANOVA: First ANOVA was performed under the Null hypothesis that the mean weights of the control plants , those under treatment1 and those under treatment2 are all equal. The Alternate was that any one or more of the means dier. The test was carried out at 95% level of signicance and it was found the Null was rejected and hence all the means are not same the treatments DO HAVE SOME EFFECT on the plant weights. t-tests: Secondly it was to be decided what type of inuence do each treatment have. For this two t-tests were performed to compare 1.]the means of the control and treatment1 and 2.]the means of the control and treatment2. Both the test were performed at 95% level of signicance. The lecture notes of Dr. Partha Sarathi Mazumdar was used for the relevant formulas. RESULTS FOR t-TEST for treatment 1 [A]Null hypothesis-the mean weights of the control and treatment-1 groups are equal Alternate hypothesisthe plants under treatment 1 have decreased mean weights Null Hypothesis accepted at 95% condence limit [B]Null hypothesis-the mean weights of the control and treatment-1 groups are equal Alternate hypothesis-the plants under treatment 1 have increased mean weights Null Hypothesis accepted at 95% condence limit RESULTS FOR t-TEST for treatment 2 [A]Null hypothesis-the mean weights of the control and treatment-2 groups are equal Alternate hypothesisthe plants under treatment 2 have decreased mean weights Null Hypothesis accepted at 95% condence limit
[B]Null hypothesis-the mean weights of the control and treatment-2 groups are equal Alternate hypothesis-the plants under treatment 2 have increased mean weights Null Hypothesis rejected at 95% condence limit. CONCLUSION: Treatment 1 donot aect the mean weights of the plants whereas treatment2 seems to increase the mean weights of the plants. I am pasting relevant part of the codes here, rest being similar: load p l a n t . dat ; %t h i s has t h e d a t a i n t h r e e columns , t r e a t e d as c o n t r o l , t r e a t m e n t 1 and t r e a t m e n t 2 respectively . x=p l a n t ; [ p , f , dfb , dfw ]= anova ( x ) % ANOVA i s performed on t h e t h r e e g r o u p s . p g i v e s t h e 1 CDF v a l u e o f t h e Fd i s t r i b u t i o n , f g i v e s t h e v a l u e o f t h e o b s e r v e d Fd i s t . , d f b t h e d e g r e e o f freedom c o r r e s p o n d i n g t o t h e b e t w e e n mean v a r i a n c e (SSB) and dfw t h e d e g r e e o f freedom c o r r e s p o n d i n g t o t h e w i t h i n group v a r a i n c e (SSW) . i f f >f i n v ( 0 . 9 5 , dfb , dfw ) disp ( N u l l H y p o t h e s i s r e j e c t e d a t 95% c o n f i d e n c e limit ) else disp ( N u l l H y p o t h e s i s a c c e p t e d a t 95% c o n f i d e n c e limit ) end %d e f i n i n g a g e n e r a l f u n c t i o n f o r two tt e s t s . t g i v e s t h e o b s e r v e d td i s t v a l u e and df t h e d e g r e e o f freedom f o r t h e td i s t t h a t t h e t e s t d t a t i s t i c f o l l o w s . I t i s assumed t h a t t h e group v a r i a n c e s a r e UNEQUAL hence t h e S a t t e r t h w a i t e s a p p r o x i m a t e df v a l u e i s a computed ( f o r m u l a from PSM s l e c t u r e s l i d e s ) . function [ t , d f ]= t t e s t ( y , z ) ny=length ( y ) ; nz=length ( z ) ; my=mean( y ) ; mz=mean( z ) ; t =(mymz) / sqrt ( var ( y ) /ny + var ( z ) / nz ) ; c=( var ( y ) /ny ) / ( var ( y ) /ny + var ( z ) / nz ) ; dfn =(ny1) ( nz 1) / ( ( nz 1) c 2 + ( ny1)(1 c ) 2 ) ; d f=round ( dfn ) ; end [ t1 , d f 1 ]= t t e s t ( x ( : , 1 ) , x ( : , 2 ) ) i f t1>t i n v ( 0 . 9 5 , d f 1 ) disp ( N u l l H y p o t h e s i s r e j e c t e d a t 95% c o n f i d e n c e limit ) else disp ( N u l l H y p o t h e s i s a c c e p t e d a t 95% c o n f i d e n c e limit ) end
Answer 6:
A linear regression model was t for the data with record times as the independent variable and distance and climb as the dependent ones(in this order) separately for males and the females. The regression coecients for the males were 1.655410 01 and 4.446610 05 repectively and for the females 2.900210 01 and 1.1677 10 04 respectively . Observation: So this shows that record times for males increase with increasing distance and climb (as the regresion coecients are positive) which is true according to our general notion. But in females the record times decrease with climb although it increases with distance. Transformation: The data was transformed and the record time was regressed on the logarithm of distance and climb so that the coecients can be compared easily. The coecients for the males turned out to be 0.79306 and 0.31727. For females they were 0.79124 and 0.31602. This was quite alright and this shows that times do increase for both increase of distance and climb for both males and females. Presence of outlier: But a glance at the plots of the data shows that one of the values is quite far from the trend for both the males and females. This is the 19th data( 46 7500 8.3069 13.5478). Also the regression coecient corresponding to climb was negative for females, which is counter intuitive. So naturally one thinks this may be an outlier. I have repeated the same program with a modied data set without this data point. Observation: The results were promising. The regression coecients for the males were 9.4252 10 02 and 1.7034 10 04 . For females they were 1.1061 10 01 and 2.0217 10 04. In the log transformed case ,regression coecients for the males were 0.75692 and 0.31848 , for females were 0.72183 and 0.31874. Final Conclusions: After the outlier elimination it is found that the record times for both males and females increase with distance and climb (on account of the positive regression coecients hence positive slope). The R2 value and F statistics value were also compared and the analysis without outlier gave better results. The details are in the program le prb6.m and prb6a.m. Comparing the coecients between males and females , it is found that females take more time to cover a given distance and climb. The relevant portion of the code is given here: load h i l l s . dat %d a t a f i l e w i t h t h e o u t l i e r x= h i l l s ; %m a t r i x w i t h 1 s t column d i s t a n c e , 2nd column c l i m b , 3 rd column time f o r males and 4 t h column time f o r f e m a l e s [ b1 , b i n t 1 , r1 , r i n t 1 , s t a t s 1 ]= r e g r e s s ( x ( : , 3 ) , [ ones ( s i z e ( x (: ,1) ) ) ,x (: ,1) ,x (: ,2) ]) ; % f i t s a r e g r e s s i o n o f t h e time ( males ) w i t h t h e d i s t a n c e and c l i m b .
% % % % % % % % %
b1 i s t h e b e t a m a t r i x ( or t h e r e g r e s s i o n c o e f f i c i e n t s ) i n t h e model bint1 i s the confidence i n t e r v a l for b r1 i s a column v e c t o r o f r e s i d u a l s rint1 i s the confidence i n t e r v a l for r s t a t s 1 i s a row v e c t o r c o n t a i n i n g : o The R2 s t a t i s t i c o The F s t a t i s t i c o The p v a l u e f o r t h e f u l l model o The e s t i m a t e d e r r o r v a r i a n c e The relevant graphs are attached here:
Figure 1: Data with outlier
Figure 2: Data with outlier
10
Figure 3: Data without outlier
11
Figure 4: Data without outlier
12

Biostat HW 09MS009

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Biostat HW 09MS009

Uploaded by

Copyright:

Available Formats

LS321 -Take Home Assignment

Histogram & normal t of the means (step size-0.3cm)

170 Means (cm.)

The st equation gives

a and the second one yields the equation

(6) (7) (8) (9)

(Xi X)(Yi a bXi ) + X

(Xi X)(Yi Y + b X bXi )

Cov(X, Y ) (10) V (X)

f (Y1 , ..., Yn |a, b, 2 )

f (Yi |a, b, 2 ) 1 1 exp( 2 2 )n/2 2 (2

(11) (Yi a bXi )2 ) (12)

Figure 1: Data with outlier

Figure 2: Data with outlier

Figure 3: Data without outlier

Figure 4: Data without outlier

You might also like