You are on page 1of 18

Research Method

Gain info from BNM


website

Obtain raw data from KLCI and


International Reserve (IR)

Transfer the data to Microsoft


Excel or xls. File format

Using that data, transform it


into SAS Technique

Then from SAS Application, get the


regression data and correlation graph

Analyze/ interpret the data


achieved
All the method is based using SAS

• Sample Data Series (1999-2009)

• OLS Regression

• Normality Test

Time Series Analysis Using SASⓇ

The purpose of this series is to discuss SAS programming techniques specifically


designed to simulate the steps involved in time series data analysis. Time series data
analysis has many applications in many areas including studying the relationship
between wages and house prices, profits and dividends, and consumption and GDP.

Many analysts erroneously use the framework of linear regression (OLS) models to
predict change over time or extrapolate from present conditions to future conditions.
Extreme caution is needed when interpreting the results of regression models
estimated using time series data. Statisticians and analysts working with time series
data uncovered a serious problem with standard analysis techniques applied to time
series. Estimation of parameters of the Ordinary Least Square Regression (OLS)
model produced statistically significant results between time series that contain a
trend and are otherwise random.

Time series datasets are different from other ordinary datasets in that their
observations are recorded sequentially over equal time increments (daily, weekly,
monthly, quarterly, and annually and etc).

There are two sets of conditions under which much of the theory is built:

• Stationary process

• Ergodicity
However, ideas of stationarity must be expanded to consider two important ideas:
strict stationarity and second-order stationarity. Both models and applications can be
developed under each of these conditions, although the models in the latter case
might be considered as only partly specified. In addition, time-series analysis can be
applied where the series are seasonally stationary and non-stationary.

Time series stationarity is a statistical characteristic of a series’ mean and variance


over time. If both are constant over time, then the series is said to be a stationary
process otherwise, the series is described as being a non-stationary process

Ordinary Least Square Regression (OLS)

It is a technique to understand a relationship between an outcome variable (also


called dependent variable) and predictor variables (also called independent
variables). According to this research, dependent variable is referring to KLCI whereas
independent variable is referring to IR.

Regression analysis is a statistical technique that can be used to estimate


relationships between variables. It is also the measurement of changes in one variable
that is the result of changes in other variables. Regression analysis is used frequently
in an attempt to identify the variables that affect a certain stock's price.

Regression models are commonly applied when it comes to planning or forecasting.


Financial data is due to their historical features of the trend component, cyclical
component, and irregular component - might not fit well with traditional regression
models. In such cases, an auto regression model or a dynamic regression model will be
the best option.

Applications of regression analysis exist in almost every field. In economics, the


dependent variable might be a family's consumption expenditure and the independent
variables might be the family's income, number of children in the family, and other
factors that would affect the family's consumption patterns. In political science, the
dependent variable might be a state's level of welfare spending and the independent
variables measures of public opinion and institutional variables that would cause the
state to have higher or lower levels of welfare spending. In sociology, the dependent
variable might be a measure of the social status of various occupations and the
independent variables characteristics of the occupations (pay, qualifications, etc.). In
psychology, the dependent variable might be individual's racial tolerance as measured
on a standard scale and with indicators of social background as independent variables.
In education, the dependent variable might be a student's score on an achievement
test and the independent variables characteristics of the student's family, teachers,
or school.

The regression equation deals with the following variables:

• The unknown parameters denoted as β. This may be a scalar or a vector of


length k.

• The independent variables X.

• The dependent variable, Y.

Regression equation is a function of variables X and β.

As you have more than one independent variable, it is most appropriate to use
multiple linear regressions. The linear regression model is a very useful tool in
prediction, but it is also very strict requiring some conditions to be met. The first
important one is the sample size. How big? It totally depends on a number of factors,
including the desired power, alpha level, number of predictors, and expected effect
size. As the rule of thumb, the bigger the sample size is, the better the model will be
if the processing time is ignored.

Normality Test
In statistics, normality tests are used to determine whether a random variable is
normally distributed, or not.

An important aspect of the "description" of a variable is the shape of its distribution,


which tells you the frequency of values from different ranges of the variable.
Typically, a researcher is interested in how well the distribution can be approximated
by the normal distribution.

The way of evaluating and comparing tests is usually in terms of its power. The power
of a test is the probability that the test rejects the null hypothesis. Of course we
want to reject the null hypothesis when it is not true. A test is said to be more
powerful than other when it has a higher probability of finding out that the null
hypothesis is not true. In the specific case of test of normality, one test is said to be
more powerful than other when it has a higher probability of rejecting the hypothesis
of normality, when the distribution is not normal. Of course to make a fair
comparison we want all tests to have the same probability of rejecting the null
hypothesis when the distribution is truly normal (i.e. they have to have the same α
or significance level). The study has done for a wide variety distribution.

A normal distribution has several characteristics, some of them are:

a) The form of f(x) and F(x) (distribution function) are known

b) The normal distribution has (Pearson's) skewness and kurtosis equal to 0 and 3
respectively.

c) If y ~ N (µ, σ ) then y can be written as y = µ + σx where x ~ N (0,1)


2

People that have defined tests for normality have focused in one of those
characteristics. Tests for normality are different in terms of:

a) the characteristic of the normal distribution they are based on

b) the complexity of their calculation


c) the way they approach the comparison of the empirical distribution with the
normal distribution:" summarize and compare" or "compare and summarize the
comparison"

d) the distribution of the statistic of the test (some use a common distribution such
as the Chi-square or the normal and some others have ad-hoc distributions)

The main types of tests for normality are:

a)Empirical Distribution tests

Kolmogorov-Smirnov (K-S) test is based on the empirical distribution function


(EDF).Empirical Function Tests (EDF). The empirical distribution function tests
compare the EDF with the CDF of the normal distribution and the normality hypothesis
is rejected if the difference between the EDF and the normal CDF is too large.

The two more popular EDF tests are the one defined by Kolmogorov (1933), which
summarizes the comparison through the maximum difference, and by Anderson and
Darling (1954), which involves a combination of all the differences. The second one is
definitely more powerful than the first one.

1.0
F(x)

0.5

0.0

-3 -2 -1 0 1 2 3
x
b) Skewness and Kurtosis tests

Skewness and kurtosis tests. There are several ways of measuring skewness and
kurtosis but the most well known are Pearson's (1905) skewness and kurtosis. Many
tests have been defined using Pearson’s skewness and kurtosis statatistics.
Skewness=0 for the normal distribution. Pearson's skewness =3 for the normal
distribution. So these tests basically calculate the skewness and kurtosis for the
sample and compare it to the values 0 and 1. A statistic is calculated based on that
comparison and a p-value is found using a given distribution. The most recent test
based on Pearson's skewness and kurtosis can be found in D'Agostino et. al. (1990), it
involves complicated transformations and uses the Chi-square distribution to find the
p-values.

• If the skewness (which measures the deviation of the distribution from


symmetry) is clearly different from 0, then that distribution is asymmetrical,
while normal distributions are perfectly symmetrical.

• If the kurtosis (which measures "peakedness" of the distribution) is clearly


different from 0, then the distribution is either flatter or more peaked than
normal; the kurtosis of the normal distribution is 0.

c) Regression and correlation tests

The regression and correlation are naturally associated to probability plots, which can
be of help for the trained eye to understand why the null hypothesis is rejected.
Regression tests focus on the slope of the line when the order statistics of the sample
are confronted with their expected value under normality. The most well known of
the regresion tests is the one defined by Shapiro and Wilk (1965).

Correlation tests focus on the strength of the linear relationship between the order
statistics and either their expected value under normality (but they are not as good as
the regression tests)
e) other special tests

There are some other tests, some of them of recent appearance, that cannot be
classified in the previous 3 types but in general they are not more powerful and they
are a little restrictive in their application.

Which are more powerful?

If the purpose of testing for normality is to simply determine if the distribution is


normal or not, regression tests that use the expected values under normality of order
statistics seem to be the best option from the point of view of power for a wide
diversity of alternatives. Among those tests, the Chen-Shapiro QH* test based on
normal spacings seems the best option due to the simplicity of its calculation,
however it needs special critical values. Tests that are based on order statistics are
sensitive to rounding but there are ways of correcting for that.

If the purpose of the test is identifying symmetric distributions with high kurtosis, the
skewness-kurtosis tests have higher power. The way in which kurtosis is measured
makes a difference not only in the power of the test for different types of
distributions, but also in the way in which omnibus tests can be defined. Using
kurtosis statistics different from Pearson's kurtosis, it is possible to define more
simple omnibus tests that do not include complicated transformations of the skewness
and kurtosis statistics and whose distribution is quite close to the Chi-square
distribution. These tests do not have a good performance detecting distributions with
kurtosis lower than the normal but have higher power when the distribution is more
peaked than the normal.
Finding

• Sample Data series

An example of a time series dataset (Raw Data) is illustrated below. It is taken from
year 1999 till Jan 2009 (Appendix)

international
Time KLCI (monthly) reserve
1999 Jan./Jan 591.43 106,214.8
Feb./Feb. 542.23 109,023.9
Mac/Mar. 502.82 105,266.3
Apr./Apr. 674.96 108,672.2
Mei/May 743.04 113,144.7
Jun/Jun. 811.10 118,293.5
Jul./Jul. 768.69 120,378.4
Ogos/Aug. 767.06 122,874.5
Sep./Sep. 675.45 119,254.5
Okt./Oct. 742.87 114,789.3
Nov./Nov. 734.66 114,572.5
Dis./Dec. 812.33 117,243.5

Each of KLCI and IR is called a series, while the combination of the 2 variables YEAR
and MONTH represent the sequential equal time increments. According to this study,
dependant variable is referring to Kuala Lumpur Composite Index (KLCI) and
independent variable is International Reserve, (IR).
• Regression Analysis

This side shows an example regression analysis with footnotes explaining the output.
These data were collected on 2 variables which are KLCI and IR and value are on
starting from 1999-2009. The dependant variable is KLCI while an independent
variable is IR .The proc reg procedure is used to perform regression analysis. On the
model statement, we specify the regression model that we want to run, with the
dependent variable (in this case, KLCI) on the left of the equals sign, and the
independent variables on the right-hand side.

proc print;
Run;
proc reg;
Model KLCI=IR;
Run;
proc gplot;
Plot KLCI*IR;
Symbol i=none v=diamond c=red;
Run;
16:04 Tuesday, May 20, 2008

The REG Procedure


Model: MODEL1
Dependent Variable: KLCI

Number of Observations Read 122


Number of Observations Used 122

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 4173325 4173325 283.16 <.0001


Error 120 1768622 14739
Corrected Total 121 5941948

Root MSE 121.40231 R-Square 0.7023


Dependent Mean 877.31639 Adj R-Sq 0.6999
Coeff Var 13.83792

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 479.08101 26.09385 18.36 <.0001


IR 1 0.00189 0.00011254 16.83 <.0001
Analysis of variance (ANOVA table)

Analysis of variance is a statistical method for comparing variables by partitioning the


variance of the observations between the effects of the different variables and
comparing it with the underlying random variation.

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 4173325 4173325 283.16 <.0001


Error 120 1768622 14739
Corrected Total 121 5941948

a. Source - This is the source of variance, Model, Residual, and Total. The Total
variance is partitioned into the variance which can be explained by the independent
variables (Model) and the variance which is not explained by the independent
variables (Residual, sometimes called Error). Note that the Sums of Squares for the
Model and Residual add up to the Total Variance, reflecting the fact that the Total
Variance is partitioned into Model and Residual variance.

b. DF - These are the degrees of freedom associated with the sources of variance.
The total variance has N-1 degrees of freedom. In this case, there were N=122 data
or number of observation, so the DF for total is 121. The model degree of freedom
corresponds to the number of predictors minus 1 (K-1). Refer in this case; it has one
independent variable in the model, which is international Reserve, IR). The Residual
degree of freedom is the DF total minus the DF model, 121 - 1 is 120.

c. Sum of Squares - These are the Sum of Squares associated with the three sources
of variance, Total, Model and Residual. These can be computed in many ways.
Conceptually, these formulas can be expressed as:

SS Total -The total variability around the mean. ∑(Y – Y bar) 2.

SS Residual -The sum of squared errors in prediction. ∑(Y – Y predicted) 2.


SS Model - The improvement in prediction by using the predicted value of Y
over just using the mean of Y. Hence, this would be the
squared differences between the predicted value of Y and
the mean of Y, ∑(Y predicted – Y bar) 2.

Another way to think of this is the SS Model =SS Total – SS Residual.

**Note that the SS Total = SS Model + SS Residual.

**Note that SS Model / SS Total are equal to 0.7023, the value of R-Square. This is
because R-Square is the proportion of the variance explained by the independent
variables, hence can be computed by SS Model / SS Total.

d. Mean Square - These are the Mean Squares, the Sum of Squares divided by their
respective DF. For the model, 4173325 / 1 =4173325. For the Residual, the values
are 1768622 / 120 = 14738.51667 or 14739. These are computed so we can compute
the F ratio, dividing the Mean Square Model by the Mean Square Residual to test the
significance of the predictors in the model.

e. F Value and Pr > F - The F-value is the Mean Square Model (4173325) divided by
the Mean Square Residual (14739), yielding F=283.16. The p-value associated with
this F value is very small (0.0001). These values are used to answer the question "Do
the independent variables reliably predict the dependent variable?” The p-value is
compared to our alpha level (typically 0.05) and, if smaller, you can conclude "Yes,
the independent variables, IR reliably predict the dependent variable, KLCI". We
could say that international reserve, IR can be used to reliably predict KLCI (the
dependent variable).

** If the p-value were greater than 0.05, we would say that the independent variables
does not show a statistically significant relationship with the dependent variable, or
independent variables does not reliably predict the dependent variable.
Note that this is an overall significance test assessing whether independent variables
reliably predict the dependent variable, and does not address the ability of any of the
particular independent variables to predict the dependent variable.

Overall Model Fit

Root MSE 121.40231 R-Square 0.7023


Dependent Mean 877.31639 Adj R-Sq 0.6999
Coeff Var 13.83792

f. Root MSE - Root MSE is the standard deviation of the error term, and is the square
root of the Mean Square Residual (or Error). √14739 = 121.40 or √14738.51667 =
121.4023

g. Dependent Mean - This is the mean of the dependent variable, KLCI.

h. Coeff Var - This is the coefficient of variation, which is a unit-less measure of


variation in the data. It is the root MSE divided by the mean of the dependent
variable, multiplied by 100: (100*(121.40231/877.31639) = 13.83792).

i. R-Square - R-Square is the proportion of variance in the dependent variable (KLCI)


which can be predicted from the independent variables (IR). This value indicates that
70.23% of the variance in KLCI value can be predicted from the variables IR. Note
that this is an overall measure of the strength of relationship, and does not reflect
the extent to which any particular independent variable is associated with the
dependent variable.

j. Adj R-Sq - Adjusted R-square. As predictors are added to the model, each
predictor will explain some of the variance in the dependent variable simply due to
chance. The adjusted R-square attempts to yield a more honest value to estimate the
R-squared for the population. The value of R-square was 0.7023, while the value of
Adjusted R-square was 0.6999. Adjusted R-squared is computed using the formula 1 -
((1 – R sq) ((N - 1) / (N - k - 1)). According to this research 1- ((1- 0.7023) (122 – 1) /
(122-1-1)) = 0.6999.
From this formula, we can see that when the number of observations is small and the
number of predictors is large, there will be a much greater difference between R-
square and adjusted R-square (because the ratio of (N - 1) / N - k - 1) will be much
less than 1). By contrast, when the number of observations is very large compared to
the number of predictors, the value of R-square and adjusted R-square will be much
closer because the ratio of (N - 1)/ (N - k - 1) will approach 1.

Parameter estimate

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 479.08101 26.09385 18.36 <.0001


IR 1 0.00189 0.00011254 16.83 <.0001

k. Variable - This column shows the predictor variables (International Reserve, IR).
The first variable (constant) represents the constant, the Y intercept, the height of
the regression line when it crosses the Y axis. In other words, this is the predicted
value of KLCI when other variables are 0.

m. DF - This column give the degrees of freedom associated with independent


variable.

n. Parameter Estimates - These are the values for the regression equation for
predicting the dependent variable from the independent variable.

These estimates tell us about the relationship between the independent variables and
the dependent variable. These estimates tell the amount of increase in KLCI value
that would be predicted by a 1 unit increase in the predictor.
IR- The coefficient for read is 0.00189(parameter estimate). Hence, for every unit
increase in IR value data we expect a 0.00189 value increase in the KLCI value data.
This is statistically significant.

Note: For the independent variables which are not significant, the coefficients are
not significantly different from 0, which should be taken into account when
interpreting the coefficients. (See the columns with the t-value and p-value about
testing whether the coefficients are significant).

o. Standard Error - These are the standard errors associated with the coefficients.
The standard error is used for testing whether the parameter is significantly different
from 0 by dividing the parameter estimate by the standard error to obtain a t-value
(see the column with t-values and p-values). The standard errors can also be used to
form a confidence interval for the parameter, as shown in the last two columns of this
table.

p. t Value and Pr > |t|- These columns provide the t-value and 2 tailed p-value used
in testing the null hypothesis that the coefficient/parameter is 0. If you use a 2
tailed test, then you would compare each p-value to your preselected value of alpha.
Coefficients having p-values less than alpha are statistically significant.

For example, we chose alpha to be 0.05, coefficients having a p-value of 0.0001 or


less would be statistically significant (i.e., you can reject the null hypothesis and say
that the coefficient is significantly different from 0).

The coefficient for IR is significantly different from 0 using alpha of 0.05 because its
p-value is 0.001, which is smaller than 0.05.

The constant (_cons) is significantly different from 0 at the 0.05 alpha level.
However, having a significant intercept is seldom interesting.

The sample code of SAS procedure regression is done as follows, but first, we would like to
have a close look at the relationship between the dependent variable and each of the
independent variables.
proc gplot;
Plot KLCI*IR;
Symbol i=none v=diamond c=red;
Run;

This will draw a diagram with each of the independent variables measured along the
horizontal axis, the dependent variable along the vertical axis, and a dot making each
observation. This is what is called a scatter diagram or scatter plot. Figure below clearly
demonstrates that a relationship between KLCI and IR , and we can see also that an increase
in KLCI leads to an increase in IR. However, this could be misleading because the data are
dealing with are time series data and one of the common violations of linear regression
assumptions is autocorrelation.
You should always keep in mind when you build a linear regression model that the
assumptions of a linear regression analysis must be met. These assumptions include

i) the mean of the response variable is linearly related to the value of the predictor
variable

ii) the observations are independent

iii) the error terms for each value of the predictor variable are normally distributed and

iv) The error variances for each value of the predictor variable are equal.

Accordingly, we may encounter the following 3 common problems with regression, which
would violate these assumptions:

i) correlated errors

ii) non-constant variance and

iii) influential observations.

• Normality Test

According to this study, we are not doing normality test using SAS technique. However, we
can conclude that the graph has positive correlation and look as if normal distributed in a
random variable.

Discussion

In order to make the appropriate judgment in the distributional assumptions, we also need to
look at the diagnostic plots that often provide the picture of the overall distribution along
with the statistical tests. Combining graphical methods and test statistics will definitely
improve our judgment on the normality of data. We can fairly easily automate the whole
analysis process base on the result of single normal tests. However, it is a challenge from a
programming point of view to incorporate such complicated visual assessment in our
programs.

You might also like