Professional Documents
Culture Documents
TYPES OF DATA
a. Time series - A time series is a set of observations on the values that a variable
takes at different times. Such data may be collected at regular time intervals, such as
daily.
b. Cross-section - Cross-section data are data on one or more vari- ables collected at
the same point in time, such as the census of population. Cross-sectional data too
have their own problems, specifically the problem of heterogeneity
c. Pooled-In pooled, or combined, data are elements of both time series and cross-
section data.
Data mining is the process of finding anomalies, patterns and correlations within
large data sets to predict outcomes. Using a broad range of techniques, you can
use this information to increase revenues, cut costs, improve customer
relationships, reduce risks and more.
Model selection is the task of selecting a statistical model from a set of candidate
models (given data). Once the set of candidate models has been chosen, the
statistical analysis allows us to select the best of these models.
𝑅 2 criterion = is a statistical measure of how close the data are to the fitted regression
line. It is also known as the coefficient of determination, or the coefficient of multiple
determination for multiple regression.
R-squared is always between 0 and 100%. In general, the higher the R-squared, the
better the model fits the data.
a. R-squared cannot determine whether the coefficient estimates and predictions are
biased, which is why you must assess the residual plots.
A residual plot is a graph that shows the residuals on the vertical axis and the
independent variable on the horizontal axis. If the points in a residual plot are
randomly dispersed around the horizontal axis, a linear regression model is
appropriate for the data; otherwise, a non-linear model is more appropriate.
b. R-squared does not indicate whether a regression model is adequate. You can have
a low R-squared value for a good model, or a high R-squared value for a model that
does not fit the data!
The t-value measures the size of the difference relative to the variation in your
sample data. Put another way, T is simply the calculated difference represented in
units of standard error. The greater the magnitude of T, the greater the evidence
against the null hypothesis. This means there is greater evidence that there is a
significant difference. The closer T is to 0, the more likely there isn't a significant
difference.
t-value in your output is calculated from only one sample from the entire population.
Misuse of t-statistic:
1) If the sample size is small (less than 15), the one-sample t-test should not be used if
the data are clearly skewed or the outliers are present. Nonparametric test can be
performed.
2) If a group of subjects receives one treatment, and then the same subjects later
receive another treatment. This is a paired t-test and not two-sample t-test.
! T and P are inextricably linked. The larger the absolute value of the t-value, the
smaller the p-value, and the greater the evidence against the null hypothesis !
Regression
In very general terms, regression is concerned with describing and evaluating
the relationship between a given variable and one or more other variables. More
specifically, regression is an attempt to explain movements in a variable by
reference to movements in one or more other variables.
Regression analysis is concerned with the study of the dependence of one variable,
the dependent variable, on one or more other variables, the explanatory variables,
with a view to estimating and/or predicting the (populadtion) mean or aver- age value
of the former in terms of the known or fixed (in repeated sampling) values of the
latter.
From an econometric point view, the presence of a deterministic trend (linear or not)
in the explanatory variables does not raise any problem. But many economic and
business time series are nonstationary even after eliminating deterministic trends due
to the presence of unit roots, that is, they are generated by integrated processes.
When nonstationary time series are used in a regression model one may obtain
apparently significant relationships from unrelated variables. This phenomenom is
called spurious regression.
Granger and Newbold (1974) estimated regression models of the type:
(4.54)
be noted that the autocorrelation of the random walk is projected into which
being a random walk as well is also highly correlated. Following these results they
suggest that finding high and low D-W statistics can be a signal of a spurious
regression.
Correlation
The correlation between two variables measures the degree of linear association
between them. If it is stated that y and x are correlated, it means that y and x are
being treated in a completely symmetrical way. Thus, it is not implied that
changes in x cause changes in y, or indeed that changes in y cause changes in x.
Rather, it is simply stated that there is evidence for a linear relationship between
the two variables, and that movements in the two are on average related to an
extent given by the correlation coefficient.
Difference between regression and correlation
In regression analysis there is an asymmetry in the way the dependent and
explanatory variables are treated. The dependent variable is assumed to be
statistical, random, or stochastic, that is, to have a probability distribution. The
explanatory variables, on the other hand, are assumed to have fixed values (in
repeated sampling).
When there are non-stationary variables in a regression model we may get results that
are spurious. One way of resolving this is to difference the data to ensure stationarity
of our variables. After doing this the regression model will be: Yt = a1 + a2Xt + ut.
In this case, the regression model may give us correct estimates of the aˆ1 and aˆ2
parameters and the spurious equation problem has been resolved. However, what we
have from Equation above is only the short-run relationship between the two
variables. Knowing that economists are interested mainly in long-run relationships,
this constitutes a big problem, and the concept of cointegration and the ECM are very
useful to resolve this.
OLS= ordinary least squares; fitting a line to the data by minimising the sum of
squared residuals
The Durbin Watson statistic is a number that tests for autocorrelation in the residuals
from a statistical regression analysis. The Durbin-Watson statistic is always between
0 and 4. A value of 2 means that there is no autocorrelation in the sample. Values
from 0 to less than 2 indicate positive autocorrelation and values from more than 2 to
4 indicate negative autocorrelation. 0-2=positive correlation 2-4=negative correlation
A rule of thumb is that test statistic values in the range of 1.5 to 2.5 are relatively
normal. Any value outside this range could be a cause for concern.
The Bootstrap is a tool for making statistical inferences when standard parametric
assumptions are questionable. For example, it is well known that significance tests for
regression coefficients may be misleading when the errors are not normally
distributed. Another way of thinking about the bootstrap is that it is a method for
computing confidence intervals around just about any statistic one could possibly
want to estimate even when no formula exists for calculating a standard error.