You are on page 1of 5

Quantitative methods

TYPES OF DATA

a. Time series - A time series is a set of observations on the values that a variable
takes at different times. Such data may be collected at regular time intervals, such as
daily.

b. Cross-section - Cross-section data are data on one or more vari- ables collected at
the same point in time, such as the census of population. Cross-sectional data too
have their own problems, specifically the problem of heterogeneity

c. Pooled-In pooled, or combined, data are elements of both time series and cross-
section data.

d. Panel, Longitudinal, or Micropanel Data - This is a special type of pooled data in


which the same cross-sectional unit (say, a family or a firm) is surveyed over time.

Data mining is the process of finding anomalies, patterns and correlations within
large data sets to predict outcomes. Using a broad range of techniques, you can
use this information to increase revenues, cut costs, improve customer
relationships, reduce risks and more.

Model selection is the task of selecting a statistical model from a set of candidate
models (given data). Once the set of candidate models has been chosen, the
statistical analysis allows us to select the best of these models.

What is meant by best is controversial. A good model selection technique will


balance goodness of fit with simplicity. Goodness of fit is generally determined
using a likelihood ratio approach, or an approximation of this, leading to a
chi-squared test. The complexity is generally measured by counting the
number of parameters in the model.

𝑅 2 criterion = is a statistical measure of how close the data are to the fitted regression
line. It is also known as the coefficient of determination, or the coefficient of multiple
determination for multiple regression.

R-squared is always between 0 and 100%. In general, the higher the R-squared, the
better the model fits the data.

a. R-squared cannot determine whether the coefficient estimates and predictions are
biased, which is why you must assess the residual plots.

A residual plot is a graph that shows the residuals on the vertical axis and the
independent variable on the horizontal axis. If the points in a residual plot are
randomly dispersed around the horizontal axis, a linear regression model is
appropriate for the data; otherwise, a non-linear model is more appropriate.
b. R-squared does not indicate whether a regression model is adequate. You can have
a low R-squared value for a good model, or a high R-squared value for a model that
does not fit the data!

The t-value measures the size of the difference relative to the variation in your
sample data. Put another way, T is simply the calculated difference represented in
units of standard error. The greater the magnitude of T, the greater the evidence
against the null hypothesis. This means there is greater evidence that there is a
significant difference. The closer T is to 0, the more likely there isn't a significant
difference.

t-value in your output is calculated from only one sample from the entire population.

Misuse of t-statistic:

1) If the sample size is small (less than 15), the one-sample t-test should not be used if
the data are clearly skewed or the outliers are present. Nonparametric test can be
performed.

2) If a group of subjects receives one treatment, and then the same subjects later
receive another treatment. This is a paired t-test and not two-sample t-test.

P value = determine statistical significance in a hypothesis test. P values are


calculated based on the assumptions that the null is true for the population and that the
difference in the sample is caused entirely by random chance. Consequently, P values
can’t tell you the probability that the null is true or false because it is 100% true from
the perspective of the calculations.

 High P value = likely with a true null


 Low P value = unlikely with a true null

! T and P are inextricably linked. The larger the absolute value of the t-value, the
smaller the p-value, and the greater the evidence against the null hypothesis !

The F-test of overall significance determines whether this relationship is statistically


significant. If the P value for the F-test of overall significance test is less than your
significance level, you can reject the null-hypothesis and conclude that your model
provides a better fit than the intercept-only model.

Regression
In very general terms, regression is concerned with describing and evaluating
the relationship between a given variable and one or more other variables. More
specifically, regression is an attempt to explain movements in a variable by
reference to movements in one or more other variables.

Regression analysis is concerned with the study of the dependence of one variable,
the dependent variable, on one or more other variables, the explanatory variables,
with a view to estimating and/or predicting the (populadtion) mean or aver- age value
of the former in terms of the known or fixed (in repeated sampling) values of the
latter.

The spurious regression

Many of the macroeconomic, finance, monetary variables are nonstationary


presenting trending behavior in most cases.

From an econometric point view, the presence of a deterministic trend (linear or not)
in the explanatory variables does not raise any problem. But many economic and
business time series are nonstationary even after eliminating deterministic trends due
to the presence of unit roots, that is, they are generated by integrated processes.

When nonstationary time series are used in a regression model one may obtain
apparently significant relationships from unrelated variables. This phenomenom is
called spurious regression.
Granger and Newbold (1974) estimated regression models of the type:

(4.54)

where and were unrelated random walks:

Since neither affects nor is affected by , one expects the coefficient to


converge to zero and the coefficient of determination, to also tend to zero.
However, they found that, frequently, the null hypothesis of no relationship is not
rejected along with very high and very low Durbin-Watson statistics. It should

be noted that the autocorrelation of the random walk is projected into which
being a random walk as well is also highly correlated. Following these results they
suggest that finding high and low D-W statistics can be a signal of a spurious
regression.

Correlation
The correlation between two variables measures the degree of linear association
between them. If it is stated that y and x are correlated, it means that y and x are
being treated in a completely symmetrical way. Thus, it is not implied that
changes in x cause changes in y, or indeed that changes in y cause changes in x.
Rather, it is simply stated that there is evidence for a linear relationship between
the two variables, and that movements in the two are on average related to an
extent given by the correlation coefficient.
Difference between regression and correlation
In regression analysis there is an asymmetry in the way the dependent and
explanatory variables are treated. The dependent variable is assumed to be
statistical, random, or stochastic, that is, to have a probability distribution. The
explanatory variables, on the other hand, are assumed to have fixed values (in
repeated sampling).

Error correction model = Short-run model that incorporates a mechanism which


restores a variable to its long-term relationship from a disequilibrium position. ECM
links the long-run equilibrium relationship between two time series implied by
cointegration with the short-run dynamic adjustment mechanism.

When there are non-stationary variables in a regression model we may get results that
are spurious. One way of resolving this is to difference the data to ensure stationarity
of our variables. After doing this the regression model will be: Yt = a1 + a2Xt + ut.

In this case, the regression model may give us correct estimates of the aˆ1 and aˆ2
parameters and the spurious equation problem has been resolved. However, what we
have from Equation above is only the short-run relationship between the two
variables. Knowing that economists are interested mainly in long-run relationships,
this constitutes a big problem, and the concept of cointegration and the ECM are very
useful to resolve this.

Classical linear regression model


yt =α+βxt +ut

OLS= ordinary least squares; fitting a line to the data by minimising the sum of
squared residuals

Durbin Watson test

The Durbin Watson statistic is a number that tests for autocorrelation in the residuals
from a statistical regression analysis. The Durbin-Watson statistic is always between
0 and 4. A value of 2 means that there is no autocorrelation in the sample. Values
from 0 to less than 2 indicate positive autocorrelation and values from more than 2 to
4 indicate negative autocorrelation. 0-2=positive correlation 2-4=negative correlation

A rule of thumb is that test statistic values in the range of 1.5 to 2.5 are relatively
normal. Any value outside this range could be a cause for concern.

The Bootstrap is a tool for making statistical inferences when standard parametric
assumptions are questionable. For example, it is well known that significance tests for
regression coefficients may be misleading when the errors are not normally
distributed. Another way of thinking about the bootstrap is that it is a method for
computing confidence intervals around just about any statistic one could possibly
want to estimate even when no formula exists for calculating a standard error.

Bootstrap Confidence level

You might also like