Professional Documents
Culture Documents
ESTIMATION THEORY
The Principle of Least Squares
The best-fitting line is the one with the smallest sum of squared
residuals.
Assumptions:
o No uncertainty in x.
-----------------------------------------------------------------------------------------------
--
Regression
Linear Regression: Y = a + bX + u
Multiple Regression: Y = a + b1X1 + b2X2 + B3X3 + ... + BtXt + u
Where:
Y= the variable that we are trying to predict
X= the variable that we are using to predict Y
a= the intercept
b= the slope
u= the regression residual.
In multiple regression the separate variables are differentiated by using subscripted
numbers.
Introduction:
This article attempts to take a close look at the process and techniques in Regression
Testing.
If a piece of Software is modified for any reason testing needs to be done to ensure
that it works as specified and that it has not negatively impacted any functionality
that it offered previously. This is known as Regression Testing.
Regression Testing plays an important role in any Scenario where a change has been
made to a previously tested software code. Regression Testing is hence an important
aspect in various Software Methodologies where software changes enhancements
occur frequently.
Any Software Development Project is invariably faced with requests for changing
Design, code, features or all of them.
Each change implies more Regression Testing needs to be done to ensure that the
System meets the Project Goals.
Why is Regression Testing important?
All this affects the quality and reliability of the system. Hence Regression Testing,
since it aims to verify all this, is very important.
Every time a change occurs one or more of the following scenarios may occur:
- More Functionality may be added to the system
- More complexity may be added to the system
- New bugs may be introduced
- New vulnerabilities may be introduced in the system
- System may tend to become more and more fragile with each change
After the change the new functionality may have to be tested along with all the
original functionality.
With each change Regression Testing could become more and more costly.
To make the Regression Testing Cost Effective and yet ensure good coverage one or
more of the following techniques may be applied:
- Test Automation:
If the Test cases are automated the test cases may be executed using scripts after
each change is introduced in the system. The execution of test cases in this way helps
eliminate oversight, human errors,. It may also result in faster and cheaper execution
of Test cases. However there is cost involved in building the scripts.
- Selective Testing:
Some Teams choose execute the test cases selectively. They do not execute all the
Test Cases during the Regression Testing. They test only what they decide is relevant.
This helps reduce the Testing Time and Effort.
Since Regression Testing tends to verify the software application after a change has
been made everything that may be impacted by the change should be tested during
Regression Testing. Generally the following areas are covered during Regression
Testing:
Partial correlation analysis involves studying the linear relationship between two
variables after excluding the effect of one or more independent factors.
For example, study of partial correlation between price and demand would involve
studying the relationship between price and demand excluding the effect of money
supply, exports, etc.
Generally, a large number of factors simultaneously influence all social and natural
phenomena. Correlation and regression studies aim at studying the effects of a large
number of factors on one another.
In simple correlation, we measure the strength of the linear relationship between two
variables, without taking into consideration the fact that both these variables may be
influenced by a third variable.
For example, when we study the correlation between price (dependent variable) and
demand (independent variable), we completely ignore the effect of other factors like
money supply, import and exports etc. which definitely have a bearing on the price.
RANGE
The correlation co-efficient between two variables X1 and X2, studied partially after
eliminating the influence of the third variable X3 from both of them, is the partial
correlation co-efficient r12.3.
Simple correlation between two variables is called the zero order co-efficient since in
simple correlation, no factor is held constant. The partial correlation studied between
two variables by keeping the third variable constant is called a first order co-efficient,
as one variable is kept constant. Similarly, we can define a second order co-efficient
and so on. The partial correlation co-efficient varies between -1 and +1. Its
calculation is based on the simple correlation co-efficient.
The partial correlation analysis assumes great significance in cases where the
phenomena under consideration have multiple factors influencing them, especially in
physical and experimental sciences, where it is possible to control the variables and
the effect of each variable can be studied separately. This technique is of great use in
various experimental designs where various interrelated phenomena are to be
studied.
LIMITATIONS
However, this technique suffers from some limitations some of which are stated
below.
The calculation of the partial correlation co-efficient is based on the simple correlation
co-efficient. However, simple correlation coefficient assumes linear relationship.
Generally this assumption is not valid especially in social sciences, as linear
relationship rarely exists in such phenomena.
As the order of the partial correlation co-efficient goes up, its reliability goes down.
MULTIPLE CORRELATION
(1)
(2)
B. Multiple R can be computed several ways. From the simple
correlations, as
(3)
(4)
(5)
So the relationship between the semi partial correlation and the beta
weight from Equations 1 and 7 is
(6)
(7)
(8)
(9)
(11)
(12)
(13)
The relation between partial correlations and beta weights for the two
predictor problem turns out to be
(14)
E. Following Cohen and Cohen (1975, p. 80), we can think of all these in
terms of what they call Ballentines (we can call Mickeys)
.
The semi partial correlations are:
And the partial correlations are:
(15)
This should remind the reader of stepwise multiple regression where each new
variable is entered while controlling the variance explained by earlier entered
variables. Therefore, if we could compute the higher order partial correlations,
we could do multiple regression by hand. A recurrence relationship allows us
to do just that, which is
(16)
(17)
Consider for now a rather abstract model where i = xifor some predictors xi. How do
we estimate the parameters and 2?
2.2.1 Estimation of
The likelihood principle instructs us to pick the values of the parameters that maximize
the likelihood, or equivalently, the logarithm of the likelihood function. If the
observations are independent, then the likelihood function is a product of normal
densities of the form given in Equation 2.1. Taking logarithms we obtain the normal log-
likelihood
n 1
2 2
(2.5)
where i = xi.The most important thing to notice about this expression is that
maximizing the log-likelihood with respect to the linear parameters for a fixed value of
2 is exactly equivalent to minimizing the sum of squared differences between observed
and expected values, or residual sum of squares
(2.6)
In other words, we need to pick values of that make the fitted values i = xi as close
as possible to the observed values yi.
Taking derivatives of the residual sum of squares with respect to and setting the
derivative equal to zero leads to the so-called normal equations for the maximum-
likelihood estimator [^()]
^ = Xy.
XX
If the model matrix X is of full column rank, so that no column is an exact linear
combination of the others, then the matrix of cross-products XX is of full rank and can
be inverted to solve the normal equations. This gives an explicit formula for the ordinary
least squares (OLS) or maximum likelihood estimator of the linear parameters:
^ = (XX)-1 Xy.
(2.7)
If X is not of full column rank one can use generalized inverses, but interpretation of the
results is much more straightforward if one simple eliminates redundant columns. Most
current statistical packages are smart enough to detect and omit redundancies
automatically.
There are several numerical methods for solving the normal equations, including
methods that operate on XX, such as Gaussian elimination or the Choleski
decomposition, and methods that attempt to simplify the calculations by factoring the
model matrix X, including Householder reflections, Givens rotations and the Gram-
Schmidt orthogonalization. We will not discuss these methods here, assuming that you
will trust the calculations to a reliable statistical package. For further details see
McCullagh and Nelder (1989, Section 3.8) and the references therein.
The foregoing results were obtained by maximizing the log-likelihood with respect to
for a fixed value of 2. The result obtained in Equation 2.7 does not depend on 2, and is
therefore a global maximum.
For the null model X is a vector of ones, XX = n and Xy = yi are scalars and [^()] =
[y], the sample mean. For our sample data [y] = 14.3. Thus, the calculation of a
sample mean can be viewed as the simplest case of maximum likelihood estimation in a
linear model.
^ ) = .
E(
(2.8)
It can also be shown that if the observations are uncorrelated and have constant
variance 2, then the variance-covariance matrix of the OLS estimator is
^ ) = (XX)-1 2.
var(
(2.9)
This result follows immediately from the fact that [^()]is a linear function of the data y
(see Equation 2.7), and the assumption that the variance-covariance matrix of the data
is var(Y) = 2 I, where I is the identity matrix.
A further property of the estimator is that it has minimum variance among all unbiased
estimators that are linear functions of the data, i.e. it is the best linear unbiased
estimator (BLUE). Since no other unbiased estimator can have lower variance for a fixed
sample size, we say that OLS estimators are fully efficient.
Finally, it can be shown that the sampling distribution of the OLS estimator [^()] in
large samples is approximately multivariate normal with the mean and variance given
above, i.e.
^ Np( , (XX)-1 2).
Applying these results to the null model we see that the sample mean [y] is an
unbiased estimator of ,has variance 2/n, and is approximately normally distributed in
large samples.
All of these results depend only on second-order assumptions concerning the mean,
variance and covariance of the observations, namely the assumption that E(Y) = X and
var(Y) = 2 I.
2.2.3 Estimation of 2
Substituting the OLS estimator of into the log-likelihood in Equation 2.5 gives a profile
likelihood for 2
n 1
^
logL(2) = - log(22) - RSS( ) / 2 .
2 2
Differentiating this expression with respect to 2 (not )and setting the derivative to zero
leads to the maximum likelihood estimator
^ = RSS( ^ )/n.
2
This estimator happens to be biased, but the bias is easily corrected dividing by n-p
instead of n. The situation is exactly analogous to the use of n-1 instead of n when
estimating a variance. In fact, the estimator of 2 for the null model is the sample
variance, since [^()] = [y] and the residual sum of squares is RSS = (yi-[y])2.
Under the assumption of normality, the ratio RSS/2of the residual sum of squares to the
true parameter value has a chi-squared distribution with n-p degrees of freedom and is
independent of the estimator of the linear parameters. You might be interested to know
that using the chi-squared distribution as a likelihood to estimate 2 (instead of the
normal likelihood to estimate both and 2) leads to the unbiased estimator.
For the sample data the RSS for the null model is 2650.2 on 19 d.f. and therefore [^()]
= 11.81, the sample standard deviation.
Maximum Likelihood
Maximum likelihood, also called the maximum likelihood method, is the procedure of
finding the value of one or more parameters for a given statistic which makes the known
likelihood distribution a maximum. The maximum likelihood estimate for a parameter is
denoted .
(1
)
so maximum likelihood occurs for . If is not known ahead of time, the likelihood
function is
(2
)
(3
)
(4
)
(5
)
(6
)
Rearranging gives
(7
)
so
(8
)
(9)
(10
)
so
(11
)
and
(12
)
giving
(13
)
Similarly,
(14
)
gives
(15
)
Note that in this case, the maximum likelihood standard deviation is the sample
standard deviation, which is a biased estimator for the population standard deviation.
(16
)
(17
)
(18
)
gives
(19
)
(20
)
But
(21
)
so
(22
)
(23
)
(24
)
(25
)
(26
)
(27
)
Method of Moments
The method of moments equates sample moments to parameter estimates. When
moment methods are available, they have the advantage of simplicity. The disadvantage
is that they are often not available and they do not have the desirable optimality
properties of maximum likelihood and least squares estimators.
The primary use of moment estimates is as starting values for the more precise
maximum likelihood and least squares estimates.
The Method
Suppose that we have a basic random experimentwith an observable, real-valued
random variableX. The distribution of X has k unknown parameters, or equivalently, a
parameter vector
taking values in a parameter space A Rk. As usual, we repeat the experiment n times
to generate a random sample of size nfrom the distribution of X.
Thus, X1, X2, ..., Xnare independent random variables, each with the distribution of X.
i(a) = E(X i| a)
denote the i'th momentof X about 0. Note that we are emphasizing the dependence of
these moments on the vector of parameters a. Note also that 1(a) is just the mean of
X, which we usually denote by . Next, let
denote the i'th sample moment. Note that we are emphasizing the dependence of the
sample moments on the sample X. Note also that M1(X) is just the ordinary sample
mean, which we usually just denote by Mn.
To construct estimators W1, W2, ..., Wkfor our unknown parameters a1, a2, ..., ak,
respectively, we attempt to solve the set of simultaneous equations
for W1, W2, ..., Wk in terms of X1, X2, ..., Xn. Note that we have k equations in k
unknowns, so there is hope that the equations can be solved.
Estimates for the Mean and Variance
1. Suppose that (X1,X2, ..., Xn) is a random sample of size nfrom a distribution with
unknown mean and variance d2. Show that the method of moments estimators for
and d2 are, respectively
a. Mn = (1 / n) j = 1, ..., n Xj.
Note that Mn is just the ordinary sample mean, but Tn2= [(n - 1) / n]Sn2where Sn2is the
usual sample variance. In the remainder of this subsection, we will compare the
estimatorsSn2and Tn2.
4. Show that
6. Suppose that the sampling distribution is normal. Show that in this case
Thus, Sn2and Tn2are multiplies of one another; Sn2is unbiased but Tn2has smaller mean
square error.
7. Run the normal estimation experiment 1000 times, updating every 10 runs, for
several values of the parameters. Compare the empirical bias and mean square error
ofSn2and of Tn2to their theoretical values. Which estimator is better in terms of bias?
Which estimator is better in terms of mean square error?
There are several important one-parameter families of distributions for which the
parameter is the mean, including the Bernoulli distribution with parameter p and the
Poisson distribution with parameter . For these families, the method of moments
estimator of the parameter is M, the sample mean. Similarly, the parameters of the
normal distribution are and d2, so the method of moments estimators are M and Tn2.
Additional Exercises
8. Suppose that (X1,X2, ..., Xn) is a random sample from the gamma distribution with
shape parameter k and scale parameter b. Show that the method of moments estimators
of kand b are respectively
a. U = Mn2/ Tn2.
b. V = Tn2/ Mn .
9. Run the gamma estimation experiment 1000 times, updating every 10 runs for
several different values of the shape and scale parameter. Record the empirical bias and
mean square error in each case.
10. Suppose that (X1,X2, ..., Xn) is a random sample from the beta distribution with
parameters a and 1. Show that the method of moments estimator of a is Un = Mn/ (1 -
Mn ).
11. Run the beta estimation experiment 1000 times, updating every 10 runs, for several
different values of a. Record the empirical bias and mean square error in each case.
Draw graphs of the empirical bias and mean square error as a function of a.
12. Suppose that (X1,X2, ..., Xn) is a random sample from the Pareto distribution with
shape parameter a > 1. Show that the method of moments estimator of a is Un = Mn /
(Mn- 1).
UNIT-5
TIME SERIES
Moving average is an indicator used in technical analysis that shows a stock's average
price over a certain period of time. It is good to show a stock's "momentum" and it's
propensity to move above or below a point. Generally moving average is plotted on a
graph alongside the stock's price. For example, a picture of a stock's moving average
along with its price is below (Dell's 50-day moving average):
There are different types of moving average, and therefore different formulas to
calculate it. However, for any given moment in time, the moving average is the average
of the stock prices over the past x days, where x is the period that you are measuring.
For example, if the stock price on Monday was $3, the price on Tuesday was $5, and the
price on Wednesday was $7, the three-day moving average on Thursday would be $5, or
the average of the past three days.
Why is it important?
Moving average can be a better indicator of the stock's average price over time than the
stock price. The moving average curve is a much smoother version of the price curve
because it eliminates all of the sharp bumps that are caused by short deviations.
Two of the most important types of moving average are linear moving average (this is
generally just called "moving average") and exponential moving average. Linear moving
average is the more simple one, and just consists of the average derived by summing up
all of the stock prices and dividing by the number of stock prices. In an exponential
moving average, the more recent days are given exponentially more weight, and it uses
a much more complicated formula to find the average. There is no one formula because
one can weight the days differently. For example, in a ten day EMA, you could give the
last day a weight of 20%, the second-to-last day a weight of 14%, etc. Some common
time frames for both moving averages are 5, 10, 20, 50, 100, and 200 days.
This is a very popular scheme to produce a smoothed Time Series. Whereas in Single
Moving Averages the past observations are weighted equally, Exponential Smoothing
assigns exponentially decreasing weights as the observation get older.
In other words, recent observations are given relatively more weight in forecasting than
the older observations.
In the case of moving averages, the weights assigned to the observations are the same
and are equal to 1/N. In exponential smoothing, however, there are one or more
smoothing parameters to be determined (or estimated) and these choices determine the
weights assigned to the observations.
Exponential
smoothing
Lets now dig into the most popular forecasting methods: the simple exponential
smoothing model to forecast the next period for a time series without trend, and the
double exponential smoothing model taking into account a trend effect.
The single exponential smoothing is somewhat similar to the moving average methods
except that there is a single weighting factor called alpha, which can take a value
between 0 and 1.
The next period forecast (Fn+1 ) is then calculated as following:
Where:
If alpha is near 0, then the next period forecast is equal to the previous forecast, which
means that the model is less reactive.
On the opposite, if alpha is near 1, the next period forecast is equal to the previous
period actual sales, meaning that now the model is over reactive.
In summary, the reactivity of the model depends on alpha: when close to 1, the model is
very reactive and the latest periods are more important and when alpha is close to 0,
the model is less reactive so former periods are more important.
Tip: When alpha = 2 / (N+1) with N is the number of periods, we get the straight
moving average!
You can determine alpha either on an empiric or scientific way. Its value allows to fine
tune the model sensitivity.
The empiric way basically is you in front of your computer playing with the model till it
fits the actual sales based on historical data. This way you get your golden number for
alpha.
The scientific method is using the standard error method for estimating alpha.
Lets take the Dow Jones index to see the results with 2 alpha values:
We can notice that the green curve with alpha equal to 0.2 is certainly not precise, and
the purple curve with alpha equal to 0.9 looks much better.
Having done the standard error method, it has confirmed that 0.9 is the best value for
alpha to fit this very instable data.
You can also see that the forecast data lag the actual data in the model whatever value
for alpha.
Autoregressive Process
The solution is
(8.4
5)
h(u)
(8.4
6)
The effect is the same as applying a filter to the input signal Z(t), and the
variance spectrum of X(t) can then be determined as in the preceding example.
In the case where Z(t) is a random variable or "white noise" source, the variance
spectrum from this source would be constant with frequency:
(8.4
7)
(8.4
9)
And therefore the autocorrelation function, obtained from the Fourier transform
of (8.49), is a decaying exponential: