Professional Documents
Culture Documents
Matteo Pelagatti
October 24, 2013
Definitions
Let us define the main object of this book: the time series.
Digital computers are finite-state machines and, thus, cannot record continuous-time
time series.
for all the finite subsets {t1 , t2 , . . . , tm } of time points in the joint
distribution of (Yt1 , . . . , Ytm ) is normal.
Stationary processes
As customary in time series analysis, in the rest of the book the terms
stationarity and stationary will be used with the meaning of weak stationarity
and weakly stationary respectively.
Proof. Trivial.
Notice that the above definitions of stationarity assume that the process
{Yt } is defined for t Z (i.e. the process originates in the infinite past and
ends in the infinite future). This is a mathematical abstraction that is useful
to derive some results (e.g. limit theorems), but the definitions can be easily
adapted to the case of time series with t N or t {1, 2, . . . , n} by changing
the domains of t, h and k accordingly.
The most elementary (non-trivial) stationary process is the white noise.
Definition 6 (White noise). A stochastic process is white noise if it has
zero mean, finite variance, 2 , and covariance function
(
2 for h = 0,
(h) =
0 for h 6= 0.
4
As the next example clarifies, white noise processes and independent identically distributed (i.i.d.) sequences are not equivalent.
Example 1 (White noise and i.i.d. sequences). Let {Xt } be a sequence of
independently identically distributed (i.i.d.) random variables and {Zt } be
white noise.
The process {Xt } is strictly stationary since the joint distribution for
any k-tuple of time points is the product (by independence) of the common
marginal distribution (by identical distribution), say F (),
Pr{Xt1 x1 , Xt2 x2 , . . . , Xtk xk } =
k
Y
F (xi ),
i=1
and this does not depend on t. {Xt } is not necessarily weakly stationary since
its first two moments may not exist (e.g. when Xt is Cauchy-distributed).
The process {Zt } is weakly stationary since mean and covariance are timeindependent, but it is not necessarily strictly stationary since its marginal
and joint distributions may depend on t even when the first two moments
are time-invariant.
The function (h), which characterise a weakly stationary process, is
called autocovariance function and enjoys the following properties.
Pm Pm
i=1
j=1
ai (i j)aj 0, m N
For the proof of this theorem and in many other places in this book,
we will make use of the covariance matrix of the vector of n consecutive
observations of a stationary process, say Y := (Y1 , Y2 , . . . , Yn )> :
(0)
(1)
. . . (n 1)
(1)
(0)
. . . (n 2)
n :=
(1)
.
..
..
..
.
.
.
.
.
.
(n 1) (n 2) . . .
(0)
As any covariance matrix, n is symmetric with respect to the main diagonal
and nonnegative definite but, as the reader can easily verify, n is also symmetric with respect to the secondary diagonal. Furthermore, the element of
the matrix with indexes (i, j) equals the element with indexes (i + 1, j + 1)
(i.e. n is a Toeplitz matrix ).
Proof. The first two properties are well-known properties of variance and
covariance. The third property follows from stationarity and the symmetry
of the arguments of the covariance:
(h) = Cov(Xt , Xth ) = Cov(Xt+h , Xt ) = Cov(Xt , Xt+h ) = (h).
As for the fourth property, let y := (Y1 , Y2 , . . . , Ym )> be m consecutive
observations of the stationary process with autocovariance (), then for any
real m-vector of constants a, the random variable a> ym has variance
m X
m
X
Var a> y = a> m a =
ai (i j)aj ,
i=1 j=1
:=
As from the following theorem, the PACF can be derived as linear transformation of the ACF.
Theorem 3 (Durbin-Levinson algorithm). Let {Yt } be a stationary process with mean and autocovariance function (h), then its PACF is
given by
(0) = 1,
(1) = (1)/(0),
Ph1
(h) j=1
h1,j (h j)
(h) =
,
Ph1
(0) j=1 h1,j (h j)
for h = 2, 3, . . .
(2)
>
where h1,j denotes the j-th element of the vector h1 := h1
1
h1
>
with h1
:= [(1), . . . , (h 1)] and h1 as in equation (1).
The coefficients h can be recursively computed as
h,h = (h),
for j = 1, . . . , h 1.
The first part of the theorem shows how partial autocorrelations relate
to autocovariances, while the second part provide recursions to efficiently
compute the PACF without the need to explicitly invert the matrices h1 .
Proof. The correlation of a random variable with itself is 1, and so (0) = 1
and as no variables intervene between Yt and Yt1 , (1) = (1) = (1)/(0).
In order to lighten the notation, let us assume without loss of generality that EYt = 0. First, notice that by the properties of the optimal
linear predictor (1. and 2. of Theorem ??), E(Yt P[Yt |Yt1:th+1 ])(Yth
P[Yth |Yt1:th+1 ]) = E(Yt P[Yt |Yt1:th+1 ])Yth . Thus, by definition of
PACF
P
(h) h1
E(Yt P[Yt |Yt1:th+1 ])Yth
j=1 h1,j (h j)
(h) =
=
, (3)
Ph1
E(Yt P[Yt |Yt1:th+1 ])Yt
(0) j=1 h1,j (h j)
Ph1
since P[Yt |Yt1:th+1 ] = j=1
h1,j Ytj with h1,j j-th element of the vec1
>
tor h1 = [(1), . . . , (h)]h1 .
8
h1
X
j=1
h1
X
!
h1,hj (h + 1 j) ,
j=1
But, by equation (3) we have the alternative formula for the numerator of
(h + 1),
h
X
(h + 1)
h,j (h + 1 j),
j=1
and equating the coefficients with the same order of autocovariance we obtain
h,h = (h),
for j = 1, . . . , h 1.
Let us denote with vh1 the denominator of (h) in equation (3) and
repeat the reasoning for the denominator of (h + 1), which will be named
vh :
vh =E Yt P[Yt |Yt1:th+1 ]+
P Yt P[Yt |Yt1:th+1 ]Yth P[Yth |Yt1:th+1 ] Yt
!
h1
X
= vh1 (h) (h)
h1,j (h j)
j=1
= vh1 (h)2
(0)
h1
X
!
h1,j (h j)
j=1
2
= vh1 (1 (h) ).
Since the population mean , the autocovariances (h), the autocorrelations (h) and the partial autocorrelations (h) are generally unknown
quantities, they need to be estimated from a time series. If we do not have
a specific parametric model for our time series, the natural estimators for
and (h) are their sample counterparts:
n
1X
Yt ,
Yn :=
n t=1
(h) :=
n
1 X
(Yt Yn )(Yth Yn ).
n t=h+1
Note that in the sample autocovariance the divisor is n and not n h (or
n h 1) as one would aspect from classical statistical inference, indeed, the
latter divisor does not guarantee that the sample autocovariance function
is nonnegative definite. Instead, the sample autocovariance matrix, whose
generic element is the above defined (i j), can be expressed as the product
of a matrix times its transpose and, therefore, is always nonnegative definite.
10
Y1 Yn
0
Y2 Yn Y1 Yn
Y3 Yn Y2 Yn
..
..
.
.
Y :=
Yn Yn Yn1 Yn
Yn Yn
0
..
..
.
.
0
columns,
0
0
Y1 Yn
..
.
Yn2 Yn
Yn1 Yn
..
.
...
...
...
...
0
0
0
..
.
. . . Y 1 Yn
. . . Y2 Yn
..
..
.
.
0 Yn Yn
(h);
2. (Variance) Var(Yn ) = n1 n1
h=n+1
n
3. (Normality) if {Yt } is a Gaussian process,
N (, Var(Yn ));
then Yn
5. P
(Asymptotic variance) if
h= |(h)| < , then nE[Yn ]
h= (h);
P
6. (Asymptotic normality) If Yt = +
j= j Ztj with Zt
P
P
2
IID(0, ), j= |j | < and j= j 6= 0, then
!
d
n(Yn ) N 0,
(h) .
h=
11
Proof.
P
Unbiasedness. EYn = n1 nt=1 EYt = .
Variance.
n
n
n
n
1 XX
1 XX
Var(Yn ) = 2
Cov(Yi , Yj ) = 2
(i j)
n i=1 j=1
n i=1 j=1
n1
n1
1 X
|h|
1 X
n |h| (h) =
1
(h).
= 2
n h=n+1
n h=n+1
n
Normality. The normality of the mean follows from the assumption of joint
Gaussianity of (Y1 , . . . , Yn ).
Consistency. In order to obtain the (mean-square) consistency of Yn , the
quantity E(Yn ) = Var(Yn ) has to converge to zero. A sufficient condition
for this to happen is (h) 0 as h diverges. In fact, in this case we can
always fix a small positive and find a positive integer N such that for all
h > N , |(h)| < . Therefore, for n > N + 1
n1
n1
|h|
1 X
1 X
|(h)|
Var(Yn ) =
1
(h)
n h=n+1
n
n h=n+1
=
N
n1
N
1 X
2 X
1 X
|(h)| +
|(h)|
|(h)| + 2.
n h=N
n h=N +1
n h=N
As n diverges the first addend converges to zero (it is a finite quantity divided
by n), while the addend can be made arbitrarily small, and so, by the very
definition of limit, Var(Yn ) 0.
Asymptotic variance. After multiplying the variance of Yn times n, we have
the following inequalities:
nE[Yn ]2 =
n1
X
h=n+1
|h|
1
n
12
(h)
n1
X
|(h)|.
h=n+1
Therefore, a sufficient
Pn1 condition for the asymptotic variance to converge as
n is that h=n+1 |(h)| converges. Furthermore, by Ces`aro theorem
n1
n1
X
X
|h|
(h) = lim
lim
(h).
1
n
n
n
h=n+1
h=n+1
Asymptotic normality. This is the central limit theorem for linear processes,
for the proof refer to Brockwell and Davis (1991, Sec. 7.3), for instance
Of course, if {Yt } is a Gaussian process, also the sample mean is Gaussian. If the process is not Gaussian, the distribution of the sample mean can
be approximated by a normal only if some central limit theorem (CLT) for
dependent processes applies; in the Theorem we provide one for linear processes, but there are alternative CLT under weaker conditions (in particular
under mixing conditions).
For the sample autocorrelation (h) := (h)/
(0) we provide the following
result without proof, which is rather lengthy and cumbersome and can be
found in Brockwell and Davis (2002, Sec. 7.3).
Theorem 5 (Asymptotic distribution of the sample ACF ). Let {Yt } be
the stationary process,
Yt = +
j Ztj ,
IID(0, 2 ).
j=
If
j= |j |
EZt4 < ,
P
2
j= j |j| < ;
then for each h {1, 2, . . .}
(1)
(1)
. . d
n .. .. N (0, V),
(h)
(h)
with the generic (i, j)-th element of the covariance matrix V being vij =
P
13
n(n + 2)
n+2
(h) = n
(h)
(h) n
1
nh
nh
which converges in probability to zero as n diverges.
The Ljung-Box statistic is more popular then the Box-Pierce since it
approximates the asymptotic distribution better in small samples.
We conclude this section on stationary processes with a celebrated result
used also as a justification for the use of the class of ARMA models as
approximation to any stationary processes.
Before presenting the result we need the concept of (linearly) deterministic processes.
14
2
= 0.
Vt = X cos(t) + Y sin(t),
where W , X and Y are random variables with zero means and finite variances;
furthermore, Var(X) = Var(Y ) = 2 and Cov(X, Y ) = 0. The two processes
are stationary, in fact their mean is zero for all t and their autocovariance
functions are
EWt Wth = EW 2 ,
EVt Vth = 2 cos(t) cos (t h) + sin(t) sin (t h)
= 2 cos(h),
which are invariant with respect to t.
Their linear predictions based on the past are
P[Wt |Wt1 ] = W
P[Vt |Vt1 , Vt2 ] = 2 cos()Vt1 Vt2 ,
(the reader is invited to derive the latter formula by computing the optimal
linear prediction and applying trigonometric identities).
For the first process it is evident that the prediction (W ) is identical to
the outcome (W ). In the second case, we need to show that the prediction
is equal to the process outcome:
P[Vt |Vt1 , Vt2 ] = cos()Vt1 + cos(2)Vt2
= 2 cos()[X cos(t ) + Y sin(t )]
[X cos(t 2) + Y sin(t 2)]
= X[2 cos() cos(t ) cos(t 2)]
+ Y [2 cos() sin(t ) sin(t 2)]
= X cos(t) + Y sin(t),
where the last line is obtained by applying well-known trigonometric identities.
15
j Ztj + Vt
j=0
where
1. 0 = 1,
j=0
j2 < ,
2. Zt is white noise,
3. Cov(Zt , Vs ) = 0 for all t and s,
4. Vt is (linearly) deterministic,
5. limk E(Zt P[Zt |Yt , Yt1 , . . . , Ytk ])2 = 0;
6. limk E(Vt P[Vt |Ys , Ys1 , . . . , Ysk ])2 = 0 for all t and s.
The message of this theorem is that every stationary process can be seen
as the sum of two orthogonal components: one, the deterministic, is perfectly
predictable using a linear function of the past of the process (point 6.), the
other, the purely non-deterministic , is expressible as a (possibly) infinite
linear combination of past and present observations of a white noise process
Zt . This process, is the prediction error of Yt based on its (possibly) infinite
past,
Zt = Yt P[Yt |Yt1 , Yt2 , . . .],
and is generally termed innovation of the process Yt . Thus, the coefficients
j are the projection coefficients of Yt on its past innovations Ztj :
2
j = E[Yt Ztj ]/E[Ztj
].
j Ztj ,
t Z,
j=
2
where Zt WN(0, P
), is a constant mean, and {j } is a sequence of
constants such that
j= |j | < .
If j = 0 for all j < 0, then {Yt } is a causal linear process.
j j+h ,
j=
with the summation starting from j = 0 and h 0 for the causal case. One
advantage of the absolute summability condition is that it implies the abso2
Pn
17
j=1 |j |
References
Brockwell, P. J. and R. A. Davis (1991). Time Series: Theory and Methods
(2nd edition ed.). Springer.
Brockwell, P. J. and R. A. Davis (2002). Introduction to Time Series and
Forecasting (2nd edition ed.). Springer.
X
2
XX
X
X
X
X
2
2
2
|
|
|
|
|
|
|
|
=
|
|
.
j j+h
j
j+h
j
i
j
h
18