You are on page 1of 95

Time Series Analysis:

Theory, Data Analysis and Computation


by
Edward J. Wegman
Center for Computational Statistics
George Mason University
Fairfax, VA 22030

(c) Copyright 1990, 1994, 1996, Edward J. Wegman, All Rights Reserved

Chapter 1
Introduction

Time series analysis deals with the probabilistic and structural inference about a sequence of data
evolving through time. Normally, we shall write X , ,X as the time series observations. Frequently, this
will be a sample from a continuous stochastic process X! ( ) so that X really means X  ! where "t is the
sampling interval. Our basic model conceptually is a stochastic process X! ( ) where t is the time index and
+, where (+, 7, P) is a probability space and P is a probability measure. In simple terms, we are interested
in discovering P based on a time series X , ,X . Fortunately, or unfortunately, this is a ridiculously hard
problem, in general. Hence, we typically make simplifying assumptions. For the purposes of our discussion, we
will usually only consider finite dimensional distributions, i.e.
Fx x (u , ,u ) y P (X! | u , , X! | u ), Dk, Dt , , t , Du , , u
which is in the simple one dimensional case,
Fx (u ) y P (X | u ).
The well-known Kolmogorov Extension Theorem basically says that knowledge of all finite-dimensional
distributions is sufficient to reconstruct the probability measure P. This theorem is well beyond the scope of our
course, however, and hence will not be discussed any further.
We actually go even further in making assumptions and assume that the finite dimensional
distributions are all multivariate normal (or Gaussian). Since this is the case, it is sufficient to know only the
first and second moments because the first and second moments of a Gaussian distribution is necessary and
sufficient to characterize that distribution. We let  O EX and , O cov (X , X ). If we know  and
, , then we have enough information to reconstruct all of the finite dimensional distributions and hence the
probability measure, P.
We actually begin our discussion with even one more simplification. We basically want our process to
be invariant with a time shift, i.e. time homogeneous so that we will assume, at least initially, that the
probability structure does not shift with time. We write this as follows:
(1.1)

Fx x (t , , t )

Fx ]

x ]

(t , , t ).

for every t , , t , for every i , , i , for every k and for every  . The parameter  is the constant time shift.
If this equation holds we say the process X is strictly stationary.
A corollary to our Gaussian assumption is that strictly stationary implies means and covariances are
invariant with time shifts. In particular,
(1.2)



]

and
(1.3)

!,!

!] !] .

Thus if we let 

O ^ t in equation (1.2), we have




and if we let 

 , a constant,

!^!,

t in equation (1.3), we have

!,!

!^! .

Thus, the mean is constant; we will usually assume  O 0 for convenience. Moreover, the covariance
depends only on the difference between the indices, called the lag. !^! is called the autocovariance or lag
covariance. A process which has a constant first moment and a covariance function which depends only on the
lag is said to be weakly stationary or wide-sense stationary. A process may be strictly stationary and not weakly
stationary if it has no second-order moments. Conversely, a process may be weakly stationary, but not strictly
stationary is say its third moments change with time. Usually there is no confusion since under Gaussian
distribution assumptions, these two notions are equivalent. In this case we simply say the process is stationary.
The problem of estimating P has been reduced to the much easier problem of estimating  . This is the
probabilistic or statistical inference problem. We are also interested in structural inference, i.e. finding the
functional structure of the X process. We can do this in the frequency domain or the time domain.
In the frequency domain, we consider the function,
f()

(1.4)

e ^q  d

This is called the spectral density or power spectral density. The spectral density is the Fourier transform of the
covariance function and hence carries the same information as the covariance function.
The time domain models, usually assumed to be linear and stationary, are of the form:
(1.5)

X!

 X!^

]]

 X !^ + !

 !^ + +  !^ ,

where     are parameters and ! is a white noise. This is called the autoregressivemoving average model (ARMA) and is the most general linear model of a stationary stochastic process. As we
shall later see, the parameters  y  ,  ) and y   ) determine and are determined by the
covariance function. Moreover, we will see that we can write the spectral density in terms of  and so that we
have a fundamental triad given in the figure below.
Probabilistic
! (P)
Frequency Domain
f()

Time Domain


Basically given one corner, we can derive the other two. Hence, under the stationary, Gaussian
assumptions, analyses in the probability domain, the frequency domain, or the time domain are all basically
equivalent and fundamentally dependent upon the mean and covariance structure of the process. The nature of
the problem and what we are trying to infer will generally lead us to prefer one mode over another.
Generally speaking, engineering applications tend to succeed somewhat better with a frequency domain
model while business and economic applications tend to be somewhat better represented by time domain
models. This, of course, is not a hard and fast rule by any means. We will consider both frequency domain and
time domain models.

Chapter 2
Frequency Models
The major focus of the first part of our discussion is structural inference for time series in the frequency
domain. Historically, one of the earliest attempts to model this time series structure is known as the method of
hidden periodicities. The premise for such a techniques is that the time series may be modeled by a linearly
superimposed sum of sinusoids together with some Gaussian errors. We will develop such a model in several
stages.
2.1. A Single Sinusoid in Noise. Consider the model, X! y R cos (t +  ),  real-valued is called the
(angular) frequency,  [ ^ , ] is called the phase, R } 0 is called the amplitude. The stochastic model is
(2.1.1)

y]

X

R cos (t + ) +  .

Recall trigonometric identity


cos(+ )

cos() cos( )

R cos(t + )

R cos(t) cos()

(2.1.2)

sin() sin( ).

We may use (2.1.2) to obtain


(2.1.3)

R sin(t) sin().

Equivalently then, we may rewrite (2.1.1)


(2.1.4)

y]

X

where,
A

A cos(t) + B sin(t) +

y ^ R sin().

R cos(), B

Least Squares Solution with 0 Mean.


We use least squares for the estimation of A and B. Set
squares (SS).
(2.1.5)

SS

 (X

^
!y

A cos(
t)

B sin(
t))

0 temporarily and form the sum of

^
!y


! .

Taking partial derivatives with respect to A and B, we obtain


C SS
CA

 2 (X

^
!y

y ^
and

A cos(
t)

B sin(
t)) ( ^ cos(
t))

 2 cos(
t) (X ^ A cos(
t) ^ B sin(
t))

^
y

C SS
CB

O
Setting partials

 2 (X

^

!y

A cos(
t)

 ^ 2 sin(
t) (X

^

y

B sin(
t)) ( ^ sin(
t))

A cos(
t)

B sin(
t)).

0 and collecting coefficients,

 cos

^

y

(
t) + B

 cos(
t) sin(
t) ^  cos(
t) X

^

^

!y

!y

 cos(
t) sin(
t) + B  sin

^

^

y

!y

or rewriting,

!y

^

y

 X sin(
t)
!

!y

0.

 cos (
t), b y  cos(
t)sin(
t) and c y  sin (
t).

^

^

!y

^

!y

!y

Solving for B, we have

^

A ab + B b 

A ab + B ac

!y

X! cos(
t)

and

 X sin(
t) y 0.

^
!y

So that

(2.1.6)

 X sin(
t) ^ b  X cos(
t)

^

!y

^

!y

ac ^ b 

Similarly,
A ac + B bc

 X cos(
t) y 0

^

!y

 sin(
t) X y 0

^

 X cos(
t) y 0

Ab+Bc

^

Aa+Bb

Here we are letting a

t)

(

 X sin(
t) y 0.

^

A b + B bc

!y

Subtracting top from bottom,


A (b

 X cos(
t) ^ b  X sin(
t) y 0

^

ac) + c

^

!y

!y

which yields

(2.1.7)

X

^

!y

cos(
t) ^ b
ac ^

b

X

^
!y

sin(
t)

Thus writing out fully,

 sin (
t)  X cos(
t) ^  sin(
t) cos(
t)  X sin(
t)

^

^

!y

^

^

!y

!y

!y

(2.1.8)

and

 cos (
t)  X sin(
t)^  sin(
t) cos(
t)  X cos(
t)

^

!y

^
!y

^

^

!y

!y

where
D O ac ^ b

y  sin (
t)  cos (
t) ^ (  sin(
t)cos(
t)) .
^

^

^

!y

!y

!y

Closed Form of the Least Squares Solution with 0 Mean


To develop a closed form least squares solution, we may consider the identity,
cos() cos( )
Letting




{cos ( +  ) + cos (

^  )}.

 O  O
t,

 cos

^
y

^ cos(2
t) +
!y
1
2

n
2

t)

(

 cos(
t) cos(
t) O  {cos(2
t) + cos (0)}

^
!y

1
2

^
!y

^

To find a closed form expression for

cos (2
t), we recall the fundamental identity,

y

cos (t)

e  t + e^
2

so that

 cos (t) O J  e  +  e  K.

^

^

1
2

!y

^

 !

!y

^ !

!y

Each of these is a geometric series, so that,

e 

^

 !

!y

e

^



1^e
1 ^ e 

and

^  !

!y



1 ^ e ^
1^ e ^ 

Combining these last two expressions, we obtain

 cos (t) O

^

1
2

!y

J 11^ ee
^







1 ^ e ^
1 ^ e^ 

 

 
Factoring out e  in numerator of first term, e  in denominator of the first, e^  in the numerator of the

second term and finally e^  in the denominator of the
second, we have

 cos (t)

^
!y

1
2

 





.e ^
.e

 



^ 

^e
^e

 




^  


] ee


^ 

So that

 cos (t) O J

^
!y

 (^)


+e
2

^ ^


Je ^

 



Je ^ 

^e



^e

or

 cos(t) O cos $ %

^
!y

(n^1)
2

 



sin $ n2 %

sin $  %
2

^ e ^  /
.

. ^e^/

.e

 



e

Thus letting  O 2

 cos (2
t) O

^
!y

cos (n ^ 1)

sin (n
)
sin (
) .

Finally we have

 cos (
t) O

^

(2.1.9)

!y

1
2

cos (n ^ 1)

sin (n
)
sin (
)

sin (n
)
n sin (
) ,

.1 +

O
The function, D (
)

n
2

n
2

n
2

{1 + D (
) cos $(n ^ 1)
%}.

sin (n
)
n sin (
)

cos $(n ^ 1)
%/

is called the Dirichlet Kernel. By similar arguments,

 cos (
t) sin (
t) O

^

(2.1.10)

!y

n
2

D (
) sin (n ^ 1)

and

 sin (
t) O

^

(2.1.11)

!y

n
2

{ 1 ^ D (
) cos (n ^ 1)
}.

Notice as verification

 sin

^
!y

(
t) +

 cos

^

!y

n
2

{1

(
t)

{1 + D (
) cos (n ^ 1)
}

n
2

D (
) cos $(n ^ 1)
%

n
2

n
2

n.

Now we can write A and B in terms of the Dirichlet Kernel in closed form by substituting (2.1.9) through
(2.1.11) in (2.1.8). Now we have defined a, b and c as follows.
a
(2.1.12)

y
b

.1 + D (
) cos $(n ^ 1)
%/,

n
2

n
2

D (
) sin $(n ^ 1)
%,
and

n
2

.1 ^ D (
) cos$(n ^ 1)
%/.

Least Squares Solutions in Amplitude-Phase Form with 0 Mean


Next recall that A

A + B

r y

Therefore R

r
nA

]r
B

O ^ R sin ( )

R cos ( ) , B

so that

R (cos () + sin ())

. Notice also that

B
A

R sin ()
R cos ()

y R .
tan ( ).

We must be careful in calculating  since the arctangent is not unique.


considered.
Case 1. If A is positive (A > 0), then cos () > 0 which implies
principal axis, so that,
 y arctan ( ^ B ).
A
Case 2. If A z 0, B { 0, then cos () < 0 so that  is between
that sin () z 0 which in turn implies  ( ^  , ^  ). In this case
2


Case 3. If A
(  , ) so that
2

0, B

arctan ( ^ B )
A

0 implies cos ()

Case 4. If B

0, then 

Case 5. If A

0, B > 0, then 

Case 6. If A

arctan ( ^

y 0 B z 0 then  y

^
2

<  <  . Thus we are on a


2

^  and ^ 32 .
2

Also B > 0 implies

.

0 and B

There are several cases to be

B
A

0 implies sin ()

0. In this case, 

.

0.

r r

y ^ .
2
.
2

Thus we can determine R ,  from A and B which can in turn be estimated from the data.
Least Squares Solution with Non-Zero Mean
If , the mean, is not equal to 0, when we may construct normal equations as follows. Consider again
the sum of squares.
SS

y  ! y

 (X

^

^

!y

!y

^ ^

A cos (
t)

B sin (
t))  .

Taking partial derivatives and equating to 0 as before, we have


(2.1.13)

C SS
CA

O ^

 {(X ^ ) ^ A cos (
t) ^ B sin (
t)) cos (
t)} y 0

^

!y

 {(X ^ ) ^ A cos (
t) ^ B sin (
t) ) sin (
t)} y 0

^

O ^

C SS
CB

(2.1.14)

!y

and
C SS
C

(2.1.15)

O ^2

 { (X ^ ) ^ A cos (
t) ^ B sin (
t)} y 0.

^

!y

Equations (2.1.13) and (2.1.14) are the same as our previous normal equations with X ^ substituted for X .

Thus the solutions for A and B are

 sin (
t)  (X ^r)cos(
t)^ sin(
t)cos(
t)  (X ^r)sin(
t)

^

^

!y

!y

^

^

!y

!y

(2.1.16)

and

 cos (
t)  (X ^r)sin(
t)^ sin(
t)cos(
t)  (X ^r)cos(
t)

^

^

!y

!y

^

^

!y

!y

where again
DO

 sin (
t)  cos (
t) ^ (  sin(
t)cos(
t)) .

^

^

y

^

!y

!y

Returning to (2.1.15), we have

X

^
!y

 A cos (
t) ^  B sin (
t) y 0.

^

^

!y

!y

Which we can solve for by


H
H
r y ^ ^  .A cos (
t) + B sin (
t)/ .

X
n
^
!y

(2.1.17)

Notice two dilemmas: 1. As formulated solution for A and B depends on and similarly the solution for

depends an A and B , and 2. Notice also that the estimator for is not X . Thus as things stand now we have
a recursive, but not direct solution for the case that 0. Remember that our first inclination would be to
^
centralize the data by subtract X , but (2.1.17) tells us that we may not do this either and have a least squares
solution.
The Matrix Formulation of the General Least Squares Problem
Letting a, b and c have their previous meaning and now defining d
we can write

 cos(
t) and e y  sin(
t),

^

^

!y

!y

Aa+Bb+ d
Ab+Bc+ e

(2.1.18)

 X cos (
t)

 X sin (
t)

^

!y

^

!y

and
Ad+Be+ n

X

^

! y

!y

0.

In matrix form, we may write

(2.1.19)

a
b
d

b d
c e
e n

H
xA{
zH}
zB}
z }
yH|

x  X cos (
t) {
z
}
z
z  X sin (
t) }.
}
z
}
z
}
z
}
z 
}
X
y
|
^

!y
^

!y

^

!y

Taking inverse matrix and solving for the unknown parameters, we have

(2.1.20)

H
xA{ a
zH} b
zB} B
z } d
yH|

O

x  X cos (
t) {
z
}
d ^ z
z  X sin (
t) }.
}
z
}
eC
z
}
z
}
n
z 
}
X
y
|
^

b
c
e

!y
^
!y

^
!y

This is a linear model.

A Computationally Quick and Dirty Approximation


While the matrix formulation gives an exact solution, it does involve a matrix inversion. This is really
not a serious impediment to calculation of a solution.
sin (n
Nonetheless, a simple quick approximation may be given. Consider the Dirichlet Kernel, D (
) O n sin (
)) .

Now

(2.1.21)

n D (
)

sin (n
)
sin (
)

1
sin (
)

for
bounded away from 0 and .
Thus
.
 cos (
t) O n/2,

^

!y

.
 sin(
t) cos(
t) O 0,

^

(2.1.22)

!y

and
.
 sin (
t) O n/2.

^

!y

Thus we obtain a quick approximate solution as follows

(2.1.23)

2
n

2
n

 (X

^ r ) cos (
t),

 (X

^ r ) sin (
t)

^
!y

^
!y

and

H y ^.

X
Least Squares Estimation of the Optimal Frequency,
, in Model (2.1.1)

r r

Using this formulation of the quick and dirty estimates of A , B and , we may perform a calculation
for the residual sum of squares,
SS (residuals)

 (X

^
!y


^r ^

A cos
t

^ r sin
t)
B

r
r
 (X ^ r ) ^ 2  (X ^ r ) (A cos (
t) + r sin (
t)) ]  (A cos(
t) + B sin (
t))


B

^
O

^
O

^

O

 ^
^
r
r  r  ^
O  (X! ^ r ) ^ 2nA ^ 2nB + A  cos (
t) + r  sin (
t)

B
!y

!y

!y

HH

+ 2A B

 (X

^ r ) ^

^

.
O

!y

r

nA

 sin (
t) cos (
t)

^
!y



r
^ n r + A $ n % + r $ n % + 0.
B
B
2
2

Therefore,

^ r) ^

. ^ (X
SS (residuals) O
!
!y

n
2

r r

(A + B ).

We may restructure this equation slightly into

 (X

^

SS(corrected)

!y

.
^ r) y

SS (residuals) + SS (due to model at


)

where we define
SS (due to model at
)

n
2

H H

(A + B )


O nH .
R
2

We recall that R is amplitude squared of the cosine model, (2.1.1) and note that we can regard R as a function

of the frequency, R (
). This suggests that we can also estimate
. Thus, the least squares estimate of
is

that minimizes SS(residuals) or equivalently maximizes R (


). R (
), it turns out, plays an interesting role in
frequency domain analysis. As we shall see later, I (
)

H

n
8 R

(
) is called the periodogram.

2.2 The Generalized Hidden Periodicities Model


We now consider the generalized model which is the sum of sinusoidal components.
(2.2.1)

X!

 {A cos (


y

 t)

+ B sin (
 t)} +! , t

0, ,n^1

In this case,
(2.2.2)

SS

^
!y


!

 $X ^

^

!y

 { A cos (


 t)

y

] B sin (
 t)}% .


Taking derivatives and equating to 0, we find we have 2m ] 1 simultaneous equations

 $X ^ ^  {A cos (
t) + B sin (
t)}%cos(
t) y 0, k y 1, ,m

^

C SS
C A

C SS
C B

!y

y

 $X ^ ^  {A cos (
t) + B sin (
t)}%sin(
t) y 0, k y 1, ,m

^
!y

y

and
C SS
C

 $X ^ ^  {A cos (
t) + B sin (
t)}% y 0.

^

!y

y

Letting a , b , c , d and e be defined as follows:

 cos (
t) cos (
t)

^

b

^

c

(2.2.3)

^

a

!y

 sin (
t) cos (
t)


!y

d

 sin (
t) sin (
t)


!y

 cos (
t)

^

!y

and
e

 sin (
t).

^

!y

We may then rewrite the normal equations as

A a


y



A b


(2.2.4)

y



]
]

B b


y



B c


y



 X cos (
t) y 0, k y 1, ,m

d

^

e

^

!y

 X sin (
t) y 0, k y 1, ,m

!y

and

A d


y

B e


y

X

^
!y

In matrix form we can write

x a
z a
a
y y
a

a
a
a

{
}
|
a
a
a

0.

x b
z b
b
y y
b

x c
z c
c
y y
c

b
b
b

c
c
c

{
}
|
b

{
}
|
c

b
b

c
c

(2.2.5)

x d {
z d }
d
y y |,
d

x A {
A
A
y z },
y A |

and finally,

x e {
z e }
e
yy|
e

x B {
B
B
yz }
y B |

(2.2.6)

^
x  X! cos (
 t) {
z !y
}
z
}
z
}

z ^
}
z
z X! cos (
 t) }
}
z !y
}
z ^
}
z
z  X! sin (
 t) }.
}
z !y
}
z
}
z
}

z
}
z ^
z  X sin (
 t) }
}
z
}
z !y
}
z
}
^

X

!y

In block matrix form, we may then write

a
x \
z bT
\
y \T
d

(2.2.7)

b
\
c
\
e T
\

H
xA
d
\{z\
e } z H
\ z\
B
n | y H

{
}
} S
} O\
|

Thus taking inverses, we may write

(2.2.8)

H
xA{ x a
z \ } z \T
zH} O b
z\}
\
B
y \T
d
yH|

Closed form expressions for a , b and c .


It is clear that expressions of the form

a

^1

d
\{
e } S.
\
\
n |

b
\
c
\
e T
\

 cos(
t) cos(
t) will play a critical role in

^
O

evaluating equations (2.2.8). Let us temporarily divert our attentions to evaluating expressions of this sort. Let
us restrict our attention to 0 |
 | 2 and we will eventually want to consider
 y 2 (j/n), j y 0, 1,
, n ^ 1. Thus just as we consider sampled times j y 0 , , n ^ 1, we consider sampled frequencies
 ,
j y 0 , , n ^ 1.
Recall the identity we used earlier
cos(a)cos(b)
We may thus write




{cos(a ] b)

cos(a ^ b)}.

 cos(
t)cos(
t) O .  cos$(
]
) t% ]  cos$(
^
) t%.

^

O

^




^

O

O

Using the earlier closed form solution and the Dirichlet formula, we may write

 cos(
t)cos(
t) O

^

!y

]
where D (x)







cos

cos$

(n^1)(
 ]
 )
2

(n^1)(
 ^
 )
2

%nD $


]

2

% n D $
^
%
2


sin(nx)
n sin(x) .

Other cross-product terms may be derived in a similar manner.


If, in particular,
 O 2 j/n, then

 ]
 y 2(j ] k)/n. If j k mod (n), then the first Dirichlet function D ((j ] k)/n) has a numerator equal
to sin$(j ] k)% y 0 since j ] k is always an integer. In the same way, D ( (j ^ k)/n) has a numerator equal
to sin$(j ^ k)%
m. Thus
2j/n
(2.2.9)

y 0 since j ^ k is also always an integer. If j y k mod(n), then k y j ] nm for some integer


cos(
t) y cos(2kt/n) y cos$2(j ] nm)t/n% y cos(2 jt/n) y cos(
t). Since we have
y


 cos(
t)cos(
t) O  cos (
t) O

^
!y

^

!y

Again D (2j/n) has a numerator equal to sin(2 j)

n
2

1 ] D (2j/n)cos $(n ^ 1)
 %.

0. Hence for


 cos(
t)cos(
t) O

^

(2.2.10)

!y

Finally we may conclude that for




y 2j/n

n
2.

O 2j/n,

(2.2.11)

a

O .0,

Similarly,
(2.2.12)
and

b

(2.2.13)

c O .

n/2, jyk
jk.

0, ,j, k
n/2, jyk
0,

jk.

Finally we note that for


 O 2 j/n, d O e O 0.
Generalized Quick and Dirty Methodology
Based on substituting the results listed above as (2.2.11) through (2.2.13) into equation (2.2.8), it is not
difficult to see that the quick and dirty formulae we had for the single sinusoid case generalizes directly to the
multiple sinusoid case. That in particular the matrix in equation (2.2.8) becomes a diagonal matrix and the set
of equations decouples into

A
and

2
n

 (X ^ r ) cos(
t),

^
!y

0<j

| n^1

B

2
n

 (X ^ r ) sin(
t),

^
!y

z j | n^1

and because of symmetry

r
H y A y


which are as before the quick and dirty estimators.

1
n

X

^
!y

y^
X

Chapter 3
Fourier Transforms and Harmonic Analysis
With the basic theory of frequency domain models now in hand, it is important to develop diagnostic
tools to structure these models. One of the most powerful of these diagnostic tools is harmonic analysis which
may be intuitively thought of as the decomposition of a time series into its sinusoidal components. Roughly
speaking, our general frequency domain model (2.2.1) suggests that we conceive of a time series as the
superposition of sinusoidal components. The determination of which of these frequency components,
 , are the
critical ones is the point harmonic analysis and the basic mathematical tool for doing this is the Fourier
transform. We begin by considering complex sinusoids. Again we will be assuming throughout the next
several sections that
 O 2 j/n as we did in the last section.
Complex Sinusoids
Recall that e O cos() ] i sin() which is the so-called complex sinusoid. From this it is easy to
see that e^ y cos ( ^ ) + i sin ( ^ ) y cos() ^ i sin() for  (0,  ). From this it is easy to derive
the fundamental identity,
cos ()

(3.1.1)

e + e^
2

by addition. Similarly, by subtraction,


sin ()

(3.1.2)

e

^e

^

2i

an identity we used earlier. These are called the Eulerian identities.


Now

e

^


 !

!y

e ^ 
 !

e

^


 ^
 !

!y

 cos $(
^
) t% + i  sin$

^

^

!y

!y

 ) t%.

By our earlier derivation of sums of series and cosines, we have

e

^

 (
 ^
 ) !

!y

Jcos :

n^1(
 ^
 )
2

] i sin : (n^1)(
^
) ;K n D : (
^
) ;
2
2


But


2(j^k)
n

of  is 0. Thus the ratio

so that, sin:

n (
 ^
 )
2

sin6(j ^ k)7

0 since the sine of an integer multiple

n D
is 0 unless
 ^


(
 ^
 )
2

;O

e

 (
 ^
 ) !

!y

n (
 ^
 )
2

; / sin :

^

2

0. In this case the ratio is undefined. Since this ratio is 0, it follows that,
^

Now if
 ^


sin:

0 unless
 ^


0.

0, then

e

^

e

^

 (
 ^
 ) !

!y

 1 O n.

^

!y

!y

Thus we have

e

^

(3.1.3)

 (
 ^
 ) !

!y

n, jyk mod(n)

0, otherwise.

This is called the orthogonality property.


The Discrete Fourier Transform
We now consider a time series, X , , X^ and define
(3.1.4)

J

1
n

X e

^

!y

^
 ! ,

y 1, , n ^ 1.

The set of possibly complex numbers J , ,J^ is the discrete Fourier transform (DFT) of X , ,X^ . To
give a bit of intuition to our interpretation of the DFT, suppose X and Y are two times series each with 0
mean. Then the expression,

cov(X, Y)

1
n

X Y

^

!y

is the sample covariance of the two time series. Now if we take Y! y e^
! y cos(
 t) ^ i sin(
 t) as the
fundamental complex sinusoid of frequency
 , then J measures the degree of correlation" of X with this
fundamental sinusoid of frequency
 . Of course, this is not a true correlation in the straight statistical sense
since J may be complex valued. However, if we look at the magnitude of J , i.e. J we do indeed get
something very close to a correlation. We shall examine this in more detail later. Let us notice now, however,
that
^ y 2(n ^ j)/n y 2 ^ 2j/n. Thus exp( ^ i
^ t) y exp( ^ 2it) exp(2jt/n). Because ^ 2 t is
an integer multiple of 2 , we have exp( ^ 2 it) y 1 and exp( ^ i
^ t) y exp (i
 t). Thus
(3.1.5)

J^

1
n

 X e^

^
!y

 ^ !

1
n

X e

^
!y

^^  !

1
n

X e

^
!y

^ ^ !

J^ .

Moreover, if X is real, then X* O X ( * will be used to indicate complex conjugate). Thus, if z



z* O a ^ ib. Consider then for real X
(3.1.6)

J*
^

O J1
n

X e

^
y

^
^ !

K O

1
n

 X* e

^
!y


^ !

1
n

X e

^
!y

^
 !

J

a + ib, then

Combining (3.1.5) and (3.1.6), we have J^


X! O Y! + Z! . Then

J%,

X e

^

1
n

!y

Y e

^

1
n

*
J . This is called the Hermetian or symmetry property. Now let

!y

^
 !

^
 !

1
n

 (Y + Z ) e

^

1
n

!y

Z e

^

^
 !

!y

^
 !

J& + J' , .

This is called the linear or superposition property. Next we claim that


(3.1.7)

J e

^

X

y


 !

if J

X e

^

1
n

!y

^
 ! .

We consider then
1
n

(3.1.8)

J e

^ ^

!y y


 ! e^
 !

1
n

J e

^
y

^


 ^
 ! .

!y

Now by (3.1.3)

(3.1.9)

Now because J O

J e

^ ^

1
n

1
n

 Xe

^

!y

!y y

^
 ! ,


 ! e^
 !

J x J

n, j y k mod(n)

^

1
n

y

J .

0, j k mod(n)

J e

^

we may identify X! in (3.1.8) with

y


 ! .

These are Fourier transform

pairs.
An interesting connection may be made with model (2.2.1). Choose this model with
the quick and dirty solution for the A and the B . In particular
A

B

(3.1.10)

2
n

 X cos(
t),

^

!y

0<j

| n^1

and
(3.1.11)

2
n

 X sin(
t),

^

!y

z j | n ^ 1.

Consider the discrete Fourier transform

X e

^

1
n

!y

Thus we have
(3.1.12)
Consider
(3.1.13)

| J |

J J *


^
 !

1
n

J

J

1
4 (A ^

 X cos(
t) ^  X sin(
t).

^
!y




A

i
n




^
!y

B .

i B ) (A +  B )

1
4


(A

i B  )

0 and recall

A + B )



1
4(

1 
4 R .

Thus the square modulus of the Fourier transform is related to the fundamental model. Recall from before that
we had defined R by the expression
R cos (
t +  )

(3.1.14)

A cos  t) + B sin  t).

Thus the quick and dirty computation of the solution to (2.2.1) can be computed in terms of the real and
imaginary components of the Fourier transform of the time series, i.e. A y 2 real (J ) and B y ^ 2
imaginary (J ).
The use of J as a diagnostic tool is to be explored a bit further. Basically we gave a correlation" type
*

interpretation to J . Because J J y R is the amplitude squared of the frequency component at  . By plotting

R , we can get an idea of power of frequency component at  .
Fourier Transforms for a Continuous Range of Frequencies
A natural question concerns the extension of the concept of Fourier transforms to a continuous range of
frequencies. Consider
J()

(3.1.15)

1
n

X e

^

^
!

!y

as the natural extension of J . Let us consider the following facts. If n O 0,




e 
 d

^
However, if n

^

2 .

0,


^


sin (
n) d

^

sinn)
n

cos (
n d

Also

cos(n) ^ cos (^n)


n

O^


e 
n d

Thus

sin(^n)
n

^

0^0
n

y ^

1^1
n

0.

0.

0.

Combining these results we have




(3.1.16)

^

2, ny0

 

OJ

0, n 0 and an integer

This result is analogous to the orthogonality result in the discrete case. We may derive
(3.1.17)

X

n
2

J(
) e 
! d
, 0


|tz

so that J(
) and X are Fourier transform pairs as before. To see this consider,

J(
) e 
! d

n
2

^

n
2

^

X

^

1
n

"

"y

e ^
" e 
! d

by definition. Re-ordering,
n
2

J(
) e 
! d

^

1
2

X

^

"

"y

e^
(!^") d
.

^

But by (3.1.16),


J(
) e 
! d

n
2

^

2, tyu

 X xJ

^

1
2

"

"y

X! 2

1
2

0, tu

X! .

Thus we have shown (3.1.17). As before, if X is real,


J( ^
)

(3.1.18)

X e

^

1
n

|

!y

y J X e K

^(^
)!

^

1
n

!y

^
!

J(
)* .

Parseval's Relation
Now let us consider,

| X

^
!y

X X

^
!y



^ ^

*
!

!y y

J e^
!

Je

^
y

*



 !

We may reorder the summations to obtain

| X

^
y

|

^
y

 J e

^

J

y

^

*



 ^
 !

!y

 J  J xJ

^
y

^

y

n, jyk mod(n)

*


.
0, jk mod(n)

A minor simplification yields

| X

^
!y

|

J J

^
y

 

J

^

y

.

The equation

^

(3.1.19)

!y

^

y

| X!

y

J 

is called Parseval's relation. Similarly,




| J(
) |  d

^

^

Interchanging summations and integration yields

1
n

X e

^
!y

^
! 1
n

X

^
"y

*
"

e 
" d
.

| J(
) |  d

1
n

1
n

^
Using (3.1.16)

| J(
) |

^

d



X! X*
e
("^!) d
.
"



X! X*
"

^ ^

^

!y "y

^ ^
!y "y

2, tyu

0, otherwise.

From this it follows that




(3.1.20)

| J(
) |  d

^

2
n

 XX

^
!y

*
!

|X | .

^

2
n

!y

This is also called Parseval's relation. We shall see later that there is yet another form of this relation in
continuous time. Basically this relation says that the sum of squares in time domain is equal to the sum of
squares in frequency domain. One important implication is that the least squares solutions for a fitting problem
may be done in either frequency or time domain depending on which is more convenient.
Some Examples of Fourier Transforms
In the following discussion we calculate some specific Fourier transforms. First recall that we had
earlier shown

 cos(t) O cos:  ;

^

(n^1)
2

!y

sin$  n %
2
sin$  %
2

In a similar way

e  O e

^

(3.1.21)

 !

sin$  n %
2

 ^


sin$  %
2

!y

We may use (3.1.21) to calculate the Fourier transform in a number of interesting cases.
Cosine Wave: First, let
(3.1.22)

X!

R cos (
t ] )

Je

 !]

+ e  !^
2

K.

Using (3.1.22) in (3.1.4)


J

 Xe

^

1
n

!y

^
 !

^

R
n

!y

cos (
t ] ) e ^
! .

Using the Eulerian identity for cosine, we have


J

R
2n

e

^


!^
 !+

!y

R
2n

e

^
!y

^!^ !^ .

Collecting terms and simplifying


J

R
2n

e

e

^
!y

^ !

R
2n

e^

e

^
!y

^] ! .

These are each geometric series and may, as before, be written in terms of the Dirichlet kernel. Thus we have

J

R
2n

e


sin:

^
^



n(
^

2

sin:

^

2

R
2n

^

^^
]



sin:

n(
]

2

sin:

]

2

which may be rewritten as

J

R
2

e e

^
^



D
^
 ) +

R
2

e^ e

^^
]



D
]
 ).

Or finally,
(3.1.23)

J

R
2

^^



Je

(^
])


D (
^
 ) + e

^(^ ])


D
]
 .

This illustrates the leakage phenomena. The leakage phenomena refers to the fact that when one takes a
Fourier transform of a single sinusoid, i.e. all energy concentrated at a single frequency, the Fourier transform
suggests that energy has leaked out to neighboring frequencies.
A Step Function: Let

(3.1.24)

OJ

X

1 0|tzm

.
0 m|tzn

In this case

e 
! O 1
n

J

^

J O 1
n

!y

^
 ^


sin:

(m^1)

2

sin : 2 ;

Rewriting in terms of the Dirichlet kernel


(3.1.25)
Thus
(3.1.26)

R (
 )

m
n

^
 ^


*
J J

D (
 ).

( m ) | D (
 ) | .
n

The Linear Process: Now let


(3.1.27)

X!

Ot^

n^1
2 ,

Let us first recall for a geometric series


****
(3.1.28)

t y 0, , n ^ 1 (chosen to give a mean of 0.)

a


!y

Taking the derivative of (3.1.28) with respect to a,

1+

a


!y

1^a+
1^a

6 a 7


t a


!y

da

!^

!y

1^a]
1 ^a

da

so that

t a


!^

!y

(1^a)(^(n+1) a )^(1^a] )(^1)


(1^a)

which is

t a


!^

!y

Thus

ta y a ta


!y

a)a ^
O ^ (n+1)(1^(1^a) (1^a


]

!^

!y

y ^ (n+1)(1^a)a ^a)^ a(1^a


(1
]

]

Finally replacing n by n ^ 1,

 ta

^

(3.1.29)

!y

a
y ^ n(1^a)(1^^a)a(1^a ) .


Thus

1
n

X! e^
!

J

J

^
!y

1
n

t e

^

^
 !

!y

n^1
2n

e

^
!y

^
 ! .

****
(1^ e^
 )e^
 
(1^e^
 )

^
^

e ^
 $ 1^e^
  %
n$1^e^
 %

n^1
2n

Je^

 

^


sin $(n^1)
 /2%
sin (
 /2)

K.

Hence we may simplify to get


(3.1.30)

J

^

O $1 ^ e^
%


e^
 ^

1
n

^

e^
 $1 ^ e^
 %$1 ^ e^
 %

^ n^1 e^

 

^


D (
 ).

Other Properties of the Fourier Transform


Time Shift: Consider
(3.1.31)
Y O X] ,
t O 0, , n ^ 1.
Then we may let
J&, y 1
Y! e^
! y 1
X!] e^
! .
n
n


!


!

Now we change the variable of summation by letting u O t + h so that t O u ^ h to obtain



 
(3.1.32)
J&, y 1
X" e^
"^ y e n
X" e^
" y e
 J%, .
n


"


"

Symmetry: Let
(3.1.33)
Thus we have that
J

J

1
n]1

2 (j/n) so that
 n

On^

X


1
y n+1

^
 !

!y

O 0, , n.

X^! , t

X e

1
n+1

Again we change variables by letting u

Recall that


X

X


!y

yn^

t so that t

u. We thus obtain

X


e^
^"

1
n+1

2 j. Thus e


1 and

"

"y

e^
! .

^!

(3.1.34)

J

X


1
n+1

"

"y

e
" e
 .

"

"y

e]
" .

If X is a real-valued time series, we may take the complex conjugate of (3.1.34) to obtain

J*


(3.1.35)

1
n+1

 X $e %


"y

]
 "

*
"

X


1
n+1

"y

"

e^
"

J .

Roughness Measures:
Let us finally consider

 (X

^

(3.1.36)

!y

X! ^ )

as a measure of roughness of the process. This is essentially the variation of the process. Now,
X!

X!^

^
y

Thus, we may write


****
(X

X^ )

J

Je 
! ^ e 
!^K y


J e

^

y

y

J

^

$1 ^ e^
%


 !

J e

^

y

*



 !

61 ^ e^
7.


e ^
! (1 ^ e ]
 )

so that

 (X

^

(3.1.37)

y

X!^ )

  J J $1 ^ e %$1 ^ e %  e

^ ^
y y

^


*
 




^

(
 ^
 )!

!y

which by (3.1.3) yields

 (X

^

(3.1.38)
But

!y

$1 ^ e
%$1 ^ e
 

Thus

^ 

y2^

X!^ )

$e

 

| J |

^
y

+ e^
 %

| 1 ^ e
 | .

y2^

2 cos(
 )

2 (1 ^ cos(
 )).

 (X

^

!y

X !^ )

But by elementary trigonometry, 1 ^ cos (


 )

 (X

^

(3.1.39)

!y

We may use Parseval's relation,

(3.1.40)

 (X

^
!y

X!^ )

X

^
!y

X

^
!y


!

y

(1

cos (
 )).

2 sin (
 /2), so that

X!^ )

 |J |

^

y

sin (
 /2).

 | J | , to scale equation (3.1.39) to obtain

^

 |J |

^

y

 sin (
/2) | J |
.
n |J |

^

!

y

^
y

This equation may be used as a measure of frequency content. If low frequencies predominate, X is close to
X^ and the ratio is small. If high frequencies are plentiful, the ratio will be large.

Chapter 4
Fast Fourier Transforms

Let us return to our consideration of the discrete Fourier transform (DFT) of X! .


before for j O 0, 1, , n ^ 1. Recall that
(4.1.1)

n J

X e

^

!y

^
 !

, j

Let


O 2(j/n) as

0, 1, , n ^ 1.

In this computation, there are n ^ 1 (complex) multiplies (one for each j y 1, , n ^ 1; j y 0 implies e y 1
so that no multiply is involved). There are also (n ^ 1) additions (adding n terms means n ^ 1 additions).
Since there are a total of n equations, to compute the full Fourier transform, there are n(n ^ 1) complex
additions and n(n ^ 1) complex multiplies.

Now if we notice that e^


! O e^  , t and j O 0, 1, , n ^ 1 and observe that e y 1 for all
integer k, there are only n distinct values of e^
! . The idea of the fast Fourier transform (FFT) is basically to
collect the coefficients of the n distinct e^
! depending on j _ t mod (n) and do only n multiplications. The
simplest case is the case that n factors say n y n _ n . If n is prime there is no FFT. The generalized
situation is n y n _ n _ _ n . The most useful case is the special case that n y 2 , i. e. all factors are
2. They need not be, however; the factors only need be prime. (Actually, primes are required for most efficient
implementation. The FFT can be done with non-primes, but a more efficient prime-based FFT can always be
done.) Let us illustrate with the case that n y n _ n .
!

Write

t O t n  ] t , 0 | t z n  , 0 | t
to generate a rectangular array, e.g. n y 55 y 5 x 11.

X
X
X

X
X
X

X

or, in general
X
X

X^

X^]

nJ

X^
X^

X^.

j n + j , 0

X e

^
!y

X 

X
X]

Let Y(t , t ) y X!]! and let j


e^/ . Then
(4.1.2)

X

^ !


z n

j

 

 ^  ^
! y ! y

n ,

| j z

n . For integer , let w

X! ]!  w]!]!




or equivalently
(4.1.3)

n J

  Y(t , t ) w

 ^  ^
! y ! y

!    ] !  ] !  ] ! .

But w!


O .e  /

(4.1.4)

! 

1 !  

n J

1. Thus (4.1.3) simplifies to

  Y (t , t ) w

 ^  ^

e^  

w O e^  

Reversing the order of summation in (4.1.4),
*****

 ! 

Also,
w


and, similarly,

! y ! y





(4.1.5)

nJ

w

 ^

w

^ 



w!

w! .


 ! 

^ 


 !


! y

w

w.

 Y(t , t ) w

 ^

 ^

! y

 !
 .

Consider the last summation in (4.1.5)


(4.1.6)

Z(j , t )

 Y(t , t ) w


! y

  !
 .

This is a column by column transform of the original data. In this transform, there are n  ^ 1 multiplies and
n ^ 1 adds. Now
(4.1.7)

n J

 .w

 ^
! y

  !



Z(j , t )/ w! .

For a fixed j there are n ^ 1 adds in equation (4.1.6) and n ^ 1 adds in equation (4.1.7). Thus there are
n +n ^ 2 adds for each j and total of n(n ] n ^ 2) y n n (n ] n ^ 2). For multiplies, we have n ^ 1
multiplies in (4.1.6). In addition in (4.1.7), we have n ^ 1 multiplies for a fixed j, so that we have n ] n ^ 2
for each fixed j. Since we have n j 's we have n(n ] n ^ 2) multiplies. The multiplication of the twiddle
factor requires n additional multiplies (j y 0, ,n ^ 1, t y 0, ,n ). Thus the total multiplies for the
factored form is
n(n ] n ^ 2) ] n

y n n (n n ^ 1)

Example: Let n y 50 y 5 _ 10 and consider the full transform DFT. Here 50(50 ^ 1) y 50 _ 49 y
2450, so that we have 2450 adds and 2450 multiplies. For the FFT, we have 50(15 ^ 2) y 50 _ 13 y 650
adds and 50 (15 ^ 1) y 50 _ 14 y 700 multiplies. This gives a FFT/DFT ratio of 650/2450 y .2653
for adds and 700/2450 y .2857 for multiplies.
Example:
Consider this time n y 49 y 7 _ 7 in the full DFT. Here 49(49 ^ 1) y 49 _ 48 y 2352.
Thus we have 2352 adds and 2352 multiplies. Correspondingly for the FFT, we have 49 (7 + 7 ^ 2) y 49 _
12 y 588 adds and 49 (14 ^ 1) y 49 _ 13 y 637 multiplies. Thus the FFT/DFT ratio is 588/2352 y .25
for adds and 637/2352 y .2708 for multiplies. We minimize these ratios when n, the number of observations,
is fully factored.

y n n , we may arrange the observations over a k ^ dimensional table.



In this case the number of adds (and multiplies) is n n ( n ^ c where c depends on exact details. But, in
As we indicated before, if n

y

any case, n n (

 n ) is an upper bound on the number of adds and the number of multiplies.




y

Thus we may

consider further examples.


Example: Now let n y 50 y 5 _ 5 _ 2 so that n is fully factored. Then 50(12) is an upper-bound and
actually in this case 50 (5 + 5 + 2 ^ 3) is the full correct computation for the number of adds. Notice that 50
_ 12 y 600, but 50 (9) y 450. Thus the prime factorization results in a savings of 200 adds and an
efficiency ratio of 450/2450 y .1837.
Frequently, we consider n y 2 , i.e. log n y k. Then an upper bound is n(n ] ] n ) y
n(2 ] ] 2) y 2kn y 2n _ log n for the FFT which is compared to n for the standard DFT. The
FFT/DFT efficiency ratio is thus (2log n)n when the sample size is a power of two.

Example: Consider now n y 128 so that k y 7 and the FFT/DFT efficiency ratio is 14/64 y .1875. Thus
an FFT for 64 terms would take 2 _ 64 _ 6 y 768 adds and multiplies. Recall for n y 50 y 10 _ 5, we
had 700 multiplies and 650 adds. One can conclude from this example that not much is lost in absolute
operation count by using FFT with excess terms since we have essentially the same operation counts for an FFT
of 128 terms and a DFT of only 50 terms.
Example: Consider an even more extreme example with n y 1024, so that k y 10 making the FFT/DFT
20
efficiency ratio to be .019 y 1024 , i.e. 20 _ 1024 y 20480 compared to 1024(1023) y 1,047,552. Suppose
Z
Z
now we have X , ,X^ with n z 2 y nZ . Let X y X , j y 0, 1, , n ^ 1 and let X y 0, j y
n, nZ ^ 1 so that we fill the data vector out with zeros. Then
JZ
Here we will let
Z

Z
2 j/nZ . Then J

1
nZ

n
1
nZ _ n

1
nZ

Xe

Z^
!y

 X exp(

^

!y

 X exp(

^
!y

Z
!

^
Z !


^   !)

n
Z
^   !) y
nZ

Z
J(
 ).

Thus J (
 ) O n JZ . The FFT of the extended series is the Fourier transform of the original series evaluated at
n

Z instead of
 . Of course,
Z
J O J(
 ) J (
Z ), but since nZ { n,
 y 2jnZ z 2jn y
 .
Thus we evaluate J(
) on a denser grid when we fill out the FFT with zeros.
Z

Example: (Worst case). Let n y 1025 so that n y 1024 too small. Extend the sample size to nZ y 2048 with
1023 zeros filled in. Then the number of FFT multiplies for n Z O 2048 O 2 is 2 _ 11 _ 2048 y 45,056.
The number of FFT multiplies for n y 1024 y 2 is 2 _ 10 _ 1024 y 20,480. However, the number of
DFT multiplies for n y 1025 is 1025 _ 1024 y 1,049,600 so that even in the worst case scenario the FFT is
dramatically better than the DFT. Consider for a moment the corresponding evaluation grids. For the
n y 2048 FFT case
 y 2 j/2048, j y 0, ,2047 so that the evaluation points are 0, .0003069, .00614,
.00921, .01227, . The evaluation grid for the n=1025 DFT case is computed based on
 y 2 j/1025, j y
0, ,1025. This gives a grid of evaluation points 0, .00612, .01226, which is roughly twice the spacing of
the more desirable FFT grid.

Chapter 5
Leakage and Aliasing
As we have already seen, harmonic analysis can lead to substantial insight into the frequency structure
of a time series. Harmonic methods based on Fourier transforms are not, however, universal panaceas for
understanding the frequency content of a time series. Problems exist with so-called leakage and aliasing and
with lack of controllability of the variance of raw Fourier transforms. In this chapter we shall look into these
problems in somewhat more detail.
5.1 Leakage
Leakage refers to the fact that if we take the Fourier transform of a pure sinusoidal function, we will
get non-zero amplitude for nearby frequencies. To see this, let
X y exp (
 !),
the complex sinusoid with unit amplitude and zero phase, then
J(
)

(5.1.1)

X e

^

1
n

^
!

!y

so that

J(
)

(5.1.2)

^

1
n

!y

e(
^
)! .

Using results from Chapter 3


J(
) O e

(5.1.3)

(^)
 ^
)


_ nsin (n(

)/2) .
sin (( ^ )/2)


Taking the square modulus of the expression in (5.1.3), we have


(5.1.4)

R (
) O J(
) J(
)*

O : nsinsin(n(((

^^

)/2) ;.
)/2)


It is easy to see from (5.1.4) that there is spectral energy near


 . R (
) is plotted in Figure ****. Let
us further consider K defined by
K(
)

(5.1.5)

J(
) e

(^)

which may be rewritten


K(
) O e

(5.1.6)

(^)



sin (n(
 ^
)2)
n sin((
 ^
)2)

We may shift the argument of k by 2 /n to obtain

(5.1.7)

K(
^

2
n )

(^)



sin :n 6
 ^6
^ 2n 77 / 2;
n sin 66
 ^6
^ 2n 77 27

We may simplify to obtain


K(
^ 2n )

(5.1.8)

sin $n(
 ^
)2]%

(^)



n sin $$

 ^

%]  %
n

so that approximately

K(
^ 2n )

(5.1.9)

(^



sin (n(
 ^
)/2)

n sin $(
 ^
)/2%

Similarly, we can show that



K$
] 2n %

(5.1.10)

(^)



sin (n(
 ^
)/2)
n sin ((
 ^
)/2) .

Let us define K (
) by
K (
)

(5.1.11)





K$
^ 2n %




K(
)





K (
] 2n .

Notice for

 , K (
) 0 for n sufficiently large for the Dirichlet approximation to hold. In reality we are
interested in J(
), not K(
), so we transform back by taking J to be defined by
K (
)

(5.1.12)

J (
) e

(^)

or equivalently
J (
)

(5.1.13)

K (
) e

^(^)

This weighted sum is called the Hanning Window and by it we introduce the idea that we can reduce leakage by
taking weighted sums. A little algebraic manipulation will lend some further insight. First, since (5.1.11) gives
us
K (
) O





K(
^ 2n ) ]




K(
) ]





K(
] 2n ),

we have

K (
) O J(
^ 2n ) _




(^)
^ 



] J(
) _  _


(^)

] J(
] 2n ) _  _


(^)(
]  )



Substituting for the J(


), we have
(5.1.14)
K (
) O
Simplifying we have

6 X e
1
n

^
!y

^
!

(^)

7_: e



^!


 ^
 )

e^  (

]  ] e



!


 ^
 )

e(

;.

K (
) O

(5.1.15)

1
n

X e J

^

^

^
!

!y

] 1 cos$ 2n $t ] n^1 %%K.


2
2

1
2

Thus, we may write


J (
)

(5.1.16)

1
n

 XJ

^

!y

1^cos(2(t]  )n

2

Ke

^
! .

We may define a weighted data set by


(5.1.17)

Y

O X J1

(t]  )/n)


^cos(2

K.

Thus we are led to the idea of weighting the data to reduce this bias introduced by leakage. We have seen that
we can do this in the frequency domain by taking the weighted sums of Fourier transforms
K (
)





K (
^ 2n ) ]

K (
) ]








K (
] 2n )

or in the time domain by weighting the data


Y

X

1^cos(2(t]  )/n)

2

The latter is called the data window, the former is usually called the spectral window. We shall investigate
windowing in much more detail later on.
5.2 Convolutions
One interesting little sidelight to note concerns convolutions. Let Z be defined by

O  X" Y!^" ,
^

(5.2.1)

Z

t O 0 ,1, ,n ^ 1.

"y

Then,
(5.2.2)

J' (
) O

Z e

^
!y

^
! .

Writing out Z , we have


(5.2.3)



J' (
) O

^ ^

J' (
) O

^

!y "y

X" Y!^" e^


! .

Interchanging the order of summation


(5.2.4)

Let v O t ^ u, so that v ] u y t. Then we have

X Y

"y

^

"

!y

!^"

e^
! .

J' (
)

(5.2.5)

X Y

^

^

"

"y

#y

e ^
# e ^
" .

We can separate variables to obtain


J' (
) O

(5.2.6)

X

^
"y

"

Y

^

e^
"

#y

e ^
#

J% (
) _ J& (
).

Thus the Fourier transform of a convolution of two time series is the product of the Fourier transforms of the
two time series and, of course, vice versa. The leakage phenomena can be understood more easily in terms of
convolutions. Leakage really arises because an infinite set of data, namely X^ , X^ , X , X , X , , is
multiplied by a finite function, to be precise the function
(5.2.7)

I (j)

1
O J0

j [0, n^1]
j [0 n^1].

if
if

The calculation we have just performed shows us is that we can reduce leakage bias by replacing I[,] (j) by a
more tapered window, say
(5.2.8)

H (j)


O J 1^ cos(22 (j] /n) K.



Of course, this is not the only window we could use, but it is a computationally simple one. We shall use this
smoothing window idea again in later discussions.

5.3 Aliasing
We have just seen that one problem in dealing with DFTs involves frequency smearing. That is, the
DFT of a single frequency signal yields a frequency spectrum which spreads some energy of the single
frequency signal over frequencies adjacent to that of the signal. Another interesting phenomena involves
aliasing. An alias is another name for something. In the frequency analysis of a continuous time signal we
shall see that potentially many frequencies of a signal will appear in the same place, i.e. many frequencies are
labeled with the same frequency designation. This is the origin of the name, alias.
Let us begin by considering a single frequency process
(5.3.1)

X

cos (
t)

which is sampled at time increments t, so that X O cos (


jt). As we vary
toward zero the cosine wave
becomes lower in frequency and the wave length, L, increases. However, as
gets larger, specifically as


t

(5.3.2)

For t

X

2
t ,

(5.3.3)
Substituting for
,

let
Z

2
t

cos (j)

so that

2
t

X


cos $ t jt%

^
,

^
.
Z

when j
. ^1when j isis odd .
1
even

Then

cos (
jt).

(5.3.4)

X

cos

J62t ^
7jtK
Z

which may be simplified to


(5.3.5)

X

cos (2j

Z jt)

cos ( ^
Z jt)

cos (
Z jt).

That is, based on this sampling rate, we can't distinguish between


and
. Next consider 2t

2

O 2t ]
, so that
^ t y
. By a similar computation,

Z

ZZ

ZZ

X

cos (
jt)

cos .$
ZZ ^ 2t %jt/


cos .
ZZ jt ^ 2j/

3
t ,

let

cos $
ZZ jt%.

Graphically, frequencies are folded back upon each other. See Figures **** Because of this folding pattern,
(and, of course, that Nyquist first articulated this idea), the frequency,  /"t, is called the Nyquist folding
frequency. For a sampled time series frequencies higher that the Nyquist frequency are indistinguishable from
(alias as) lower frequencies. This process is called aliasing.

Chapter 6
Random Process Models
6.1 Introduction to Random Processes
To this point we have said nothing of the statistical structure of the process X , especially with respect
to the DFT of X . The DFT is of course itself a random process if X is a random process. The implication is
that a different realization of X yields a different realization of J(). In order to make the DFT a statistically
meaningful diagnostic tool, we must investigate the stochastic structure of X a bit more thoroughly. Basically,
the use of the DFT straight away on the data implies we believe X is essentially deterministic.
We now want to regard X as a stochastic process. That is to say we let + be a sample space and P a
probability on that sample space satisfying the axioms given below.
Axiom 1. For A +, 0 | P(A) | 1.
Axiom 2. For A +, A disjoint,
P(A ) y P( m A ).

Axiom 3. P(+) O 1
In this section it is often convenient to regard a random process, X, as complex valued rather than real valued.
Of course, since all real-valued processes are trivially complex-valued, this is no real loss of generality. We
shall define X ( ) O X(t, ) to be a random process if
X(t, c ) : + b, the real numbers (or S, the complex numbers).
That is to say, for t fixed, X(t, c ) is a mapping of the sample space into the real line (the complex plane). For
each fixed t, X(t, ) is an ordinary random variable. Usually we write X ( c ) for X(t, c ). It is common to
suppress the explicit dependence on , although, of course, it is always there implicitly.
The variable, t, is usually thought of as belonging to an index set, J , and usually is thought of as time.
As we had earlier discussed in Chapter 1,
Fx! ,,x! (u , ,u )

(6.1.1)

P(X!

| u , , X! | u )


is the finite dimensional distribution. X is strictly stationary if


(6.1.2)
Fx! ,,x! (u , , u ) y Fx! + ,,x!+ (u , ,u )
for every n, t , ,t , u , , u and  . Again recall that if !
 , then we say X is weakly stationary.
Example: Let X

 {A


a constant, usually 0 and cov (X! , X!] )

variables with mean O 0 and variance


Taking expectations we have

E (X ! )

  , written as , (0,   ).

 {E( A


y

) cos(n
t) ] E( B )sin(n
t)]} y 0.

Consider now the product term and take expectation to get


J y y6A cos(n
t) + B sin(n
t)76A cos(m
t) ] B sin(m
t)7K.


E{X X } = E

Expanding we obtain,
E{X X }

cos(n
t) + B sin(n
t)}. Here A and B are independent normal random

y

 .E(A ) cos(n
t )cos(n
t ) + E(B ) sin(n
t )sin(n
t )/,





y

so that
E{X X }




  [cos(n
t )cos(n
t ) + sin(n
t )sin(n
t )].


y

From our previous trigonometric identities,




E{X X }

y

cos$n
(t ^ t )%.

This process is thus weakly stationary. Also since a linear combination of normal random variables is normal,
the process is also normal and hence it is also strictly stationary.
In the discussion which follows, we shall assume that the reader is familiar with Riemann-Stieltjes
integration. There is a discussion of Riemann-Stieltjes integration in Appendix A.
A bivariate function  (s, t) is non-negative definite if for any set of time points t ,
complex numbers z ,
,z

 (


(6.1.3)

y y

 ! )

z z*


,t

and any set of

0.

If the inequality is strict, the function is said to be positive definite. In general we will use the * to indicate the
complex conjugate. That is if z is a complex number given by a ] bi, the z* is a ^ bi. Notice that if  (s, t) y
E(X X* ) is the covariance function of a zero mean process, then


  (t t ) z z


(6.1.4)

y y

*


.  X


y y

!

*
X* z z
!

/ y E *  X z  X z +.


! 

y

y

* *
! 

Rewriting (6.1.4) as a modulus square, we have

  (t t ) z z


(6.1.5)

y y

*


y E

X z |


y

! 

0.

Thus the covariance function of a zero-mean stochastic process is a non-negative definite function. A trivial
modification shows, in fact, that every covariance function is non-negative definite. The converse is also true,
i.e. if  is non-negative definite, then there is a stochastic process for which it is a covariance function.
If we assume X is complex valued and stationary, then without any loss of generality, we may assume
E(X ) y 0. Notice in this case as we did in Chapter 1, we have
*
(6.1.6)
 (s, t) y E (X X* ) y E (X ] X!] ).
!
Now let  O ^ t so that  (s, t) y E( X ^! X* ) y  (s ^ t, 0). It is customary to write ! y  (t, 0) Thus we

may define a one-dimensional function ! to be non-negative definite if and only if  (t, tZ ) O !^! is nonnegative definite. Notice that if  O E(X!] X* ), then
!
Z

(6.1.7)

^

*
E( X!^ X! )

*
E( X! X!] )

*.

The property ^ O  * we have encountered before in connection with J( ^


) y J(
) . This is called the
Hermetian property. If the times series is real-valued, then, of course, the Hermetian property becomes a
symmetry property.
Because ^ =  , the covariance matrix associated with this stationary, real-valued time series is,

x
z
z
z





y ^




^

^ ^ {
^ ^ }
^ ^ }
}


 |

.

Note that the matrix is symmetric with constant elements on any diagonal. The corresponding correlation
matrix is R =  / .
One of the consequences of positive definiteness is that the determinant and all of its principal minors are
greater than 0.
Hence
B





 B




or




or


| |

or

| |

1 where  is the correlation of lag s.

Similarly,


C 







 C


0;





 B

1; | |

 ; | |

| |

1;

0; and





 B

0.

which imply respectively

^1 z

  ^

 ^





or equivalently

^1 z


  ^ 

1^

1;

| |

1.

6.2 Basic Spectral Theorems


The discussion we have had earlier focussed on the the discrete Fourier transform of the raw data.
Because the raw data itself is a stochastic process, its Fourier transform will also be a stochastic process. While
this is interesting from a probabilistic point of view, the fact that our putative spectral distribution is a random
process makes it somewhat unsatisfactory for being used a diagnostic tool. In order to stabilize the notion of a
spectrum, we need to take the Fourier transform of a deterministic function. Thus, we would like to replace X
with the covariance function,  . The following series of results are designed to lead us to understanding and
interpreting the transform of the covariance function.

Bochner's Theorem: A continuous function  (t) is non-negative definite if and only if it can be represented by
an integral of the form

 (t) O
e
! d F(
)
B

(6.2.1)

^B

where G is a real-valued, non-decreasing and bounded function. The function, F, will be the analog of a
probability distribution function as we shall see shortly. Notice that (6.2.1) strongly resembles relationship for
the inverse Fourier transform we had discussed earlier. Recall
(6.2.2)

X!

n
2

e
! J(
) d
.


The proof of Bochner's Theorem is beyond the scope of these lecture notes and will be left as an unproven
assertion. Clearly to employ Bochner's Theorem as it might apply to a covariance function,  , we need only
show that  is continuous.
Lemma: If  is continuous at  O 0, then it is continuous at all  .
Proof: We first consider  evaluated at  and  . Now
Z

| ^  |
Z

*
*
| E. X!+ X! / ^ E.X!] Z X! /|

y E.(X ^
!]

*
X!] ) X! /|
Z

Since X is stationary, we have by the Schwartz Inequality,


|

 |
Z

| E.(X

^X

Z

) X* /|





^ X | / y E.X X* / ^ E.X X* / ^ E.X X* / ] E.X


As  ^  Z 0,  ^  and  ^  by assumption.
| ^  0. Thus  is continuous for all  .
But E.|X

Z

Z

Z

Z




$E. X*  /% $E.| X ^ X | /% .

Z

Z

Z 

*
X / y 2 ^  ^
Z

Hence E.| X

^

 Z ^

^X / 0



Z|

so that

We are now in a position to combine Bochner's Theorem with the basic Lemma we have just show to
obtain the basic spectral representation theorem.
Spectral Representation Theorem: If  is the covariance function of X and  is continuous at 0, then there
is a real-valued, non-decreasing bounded function F such that



(6.2.3)

^B

e d F(
)

The function, F, is called the spectral distribution function. The proof follows from Bochner's Theorem and the
fact that  is non-negative definite and continuous.

B) ^ F( ^ B) y var (X ). Since F is defined up to an


additive constant, choose F( ^ B) y 0, so that  y F(B). As a convention, we choose F(
) y F (
] ),

Notice that 

d F(
)

^B

i.e. we assume continuity on the right. If F has a derivative, then f(


)
(6.2.4)



FZ (
)

d F( )
d

and

e  f(
) d
.
^B

The function f(
) is called the spectral density of X! . The set of all discontinuities of F is called the point
spectrum of X! .

The spectral representation theorem we have just seen express the covariance function in terms of the
spectral distribution function or the spectral density function. Of major interest is the situation when the
spectral distribution or the spectral density can be represented in terms of the covariance function.
B

If
| | d <

Inverse Fourier Transform Theorem:

^B

(6.2.5)

f(
)

*, then

O F (
) exists and f(
) O

e^
! ! d t.

1
2

^B

Again we shall leave this inverse result unproven. Recall, however,


J(
) O

(6.2.6)

e

^

1
n

^
!

X! .

y

This is a rather striking analogy. Of course, we are dealing with continuous-time processes in the present
discussion. If X! is not a continuous-time process but rather a discrete-time process, the analogies are even
more striking.
To see this, let  be the covariance function for a wide sense stationary discrete time process, X! t
The discrete analog of the Spectral Representation Theorem is known as the Herglotz
Lemma and is given below.

y 0 a 1, a 2, .

Herglotz Lemma: If  is the covariance function indicated above, then there is a real-valued, non-decreasing,
bounded function, F, defined on the interval ( ^  ,  ) such that



(6.2.7)

e
 d F(
).
^

Recall in the case of the discrete Fourier transform we studied earlier that
(6.2.8)

X

O 21

e
! n J(
) d

so that we have an even more striking analogy. As in the continuous case, F is defined up to an additive
constant. Choose F( ^ ) y 0, so that F( ) y  and as before we take F (
) y F(
] ), i.e. choose F to be
continuous from the right. The discrete version of the inverse Fourier transform can then be given.

 | |<*, then
B

Inverse Fourier Transform Theorem (Discrete Version): If

y^B

(6.2.9)

f(
) O F (
) exists and f(
)
Z


B

1
2

e^
  ,

y^B


z .

Recall for the discrete Fourier transform we had


J(
)

(6.2.10)

1
n

e
-

^


y

X .

To this point, we have been assuming that X! is a complex-valued process. If, in fact, X! is a real-valued
process,  O ^ . In this case
B

(6.2.11)

e
 d F(
)
^B



^

e^ dF(
)
^B

Summing the expressions for  and ^ , we have

2

 + ^

(e 
^B

]e

^

 )

d F(
) .

Thus



(6.2.12)

B
^B

This implies that


sin (
 ) d F()
f()

(6.2.13)

B (e
 ]e
^B 2

O
O

^

 )

d F()

B
^B

0. Similarly

1
2

^B

1
2

e^
! ! d t

1
2

Because of the even symmetry of the cosine function


f(
)

(6.2.14)

1
2

cos ( ) d F().

cos (
t) ! d t

^B

cos (
^B

^ t) 

e
! ! d t

^B

d t.

f ( ^
).

Thus for a real-valued stochastic process, the spectral density is symmetric about the 0 frequency. A practical
implication of this fact is that we really need not plot the spectral density function over the whole range from
^  to . The range from 0 to  is sufficient. If  is real so that  is symmetric about 0, then we may define
G(
) y 2F(
) ^  . In this case
(6.2.15)



e  d F()

^B

cos ( ) d F()

^B

cos ( ) d G().

The function G is sometimes called the spectral distribution in real form.


Note that the spectral representation theorem could be (and often is) formulated in many different ways
depending on whether or not the constant 2 is included in the equation representing  or the equation
representing f(
). A few of the more common representations are
(6.2.16)

(6.2.17)
or
(6.2.18)






e 
 d F(
)
^

1
2

 e  d F(f).

e 
 d F(
)
^

^


In this latter expression, f is the so-called natural frequency whereas


is often called the angular frequency.
The relationship between the two is
O 2f. All three equations (6.2.16), (6.2.17) and (6.2.18) are
equivalent, but not identical. Care must be taken to insure the use of the correct constants.
6.3 Interpreting Spectral Densities
We have shown that we may write the covariance function as a Fourier transform of the spectral
distribution function (the spectral density) and as we did with the DFT applied to the raw data, we can interpret
the spectral density function as indicating the intensity of the sinusoidal components in the covariance function.

That is to say, f(
) is the intensity of the fundamental sinusoid at frequency
in the covariance,  , much the
same way that |J(
)| O R (
) was the intensity of the fundamental sinusoid at frequency
in the data X . Of
course J(
) is a random process while f(
) is a deterministic function. Of course, as we indicated earlier, this is
the reason for exploiting Bochner's Theorem.
We now give a fundamental theorem which will illustrate the relationship between the Fourier
transform of X and the Fourier transform of  . Clearly we want to have the Fourier transform of  represent
in some sense the structure of the Fourier transform of X . We will only give the statement of the following
theorem.
Let X(t) O X be a zero mean (continuous) process with spectral distribution function, F(
). A
process K(
) is said to have orthogonal increments if for every set,
 z
 |
 z
 , we have
(6.3.1)

E 6K(
 )

K(
 )76K(
 )

K(
 )7K

0.

Spectral Representation Theorem For every stationary process, X(t), there is a stochastic process, K(
), with
orthogonal increments, such that for each fixed t
(6.3.2)

X(t)

e
 d K(
),

^B

K(
) is defined up to an additive random variable. We may fix this additive random variable by taking
K( ^ )
0. In this case,

B y

O 0,
E. |K(
)  / y F(
),
E$ K(
)%

and

E.|d K(
)| / O d F(
).

The process, K(
), is called the spectral representation process. (The DFT process, J(
), is an approximation
to K(
) based on a time series). This result may be interpreted to say that the expected sinusoidal oscillations in
X are the same sinusoidal oscillations in  , i.e. the structure we expect to find in X are found in  . We say
this in the sense that the respective spectral densities are the same. This gives us confidence that using  as a
deterministic representative of X is very satisfactory since (at least as far as frequency content is concerned) the
structure of  is reflective of the structure we expect to find for X .

Chapter 7
Estimating Spectral Densities
7.1 Estimating of the Mean of a Stationary Time Series. I want to make a slight change of notation
here. Instead of writing the time series as X , ,X^ , I will write the series as X , ,X . Previously, in our
discussion of complex sinusoids for, has been convenient to start with an index of zero so as to make the first
term in the DFT series one. For reasons of symmetry in our present discussion it will be more convenient to
make the first index one. The major effect of this shift is to create a phase shift in the computation of the DFT,
i.e. multiplication by a complex constant whose modulus is one. This will be irrelevant to the use of the DFT as
a diagnostic tool.
As before, let X! be a real-valued stationary process with constant mean, , and covariance function,

^ y

 . We assume that we have real-valued time series X , ,X . We consider the sample mean, X

^ y

1
nE

Notice that E(X )

1
n

X .


!y

. X / y  y so that ^ is an unbiased estimate of . Recall that ^ is not


X
X


!y

1
n

!y

the maximum likelihood estimate for the mean, , in the hidden periodicities model. Here, however, we are
only prepared to assume a stationary model for the process. Thus we are not yet able to compute all other
^
^
properties of X . We are interested in computing the variance of X . Unlike the usual independent and
identically distributed case (i.i.d.), the computation is not straightforward. We start with a lemma.
Lemma: If a , j

0,

a 1, a 2, , then

 a


(7.1.1)

!y

y

^

!^

!y^]

(m ^ |t|) a! .

Similarly, if h(t) is a real-valued function,


T T


h(t

(7.1.2)

0 0

^ s) ds dt y

(T

^T

^ t ) h(t) dt.

Proof: We prove the discrete case. The other follows in a similar way. Consider

 a


!y

y

!^

 a


!^

!y y!^

where we are letting t ^ s y j. Interchanging the order of summation (see **** figure), we obtain

 a


!y

y

!^

 a
]

y^]

!y

  a.

^

y

!y]

Rewriting yields

 a


!y

y

!^




y^]

(m ] j) a

 (m ^ j) a

^
y

^
y^]

(m ^

j ) a .

This gives us the result in the discrete case. the continuous case follows in a similar way although actually the
computation is slightly easier.

Applying this lemma to the computation of the variance of X , we have

^
^
var( X ) y E. X ^ ) G y
Then

E J  (X ^ )K
1
n

!y

 

!^

^ y

^ y

var( X )

1
n

1
n

!y

y

^

1
n

J   (X ^ )(X ^ )K.


1
n

!y

(n ^

!y^]

y

t ) !.

Canceling an n yields,
(7.1.3)

var( X )

^

(1 ^

t
n )!.

!y^]

We are in a position to have the following theorem.

^y1
n

Theorem: Let X

X . Then ^ is unbiased for with variance equal to


X


!y

^

1
n

!y^]

t
.1 ^ n /! .

  z B , then
B

If

!y^B

(7.1.4)

var( X )

In particular, var( X )

1
n

  as n B.
B

!y^B

0 as n B. Further if X is a discrete process,


  O 2 f(0)
!

(7.1.5)

!y^B

where f is the spectral density of X! . The expression, , means is asymptotic to."


Proof: The first part has already been shown. To see the second part, consider

^ O

n var (X )

 $1 ^ %

^

!y^]

t
n

^
!y^]

!

1
n

^
!y^]

t !.

The second term can be rewritten as


1
n

^
!y^]

t ! y


B

!y^B

I (t) n !

where term by term we define I by,


I (t

1 if ^n]1|t|n^1

yJ

0 otherwise

Notice that

I (t)

t
n

!

| ! .

By results on series from advanced calculus, we know

 I (t)
B

lim

B


B

Hence n var( X )

!y^B

!y^B

!y^B

! y

 0 y 0.
B

!y^B

!y^B

so that

f(0)

  . Finally note that


O

f(
)

Thus n var( X )
(7.1.6)

 lim BI (t)

! y

e
B

1
2

^
!

!y^B

e
B

1
2

!y^B

^ 
!

! ,

1
2


B

!y^B

!.

2 f(0). Similar results hold for the continuous case. Specifically


^
E( X ) y ,

^ y

(7.1.7)

1
T

var( X )

with

(7.1.8)

^ Tt ) 

(1

^T

! d t

T var( X )

^B

dt

2 f(0).

7.2 Estimation of Variances and Covariances. Let X , ,X be a time series from X! which is
stationary process with mean and covariance function  . We want now to estimate  . A first
approximation variance estimator might be

 y


(7.2.1)

(X ^ ^ )
X


1
n

!y

the usual estimator of variance with the general stationary covariance estimator

y


(7.2.2)

n^

 (X

^ )(X!] ^ ^ ).
X

^
!y

!^ X

This, like other ad hoc estimators, has some potential drawbacks. One might object that the first factor in the
^
summation given in (7.2.2) involves the observations X , ,X^ and yet we subtract X which is computed
from the whole set of observations, X , ,X . Similarly, the second factor involves X ] , ,X and again we
^
subtract X . In spite of the symmetry, this clearly will cause the cross products we expect to cancel in quadratic
form computations not to cancel. Hence, a second proposed estimator might be

^ y


(7.2.3)

where X ,

(7.2.4)
so that

1
n^s

 X and ^
X

^

!y

E(^ )


,

1
n^s

1
n^s

1
n^s

 (X ^ ^
X

^

!y

)(X!]

^ ^ , )
X

!y ]

:  JX X
!y

,

 X . In this case, expanding the cross product, we have

^

!]

^ ^ , X!] ^ ^ , X! ] ^ , ^ , K;
X
X
X X

E(^ ) y


(7.2.5)

1
n^s E

: X X

^ X
^ X
^ 2(n ^ s)X , ^ , ] (n ^ s)X , ^ , ;.

^

!y

!]

Canceling similar terms, we have


E(^ )


(7.2.6)

1
n^s E

: X X
^

!y

!]

^ X
^ (n ^ s)X , ^ , ;.

This may be rewritten as


(7.2.7)

E(^ )


E(^ )


:  (X ^ )(X

1
n^s

^ ) ] ^ , ] ^ , ^  ^ ^ , ^ , ;
X
X
X X

^

!y

!]

which in turn yields


(7.2.8)

:  J(X ^ )(X
1
n^s

^
^ )K ^ ( ^ , ^ )(X  , ^ );.
X

^

!y

!]

Finally this becomes


E(^ )


(7.2.9)

X
X
^ cov( ^ , , ^ , ).

But

(7.2.10)

^
y EJ  Xn^s
y

^

cov( X , , X , )

^

y

(X] ^ )
n^s

which in turn may be written

cov( X , , X , )

1
(n^s)

  E(X ^ )(X

^ ^

y y

^ )

]

so that

cov( X , , X , )

1
(n^s)

 

^

^

y y

]^ .

Using the (7.1.1) from the Lemma, we have

(7.2.11)

cov( X , , X , )

^ ^

1
(n^s)

y^] ]

] (n

^ s ^ j ).

  z B, then cov( ^ ^ ) 0 as n B.
X
X
B

As before, if

y^B

,

We may compute the asymptotic

,

bias by using (7.2.9) to obtain


(7.2.12)

^ ^  ] y ^ n cov( ^ , ^
X
X

n[E( )

,

,

^n
n^s

^ ^

(1

y^] ]

j
n^s ) ] .

Letting n

B, we have

^ ^ ] ^   y  .
B

n[E( )

(7.2.13)

]

This shows that ^ is asymptotically unbiased. Computing the asymptotic mean of ^ is not a trivial task.


Computing the corresponding mean of is a real mess! We will not show the algebra of the computation, but

for the record,
E( )


(7.2.14)

^ n(n1^s) J
Taking limits as n

y^B

^
] var(X )

 (n ^ k) ]  (n ^ s) ] 

^

y ]

y

B and simplifying

(n ^ s ] k) ( ] ] ).

^
 ] var(X ) ^

E( )

(7.2.15)
where lim

B

r y 2

y^(^ )^

1
n

r

  , so that
B

y^B

n E( ) ^ 


(7.2.16)

.
B

y^B

Thus E( ) ^  var( X ) 0 as  B. In general is not asymptotically unbiased.




A simpler estimator assumes we start with a zero mean process. In this case a natural estimator of 
is
^
 y

(7.2.17)

1
n

X X

^

!y

!]

It is straightforward to see

  O   as n B.
Whereas the computation of the variance of and ^ is all but impossible, some results are known about the


^
variance of  . We may get some idea by considering a special case. The following results were developed in
****. Let X O   b . We shall later see this as an infinite moving average where are independent
^
E( )

(7.2.18)

1
n

^

!^

n^s
n

y

 !^

random variables with E(  )

^
0, var(  ) O 1, E(  ) z B and |b z Kc , 0 | c<1. Let  y


(7.2.19)

^
lim B n.E( ^ 

/y2
B

y^B

(  ^  ^ ).


To compute second moments of X! , we have


(7.2.20)
and

B

^
lim n var ( ) y
( ] ] ^ ^ 4   ^ ] 2   ) y V
B
y^B 


^

^ .


Then

(7.2.21)

^ ^
n cov ( , 

lim

B

]#

) y

 ( 
B

y^B

]#

^ 2   ^ ^# ]

] ] ]# ^ ^ 2  ]#  ^



]#  )

2 

U ,# .

From this the correlation is seen to be,


^ ^
lim B corr ( ,  ]# ) y

(7.2.22)

U,
V .

The asymptotic distribution is calculated by


^
ln ( ^ )

(7.2.23)

dist

n(0, V ).

^
^
Thus  is asymptotically unbiased, consistent and asymptotically normal. Under appropriate assumptions, 
^
(and   by implication) seem to be reasonable estimators of respectively the correlation,  , and the covariance,
 , respectively. Notice that is we consider lim V , then we have

(7.2.24)

V y lim

lim

lim

B B

^
n var ( ) y

.
B

y^B




^
Hence even if n is large, when we try to estimate  for large lags, the variance of the estimate  does not
converge to 0.

7.3 The Periodogram and Other Smoothers. We restrict our attention to real-valued, wide-sense
stationary time series with a spectral density function. Let X , ,X be the time series. We want to estimate
the spectral density function, f(
), |
|

| . We assume in this discussion that   z B.


B

y^B

Recall by

definition

f(
)

(7.3.1)


B

1
2

y^B

 e^
 .

By analogy to (7.3.1), we can consider an empirical density estimator called the periodogram and given as

^
where  

^

1
2

(7.3.2)

1
n

X X

^
!y

!]

y^]

^
  e^
 ,

is the estimator of the covariance function,  . From our earlier discussions, we

considered a spectral estimator defined by R (


)

J(
)J* (
)

O  X! e^
! | .


!y

To see that these are actually

closely related results, we have the following theorem.


Periodogram Theorem:
(7.3.3)

(7.3.4)

I  (
)

1
2n

X e


!y

^
!

^


 y

1
2

^
y^]

^
  e^
 ,

e 
 I (
) d
.
^

 and

Proof: Consider second part, (7.3.4), first. Assume that I (


)

1
2

e 
 I (
) d

^

e 


1
2

^




^

y^]

^

1
2

y^]

^
  e^
 so that

^
  e^
 d

Interchanging order of summation and integration, we have that

1
2

^

e 


I (
) d

^

e 
(^ ) d

^


y^]

2

^
y^]

^
   

^
 .

Here  is the Dirac delta function. That this integral is either 0 or 1 follows from the orthogonality result
(3.1.16). Hence, second part follows immediately from the (7.3.2). Now consider,
I (
)

1
2n

X e


!y

Simplifying, we have
I (
)
Now letting s

Ot^

 so that 

yt^
I (
)

^
! |

1
2n

JX e  X


!y

  e^


!y

1
2n

y

(!^

^
!

y

e]

K.

X! X* .

s we have

 


2n

!^

*
e^ X! X!^ .

!y  y!^

We want to interchange summations. For  | 0, the range on t is 1 to  ] . For  { 0, the range on t is


 ] 1 to n so that

^

I() y 21


 y]

e

.  X X /.
1
n

*
!^

Here because the range of summation on t is dependent on  , we have deliberately left the summation
ambiguous. For 

| 0, we consider

1
n

X X

 ]
!y

*
!^ .

Since  is negative

^
is positive and   in this case has precisely the definition,

 XX


!y ]

*
!^ .

Let s

O t ^  so that t y s ]  .
1
n

For real-valued processes X*

X

 ]

1
n

!y

X*  . For 
!^

0, we consider

1
n

Then

 XX


!y ]

^

*
!^

1
n

X

 ^

y

] .

^
X , so that the above sum is the definition of   . In either case,
I (
)

1
2

^

 y^]

^
  e  .

We define the function I (


) to be the periodogram. Thus the periodogram, the square modulus of the DFT of
 ^
^
the data, is also the Fourier transform of the covariance estimate in the formula   O 1
X! X!] . We
n
develop this connection in more detail. Letting X!
quick and dirty formulae that

 .A


y

cos (
 t)

!y

B sin (
 t) , we know from the

X cos (
t)and B


2
n

A ] B y

4
n

  X X .cos (
s) cos (
t) + sin (
s) sin (
t)/

!y

X sin (
t).

A

(7.3.5)

2
n

!y

Consider then
(7.3.6)

!y y

which may be simplified to


A ] B y


X X


4
n

cos (
 (t ^ s)).

!y y

We have assumed that X is real valued. We may take the real part of A ] B to have


A ] B y Real


J n4

 X X


e
(!^

!y y

This is simplified to
A ] B y Real


Real

X e


J n4

!y


 ! 

J X e X


4
n2

!y


 !

y

e^


K.

X e

4
y n


 ! | .

8
n

I (
 ).

Using the results of the Periodogram Theorem, we find that


A ] B y

(7.3.7)

1

J 2n | !y X! e^
! |K y

2 n _

4
n _

Hence, it is only a slight obvious generalization to find that


R (
)

(7.3.8)
where

I (
)

(7.3.9)

A ] B

1
2

or
I(
) y 21


(7.3.10)

with A

8
n

X e


I (
).

!y

^
y^]

^
! |

^
  e^


^
O 2  X! cos(
t) and B
O 2  X! sin(
t), and with   O 1  X! X!] where X! is real valued.
n
n
n


^

!y

!y

!y

Thus

we have a fundamental linkage between the generalized model of hidden periodicities, the DFT, the spectral
density and the covariance function. This gives additional intuitive support to the use of R (
) as a diagnostic
tool for fitting the model of hidden periodicities.
Next let us consider some of the statistical properties of the periodogram. It is not hard to see that
(7.3.11)

E[ I (
)]

1
2

^
y^]

^
e^
 E[  ]

1
2

^
y^]

e^
 

. /.
n^j
n

The latter equality follows from a slight adaptation of (7.2.18). A little algebra yields

(7.3.12)

^

E[I (
)] O 21


y^]

. ^ /
j
n

e^
 1

1
2

e
B

y^B

^
 

y f
, as n B.

From this it follows that I (


) is asymptotically unbiased. Consider the variance of I (
). In general, this is not
a very tractable calculation so that we consider the special case in which X , ,X are independent random
variables. Then

var[I (
)]

(7.3.13)

y EJ

1
2

^
y^]

e^


^
 ^ 1 ^

|j|
n

G  7K .


Realistically there is not much hope for a simple solution based on (7.3.10), The formulation (7.3.9), however,
yields a more tractable computation, so consider
E[ I (
)]


(7.3.14)

O EJ 41n

 

X e


!y

^
! |

K.

Expanding the quadruple sums yields


(7.3.15)

E[I (
)]


    E[X X


1
4 n

!y

X" X# ] e^
! e]
e^
" e]
# .

y "y #y

Now let the argument of the quadruple sum be E!"# . Depending on the values of s, t, u and v, we will get
different sums as follows.
If s O t u y v, E !"# y   and there are n(n ^ 1) terms of this form.
If s y u t y v, E !"# y   e
^! and there are n(n ^ 1) terms of this form.
If s O v t y u, E!"# y   and there are n(n ^ 1) terms of this form.
If s y t y u y v, E !"# y  and there are n terms of this form.
Otherwise, E !"# y 0.
We may use these 5 cases to form the equation
(7.3.16)

E [I (
)]


1
4 n

:2 n(n ^ 1) + n ]  J|  e




O

| ^ n

K;

The ^ n term in the expression in braces is meant to account for the case s y t which is double counted in the
square modulus. Using our earlier results related to the Dirichlet kernel

(7.3.17)

E[I (


)]


2


2 n


4 n


4

sin(n
)
n sin(
)

;^

sin(n
)
n sin(
)



4 n .

Collecting terms we obtain

(7.3.18)

E[I (


)]


2

4 n

.  ^ 3 G ]


4

;.


But E[I (
)] f(
) y 2 for a times series which is an independent and identically distributed sequence
(called white noise). We may combine this with (7.3.18) to obtain an approximation for the variance.


(7.3.19)


.
var[I (
)] O 2 


] 41 n .  ^ 3  G ]  : nsin(n

)) ; ^ 4
sin(





Simplifying (7.3.19) and recognizing that the second term of (7.3.19) approximates 0 for large n, we obtain

.
var[I (
)] O 4 


(7.3.20)
If

sin(n
)
n sin(
)

0 or
O

a , then

sin(n
)
n sin(
)

1 and var[I (
)]

0 as n B and var[I (
)]


4



4

sin(n
)

: n sin(
) ;.


for n large. If

2

0 and

a , then

for n large.

Thus
var[I (
)]

(7.3.21)



4

,
0 and
a

  ,
y 0 or
a.
2

Thus, even in this simple case of white noise, the variance of the periodogram is asymptotically not equal to
zero. In general,
var[I (
)]

(7.3.22)

f  (
),
0 and
a
2 f  (
),
y 0 or
y a.

Repeating arguments along these lines, we may obtain


cov[I (
), I ()] O

(7.3.23)

sin $ n (
^)%
2
sin $
^ %
2

sin $ n (
])%
2

K f(
n)f() 0 as n B.


sin $
] %
2

In general, the periodogram is too rough, i.e. its variance is too large. We want to smooth the periodogram. We
do this with a kernel smoother.
7.4 Kernel Smoothers. We construct a smoothed periodogram by a convolution of the raw periodogram
with a smoother function which we will label K(
). Let K(
) be a real-valued, non-negative function such that
K(
)

(7.4.1)

K( ^
) and
K(
) d

^B

1.

Consider
B

K(

(7.4.2)

^B

Let s y
^ , so that ds
B

If we define a

^ ) I () d y

^B

^ ) 21

^
y^]

^
e^   d.

1
2

^
 J
K(s) e dsK  e

^

y^]

 

^B

K(s) e  ds, then


^B

(7.4.3)

K(

y ^ d and  y
^ s.

K(

^B

^ ) I () d y

K(

^B

^ ) I () d y

1
2

^
y^]

^
a   e^
 .

^
 .

Thus we can weight the covariance function  by a to smooth I (


). We consider therefore estimates of the
form

^ (
)
f

(7.4.4)




1
2

^
k(j/m)   e^


y^

where a O k(j/m) j O ^ m, ^ m ] 1, , m ^ 1, m and a


lag window and m is called the truncation point.
K (
)

(7.4.5)

is called the spectral window.

K(
)

1
2

e
B

1
2

 !

e


 

0 elsewhere. The function, k, is called the

k(j/m)

y^

k(t) dt is called the spectral window generator. Note that

c
K (
) O m K(m
) if K is concentrated on ( ^ , ). We may begin with k(t) and calculate K or K as
desired. In general, however, we will want to weight the covariance and then use the formula
^B

^ (
)
f

(7.4.6)




1
2

^
k(j/m)   cos(
j)

y^

as the computational form. We are assuming that the time series X is real valued so that the imaginary part of
(7.4.4) is zero yielding the very convenient form (7.4.6).
Examples of lag windows: There are several classical windows for smoothing the periodogram. We may
think of the smoothed periodogram as a nonparametric spectral density estimator. It is nonparametric because
we are not assuming any parametric form for the spectral density. We will later study time-domain models
which are indeed parametric. Thus we will also be able to estimate spectral densities parametrically. The
smoothed periodogram is in a sense an ad hoc estimator. It is not chosen against any optimality criterion. The
practical consideration involves reducing the variance so as to make f (
) a statistically consistent estimator of
the density f(
). The trade-off is that the bias of f (
) will increase. For this reason, a fairly large number of
different functional forms for the kernel, k, have been developed.
Bartlett window.

(7.4.7)

k(x)

OJ

1 ^ | x |, for | x | | 1

0, otherwise.

The Bartlett window is desirable because of its particularly simple computational form.
Daniell window.
(7.4.8)

k(x)

sin(x)
x ,

^ B z x z B.

Tukey window.

(7.4.9)

k(x)

OJ

1 ^ 2 a ] 2 a cos(x), |x| | 1

0, otherwise.

For a

1
4

and |x|

1, k(x)

y 1 .1+cos(x)/ which, of course, is a particularly simple form.


2

.23 was at one time a recommendation made by John Tukey.

The value a

Parzen window.

^ 6x ^ 6 | x | , 0 | | x | | .5
O J2 (1 ^ | x |), .5 | | x | | 1

1
(7.4.10)

k(x)

0, otherwise.
Generally speaking the choice of a window will have a comparatively minor effect on the spectral density
estimator. The choice of the truncation point, m, is more critical An optimal window can actually be computed
by minimizing
1
n

(7.4.11)

 JI 6

^

y^]

j
n 7-

K+


f6 jn 7

FL f(
)G d

 

^

where f is chosen from a class of smooth functions which we shall not attempt to define here The parameter,
,
d
is a bandwidth parameter and L is a differential operator. If L is chosen as d
 , the the solution to (7.4.11) is a
cubic spline. The expression is, needless to say a rather complex expression and will not be treated here. A
fuller discussion may be found in Wegman (1980, 1981).
7.5 Statistical Properties of Smoothed Periodograms. Choice of kernels and truncation points are, of
course, factors which will affect the statistical properties of our proposed kernel smoothed periodograms. We
state some results here which can be used to establish choices of both.
Bias. If both the spectral density, f(
), and and the lag window, k(
), are twice differentiable, then
lim
, B m.E[f (
)-f(
)]/ y ^ kf ZZ (
)

(7.5.1)
where k

O ^  kZZ (0) y


B



^B

K(
)d
. For the Tukey window k y

^ 4


2.5. For the Parzen window

k O 6, so that the Tukey window has better bias properties. Equation (7.5.1) may be rewritten as E[f (
)]
k
f(
) ^ m f ZZ (
).

Variance. The corresponding expression for the variance can be computed as


B

f  (
)
k (u) du, 0 z |
| z 

(7.5.2)

lim
 B

n
m

var[f (
)]

yJ

^B

2 f  (
)
k (u) du,
y 0 or a.
^B

Thus
(7.5.3)

var[f (
)]

m
n

f  (
)
k (u) du, 0
^B

|
|

.

We want to think of m as a function of n. We write m . In order to make the smoothed periodogram


asymptotically unbiased, using (7.5.1) it is clear that we must choose m
. In order that the variance of
f (
) converge to 0 it is also clear from (7.5.3) that we must choose m
0. Moreover, in order that neither
n
the bias nor the variance dominate the mean square error, both the bias squared and the variance must converge
to 0 at the same rate. If this happens, then the mean square error will be
(7.5.4)

MSE

c
m


c m
n

where the c -term represents the bias and the c -term represents the variance. We want these terms to
converge to 0 at the same rate (be of the same order) so that we may equate them to obtain
c
m


(7.5.5)
m

In this case n must be a constant so that m



order to have best rate for mean square error.

c m
n .

c n. This implies that m

y c n

is optimal choice of m in

7.6 Practical Spectral Estimation. There are 3 choices associated with the estimation of the spectral
density by a smoothed periodogram, i.e. by
(7.6.1)f (
)

1
2

^
 k$ %


y^

j
m

cos(j
),

namely the choice of a lag window, k, the choice of a truncation point, m, and the choice of a set of frequencies,

 at which to evaluate (7.6.1). We comment on each in turn. For a choice of a lag window, the Tukey window
or the Parzen window are usually quite satisfactory. Window choice is not so critical. For the choice of
bandwidth, m, there is a much more difficult choice here as m determines the degree of smoothing. The choice
of m O c n. is an optimal asymptotic choice, but we usually have a fixed sample size which means that c is
the critical choice. I recommend that you try m O 10% of n to start, but be prepared for some trial and error.
Advanced techniques of cross-validation and other types of error criterion such as Akaike Information Criterion
(AIC) can help but are really beyond our discussion here.
The choice of frequencies,
 , at which to evaluate is another fairly critical choice. Here we want a

choice that is reasonably dense along the frequency axis. Fourier frequencies
 O 2n j are not a bad starting
point, but are probably too dense for large data sets. Some refinement of that choice with a less fine grid may be
in order! In general, there is no optimal theory. This is a nonparametric technique and so some
experimentation is in order.

Chapter 8
Filtering
8.1 Linear Filters. We now turn our attention to filters. A filter maps one stochastic process into
another. It can conveniently be and usually is thought of as a black box which has an input and an output
respectively, X! O Xt as input and Y! y Yt as output. We can use the conceptual black box to model a wide
variety of systems such as an amplifier, a communication channel, the ocean or atmosphere or any number of
models which may transform one signal into another. Indeed, estimating the filter structure is known in the
electrical engineering literature as the system identification problem. From a more mathematical perspective,
let < be a set of complex-valued functions. Let : <
< .  is a functional or an operator in an abstract
mathematical setting. For us  (X ) y Y is a filter.

The filter, , is linear if for every a , a complex numbers, with X , X < such that
!
!
a X ] a X < , then (a X +a X ) y a (X ) + a (X ). The filter, , is called time invariant if for
!
!
!
!
!
!
any X! < , then if X (t) y X(t+h) and Y (t) y Y(t+h) with  (X) y Y, we have  (X ) y Y .
Examples of Filters. Let X! be a stochastic process.
Backward Shift Filter: Y! y X!^ , t J .
Difference Filter: Y! y X! ^ X!^ y "X! , t J .
Moving Average Filter: Y!

a X


!^ ,

y

J.

Exponential Filter:
Differential Filter:

O
e^ (!^ ) X ds, t J .
^B
(X ) O X O dX , t J .
dt
Y

Z
!

y
w(t-s) X ds, t J . If w(t) y 0, for t z 0, then t ^ s z
^B
Thus w(t ^ s) { 0 only when t ^ s { 0, that is, when t { s. Then
B
!
Y O
w(t-s) X ds O
w(t-s) X ds.

General Linear Filter: Y


w(t ^ s)

0.

^B

0 implies that

^B

Since we deal with values of the process in the past with respect to t, we say that the filter is physically
realizable.
General Physically Realizable Filter: Y!

w(t
^B

^ s) X

ds, t

J.
B

The function, w(t), is called the impulse response function. We require


|w(t)| dt <
^B

The domain, J , of X! and Y! , is, in general, a subset of (

a 1, a 2, / for example.

*.

^ B, B). This subset may be ( ^ B, B), (0, B), .0,

The backward shift, the difference and the moving average are examples of

discrete-time filters. The exponential, differential, general linear filters and the general physically realizable
filters are examples of continuous-time filters.
We may define physically realizable filters in the discrete-time case as well. For the filter, Y
c X!^ , if c

0 for j

physically realizable also.

0, then Y!

c X
B

y

!^


B

y^B

and Y! depends only on past values of X! . This filter is

In the discrete case we consider g(


)
g(
)

e^
! w(t) dt

^B


B

c e^
 , for |
|

y^B

. In the continuous case, we let

( ^ B, B). The function, g(


), is called the transfer function of the filter, while

the function, |g(


)| , is often called the
frequency response function of the filter. Notice that the transfer function is the Fourier transform of the
impulse response function.
8.2 Transfer Functions and Spectral Densities. Now consider a stationary process X! with E[X! ]

and spectral density f% (


). The process X! is possibly complex valued. Let Y!

|c | z
Then we have

y EJ 
B

(8.2.1)

E[ Y! ]

Ky
B

c X!^

y^B

c E[ X!^ ]

y^B


B

y^B

c X!^ with


B

y^B

y 0.

Thus the mean of a filtered process is 0 if the original process has mean 0. Consider the the expression for the
covariance,
(8.2.2)

O E.

E [Y Y* ]
!]

c X!^

y^B

y^B

c* X*

!]

^

/.

We may move the expectation operator through the summation signs to obtain

E [Y Y* ]
!]

 
B

y^B y^B

*
*
c c E[ X!^ X!]

^

].

The expectation, however, is just the covariance of the X! process so that we have
(8.2.3)

E [Y! Y* ]
!]

 
B

y^B y^B

*
c c 

^]

which doesn't depend on t. Hence Y is stationary stochastic process. This is true in general. A time-invariant
linear filtered version of a stationary process is still stationary. We may continue the development of equation
(8.2.3) by replacing the covariance function by the Fourier transform of the spectral density to obtain
(8.2.4)

E [Y Y* ]
!]

 cc
e
B

y^B y^B

^ ^]

f% (
) d
.

We may reorder the integration and summations


E [Y Y* ]
!]

e^


B

c e^


y^B


B

c* e]
 f% (
) d

y^B

so that
(8.2.5)

E [Y Y* ]
!]

e^
|

^


B

y^B

c e^
 | f% (
) d

& .

&
But by (6.2.5), we know that 

e^
f& (
) d
. Thus
^

e^
|

(8.2.6)

^


B

y^B

c e^
 | f% (
) d

e^
f& (
) d
.

^

By equating integrands in (8.2.6) we have


f& (
)

(8.2.7)


B

f% (
) |

y^B

c e^


or equivalently

f (
)

(8.2.8)

f (
) | g(
) | .

Thus the spectral density of the Y process may be computed in terms of the spectral density of the X process
that the transfer function of the filter. Frequently we may measure both the X process and the Y process.
Thus by computing the ratio of the spectral densities, we may estimate the frequency response function of the
filter.

O
w(t ^ s)X ds with

The continuous time computation is similar. If Y

(8.2.9)

f& (
)

^B

^B

w(t)| dt z B, we have


f% (
) |
w(x) e^
% dx |
^B

so that

f& (
)

(8.2.10)

f% (
) | g(
) | .

Thus we have shown the following theorem.


Transfer Function Theorem. If a zero mean stationary process X! with spectral density, f% (
), is
passed through a filter with transfer function, g(
), then

f& (
)

f% (
) | g(
) | .

The nature of the frequency response function largely determines the properties of the filter. A filter such that

(8.2.11)

| g(
)

|

OJ

1,
 |
|


0, elsewhere

is called a band-pass filter. A filter such that

(8.2.12)

| g(
)

|

is called a low-pass filter. Finally a filter such that

OJ

1,
|


0, elsewhere

| g(
)

(8.2.13)

|

OJ

1,
}


0, elsewhere

is called a high-pass filter.


Recall now that a white noise process, (that is a process which is a sequence of independent and


identically distributed random variables), say  , has covariance  O 1, for s O 0 and  O 0, for s 0.
Thus the spectral density of a white noise process is
(8.2.14)

f (
)

1
2


B

y^B

e 




2


2

where   is the common variance of ! . It is not too hard to think filtering the white noise with a time invariant
linear filter to obtain
(8.2.15)

X!


B

y^B

c ! .

By the transfer function theorem, X! has the spectral density function


(8.2.16)

f& (
)


B



2

y^B

c e^
 | .

This is a general linear model for a stochastic process known as an infinite moving average. If
(8.2.17)

X

c


 !^ ,

y

then
(8.2.18)

f& (
)


2

c e


y

^
 | .

The finite moving average representation of a stationary stochastic process, X , yields a powerful parametric
representation in the time domain for a time series. Moreover we can see that the spectral density given in
(8.2.18) has a parametric representation with q ] 1 parameters. This motivates a general discussion of linear,
time-domain models of time series with a finite number of parameters. These are known in the statistical
literature as either Box-Jenkins models or autoregressive-moving average (ARMA) models and in the electrical
engineering as linear recursive models. With this motivation, we shall begin our discussion of time domain
modeling.

Chapter 9
Linear Processes
9.1 Structural Models. A structural model describes the relationship among two are more random
variables in contrast to a statistical model which describes the probability structure of one or more random
variables. To be sure there are very intimate connections which as we indicated earlier are captured in the time
series case by the covariance structure of a stationary time series. Nonetheless, in the next chapters we are
interested in describing structural models appropriate to time series. These time domain models are, in essence,
analogs to linear regression models and, in fact, share so rather extensive theoretical connections. They do,
however, also have features which are peculiar to time series analysis.
Let us first consider a discrete time process of the form
X



+  !^ +  !^ +

or in slightly more convenient form


(9.1.1)

X




B

 !^ .

y


The  's are uncorrelated random variables with zero means and constant variance,  . As defined earlier, such
a series { } is called white noise and the process {X! } is called a general linear process. Notice that the
covariance function for ! is given by



(9.1.2)

OJ

, ky0

0, k0

and autocorrelation by



(9.1.3)

OJ

1, k y 0

0, k 0.

Equation (9.1.1) may be re-written under certain invertibility conditions which we describe later in the alternate
form
X

O  X!^ +  X!^ ] + !

or again in a slightly more compact form


(9.1.4)

X

X
B

y

!^

+ ! .

These infinite summation forms, (9.1.1) and (9.1.4), while elegant do not yield a convenient calculus for
manipulating time series. It will be convenient to introduce difference equation notation for this will give us the
tools for manipulating the time domain models more easily. Let B denote the backward shift operator so that

O X ^ and B X! O X!^ . Similarly F is the forward shift operator so that FX! O


B
F X! O X!] . Thus we may write X! O (1 +   B )! as
BX

X!] and, of course,

y

(9.1.5)

X

where, of course, the operator (B) is defined by (1 +


X



(B)

 B ).

X

O

In a similar way, we may write (1


*

O

 B )

as
 (B )

(9.1.6)
Here the operator (B)

(1


B

y

 B ).

 .

These operator forms of the infinite series representations of the

time series allow us to develop a calculus. Consider applying  (B) on both sides of (9.1.6) to obtain
(B)(B) X! y (B)! y X!
from which we have the identity
(9.1.7)
(B)(B) y I.
Here I is the identity operator so that I X! y X! . Notice also that FB X! y BF X! y X! , so that FB y BF
y I. It follows from this that (F)(B) y I. Combining this result with (9.1.7) we may conclude that (F)
y (B) or equivalently that ^ (B) y (B) where inverse means the inverse operator.
9.2 The Autocovariance Generating Function and the Spectral Density Function. With the
introduction of the notion of back shift and forward shift operators, we are now in a position to develop some
simple difference equation theory. Let's first use (9.1.1) to notice that

O EJ     !^ !] ^ K.
y y
B B

E [X X] ]

We may interchange the expectation operator with the summation operators to obtain
E [X X] ]


B

O

O

 

E[!^ !] ^ ].

Because the  process is white noise, only products with identical subscripts will contribute to the summation.
Thus
(9.2.1)

E [X X] ]


B

y


  ]  .

We first observe that the right hand side of (9.2.1) does not depend on t. Because  is a 0 mean process, X is
also a 0 mean process. Thus (9.2.1) describes the covariance function  of a stationary process. We may thus
write
(9.2.2)

E [X X] ]







B

y

 ]

Our only concern is with the convergence of the infinite series represented in (9.2.2).

  z B. Using Schwartz's inequality, we may conclude that |   | z B.


B

y




y

 ]

We assume that

We may define  (s)

B

y y^B s


to be the autocovariance generating function. We may substitute the

expression (9.2.2) for  in the definition of the autocovariance generating function to obtain

 (s)

(9.2.3)




  
B

 y^B

y

 ] s

As is the usual routine, we interchange the order of summation

 (s)
where it is understood that
With this we have



 
B

y




 y^

 ] s

0, j z 0. We may now re-index by letting j + 

 (s)

We may now separate the sums to obtain

 (s)




h so that 

y h ^ j.

   s
B B

y y

^ .

 

  s KJ   s K.
B


 J

y

y

^

Finally, re-writing in terms of  ( c ), we have



(9.2.4)
 (s) y  (s) (s^ ).
^ B y I and also FB y I so that F y B^ . Thus substituting B for s
As before, B

(9.2.5)
 (B) y  (B)(F)
As an example, consider a time series process whose structural equation is X! y ! +!^ . We may re-write this

as X! y (1+B)! . Thus (B) y I ^ B and  (B) y  (I+B)(I+B^ ). Notice here we adopt the convention that
 y 1 is replaced with B y I. We may expand this polynomial in B to obtain  (B) y   (I+B+B^ +BB^ ) or
s


 (B) y  (2 I+B+B^ ). We may use the autocovariance generating function to reconstruct the covariance


function. Reading off the coefficients, we may deduce that  y 2 ,  y ^ y  , and finally  y 0,
| | } 2.
s

The autocovariance generating function may be used in another clever way. To see this application, let
e^
. Recall by definition,  (s)

 s.
B

 y^B

Thus when we substitute e^ for s, we obtain the spectral

density function given below as

$ %O e
B

 e^

 y^B

^

2f(
),

^  |
| .

he covariance generating function is related to the spectral density function by the equation
(9.2.6)
f(
) y 21  (e^ ), ^  |
| .


In our previous example, X! O ! + !^ ,  (s) O  (2 + s + s^ ). Thus
f(
)
We may simplify this to

(e^
)
2



2

J2+2 (e

^

+e]
)
2

K, ^  |
|  .


f(
) O 2 2 (1+cos(
)) O  
, ^ |
| 
which is a particularly simple form of spectral density. Another informative example is the case for white noise,



X! O ! . In this case  (s) O  so that f(
) O 2 , a constant for ^  |
|  . This gives insight to the use
of the phrase, white noise. As with white light, white noise contains all frequencies with equal amplitude. The
fact that the spectral density function is constant verifies that all frequencies are of equal amplitude.



(1]cos(
))

The general spectral density is computed by first observing that the autocovariance generating function
is given by

 (s) O  (s) (s^ ).
^
for s, we have
Substituting e




(9.2.7)
f(
) O 2 (e^
) (e
) O 2 |(e^
)| .
Thus if we know the infinite moving average form of the general linear model, then we may calculate the
spectral density function directly by (9.2.7). This is a very powerful equation as it allows us to move from a
time domain model to a frequency domain model easily. It also has the enormous advantage that if we have a

finite parameter time domain model, we may easily calculate the spectral density parametrically in stead of
having to use the nonparametric smoothing windows techniques we studied in Chapter 7.
9.3 Stationarity and Invertibility. We have already determined that for the infinite moving average

  z B Suppose we assume that   z B. For this


B

representation, X! is always stationary provided that


series to be convergent,
implies that



0 so that | |

 | | z B so that
*

O

0 as well.
y




y

Thus eventually | | z 1, so that | |

| | |.


This

exists on and within the unit circle, |s|

| 1.

(s)

y  s
*

O

If a time series process with -weights can be changed to a process with converging -weights, it is
said to be invertible. A series may not be invertible because the  weights increase instead of decrease.
Consider for example X O (I ^ B) . The inverse operator is (I ^ B)^ y (I+B+ B +) so that
X! y ! ^ X!^ ^  X!^ ^ .
Thus  y ^  so if ||>1, the  weights increase. We say X! is invertible if the generating function of the 
weights,
(s)

exists on and within the unit circle, |s|


*

O

 s

1.

In the special case of the -weight representation when we have only a finite number of  weights, we
call the process autoregressive and write
(9.3.1)
X O  X!^ +  X!^ + +  X!^ + ! .
We denote such a process by AR(p) and p is called the order of the process. We can write (I ^  B ^
^  B ) X! y ! or in operator form
(9.3.2)
(B) X! y !
where, of course, (B) y (I ^  B ^ ^  B ).
If only a finite number of the  weights are non-zero, we call the process a moving average process
and write
(9.3.3)
X! y ! ^  !^ ^  !^ ^ ^  !^ .
We denote such a process by MA(q) where now q is called the order of the process. We can write X! y
(I ^  B ^ ^  B )! or in operator form
(9.3.4)
X! y (B)! .
If we have more complicated data, it is sometimes necessary to use a mixed model which is given in operator
form by
(9.3.5)
(B)X! y (B)! .
This is called a mixed autoregressive moving average process of order (p, q) and abbreviated ARMA (p, q).
These three classes of models represent the most general form of finite parameter linear models for the structure
of a time series process and are commonly know as Box-Jenkins models. We will study these time domain
models in more detail. We note in closing this section that autoregressive processes are always invertible by
definition and that moving average processes are always stationary. Autoregressive and ARMA processes may
be stationary and moving average and ARMA processes may be invertible. Conditions under which they are are
the subject of our continued study.

Chapter 10
Autogregressive Models
10.1 Autoregressive Models. Let us now consider a general p! order autoregressive model,
X! y  X!^ ++ X!^ +! or (I ^  B ^ ^  B ) X! y (B) X! y ! . Clearly this is an invertible
process, so the question in this case is whether it is stationarity. For stationarity  (s) must converge for |s| |
1. We know by definition that (s) is a p! order polynomial. Suppose we write  in terms of its factors
(10.1.1)
(s) y (I ^ u s)(I ^ u s) (I ^ u s).
We may expand (10.1.1) in terms of partial fractions as follows
k
1
1
k
(10.1.2)
(s) O (1^us)(1^us) O (1^us) ] ] (1^us)
so that
X!

^ (B) !




y

k
1^u B ! .


Now expand 1^1 B ! as (1 + u B + u B + ) ! which is stationary if |u | z 1. Hence the p! order
u
autoregressive process is stationary if |u | z 1 for i y 1, 2, , p. Setting (s) y 0, we find (1 ^ u s)
1
1
1
(1 ^ u s) y 0. Hence the roots are the polynomial  (s) are s y u , , u . Thus | u |>1 if and only if
|u | z 1, so that an autoregressive process is stationary if and only if the roots of the equation,  (s) y 0, lie
outside the unit circle. The equation, (s) y 0, is called the auxiliary equation and has much practical
importance in time series analysis.

Let us again consider the autogressive model, X O  X!^ ++ X!^ +! . We wish to develop an
expression for the autocovariance function. Multiplying through this expression by X!^ , we obtain
(10.1.3)
X X^ O  X!^ X!^ + +  X!^ X!^ + ! X!^ .
Taking expectations, we see that
(10.1.4)
 O   ^ + +   ^ ,  > 0.
Note that X^ depends only on !^ , !^ ^ ,
and so is uncorrelated with ! . Thus E[! X!^ ] y 0.
Dividing (10.1.4) by  , we obtain an expression for the autocorrelation function
 y   ^ + +   ^,  > 0,
or equivalently
(10.1.5)
(B) O 0.
This latter expression is a homogeneous linear difference equation whose solution is given as a linear
combination of a function of the roots of the auxiliary equation.

In particular if (s) O ( 1 ^ u s)(1 ^ u s) and all the u are distinct, then




(10.1.6)
 y A u + A u + + A u


where  is an integer. Consider, for example, the term A u . Operate on it with the operator, (I ^ u B ). Then




^
(I ^ u B) A u y A u ^ u A u^ y 0. Similarly the term A u is eliminated by (I ^ u B). If u is a
 , then the solution becomes A u +A


double root, i.e. if (s) contains (1 ^ u s)
 
]  u . Note that (I ^ u B )
O (I ^ u B)[A]  u ^ uA]( ^ 1)u ^] y (I ^ u B)[A] u ] y A]u ^ uA] u ^ y 0.
A]  u

Similarly for higher order multiple roots.
Now returning to 

  ^ + +   ^ , noting the symmetry, ^

, p, we obtain a series of equations

(10.1.7)






y
y
y

y ,


and letting k

1, 2

 +   + +  ^
  +  + +  ^

 ^ +  ^ + + 

This set of equations is known as the Yule-Walker equations and they relate the autoregressive coefficients to the
autocorrelations (or equivalently the autocovariances). In matrix form write


1

x 1
x  {
x  {
z 
 O ,  O , P O z 
y  |
y  |
y ^

So that the Yule-Walker equations become


(10.1.8)




^

^ {
^ }

P^ .


We have used the basic autoregressive difference equation to derive the  coefficients in terms of the

autocorrelations  . Next we wish to derive the variance,  , in terms of the autocorrelations and the
autoregressive coefficients. Let's consider again X! O  X!^ ++ X!^ +! . Multiplying by X! , we obtain
X y  X!^ X! + +  X!^ X! + ! X! .
!


Noting that E [ X ] O E [  (B) ] O E [ ] O  , we obtain

(10.1.9)
 O   + +   +  .
,
Now dividing by  y '

1 y   + +     + 


or


' O 1^ ^  ^  .
Also noting (s)

(s) ,

the spectral density becomes


f()




2

|(e )|


1
2 |1^ e^
^ ^  e^
| .


10.2. First Order Autoregressive Process (Gauss-Markov Process). The first order autoregressive

process is X! O  X!^ +! O ! + !^ + !^ + where ^ 1 z  z 1 for stationarity. The auxiliary

equation is (s) O (1 ^  s) y 0, so that the difference equation  y   ^ has solution  y  , k } 0. In
particular  O  O  so that often the process is written X! O X!^ +! . The autocorrelation function either

decays exponentially  } 0 or oscillates, but the oscillations decay exponentially. The variance ' is given by


'

(10.2.1)




1^ 





1^

The spectral density f(


) is given by
(10.2.2)

f(
)


1
2 |1^ e^
| ,


^  |
| .

This may be expanded as


f(
)
or


1
2 1]  ^ (e^
]e]
) ,



^ |
z 


f(
) O 2 1]^21cos() .

10.3 Second Order Autoregressive Process. The second order autoregressive process is
(10.3.1)
X! O  X!^ +  X!^ + ! .
For stationarity, the roots of the auxiliary equation, (s) O 1 ^  s ^  s y 0, must be outside the unit
circle. A necessary and sufficient condition for this is that i.  + z 1, ii.  ^  z 1, and iii.
^ 1 z  z 1. This is a triangular region of admissibility for the coefficients  and . **** Assuming
distinct roots, the solution to the difference equation,  O   ^ +   ^ , is


(10.3.2)
 O A u + A u .
We need some initial conditions to solve for the unknown coefficients, A and A . Recall that  O 1 and,

from the Yule-Walker equations, we have  O  +  .
Hence  O 1^  .
Now since

1 ^  s ^  s y (1 ^ u s) (1 ^ u s) y 1 ^ (u +u )s + u u s , we can write  y u +u and  y ^ u u .
Thus
u u
 O 1]]u .
u
Using this together with the condition that  O 1 , we can write

(10.3.3)
 O A +A O 1 and  O A u +A u O 1u]]uu .
u
Note the symmetry in u and u , A and A . Now

A u + (1 ^ A )u y 1u]]uu
u
which implies
u ]u ^u ^u u
A (u ^ u ) y  1]uu  
so that
u (1^u )
(10.3.4)
A y (u^u)(1]uu) .
By symmetry,
u (1^u )
(10.3.5)
A y (u^u)(1]uu),
hence
(1^u )u ] ^ (1^u )u ]
 
 
(10.3.6)
.
 O
(u ^u )(1]u u )



In this case  is a mixture of exponentially decaying functions possibly oscillating if either of u or u is


negative.



If the roots are real and equal, say u^ , then 


Thus


2u
O A O 1 and  O A u+A u O 1]u .

A + 1
so that
(10.3.7)
Hence

A



u + 1^u  u
1]u


A u +A  u . Again using the initial conditions,

2
1]u

1^u
1]u .

1]u ] ^ u
1]u

u .

Thus

(1
(10.3.8)
 O 1] ]]u^ )u u .
1 
Thus  either decays exponentially or is dominated by a decaying exponential.

Finally, if the roots are complex conjugate pairs, we can write u O ue
 and u O ue^
 . Since
|u | z 1, we also have |u| z 1 . Now
 O A u e ] A u e^ .
Taking complex conjugates,
d
d
d O  O A u e^ + A u e .

@ O A and A@ O A . Let A O  (a ^ bi) and A y  (a ] bi). Then
Since  is real, we can identify A







 O  (a ^ bi)u e +  (a+bi)u e^ .


Collecting coefficients, we have



(10.3.9)

O u Ja (e



 

 
^ 
+e^  )
+b (e ^e )
2
2i

K.

Using our earlier trigonometric identities,

O u .a cos (  ) + b sin(  )/



(10.3.10)

where a and b can be written in terms of A . Once again we find  is dominated by a decaying exponential.
Using the Yule-Walker Equations, 




(10.3.11)

O  +  and  O   + , we may obtain


O  +

  +   and 


may be combined to solve for the 

From this 



(10.3.12)


  ^ 
 .
1^


O (1 ^  ) y 1^ 1^( ^ )  , so that
^










 O

(10.3.13)

 (1^ )
 .
1^

Solving in turn for  and  ,


(10.3.14) O 1^
and



1^


'

(10.3.15)

1^  ^  .




+ 


The variance ' is given by




Substituting (10.3.14) and (10.3.15) in this expression



'




1^ 1^ ^




J+ 1^  K

Simplifying by multiplying top and bottom by (1 ^  ),



' O

(1^ )

.
1^ ^ ^ (1^ )^ 




Now collecting coefficients


(10.3.16)


' O


1^

1] (1^ ) ^  .


Finally the spectral density is given by


(10.3.17)

f(
)


1
2 |1^ e^
^ e^
| .


Expanding the expression in the denominator, we obtain


f(
)


1
2 1] ] ]2 ( ^1). e
]e^
/ ^ 2 . e
]e^
/ .
 



2
2


Using trigonometric identities once again, we obtain the closed form for the second order autoregressive spectral
density

f(
)

(10.3.18)


1
2 1] ] ^2 (1^ ) cos () ^ 2 cos (2) .




Thus the theory of autoregressive processes is quite complete with respect to expressing the autoregressive

parameters  , , ,  and, by extension, f(), in terms of the autocovariance,  . Since we have the ability
to estimate  directly from the data, we have a reasonably satisfactory approach to estimating the
autoregressive structure in general. Two facts are evident in this discussion. First, we are estimating functions
of the moments, that is, functions of the covariances. This is essentially applying the method of moments,
which, although it has a long history, is certainly not guaranteed to give us any statistical optimality. It would
be better to consider least squares or maximum likelihood techniques if available. We shall do this later. Also,
left unsaid was how to determine the order, p, of the process. We address this issue by the use of partial
autocorrelations.
10.4 Partial Autocorrelations. The autocorrelation,  , is a valuable diagnostic tool in fitting time
series models. As we have just seen,  is a function which either is a decreasing exponential function or at
least is dominated by a decreasing exponential function whenever X ! is an autoregressive process. A second
function which will serve as a diagnostic aid is the partial autocorrelation function of lag k,   . To define this
let   be the j! coefficient of an autoregressive process of order  . Thus, our standard p! order autoregressive
model becomes
 O   ^ ]   ^ + +   ^ j y 1, 2, ,  .
From the Yale-Walker equations, we have
(10.4.1)

 

 

^
+ y P 

so that   can be easily picked out. Note  O  . Also worth noting is that if X  is truly an autoregressive
process of order p, the   will be non-zero for  | p.
The partial autocorrelations may be estimated by fitting autoregressive processes of orders 1, 2, 3,
^
^
^
and picking out the estimates  ,  ,  ,
. If X  is an autoregressive process of order p, then for 
^ should approximate 0 and the its variance given by
p+1,  

^
var [  ]

1
n

^
so that the standard error of   is approximated by
(10.4.2)

^
SE[  ]

1
ln ,  }

p+1.

^
As a diagnostic tool, we can thus plot the   as a function of  . By drawing in the SE limits we can compare
^
^
the values of   to the standard error estimates. If   lies well within the two standard errors limit, it will
probably be estimating   y 0 and hence the process is of order less than  .

Chapter 11
Moving Average Processes
11.1 Moving Average Processes We now turn our attention to a moving average process of order q given
by the expression, X O  ^  !^ ^ ^  !^ . Expressing thus in terms of a backwards shift operator, we
obtain
X! y (I ^  B ^ ^  B )! y (B)!

(11.1.1)

Recall that a moving average process is always stationary, but may not be invertible. However, in its inverted
form we would have ! y ^ (B) X! . If we let
(s)


y & (1 ^ v s)
y

and expanding as we did in the autoregressive case using partial fractions, we obtain
(s)

y ^ (s) y




y

m
1^v B .

As with the stationary conditions for an autoregressive process,


1
1^v B

1 + v B + v B ]


y 1, 2, , q. Or equivalently X y
y 0, lie outside the unit circle.
Just as the autoregressive process was always invertible because |(s)| z B for all s, so too is the
moving average process always stationary because |(s)| y | (s)| z B for all s. It is worthwhile to note this

provide |v | z 1. Hence X! will be invertible if | v | z 1 for each j


(B) is invertible if and only if the roots of this auxiliary equation, (s)

duality between moving average and autoregressive processes.


autocorrelation function for a MA(q) process, consider



E [X X^ ]

E[(! ^  !^ ^ ^  !^ )(!^

It recurs frequently.

To compute the

^  !^ ^ ^ ^  !^ ^ )]


Hence, since only terms with similar subscripts on the  will contribute to the covariance, we have



(11.1.2)




(1 +  ] ]  ) 

and



(11.1.3)

OJ

(^  ]   ] ]] ^  ) ,  y 1, 2,, q




.
0, 

} q]1

Thus the autocorrelation is

(11.1.4)

OJ

^ +  ]++ ^ 1 +  +  ++  ,  y 1, 2, , q 0,
 


} q + 1.

Notice for the MA(q) process, the autocorrelations behave like the partial auto-correlations do for the AR(p) in
the sense that they are non-zero up to some cut-off point q and the zero from then on.
Similarly for the moving average process the partial autocorrelations will be dominated by an
exponential function. Again the duality between autoregressive and moving average shows up. Comparison of
these two functions, the autocorrelation and the partial autocorrelation, frequently allows us to distinguish
between
data
which
follows
autoregressive
and
moving
average
processes.
Since
(s) O (s) O 1 ^  s ^ ^  s , the spectral density function is
(11.1.5)

f(
)




2

|(e^
)|




2

|1 ^  e^

^^

 e^
| .

11.2 The First Order Moving Average Process It is instructive to consider several simple cases of the
moving average process. We begin with the first order moving average, X! O ! ^  !^ . In the operator form
we have,
(11.2.1)
For invertibility,
by

X

^ 1 z  z 1 since 1

y (I ^  B) y

(B) .

is the root of the auxiliary equation (s)



(11.2.2)

0. The variance is given

 
(1+  )

and the autocorrelation function

OJ



(11.2.3)

^ 
, 
1] 


y1

.
0,  } 2


Of course, as always  O 1. We note from this equation,  O ( ^  )/(1 ]  ), that  and  are related in a
-
non-linear fashion. In fact if  is a solution then so is  since



1


1] 1


^ 

 +1

^ 
.
1] 

The spectral density function for the first order moving average process is

f(
)

(11.2.4)



2

|1 ^  e^ | .

Expanding the square modulus


f(
)



O 2 .1+  ^ 2  $ e

f(
)



O 2 .1 +  ^




^

]e

%/

which reduces to
(11.2.5)




2  cos ()/.

We can calculate the partial autocorrelations from the Yule-Walker equations and our expressions for
 . In particular

P

^ 
1] 


x 1
z ^
z 1]
z
z
y 0





^ 
1] 


^ 
1] 


1
0

{
}
}
}
}
|

so that

P B

 

 

x 1]  {
z 0 }
z
}
|
y 0

Solving this for   yields,


(11.2.6)

^
 



O ^  [1 ^  ]/[1 ^  ] ]


which is dominated by an decreasing exponential. Notice that the partial autocorrelation for the moving
average process has this damped exponential property that the autocorrelation of the autoregressive process has
and simultaneously the autocorrelation of the moving average process has the cutoff property that the partial
autocorrelation of the autoregressive process has.
11.3 The Second Order Moving Average Process The second order moving average process is
X O  ^  !^ ^  !^ . Of course it is automatically stationary. In analogy with the AR(2) process, it will
be invertible if and only if the roots of the equation, 1 ^  s ^  s y 0, lie outside the unit circle.
Equivalently, again in analogy to the stationary conditions for the AR(2) case, if and only if i.  +  z 1, ii.
 ^  z 1 and iii. ^ 1 z  z 1. The variance of the second order moving average process is
(11.3.1)





y   (1+  +  )


and the autocorrelation



^  (1^  )
1]  ] 





(11.3.2)

^ 
1]  ] 



and



0, 

3.

The spectral density


f()

(11.3.3)



2

|1 ^  e^ ^  e^

which may be written in real terms as


(11.3.4)

f(
)




2



.1+  +  ^ 2  (1 ^  ) cos() ^ 2  cos(2)/, ^  z  z .

Again it is clear that the basic theory for moving average time series is in hand. The major difficulty is that, in
general, the relationship between the moving average coefficients,  , and the covariance function,  , is
nonlinear whereas the relationship between the autoregressive coefficients,  , and the covariance function,  ,
is given in a linear fashion by the Yule-Walker equations. For this reason as well as their analogy to differential
equations modeling physical systems, the autoregressive models seem to be somewhat preferred. Nonetheless,
the autoregressive models may not be a very economical representation. The most general finite parameter
representation of a linear process is given by the mixed autoregressive-moving average models which we study
in the next section.
11.4 Mixed Processes-Autoregressive-Moving Average Processes In order to achieve a representation
of a time series data set with a finite number of parameters, we can consider a mixed model
(11.4.1)

X!

 X!^ ++ X!^ +! ^  !^ ^ ^  !^

or rewriting in operator form


(11.4.2)

(B)X

( B )  .

We refer to this as an ARMA (p, q) process. Noting that


X

^ (B) (B)

we infer that the process will be stationary if ^ (B) exists, that is, if the roots of the auxiliary equation,
(s) y 0, lie outside the unit circle. By the same token,
^ (B)(B)

X!

!

so that the process is invertible if ^ (B) exists, that is, if the roots of the auxiliary equation, (s) y 0, lie
outside the unit circle.
Next, we multiply (11.4.1) by X!^ to obtain
(11.4.3)

X X^ O  X!^ X!^ ++ X!^ X!^


+ X^ ^  X!^ !^ ^ ^  X!^ !^.

Notice that E[^ X!^ ] O 0 provided t ^ j>t ^  , that is, provided j z  . Write E[!^ X!^ ] y % (
that

^ j) so

 ( )

0, 

 ( )

and

0, 

0.

Taking expectations in (11.4.3) we obtain



(11.4.4)
Hence if 

O   ^ ++  ^ + % ( ) ^   % ( ^ 1) ^ ^   % ( ^ q).


^ q { 0, that is,  { q


(11.4.5)

  ^ ++   ^

  ^ ++   ^,

and so



(11.4.6)

 {q

We may write this operator form as

(B)

(11.4.7)

0, 

{ q.

Thus there are q autocorrelations  , which depend on  , ,  as well as  , , . The remaining
autocorrelations satisfy the usual difference equation,  (B) O 0, and, hence, will be an exponentially decaying
functions or at least dominated by an exponentially decaying function. When  O 0,




  ++  + ^  % ( ^ 1) ^ ^  % ( ^ q).

Hence to solve for  we must solve simultaneously with the p equations (11.4.4) for 

1, 2 , , p.

Using the covariance generating function the spectral density is


(11.4.8)

f()




2

(e^
)
,
(e^
)

^  |  | .

Because (s) and (s) are polynomials, this is sometimes referred to as a rational spectral density (rational in
polynomials). It is easy to see that rational spectral densities can approximate a very large class of functions, so
that the ARMA model is, indeed, a very powerful one. We may expand these polynomials to
(11.4.9)

f() O

 1^ e^
^^ ^
 


2  1^ e^
^^ e^
  .

Finally, since ! O ^ (B)(B)X! and ^ (s) is an infinite series, the partial autocorrelation is infinite in extent
and eventually behaves like the partial autocorrelation of a pure moving average process. It is perhaps worth
examining the ARMA (1, 1) in a bit more detail.

11.5 The ARMA (1, 1) Process To understand the generality of the ARMA (p, q) process, it is
worthwhile examining the ARMA (1, 1) process. This is particularly useful since the parameter estimation
problem for the ARMA (p, q) is in general quite messy and the ARMA (1, 1) illustrates the principles a bit more
clearly. The ARMA (1, 1) given by
(11.5.1)
The process is stationary if

X! ^  X!^

y ! ^  !^ .

^ 1 z  z 1 and invertible if ^ 1 z  z 1.

The autocovariance is given by




  + ^   ( ^ 1)



(11.5.2)

  ^, 




  ^  

2.

Also note
X ^

 X!^ !^


! !^ ^  !^

so that taking expectations we obtain


(11.5.3)

 ( ^ 1)


 

y ^  


Thus, we may substitute the expression in (11.5.3) for  ( ^ 1) into (11.5.2) to obtain
(11.5.4)




  +


 (   ^  ) 

so that using simple algebra


(11.5.5)

 ^  



 (1 ^   +  ).

Also from (11.5.2) we have,


(11.5.6)


  ^  


 ( ^  )

Eliminating the   term by adding (11.5.5) to (11.5.6),





 ( ^  ) y  (1 ^   +  ^   )

so that solving for  we obtain


(11.5.7)

By a simple substitution,
(11.5.8)


 y  



(1]  ^2   )

.
1^


O   (1^


  )( ^  )

1^


By division of these relationships,



and



O
O

(1^  )( ^  )
.
(1]  ^2  )


  ^ , 

} 2.

The above discussions allow us to estimate the parameters of an autoregressive process, a moving
average process or a mixed process once we have settled on a model. The model building process is not a trivial
one and fundamental questions as to model order and type must be addressed. We have suggested the use of the

autocorrelation function and the partial autocorrelation function as tools to the end. The following table
summarizes the behavior of different linear models.

Autoregressive
Models in terms
of X's

Moving Average

(B)X! O !

- (B)X!

O - (B)!

O (B)!

- (B)(B)X!

X!

O !

O ^ (B) (B)!

Models in terms
of 's

X!

Weights

Finite Series

Infinite Series

Infinite Series

Weights

Infinite Series

Finite Series

Infinite Series

Roots of (s) O 0 lie


Outside Unit Circle

Always
Stationary

Roots of (s) O 0 lie


Outside Unit Circle

Invertibility
Conditions

Always Invertible

Roots of (s) O 0 lie


Outside Unit Circle

Roots of (s) O 0 lie


Outside Unit Circle
unit circle

Autocorrelation
Function

Infinite, Tails Off

Finite, Cuts Off

Infinite, Tails Off

Partial
Autocorrelation
Function

Finite, Cuts Off

Infinite, Tails Off

Infinite, Tails Off

Stationarity
Conditions

X!

O !

Mixed (ARMA)

Chapter 12
Nonstationary Processes and Time Series
12.1 Nonstationary Time Series Models In the previous chapter, we introduced the ARMA model
(12.1.1)
(B)X! O (B)!
with (s) and (s) polynomials in s of degree p and q respectively. We assumed that  (s) O 0 had roots outside
the unit circle so that the resulting process would be stationary. Suppose now we allow the possibility of roots
on or inside the unit circle. For simplicity consider the AR(1) model
(1 ^ B)X!

!

If  is taken as two, root is 1/2, then, for example, the series increases or decreases exponentially essentially
with no effect from the random input, ! . We will often refer to the ! as random shocks or just shocks.
Similarly in any model for which one or more of the roots of (s) y 0 lie strictly within the unit circle, the
process will behave in this exponential manner essentially independent of the ! 's. In this circumstance the
process is essentially deterministic and there is little need for a statistical analysis. We consider then only those
cases where the roots lie on the unit circle.
In particular, we now entertain the model
(12.1.2)
(B)(I ^ B) X! y (B)! .
Here (B) is an ordinary stationary autoregressive operator. We write I ^ B y ", the difference operator. Thus
our model can be written (B)" X! y (B)! or (B)Y! y (B)! with Y! y " X! . After proper differencing
of the process X! we have the usual ARMA (p, q) model. Now X! y S Y! where S is the infinite summation
operator defined by
(12.1.3)

SX!

X
!

y^B

(1+B+B +B +) X!

which may be written


(12.1.4)

SX

(1 ^ B)- X

y -X

Thus the summation operator is the inverse of the difference operator,

.

SX! + SX!^ + SX!^ +

S

S X!

S X!
which may be written as

Similarly

and

 X
!

y^B y^B

  X.
!

y^B y^B y^B

Differencing is very much analogous to differentiation while summing is analogous to integration. For this
reason the model (12.1.2) is called the autoregressive integrated moving average process of order (p, d, q) and
abbreviated ARIMA(p, d, q).

One important feature of the nonstationary process of the type we seek to analyze with the ARIMA
model is that the local behavior must be independent of the level of the process. That is, we want,

(B)(X +c)

(12.1.5)

(B)X .


Note that by adding the " operator in (12.1.5), we obtain

(B)"(X +c)

(B)[X +c ^ X^ ^ c]

(B)"X!

which is independent of level. Thus, for a process which is homogeneous except in level, the model

(B)"X!

(12.1.6)

(B)!

is appropriate. If, in contrast, a series has neither fixed level nor fixed slope, we require

(B)(X! +mt+c)

(B)X!

(B)" X! .

By introducing " into this equation, we have

(B)" (X! +mt+c)


In this case an ARIMA (p, 2, q) is appropriate.

It is sometimes useful to consider a slight extension of the ARIMA process

(B)" X!

(12.1.7)

 + (B)!

The operator (B) is the autoregressive operator and it is assumed that the roots lie outside the unit circle. The
operator (B)" is the generalized autoregressive operator and finally (B) is the moving average operator
and it is assumed that the roots here also lie outside the unit circle. If we wish to include a deterministic
function of time as a trend, this can be done by choosing  non-zero. For example if d y 1, we can estimate a
deterministic linear trend in the presence of nonstationary noise.
12.2 Some Important Special Cases We consider the following special cases in some detail: i. ARIMA
ii.
(0,
1,
1),
"X! y ! ^  !^ y (I ^  B)! ;
ARIMA
(0,
2,
2),
" X! y a! ^  !^ ^  !^ y (I ^  B ^  B )! ; and iii. ARIMA (1, 1, 1), "X! ^  "X!^ y  !^ or
(I ^  B)"X! y (I ^  B)! . Let us now write for the generalized autoregressive operator,

)(B)

(12.2.1)

(B)" .

We will write the model in three forms.


Let )(B)
(12.2.2)

y I ^ ) B ^ )B ^ ^ )] B] , so that the general model (  y 0) can be written


X! y ) X!^ ++ )] X!^^ ^  !^ ^ ^  !^ + !

For example, the ARIMA (1, 1, 1) is (1 ^ B)(1 ^ B)X!


This ARIMA (1, 1, 1) model may be expanded as

or

y (1 ^ B)! where we drop the suffix 1 on  and  .

{I ^ (1+)B+B }X!
X!

y (I ^ B)!

(1+)X!^ ^ X!^ +! ^ !^ .

It is sometimes convenient to represent the ARIMA models as a infinite moving average, that is, with 
weights. For nonstationary time series models, the -weight representation does not converge. Thus strictly

speaking the following discussion is only a formal computation. It may be used to suggest relationships among
the -weights, the -weights, the -weights and the -weights, but because we are manipulating nonconvergent series, these relationships must be verified by more rigorous calculations. We may formally write
X! in the infinite moving average form
X!
so that

! + !^ + !^ +


X!

(B)! .

)(B) X!

)(B)(B)!

Operating with )(B) on both sides yields

But since

)(B) X!

we can identify (B)

y (B)!

)(B)(B). Thus,

(I ^ ) B ^ ^ )] B] )(I+ B+ B +) y (I ^  B ^ ^  B )


Notice that if j is sufficiently large, namely j } max(p+d ^ 1, q), then the  weights must satisfy

)(B)

0.

Thus for j sufficiently large, the  satisfy the same homogeneous difference equation as the covariance
function,  , and, hence, can be written as a mixture of polynomials, exponentials and sinusoids in the
argument j.
Consider for example the ARIMA (1, 1, 1). Here

O I ^ (1+)B+B

)( B )

so that

{I ^ (1+)B+B }(1+ B+ B +)

Thus
and for 

y I ^ B.

^ (1+) +  y ^

1,

 ^ (1+)^ + ^ y 0.
This is a second order difference equation with roots 1 and ^ . Thus 
simultaneously with ^ (1+ )+ y ^ yields

A

1^
1^

and A

y A +A  , j } 1.

Solving this

^
1^ .

Thus the ARIMA (1, 1, 1) can be written


X

 (A +A  )
B

y

 
!^

with the above values for A and A . Notice that for A 0, this is not a convergent series and so, in some
sense this infinite series representation is nonsense. Even so, it will have some utility in establishing forecasting
procedures for the nonstationary time series models.
Finally we can write the ARIMA time series model in its inverted form  (B )X!
X! y  X!^ + X!^ ] + ! .
As before, operating on both sides with (B)

y ! or

(B)(B)X!

Thus we have (B)(B)

)(B)

(B)!

)(B)X! .

or

(I ^ ) B ^ ^ )] B] ) y (I ^  B ^ ^  B)(I ^  B ^  B ^ ).


For j { max(p+d, q), (B) y 0. We again return to our ARIMA (1, 1, 1) as an example. We have
(12.2.3)
(B)(B) y )(B)
or
(B) y {I ^ (1+)B+ B }(I+ B+  B +)
so that by equating corresponding coefficients
^  y ^ 1 ^ 
^  y +  ^ (1+)
and
^  y  ^ (1+) ^+ ^ , j } 3.
Rewriting and solving for the -weights
 y 1+ ^
 y ( ^  )(1 ^ )
 y ( ^ )(1 ^ ) ^ , j } 3.
Let us also observe that if d is greater than 0, since )(s) y  (s)(1 ^ s) , we may conclude that )(1)
Substituting in (12.2.3)
0 y )(1) y (1)(1).

0 has roots outside the unit circle, (1)

0. Thus (1)



Since (s)

0 so that 1 ^


*

0.



O

1

0 or

equivalently
*

(12.2.4)

O

Thus if d { 0, the -weights must sum to 1. They are not necessarily positive, so that they may not be a
probability distribution.
12.3 Integrated Moving Average Process Models In this section we wish to consider integrated moving
average processes (IMA) of the form " X O (B) . Two models of particular interest are the ARIMA (0, 1,
1) O IMA (1, 1), X O X^ +! ^ !^ and the ARIMA(0, 2, 2) y IMA (2, 2), X! y
2X!^ ^ X!^ +! ^  !^ ^  !^ .
First the IMA(1, 1). Let us note that we may write
I ^ B

(1 ^ )B+(I ^ B)

(1 ^ )B+"

B+",

with

y 1 ^ .

We can thus reparametrize the process as

"X! y !^ +"!


Again the following computation is purely formal. Applying "^
X! y S!^ +!
But "^ y (1 ^ B)^ y 1+B+B + so that

X!
The -weights are 

1 and 

X

where we define X ^ () to be


B


B

y

!^

S to both sides

+ ! .

O 1 ^ , j } 1. To consider the inverted form of the model


B
y   X!^ +! y ^ !^ () + !
X

 X!^ .

y

y

We notice

OI^

( I ^ B )  ( B )
Equivalently,

(s)

1^s
1^ s

We may simplify this to obtain

(s)

B.

1^ s^(1^ )s
.
1^ s

y 1 ^ (1^ s)s .
1^

Expanding the denominator as an infinite series

(s)

y1^

(s)

y 1 ^ (1 ^ )(s+ s+ s+ ).

or

s(1 ^ )(1+ s+  s + ).

But  is the negative of the coefficient of s so that



(1 ^ ) ^

(1 ^ )^ , j

except for  which is 1. Thus X! is an exponentially weighted moving average (EWMA) of previous values.
This is sometimes also called an exponential smoother or an exponential filter. For this case, we can write

 (1 ^ )

^ !^ ( ) y
X
^
X

so that

Factoring out the first term,

^
X

 (1 ^ )

( )

( )

y

y

X +

^ X
!^

^ X

 (1 ^ )
B

y

]^

^ X

]^

We can make a change of index by letting k O j ^ 1 so that

^
X

( )

X +

 (1 ^ )
B

y

X^ .

Factoring out (1 ^ ) we have

^
X

( )

X + (1 ^ )

 (1 ^ )
B

^

y

X^ .

Thus

^ ) y X + (1 ^
^ ( ), may be thought ofX ! (the level"! of the^ ) X !^( ). t. Since ^ 1 z z 1, we have that
The function, X
as
process at time
0 z z 2. In most cases 0 z z 1. If is close to 0, the level" is almost constant. If is close to 1, the
level is determined almost exclusively by the most recent observation, X . Since
^
X y X ^ ( ) + !


and

we have
or

^ ! ( ) y X!+(1 ^ ) ^ !^( )
X
X
^ ( ) y ^ !^( ) + ! + (1 ^ ) ^ !^( )
X
X
X

^ ! ( ) y ^ !^ ( ) + !.
X
X

Thus while the process itself has the random shock, ! , the level of the process has random shock, ! , which
may be more or less than ! depending on .

where

Let us now turn to the ARIMA (0, 2, 2). We write


I ^  B ^  B y (  "+  )B+"


y 2 ^  ^  and  y  ^ 1

or in inverse form

1+  and 

y 1 ^  ^  .

We can then write the ARIMA (0, 2, 2) as

" X!

(  "+  )!^ + " ! .

Again we make a formal computation. Taking inverse ", "^


X!

 S!^

S, twice

+  S !^ + ! .

But
S!^
so that
S !^
We may expand to get

!^

+ !^ + !^ +

S!^ + S!^ + S!^ + .

S !^ y !^ + !^ + !^ +


+ !^ + !^ +
+ !^ + !^ ]

!^ ] ^ ]

Collecting like terms, we have


Thus
X

S !^


!^

+ 2!^ + 3!^ + .

+ (  +1  )!^ + (  +2  )!^ + (  + 3  )!^ + .

The  weights are




1

 + j  ,

To find the inverted form, we equate coefficients in


(1 ^ 2s+s )
Thus


and

 (2

(1 ^  s ^  s )(1 ^  s ^  s

 O 2 ^  y  + 
^  ) ^ (1+ ) y  + 2  ^

(1 ^  B ^  B )

0,

^ ).

(  +  )

3.

Chapter 13
Least Squares Forecasts
13.1 Linear Forecasting Functions. We now wish to consider forecasting based on the general ARIMA
process model
)(B)X!

(B)" X!

(B)! .

We suppose we are currently at time t and we wish to forecast m steps ahead, X!] . We shall write the
^
^
forecast of X!] as X! (m). We wish to write X! (m) as a linear function of the past values of the time series data
^
we know, that is, X! , X!^ , X!^ , . Equivalently we may write X! (m) as a linear function of ! , !^ , !^ ,
. We will assume an infinite moving average representation exists. For a nonstationary model this
representation doesn't converge. Hence, for nonstationary models, the following discussion is a formal
representation, but not a rigorous proof.
We wish to make a linear representation of the process depending on the past values ! , !^ , !^ ,
which we do by setting
^
X! (m)

(13.1.1)

*
*
*
y  ! ] ] !^ ] ] !^ ] .

*
Here, the  are weights on the ! which may or may not be related to the  weights. We do know however
that

(13.1.2)
X!] O !] ]  !]^ ] ]  ! ] ] !^ ] .
^
Subtracting X! (m) from X!] , squaring and then taking expectations yields
^
E[X!] ^ X! (m)]





(1 ]  ]  ] ] ^ )
*
]

This may be minimized by choosing

^
X (m)

(13.1.3)

O

]

 .
B

y

*
^ ] /   .

]

. Thus the forecast is

O  ! ] ] !^ ] ] !^ ] .

We then may write


e! (m)

!]

]  !]^ ] ] ^ !]

so that
(13.1.4)

X!]

e! (m)

^
X! (m).

We call e! (m) the forecast error of lead time m. Let us write E[X!] | X! , X!^ , ] as E[X!] | t]. We first
notice that
E[X!] | t]


B

E[

y

 !]^

| t]


B

y



E[!]^ | t]


B

y

 !]^

^
X! (m).

^
Hence X (m) is the best linear estimator of X] given the data up to time, t. We may now examine
(13.1.5)

E[e! (m) | t]

E[!] ]  !]^ ] ] ^ !] | t]

0.

^
Hence the expected forecast error is 0 and X (m) is the minimum mean square error estimator and is unbiased.
^
Hence X (m) is the best estimator in the Rao-Blackwell sense. Let us also notice that
var[e (m) | t]




(1 ]  ] ] ^ ) .

Forecasting Linear Combinations of X


Consider now the array, X] ,

,X

!]

defined as follows

X!]

O !] ]  ! ]  !^ ]

X]

y !] + !]^ + ! ] ] !^ ] .

We may define Q by
Q

w X


y

!]

w X!] ] w X!] ] ] w X!] .

Writing the X! in terms of their  -weight representations we have


Q O w !] ] (w^ ]  w )!]^ ] ] (w ]  w ] ] ^ w )!]
] ( w ] ]  w )! ] ( w ] ] ] w )!^ ] .
Suppose we want a linear forecast of Q in the form Q
^
E[Q ^ Q]

c ! ] c !^ ] . Then as before



[w ] (w^ ]  w ) ] ] (w ]  w ] ] ^ w ) ]
 ] ( w ] ] 
 ] / .
]  .( w ] ]  w )
 
] w )


This may be minimized by choosing c O ( w ] ]  w ), c


Thus the minimum mean square error forecast is

y ( w ] ] ] w ), and so on.

^
Q y ( w ] ]  w )! ] ( w ] ] ] w )!^ ] .
But from our earlier discussion, we know
^
X! (1) y  ! ]  !^ ]
^
X! (2) y  ! ]  !^ ]

^
X! (m) y  ! ] ] !^ ] .
From this we may conclude that
(13.1.6)

^
Q

Thus the least squares forecast of w X!] ]

^
w X! (1)

]]

^
w X! (m).

^
w X!] is w X! (1)

]]

^
w X! (m).

Correlations of Forecast Errors. Now let us consider the correlations of forecast errors. First we note
that
e! (1)

^
X!] ^ X! (1)

 !]

!] .

The one-step-ahead forecast errors are uncorrelated. Also we note


^
e! (m) y X!] ^ X! (m) y !] +  !]^ ] ] ^ !]
and
^
e!^ (m) y X!]^ ^ X!^ (m) y !]^ +  !]^^ ] ] ^ !^] .
Thus

E[e! (m)e!^ (m)] y [  ]  ] ] ] ^^ ^ ]

O 




^
y

 ^

with 

1.

Thus




^

/O

 ^
y
^


y

corr e (m), e^ (m)

, 0

|jz

and, of course, 0 elsewhere.


Finally, we observe
^
e (m ] j) y X]^ ^ X! (m ] j)

y  ]] ]  !]]^ ] ]  !] ] ] !]^ ] ] ]^ !] .




Thus
E[e! (m)e! (m ] j)]

y [  ]  ] ] ] ^ ]^ ] 




so that
E[e (m)e (m ] j)]






^

 ] .

y

From this we may compute the correlation,



^

corr.e (m), e (m ] j)/

y

 ]

J   K
^
y




^^
y







13.2 Other Forms of Forecast Functions. We have so far considered forecasts from the
formulation of the general linear model. Let us now consider some alternatives. We first observe that
E[X^ | t] O X!^ , j O 0, 1, 2,
^
E[X!] | t] y X! (j), j y 1, 2, 3,
^
E[!^ | t] y !^ y X!^ ^ X!^^ (1), j
and finally

0, 1, 2,

-weight

E[!] | t]

0, j

1, 2, 3, .

Consider the mixed formulation


X!] y ) X!]^ ] ] )] X!]^^ ^  !]^ ^ ^  !]^ ] !] .
Thus we have
^
X! (m) y ) E[X!]^ | t] ] ] )] E[X!]^^ | t]
^  E[!]^ | t] ^ ^  E[!]^ | t] ] E[!] | t].

(13.2.1)

Using the above forms of the conditional expectations, we can use (13.2.1) as a forecast function. To make this
clearer, it is worth considering some examples. Consider for example (I ^ .8 B)(I ^ B)X!] y !] . Then
X!]
Then

1.8 X!]^

^
X! (1)
^
X! (2)

y
y
y

^
1.8 X! (m ^ 1)

] !] .

1.8 X! ^ .8 X!^
^
1.8 X! (1) ^ .8 X!

^
X! (m)

.8 X!]^

and

^
.8 X! (m ^ 2), m

(I

3, 4, 5, .

As another example, consider

" X!]
Then
X!]

y 2 X!]^ ^

.9 B

X!]^

!]

.5 B ) !] .

.9 !]^

.5 !]^ .

We may thus apply the conditional expectations to obtain


^
^
^
X! (1) y 2 X! ^ X!^ ^ .9 [X! ^ X!^ (1)] ] .5 [X!^ ^ X!^ (1)]
^ (2) y 2 X (1) ^ X ^ .9 [X ^ X (1)]
^
^
X!
!
!
!
!^
and
^
^
^
X! (m) y 2 X! (m ^ 1) ^ X! (m ^ 2).
^
Notice that in both cases, eventually, )(B)X! (m) y 0 where the operator B operates on the index, m. Thus
^ (m) has the same structure as the covariance,  . Finally we note that we can write
eventually the forecast X!


O   X!]^ ]
B

X]

y

!]

so that
(13.2.2)

^
X (m)

^
  X (m ^ j) ]   X

^
y

y

!]^

13.3 Practical Forecasting We have now developed three forms of the forecasting function, specifically
(13.3.1)
(13.3.2)

^
X (m) O  E[! | t] ] ] E[!^ | t] ] ,
^
X! (m) y ) E[X!]^ | t] ] ] )] E[X!]^^ | t]
^  E[!]^ | t] ^ ^  E[!]^ | t] ] E[!] | t],

and
^
X! (m) y

(13.3.3)

^
  X (m ^ j) ]   X

^

y

y

!]^ .

Equation (13.3.2) is a useful computational algorithm if the process is purely autoregressive in form. However,
^
if there is a nontrivial moving average component, the computation of E[ ^ | t] involves X !^ ^ X !^^(1).
This leads to an infinite regression and, so, creates a problem with initializing the algorithm. The same is
basically true of (13.3.1). Equation (13.3.3) can be used, but of course we would generally only have data
X , ,X! so that
^
X! (m) y

^
  X (m ^ j)

^
y

 X

!]^
y

!]^

is the practical formula. Because of the truncation of the second series, this would only be practical in settings
where the -weights converged to 0 sufficiently rapidly that the truncated part of the series were negligible.
Clearly the model for the time series must be invertible for this to be true, but even more the time series must be
sufficiently long so that we can reach sufficiently far back into the tail of the second series.
To calculate the -weights, we use the standard device we have used earlier. Since )(B)X
and also &(B)X O  , we can substitute the second in the first to obtain
)(B)X

O (B)

(B)&(B)X .

By equating operators, we get )(B) O (B)&(B). We can then equate the coefficients of the Bs to solve for  .
In general



O ) ^ 
y ) ^  ]  () ^  )

(B)

y 0, j { max{p+d, q}.

In the last difference equation, the operator (B) operates on the j subscript. Notice that if the roots of the
auxiliary equation, (s) y 0, lie outside the unit circle, then the -weights will be exponentially damped (or
dominated by something which is exponentially damped) and, hence, we will be able to use the -weight
representation to initiate our forecasting.
13.4 Updating Forecasts One issue of substantial interest is an updating procedure. That is, if we know
the time series up to time, t, e.g. X , ,X! , and we add a new observation, X!] , it is desirable not to have to
recompute the whole forecast from scratch. To see how to update, recall that

and

^
X!] (m) y  !] ] ] ! ] ] !^ ]
^
X! (m ] 1) y ] ! ] ] !^ ] ] !^ ] .

Subtracting, we have
^
^
X!] (m) ^ X! (m ] 1)
or

 !]

^
^
X!] (m) y X! (m ] 1)

 !] .

^
^
Notice that X!] (m) is an updated forecast of X!]] with the shift in origin from t to t ] 1. X!] (m) and
^
X! (m ] 1) both forecast X!]] . Thus to update a forecast, we only need to add a single term,  !] , to the
^
old forecast. Recall also that !] y X!] ^ X! (1). Thus the update term is
 [X!]

^
^ X! (1)].

To calculate the  -weights, we use the same standard device as in the previous section
)(B)(B)

y (B)

so that
(I ^ ) B ^ ^ )] )(I ]  B ]  B ] ) y (I ^  B ^ ^  B ).
Equating coefficients, as before,



y ) ^ 
y )  ] ) ^ 



y ) ^ ] ] )] ^^ ^ 

where 

y 1,  y 0, j z 0 and  y 0, j { q.

For example, in our example in the last chapter, we were considering (I ^ .8B)(I ^ B)X!
computation thus is
 y 1
 y 1.8
 y 1.8  ^ .8



y ! .

The

y 1.8 ^ ^ .8 ^ .

From this it is simple to compute,  y 1,  y 1.8,  y 2.44,  y 2.95,  y 3.38, . From this it is a
simple step to update our forecasts. Notice, that the  -weights are increasing. This is as it should be since this
is a nonstationary model and the -weights will not converge.
13.5 Confidence Limits for Forecasts Finally, we would like to address the question of confidence
^
limits for the forecasts. Let us consider X!] . We first observe that E[X!] | t] y X! (m) so that
^



var[X!] | t] y E[{X!] ^ X! (m)} | t] y E[{e! (m)} | t] y (1 ]  ] ] ^ ) .
^
Assuming the  are normally distributed, then so is X] ^ X! (m). Thus upper and lower probability limits for
X!] are

$  %

]
^

X! (m) ] p  1 ] 

y

and




$  %

]
^

X (m) ^ p  1 ] 

y




where p  is the (1 ^ /2) _ 100 percentile of a standard normal distribution. If we know the model exactly,

then, of course,  , is known.
In closing this section, it is worth noting that all of this chapter presumes that we know the model
exactly. Thus, none of the coefficients ) ,  ,  or  are estimated. None of this theory really works if when
we take the conditional expectations, E[ c | t], we must account for coefficients ) ,  ,  and  which are
random variables depending on the observations, X , ,X . In practice, of course, we use estimates of ) ,  ,
 and  based on the data and then forecast into the future based on these estimated coefficients. This is
strictly speaking a questionable practice from a theoretical point of view and is truly a weakness of the theory
just developed. The situation is improved somewhat is the data used to estimate the coefficients, ) ,  ,  and
 is different from the data used for forecasting X!] . In this case the conditional expectations are at least
legitimate and the minimum mean square error forecast holds at least conditionally. Unfortunately, this is often
not possible, for example, with economic forecasting. In any case, the procedures described above seem to have
value even though the proofs given may not be entirely legitimate in the setting in which these procedures are
often used.

Appendix A
Riemann-Stieltjes Integrals
In our discussion of random process models in Chapter 6, it is convenient to use the concept of
Riemann-Stieltjes integration. Consider an interval (,  ) and let xo z x z z x be a partition of (,  ) so
that

(A.1)

x z x z z x y

Let x y max (x ^ x^ ) and consider the sum




 g (x ) (F(x ) ^ F(x


(A.2)

y

^ )).

The Riemann-Stieltjes integral is defined as

g(x) d F(x) O lim B  g(x )(F(x ) ^ F(x

(A.3)

x  y

^ ))

If F is the identity function, the Riemann-Stieltjes integral is the ordinary integral as defined in elementary
calculus. The function F in general must be of bounded variation, but for our purposes, we will make the more
restrictive assumption that F is non-decreasing. The usual properties of integrals hold.


Property 1.
(g(x)


Property 2.
dF(x)


h(x)) dF(x) y
g(x) dF(x)


F (x)

then
g(x) dF(x)

Property 4. If g(x)

h(x) dF(x).

y F( ) ^ F().

Property 3. If f(x)

d F(x)
dx

g(x) f(x) dx.

1,
f(x) dx


O F( ) ^ F()

and, hence,
F(x)

f(u) du.

^B

If F is a function of two variables, x, y, and

x z x z z x y

 and 

y y z y z z y y

and also

x y


 B

lim
x
y
,

max (x ^ x^ ),

y y

max (y ^ y^ ), then

  g(x , y ) (F(x , y ) ^ F(x , y ) ^ F(x




y y

 

^

y

g(x, y) F(dx, dy).


^ ,

y )

] F(x^ , y^ ))

WOLFER SUNSPOT NUMBERS: YEARLY*


1770
101
1795
1771
82
1796
1772
66
1797
1773
35
1798
1774
31
1799
1775
7
1800
1776
20
1801
1777
92
1802
1778
154
1803
1779
125
1804
1780
85
1805
1781
68
1806
1782
38
1807
1783
23
1808
1784
10
1809
1785
24
1810
1786
83
1811
1787
132
1812
1788
131
1813
1789
118
1814
1790
90
1815
1791
67
1816
1792
60
1817
1793
47
1818
1794
41
1819
*100 Observations

21
16
6
4
7
14
34
45
43
48
42
28
10
8
2
0
1
5
12
14
35
46
41
30
24

1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844

16
7
4
2
8
17
36
50
62
67
71
48
28
8
13
57
122
138
103
86
63
37
24
11
15

1845
1846
1847
1948
1949
1850
1851
1852
1953
1954
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869

40
62
98
124
96
66
64
54
39
21
7
4
23
55
94
96
77
59
44
47
30
16
7
37
74

You might also like